Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-773

HAJournalServer deadlock: pipelineRemove() and getLeaderId()

    Details

      Description

      
      Found one Java-level deadlock:
      =============================
      "com.bigdata.journal.jini.ha.HAJournal.executorService5430":
        waiting for ownable synchronizer 0x0000000700e6e3e8, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
        which is held by "com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher1"
      "com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher1":
        waiting for ownable synchronizer 0x00000007081c3a28, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
        which is held by "com.bigdata.journal.jini.ha.HAJournal.executorService5430"
      
      Java stack information for the threads listed above:
      ===================================================
      "com.bigdata.journal.jini.ha.HAJournal.executorService5430":
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x0000000700e6e3e8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
      	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
      	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
      	at com.bigdata.quorum.AbstractQuorum.getLeaderId(AbstractQuorum.java:1024)
      	at com.bigdata.quorum.AbstractQuorum.assertLeader(AbstractQuorum.java:1079)
      	at com.bigdata.quorum.AbstractQuorumMember.assertLeader(AbstractQuorumMember.java:216)
      	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.assertQuorumState(QuorumPipelineImpl.java:878)
      	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.innerReplicate(QuorumPipelineImpl.java:1070)
      	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.retrySend(QuorumPipelineImpl.java:1129)
      	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.call(QuorumPipelineImpl.java:1031)
      	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.call(QuorumPipelineImpl.java:772)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:662)
      "com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher1":
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000007081c3a28> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
      	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
      	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
      	at com.bigdata.ha.QuorumPipelineImpl.pipelineRemove(QuorumPipelineImpl.java:415)
      	at com.bigdata.quorum.AbstractQuorumMember.pipelineRemove(AbstractQuorumMember.java:300)
      	at com.bigdata.quorum.AbstractQuorum$QuorumWatcherBase.pipelineRemove(AbstractQuorum.java:2409)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.access$4300(ZKQuorumImpl.java:1244)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$QuorumPipelineWatcher.remove(ZKQuorumImpl.java:1861)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$AbstractSetQuorumWatcher.applyOrderedSetSemantics(ZKQuorumImpl.java:1721)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$QuorumPipelineWatcher.handleEvent(ZKQuorumImpl.java:1851)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$HandleEventRunnable.run(ZKQuorumImpl.java:1389)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.accept(ZKQuorumImpl.java:1327)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$AbstractSetQuorumWatcher.start(ZKQuorumImpl.java:1696)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.setupWatchers(ZKQuorumImpl.java:1495)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.handleExpired(ZKQuorumImpl.java:1429)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.access$2100(ZKQuorumImpl.java:1244)
      	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$HandleEventRunnable.run(ZKQuorumImpl.java:1374)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:662)
      
      Found 1 deadlock.
      

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        This deadlock is a lock ordering problem. Both AbstractQuorum and QuorumPipelineImpl have an internal lock.

        In one code path, we start with a zookeeper event. This is event is dropped onto an executor in ZKQuorumImpl. In this case, the event was actually a zk client session expire notice. The same problem could have been observed for other events, but this one generates the most changes in the internal reflection of the quorum model. This code path takes the AbstractQuorum lock first and then attempts to take the QuourmPipelineImpl lock.

        	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) <== QuorumPipelineImpl lock.
        	at com.bigdata.ha.QuorumPipelineImpl.pipelineRemove(QuorumPipelineImpl.java:415)
        	at com.bigdata.quorum.AbstractQuorumMember.pipelineRemove(AbstractQuorumMember.java:300)
        	at com.bigdata.quorum.AbstractQuorum$QuorumWatcherBase.pipelineRemove(AbstractQuorum.java:2409) <== AbstractQuorum lock.
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.access$4300(ZKQuorumImpl.java:1244)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$QuorumPipelineWatcher.remove(ZKQuorumImpl.java:1861)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$AbstractSetQuorumWatcher.applyOrderedSetSemantics(ZKQuorumImpl.java:1721)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$QuorumPipelineWatcher.handleEvent(ZKQuorumImpl.java:1851)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$HandleEventRunnable.run(ZKQuorumImpl.java:1389)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.accept(ZKQuorumImpl.java:1327)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$AbstractSetQuorumWatcher.start(ZKQuorumImpl.java:1696)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.setupWatchers(ZKQuorumImpl.java:1495)
        	at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.handleExpired(ZKQuorumImpl.java:1429)
        

        The other code path takes the QuourmPipelineImpl lock first and then takes the AbstractQuorum lock.

        	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) <== AbstractQuorum lock.
        	at com.bigdata.quorum.AbstractQuorum.getLeaderId(AbstractQuorum.java:1024)
        	at com.bigdata.quorum.AbstractQuorum.assertLeader(AbstractQuorum.java:1079)
        	at com.bigdata.quorum.AbstractQuorumMember.assertLeader(AbstractQuorumMember.java:216)
        	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.assertQuorumState(QuorumPipelineImpl.java:878)
        	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.innerReplicate(QuorumPipelineImpl.java:1070) <== QuorumPipelineImpl lock
        	at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.retrySend(QuorumPipelineImpl.java:1129)
        
        Show
        bryanthompson bryanthompson added a comment - This deadlock is a lock ordering problem. Both AbstractQuorum and QuorumPipelineImpl have an internal lock. In one code path, we start with a zookeeper event. This is event is dropped onto an executor in ZKQuorumImpl. In this case, the event was actually a zk client session expire notice. The same problem could have been observed for other events, but this one generates the most changes in the internal reflection of the quorum model. This code path takes the AbstractQuorum lock first and then attempts to take the QuourmPipelineImpl lock. at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) <== QuorumPipelineImpl lock. at com.bigdata.ha.QuorumPipelineImpl.pipelineRemove(QuorumPipelineImpl.java:415) at com.bigdata.quorum.AbstractQuorumMember.pipelineRemove(AbstractQuorumMember.java:300) at com.bigdata.quorum.AbstractQuorum$QuorumWatcherBase.pipelineRemove(AbstractQuorum.java:2409) <== AbstractQuorum lock. at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.access$4300(ZKQuorumImpl.java:1244) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$QuorumPipelineWatcher.remove(ZKQuorumImpl.java:1861) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$AbstractSetQuorumWatcher.applyOrderedSetSemantics(ZKQuorumImpl.java:1721) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$QuorumPipelineWatcher.handleEvent(ZKQuorumImpl.java:1851) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$HandleEventRunnable.run(ZKQuorumImpl.java:1389) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.accept(ZKQuorumImpl.java:1327) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher$AbstractSetQuorumWatcher.start(ZKQuorumImpl.java:1696) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.setupWatchers(ZKQuorumImpl.java:1495) at com.bigdata.quorum.zk.ZKQuorumImpl$ZkQuorumWatcher.handleExpired(ZKQuorumImpl.java:1429) The other code path takes the QuourmPipelineImpl lock first and then takes the AbstractQuorum lock. at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) <== AbstractQuorum lock. at com.bigdata.quorum.AbstractQuorum.getLeaderId(AbstractQuorum.java:1024) at com.bigdata.quorum.AbstractQuorum.assertLeader(AbstractQuorum.java:1079) at com.bigdata.quorum.AbstractQuorumMember.assertLeader(AbstractQuorumMember.java:216) at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.assertQuorumState(QuorumPipelineImpl.java:878) at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.innerReplicate(QuorumPipelineImpl.java:1070) <== QuorumPipelineImpl lock at com.bigdata.ha.QuorumPipelineImpl$RobustReplicateTask.retrySend(QuorumPipelineImpl.java:1129)
        Hide
        bryanthompson bryanthompson added a comment -

        Martyn and I have discussed this deadlock in some depth. A few observations:


        - The basic issue is that the QuorumWatcher can call through to the QuorumClient interface through either the QuorumMember interfaces (e.g., QuorumPipelineImpl) or the QuorumListener interface. All of those calls are made synchronously. The called methods can not reliably request locks without the potential of creating a deadlock.


        - ZKQuorumImpl could tunnel and obtain the QuorumPipelineImpl lock before applying the changes from the zk event. However, this is pretty ugly.


        - Another approach is to have the pipelineAdd(), pipelineRemove(), and pipelineChange() methods use non-blocking techniques. E.g., those methods could drop a message onto a queue. The write replication methods on that class could then spin through the queue, applying the messages once they take the QuorumPipelineImpl lock and then again when they release that lock. This behavior could be encapsulated in a lock() and unlock() pattern. This would decouple the pipeline change notification messages from the write replication messages while still maintaining the critical sections imposed by the QuorumPipelineImpl lock today.


        - Martyn points out that we could use a ReadWriteLock in AbstractQuorum, taking only the read lock if we are testing the Quorum state and taking the write lock if we are modifying it. This could potentially increase concurrency and seems reasonable. We would have to review the public API since a read lock can not be upgraded into a write lock, thus the write lock would have to be taken at the top-level or we could create another kind of lock ordering problem.


        - We observed a potential issue where the last follower in the pipeline would hold the QuorumPipelineImpl lock while receiving the payload from the upstream service. We have pushed the logic into a static ReceiveTask (similar to the SendBufferTask and the ReceiveAndReplicateTask). The lock is now held only when setting up that task, but the task itself runs without the lock.

        Show
        bryanthompson bryanthompson added a comment - Martyn and I have discussed this deadlock in some depth. A few observations: - The basic issue is that the QuorumWatcher can call through to the QuorumClient interface through either the QuorumMember interfaces (e.g., QuorumPipelineImpl) or the QuorumListener interface. All of those calls are made synchronously. The called methods can not reliably request locks without the potential of creating a deadlock. - ZKQuorumImpl could tunnel and obtain the QuorumPipelineImpl lock before applying the changes from the zk event. However, this is pretty ugly. - Another approach is to have the pipelineAdd(), pipelineRemove(), and pipelineChange() methods use non-blocking techniques. E.g., those methods could drop a message onto a queue. The write replication methods on that class could then spin through the queue, applying the messages once they take the QuorumPipelineImpl lock and then again when they release that lock. This behavior could be encapsulated in a lock() and unlock() pattern. This would decouple the pipeline change notification messages from the write replication messages while still maintaining the critical sections imposed by the QuorumPipelineImpl lock today. - Martyn points out that we could use a ReadWriteLock in AbstractQuorum, taking only the read lock if we are testing the Quorum state and taking the write lock if we are modifying it. This could potentially increase concurrency and seems reasonable. We would have to review the public API since a read lock can not be upgraded into a write lock, thus the write lock would have to be taken at the top-level or we could create another kind of lock ordering problem. - We observed a potential issue where the last follower in the pipeline would hold the QuorumPipelineImpl lock while receiving the payload from the upstream service. We have pushed the logic into a static ReceiveTask (similar to the SendBufferTask and the ReceiveAndReplicateTask). The lock is now held only when setting up that task, but the task itself runs without the lock.
        Hide
        bryanthompson bryanthompson added a comment -

        Martyn has tracked the problem where a pipeline reorganization would sometimes result in the quorum not becoming fully met to a sequence of pipeline events (PIPELINE_ADD followed by PIPELINE_CHANGE) such that the send/receive services wind up being torn down when there is an attempt to replicate a write. He has code that will elide one of the queued pipeline events in order to work around the problem. We need to review that code and the pipeline event handling to validate the workaround or identify an alternative solution.

        Show
        bryanthompson bryanthompson added a comment - Martyn has tracked the problem where a pipeline reorganization would sometimes result in the quorum not becoming fully met to a sequence of pipeline events (PIPELINE_ADD followed by PIPELINE_CHANGE) such that the send/receive services wind up being torn down when there is an attempt to replicate a write. He has code that will elide one of the queued pipeline events in order to work around the problem. We need to review that code and the pipeline event handling to validate the workaround or identify an alternative solution.
        Hide
        bryanthompson bryanthompson added a comment -

        See [1,2]. Reconciled edits with Martyn, primarily with respect to (a) elision of pipeline change events in QuorumPipelineImpl to resolve a deadlock [2]; and (b) closing a concurrency hole when a service joins with a met quorum after the GATHER and before the PREPARE.

        Committed revision r7206.

        - QuorumPipelineImpl : incorporated changes for event elision. This
          provides a fix for <a
          href="http://sourceforge.net/apps/trac/bigdata/ticket/681" >
          HAJournalServer deadlock: pipelineRemove() and getLeaderId() </a>.
          I also fixed whitespace formatting for the indentation in
          elideEvents().  This was some odd mixture of tabs and spaces that
          was causing indent rendering issues.
        
        - BufferedWrite: Ignored add of unused logger.
        
        - WriteCacheService : did not accept log @ ERROR messages that were
          designated as "REMOVE TRACE".
        
        - HA messages: modified some IHAMessages that did not override
          toString() to provide useful representations in log output.
        
        - IHAAwaitServiceJoinRequest : javadoc and copyright notice.
        
        - IHAWriteMessage : javadoc update for the commit counter semantics.
        
        - HAReceiveService: added readFuture reference into the log message
          for changeDownstream.
        
        - HASendService: incorporated tryno information into channel reopen
          log messages per mgc.
        
        - HAGlue : folded in changes to awaitServiceJoin(), specifically it
          will now return the most recent consensus release time on the
          leader.  This closes a hole when a service joins after the GATHER
          and before the PREPARE.  By using the leader's consensus release
          time, the service will not permit transactions to start against a
          commit point that has been recycled by the leader.
        
        - AbstractJournal: reconciled.
        
          - commitNow() was reusing the nonJoinedServices definition from the
            GATHER.  I added a JoinedAndNonJoinedServices helper class.
            Distinct instances of this class are now used for the GATHER and
            for the PREPARE/COMMIT.
        
          - doLocalCommit(): I did not pick up the 2nd half of this
            if/then/else.  Why is it there?
        
                    if ((shouldFlush || true) && doubleSync) {
        
                        _bufferStrategy.force(false/* metadata */);
        
                    } else {
                    	if (log.isInfoEnabled())
                    		log.info("ALWAYS FORCE TO DISK");
                        _bufferStrategy.force(false/* metadata */);
                    }
        
        - AbstractHATransactionService: changed updateReleaseTime() method
          into an override of setReleaseTime() which is public (versus
          protected) in order to expose this to the HAJournalServer.
        
        - Journal: 
        
            - accepted implementation of updateReleaseTime() method from
              AbstractHATransactionService.
        
            - modified runWithBarrierLock() to log around the critical section
              so we can observe when this section runs and finishes.
        
        - HAJournal: Modified to log @ INFO the request and result for an HA
          digest.  
        
        - HAJournalServer: reconciled all changes.
        
          - ServiceLeaveTask: Yes, it should use quorum.token(), not
            NO_QUORUM. There is a difference between the quorum token (which
            just reflects zookeeper) and the journal token (which reflects
            zookeeper plus whether or not the local journal is HAReady).
          
                        journal.setQuorumToken(getQuorum().token()); // This is correct.
        
        		   - vs -
        
                        journal.setQuorumToken(Quorum.NO_QUORUM); // This is wrong.
        
        - TestHA3JournalServer: reconciled numerous changes.
        
          - Modified code that was using a sleep to (presumably) wait until
            the current quorum broke to instead just await the next quorum
            meet.
        
        //        Thread.sleep(50); // wait a while for A to shutdown
        //
        //      final long token2 = quorum.awaitQuorum(awaitQuorumTimeout * 2,
        //      TimeUnit.MILLISECONDS);
        
                // Wait for the next quorum meet on (token+1).
                final long token2 = awaitNextQuorumMeet(token);
            
        - AbstractHA3JournalServerTestCase: converted stderr to log @ INFO.
        
        - AbstractHAJournalServerTestCase: accepted log message @ INFO, but
          made it conditional.
        
        Test failures:
        
        - The 4 "overrides" tests need to be revisited. They still fail.
        
        - testQuorumABC_HAStatusUpdatesWithFailovers: The problem appears to
          be that [leader] can not be compared with [serverB] and [serverC]
          using reference testing (==). [leader] has the same data in this
          case as [serverC] (same UUID, same TcpEndpoint).
        
          junit.framework.AssertionFailedError: Did not elect leader consistent with expectations:
           leader=Proxy[HAGlue,BasicInvocationHandler[BasicObjectEndpoint[b44e72e1-dd5c-4dbc-9640-129bdab11007,TcpEndpoint[192.168.1.135:55983]]]],
          serverB=Proxy[HAGlue,BasicInvocationHandler[BasicObjectEndpoint[073e0614-26a6-49be-83f4-381ce6338306,TcpEndpoint[192.168.1.135:55965]]]],
          serverC=Proxy[HAGlue,BasicInvocationHandler[BasicObjectEndpoint[b44e72e1-dd5c-4dbc-9640-129bdab11007,TcpEndpoint[192.168.1.135:55983]]]]
        	at junit.framework.Assert.fail(Assert.java:47)
        	at com.bigdata.journal.jini.ha.TestHA3JournalServer.testQuorumABC_HAStatusUpdatesWithFailovers(TestHA3JournalServer.java:2946)
        
        - testStressStartAB_C_MultiTransactionResync_200_5: I have observed a
          failure of this test (after 33 out of 50 trials). A subsequent run
          of 50 trials succeeded.  Good, but not perfect.
        
          junit.framework.AssertionFailedError: Fail after 33 trials : java.util.concurrent.TimeoutException
        	at junit.framework.TestCase2.fail(TestCase2.java:90)
        	at com.bigdata.journal.jini.ha.TestHA3JournalServer.testStressStartAB_C_MultiTransactionResync_200_5(TestHA3JournalServer.java:620)
        
        - testStress_RebuildWithPipelineReorganisation: failed on 7th run.
        
          java.lang.RuntimeException: junit.framework.AssertionFailedError
        	at com.bigdata.io.TestCase3.assertCondition(TestCase3.java:250)
        	at com.bigdata.journal.jini.ha.AbstractHA3JournalServerTestCase.awaitFullyMetQuorum(AbstractHA3JournalServerTestCase.java:1990)
        	at com.bigdata.journal.jini.ha.TestHA3JournalServer.testStartABC_RebuildWithPipelineReorganisation(TestHA3JournalServer.java:1071)
        	at com.bigdata.journal.jini.ha.TestHA3JournalServer._testStress_RebuildWithPipelineReorganisation(TestHA3JournalServer.java:1090)
        

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/530 (Journal HA)
        [2] https://sourceforge.net/apps/trac/bigdata/ticket/681 (HAJournalServer deadlock: pipelineRemove() and getLeaderId())

        Show
        bryanthompson bryanthompson added a comment - See [1,2] . Reconciled edits with Martyn, primarily with respect to (a) elision of pipeline change events in QuorumPipelineImpl to resolve a deadlock [2] ; and (b) closing a concurrency hole when a service joins with a met quorum after the GATHER and before the PREPARE. Committed revision r7206. - QuorumPipelineImpl : incorporated changes for event elision. This provides a fix for <a href="http://sourceforge.net/apps/trac/bigdata/ticket/681" > HAJournalServer deadlock: pipelineRemove() and getLeaderId() </a>. I also fixed whitespace formatting for the indentation in elideEvents(). This was some odd mixture of tabs and spaces that was causing indent rendering issues. - BufferedWrite: Ignored add of unused logger. - WriteCacheService : did not accept log @ ERROR messages that were designated as "REMOVE TRACE". - HA messages: modified some IHAMessages that did not override toString() to provide useful representations in log output. - IHAAwaitServiceJoinRequest : javadoc and copyright notice. - IHAWriteMessage : javadoc update for the commit counter semantics. - HAReceiveService: added readFuture reference into the log message for changeDownstream. - HASendService: incorporated tryno information into channel reopen log messages per mgc. - HAGlue : folded in changes to awaitServiceJoin(), specifically it will now return the most recent consensus release time on the leader. This closes a hole when a service joins after the GATHER and before the PREPARE. By using the leader's consensus release time, the service will not permit transactions to start against a commit point that has been recycled by the leader. - AbstractJournal: reconciled. - commitNow() was reusing the nonJoinedServices definition from the GATHER. I added a JoinedAndNonJoinedServices helper class. Distinct instances of this class are now used for the GATHER and for the PREPARE/COMMIT. - doLocalCommit(): I did not pick up the 2nd half of this if/then/else. Why is it there? if ((shouldFlush || true) && doubleSync) { _bufferStrategy.force(false/* metadata */); } else { if (log.isInfoEnabled()) log.info("ALWAYS FORCE TO DISK"); _bufferStrategy.force(false/* metadata */); } - AbstractHATransactionService: changed updateReleaseTime() method into an override of setReleaseTime() which is public (versus protected) in order to expose this to the HAJournalServer. - Journal: - accepted implementation of updateReleaseTime() method from AbstractHATransactionService. - modified runWithBarrierLock() to log around the critical section so we can observe when this section runs and finishes. - HAJournal: Modified to log @ INFO the request and result for an HA digest. - HAJournalServer: reconciled all changes. - ServiceLeaveTask: Yes, it should use quorum.token(), not NO_QUORUM. There is a difference between the quorum token (which just reflects zookeeper) and the journal token (which reflects zookeeper plus whether or not the local journal is HAReady). journal.setQuorumToken(getQuorum().token()); // This is correct. - vs - journal.setQuorumToken(Quorum.NO_QUORUM); // This is wrong. - TestHA3JournalServer: reconciled numerous changes. - Modified code that was using a sleep to (presumably) wait until the current quorum broke to instead just await the next quorum meet. // Thread.sleep(50); // wait a while for A to shutdown // // final long token2 = quorum.awaitQuorum(awaitQuorumTimeout * 2, // TimeUnit.MILLISECONDS); // Wait for the next quorum meet on (token+1). final long token2 = awaitNextQuorumMeet(token); - AbstractHA3JournalServerTestCase: converted stderr to log @ INFO. - AbstractHAJournalServerTestCase: accepted log message @ INFO, but made it conditional. Test failures: - The 4 "overrides" tests need to be revisited. They still fail. - testQuorumABC_HAStatusUpdatesWithFailovers: The problem appears to be that [leader] can not be compared with [serverB] and [serverC] using reference testing (==). [leader] has the same data in this case as [serverC] (same UUID, same TcpEndpoint). junit.framework.AssertionFailedError: Did not elect leader consistent with expectations: leader=Proxy[HAGlue,BasicInvocationHandler[BasicObjectEndpoint[b44e72e1-dd5c-4dbc-9640-129bdab11007,TcpEndpoint[192.168.1.135:55983]]]], serverB=Proxy[HAGlue,BasicInvocationHandler[BasicObjectEndpoint[073e0614-26a6-49be-83f4-381ce6338306,TcpEndpoint[192.168.1.135:55965]]]], serverC=Proxy[HAGlue,BasicInvocationHandler[BasicObjectEndpoint[b44e72e1-dd5c-4dbc-9640-129bdab11007,TcpEndpoint[192.168.1.135:55983]]]] at junit.framework.Assert.fail(Assert.java:47) at com.bigdata.journal.jini.ha.TestHA3JournalServer.testQuorumABC_HAStatusUpdatesWithFailovers(TestHA3JournalServer.java:2946) - testStressStartAB_C_MultiTransactionResync_200_5: I have observed a failure of this test (after 33 out of 50 trials). A subsequent run of 50 trials succeeded. Good, but not perfect. junit.framework.AssertionFailedError: Fail after 33 trials : java.util.concurrent.TimeoutException at junit.framework.TestCase2.fail(TestCase2.java:90) at com.bigdata.journal.jini.ha.TestHA3JournalServer.testStressStartAB_C_MultiTransactionResync_200_5(TestHA3JournalServer.java:620) - testStress_RebuildWithPipelineReorganisation: failed on 7th run. java.lang.RuntimeException: junit.framework.AssertionFailedError at com.bigdata.io.TestCase3.assertCondition(TestCase3.java:250) at com.bigdata.journal.jini.ha.AbstractHA3JournalServerTestCase.awaitFullyMetQuorum(AbstractHA3JournalServerTestCase.java:1990) at com.bigdata.journal.jini.ha.TestHA3JournalServer.testStartABC_RebuildWithPipelineReorganisation(TestHA3JournalServer.java:1071) at com.bigdata.journal.jini.ha.TestHA3JournalServer._testStress_RebuildWithPipelineReorganisation(TestHA3JournalServer.java:1090) [1] https://sourceforge.net/apps/trac/bigdata/ticket/530 (Journal HA) [2] https://sourceforge.net/apps/trac/bigdata/ticket/681 (HAJournalServer deadlock: pipelineRemove() and getLeaderId())
        Hide
        bryanthompson bryanthompson added a comment -

        This issue has been resolved by (a) a refactoring of QuorumPipelineImpl to queue and then process the events; and (b) eliding events. See the notes above.

        Show
        bryanthompson bryanthompson added a comment - This issue has been resolved by (a) a refactoring of QuorumPipelineImpl to queue and then process the events; and (b) eliding events. See the notes above.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: