Details

      Description

      There is a stochastic failure mode in the HA3 test suite when starting 3 services concurrently.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        The change set above lead to a code path where replicateAndApplyWriteSet() could fail to transition to RunMet as described for this code path:

                            /*
                             * This can happen if there is a data race with a live write
                             * that is the first write cache block for the write set
                             * that that we would replicate from the ResyncTask. In this
                             * case, we have lost the race to the live write and this
                             * service has already joined as a follower. We can safely
                             * return here since the test in this if() is the same as
                             * the condition variable in the loop for the ResyncTask.
                             * 
                             * @see #resyncTransitionToMetQuorum()
                             */
        
                            if (journal.getHAReady() != token) {
                                /*
                                 * Service must be HAReady before exiting RESYNC
                                 * normally.
                                 */
                                throw new AssertionError();
                            }
        
        

        The code used to do a return at this point. The loop in the caller would then terminate since the service was joined. Since the loop in the caller is now <code>while(true) {...}</code>, this code path now needs to do an explicit transition into another run state (RunMet).

        Note that the code path does verify that the service is not only JOINED but also HAReady.

        Therefore, all we need to do is add the transition to RunMet:

                            // Transition to RunMet.
                            enterRunState(new RunMetTask(token, leaderId));
        

        Committed revision r7288.

        Show
        bryanthompson bryanthompson added a comment - The change set above lead to a code path where replicateAndApplyWriteSet() could fail to transition to RunMet as described for this code path: /* * This can happen if there is a data race with a live write * that is the first write cache block for the write set * that that we would replicate from the ResyncTask. In this * case, we have lost the race to the live write and this * service has already joined as a follower. We can safely * return here since the test in this if() is the same as * the condition variable in the loop for the ResyncTask. * * @see #resyncTransitionToMetQuorum() */ if (journal.getHAReady() != token) { /* * Service must be HAReady before exiting RESYNC * normally. */ throw new AssertionError(); } The code used to do a return at this point. The loop in the caller would then terminate since the service was joined. Since the loop in the caller is now <code>while(true) {...}</code>, this code path now needs to do an explicit transition into another run state (RunMet). Note that the code path does verify that the service is not only JOINED but also HAReady. Therefore, all we need to do is add the transition to RunMet: // Transition to RunMet. enterRunState(new RunMetTask(token, leaderId)); Committed revision r7288.
        Hide
        bryanthompson bryanthompson added a comment -


        - Why is the barrier count not reflecting the leader's consensus vote?

        nparties 2 versus 3? (because only follower responses are counted

        here; I have renamed the responses field as followerResponses to

        clarify this and also updated the logged message).


        - What is the impact of the mock notify message? That the consensus

        release time can not advance? (We now explicitly mark the mock

        GATHER responses and then ignored them on the leader. Followers

        that provide a mock GATHER response will vote NO for the

        PREPARE. Also, added a unit test for an ABC() simultaneous restart

        once the services already have some data and are not at

        commitCounter=0.)


        - Why is the leader reporting that it is forcing a barrier break in

        messageFollowers()? (Not sure. Added a workaround using

        "consensus==null" as the test condition to drive the barrier.reset()

        call.)


        - What the hell happened to the NotReady exception? (It will cause the

        follower to fail in prepare2Phase, but we never get past the flush()

        in this test run.)


        - Why did the commit the commit never finish? (We were stuck in

        flush(). Not sure why.)


        - Modified QuorumPipelineImpl.retrySend() to log @ WARN if we have to

        do a retrySend(). (Note that it will also log @ ERROR if

        RobustReplicateTask.call() is unable to send whether or not we then

        transition into retrySend()).

        Committed revision r7289.

        Show
        bryanthompson bryanthompson added a comment - - Why is the barrier count not reflecting the leader's consensus vote? nparties 2 versus 3? (because only follower responses are counted here; I have renamed the responses field as followerResponses to clarify this and also updated the logged message). - What is the impact of the mock notify message? That the consensus release time can not advance? (We now explicitly mark the mock GATHER responses and then ignored them on the leader. Followers that provide a mock GATHER response will vote NO for the PREPARE. Also, added a unit test for an ABC() simultaneous restart once the services already have some data and are not at commitCounter=0.) - Why is the leader reporting that it is forcing a barrier break in messageFollowers()? (Not sure. Added a workaround using "consensus==null" as the test condition to drive the barrier.reset() call.) - What the hell happened to the NotReady exception? (It will cause the follower to fail in prepare2Phase, but we never get past the flush() in this test run.) - Why did the commit the commit never finish? (We were stuck in flush(). Not sure why.) - Modified QuorumPipelineImpl.retrySend() to log @ WARN if we have to do a retrySend(). (Note that it will also log @ ERROR if RobustReplicateTask.call() is unable to send whether or not we then transition into retrySend()). Committed revision r7289.
        Hide
        bryanthompson bryanthompson added a comment -

        - done: Tag the Gather messages with the commitCounter and a one up

        gather attempt number (or ts on the leadersValue) and verify that

        the gather task in prepare2Phase() was the RIGHT gather task. [newCommitCounter and newCommitTime are now passed into the GATHER

        protocol. If the follower is not at the expected commit counter

        (newCommitCounter-1) then the GatherTask will fail. Also, if the

        leader receives a response from a GatherTask that is for the wrong

        newCommitCounter or newCommitTime then the GATHER will fail.|The

        ]


        - done: We were failing to check the Future of the RMI requests to

        start a GatherTask on the follower. This lead to a deadlock in one

        of the testStartAB_C_MultiTransactionResync_0tx_then_500ms_delay

        test runs. This was fixed by checking the Future for the RMI in the

        monitoring Runnable. If any RMI fails, then the GATHER is

        cancelled.

        Committed revision r7290.

        Show
        bryanthompson bryanthompson added a comment - - done: Tag the Gather messages with the commitCounter and a one up gather attempt number (or ts on the leadersValue) and verify that the gather task in prepare2Phase() was the RIGHT gather task. [newCommitCounter and newCommitTime are now passed into the GATHER protocol. If the follower is not at the expected commit counter (newCommitCounter-1) then the GatherTask will fail. Also, if the leader receives a response from a GatherTask that is for the wrong newCommitCounter or newCommitTime then the GATHER will fail.|The ] - done: We were failing to check the Future of the RMI requests to start a GatherTask on the follower. This lead to a deadlock in one of the testStartAB_C_MultiTransactionResync_0tx_then_500ms_delay test runs. This was fixed by checking the Future for the RMI in the monitoring Runnable. If any RMI fails, then the GATHER is cancelled. Committed revision r7290.
        Hide
        bryanthompson bryanthompson added a comment -

        Diagnosed an error with Martyn where the initial KB create failed. The root cause was a failure of the BigdataRDFServletContext to wait until the leader was HAReady before attempting to create the KB. The request to create the KB was concurrent with an abort() that was done by the leader as part of its transition into the HAReady(Leader) state. The fix was to CreateKB.

        Committed revision r7293.

        Show
        bryanthompson bryanthompson added a comment - Diagnosed an error with Martyn where the initial KB create failed. The root cause was a failure of the BigdataRDFServletContext to wait until the leader was HAReady before attempting to create the KB. The request to create the KB was concurrent with an abort() that was done by the leader as part of its transition into the HAReady(Leader) state. The fix was to CreateKB. Committed revision r7293.
        Hide
        bryanthompson bryanthompson added a comment -

        We have had three CI runs in a row in which there are only HA test failures that are attributed to other problems. Those test failures are:

        com.bigdata.journal.jini.ha.TestHAJournalServerOverride.testStartABC_userLevelAbortDoesNotCauseQuorumBreak - not implemented
        com.bigdata.journal.jini.ha.TestHAJournalServerOverride.testStartABC_commit2Phase_B_fails - test needs to be updated.
        com.bigdata.journal.jini.ha.TestHAJournalServerOverride.testStartAB_BounceLeader - see BLZG-809.
        

        This ticket is closed pending new errors from CI.

        Show
        bryanthompson bryanthompson added a comment - We have had three CI runs in a row in which there are only HA test failures that are attributed to other problems. Those test failures are: com.bigdata.journal.jini.ha.TestHAJournalServerOverride.testStartABC_userLevelAbortDoesNotCauseQuorumBreak - not implemented com.bigdata.journal.jini.ha.TestHAJournalServerOverride.testStartABC_commit2Phase_B_fails - test needs to be updated. com.bigdata.journal.jini.ha.TestHAJournalServerOverride.testStartAB_BounceLeader - see BLZG-809. This ticket is closed pending new errors from CI.

          People

          • Assignee:
            martyncutcher martyncutcher
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: