Details

      Description

      The RESYNC logic can fail, causing a transition to the OPERATOR state. We have analyzed this failure mode. It appears when the quorum leader does not have a live HALog file open and the service in RESYNC is at the same commit point as the quorum leader. The service in RESYNC then requests the HALog file for commitCounter+1, which would be the live HALog. Since that live HALog file is missing, the service transitions to the OPERATOR state. Note that new writes on the leader would cause the live HALog file to spring into existence, at which point at attempt to RESYNC would succeed.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        The root cause that abort2Phase() discards the live HALog file and does not immediately re-create that HALog file. Given A + B + C, if all three services are at the same commit point and there is an attempt to commit which fails, causing abort2Phase() to be invoked. If C is then shutdown it will fail to RESYNC (per this ticket) since the HALog file is missing on the leader (A).

        abort2Phase() called doLocalAbort() which called through to discardWriteSet(), which was calling disableHALog() and deleting the live HALog file. We have modified discardWriteSet() to create the new live HALog file IF the service is joined with the met quorum at the time that discardWriteSet() was invoked. This should fix the problem with abort2Phase() leaving the leader without a live HALog file.

        We have also modified getHARootBlocksForWriteSet() to conditionally create the live HALog file if it does not exist. This is the method that is invoked by the RESYNC task. Thus, by ensuring that the live HALog file exists here we can make RESYNC much more robust. We believe that a data race exists between when a service is elected as the leader and when the HALog file would normally be created. It is possible that a 3rd service in a simultaneous restart of {A,B,C} could attempt to obtain the live HALog file before the leader would have otherwise created it. Explicitly creating the live HALog within getHARootBlocksForWriteSet() closes out this possible deadlock.


        - conditionalCreateHALog() was moved to HALogNexus.


        - AbstractJournal._abort() was made robust to the case were the HAQuorumService was not running.


        - HALogWriter: there were some methods that failed to obtain the m_stateLock before examining fields that were protected by that lock. Those methods included: getCommitCounter(), getSequence(), and isHALogOpen().


        - CommitCounterUtility: exposed method to format the commitCounter as a 21-digit string with leading zeros.


        - HAStatuServlet: modified to disclose the most current historical HALog file (if any) and the live HALog file (if it exists). This was done to help diagnose the problem associated with this ticket.


        - HAJournalServer::ResyncTask().doRun()
        - remove the conditionalCreateHALog() invocation. There was no reason to create the live HALog on the service that was trying to resync. The code comment indicated that this might have been done because of test suite expectations, but the test suite runs fine without this.


        - HA CI test suite: Added a number of startABC_restartX tests in an effort to recreate the problem associated with this ticket. These tests did not managed to recreate the problem, but they are being retained because I do not see obvious overlap in the existing test suite with these tests.


        - TestHAJournalServerOverride.testStartABC_abort2Phase_restartC() :: Created a unit test that replicates the problem against the revision of the code before this commit. This test can be observed to fail if you disable the HALog create in both HAJournal.getHARootBlocksForWriteSet() and HAJournalServer.discardWriteSet().

        Note: If this solution proves to be robust, then we could consider re-enabling the automatic logic to do a disaster rebuild rather than transitioning to an OPERATOR state. However, transitioning to an OPERATOR state is perhaps wise regardless.

        HA CI is locally green.

        Committed revision r7508.

        Show
        bryanthompson bryanthompson added a comment - The root cause that abort2Phase() discards the live HALog file and does not immediately re-create that HALog file. Given A + B + C, if all three services are at the same commit point and there is an attempt to commit which fails, causing abort2Phase() to be invoked. If C is then shutdown it will fail to RESYNC (per this ticket) since the HALog file is missing on the leader (A). abort2Phase() called doLocalAbort() which called through to discardWriteSet(), which was calling disableHALog() and deleting the live HALog file. We have modified discardWriteSet() to create the new live HALog file IF the service is joined with the met quorum at the time that discardWriteSet() was invoked. This should fix the problem with abort2Phase() leaving the leader without a live HALog file. We have also modified getHARootBlocksForWriteSet() to conditionally create the live HALog file if it does not exist. This is the method that is invoked by the RESYNC task. Thus, by ensuring that the live HALog file exists here we can make RESYNC much more robust. We believe that a data race exists between when a service is elected as the leader and when the HALog file would normally be created. It is possible that a 3rd service in a simultaneous restart of {A,B,C} could attempt to obtain the live HALog file before the leader would have otherwise created it. Explicitly creating the live HALog within getHARootBlocksForWriteSet() closes out this possible deadlock. - conditionalCreateHALog() was moved to HALogNexus. - AbstractJournal._abort() was made robust to the case were the HAQuorumService was not running. - HALogWriter: there were some methods that failed to obtain the m_stateLock before examining fields that were protected by that lock. Those methods included: getCommitCounter(), getSequence(), and isHALogOpen(). - CommitCounterUtility: exposed method to format the commitCounter as a 21-digit string with leading zeros. - HAStatuServlet: modified to disclose the most current historical HALog file (if any) and the live HALog file (if it exists). This was done to help diagnose the problem associated with this ticket. - HAJournalServer::ResyncTask().doRun() - remove the conditionalCreateHALog() invocation. There was no reason to create the live HALog on the service that was trying to resync. The code comment indicated that this might have been done because of test suite expectations, but the test suite runs fine without this. - HA CI test suite: Added a number of startABC_restartX tests in an effort to recreate the problem associated with this ticket. These tests did not managed to recreate the problem, but they are being retained because I do not see obvious overlap in the existing test suite with these tests. - TestHAJournalServerOverride.testStartABC_abort2Phase_restartC() :: Created a unit test that replicates the problem against the revision of the code before this commit. This test can be observed to fail if you disable the HALog create in both HAJournal.getHARootBlocksForWriteSet() and HAJournalServer.discardWriteSet(). Note: If this solution proves to be robust, then we could consider re-enabling the automatic logic to do a disaster rebuild rather than transitioning to an OPERATOR state. However, transitioning to an OPERATOR state is perhaps wise regardless. HA CI is locally green. Committed revision r7508.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: