Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-809

HAJournalServer needs to handle ZK client connection loss.

    XMLWordPrintable

    Details

      Description

      Bouncing the ZK client connection causes the reflected state maintained by the ZKQuorumImpl to be out of sync with the state in zookeeper. Not only can some events be lost, but none of the events that correspond to the elimination of the ephemeral znodes for this service will be observed. Also, due to lost events, actions such as <code>conditionalWithdrawVoteImpl</code> can deadlock because the local reflection of the quorum state failed to observe actual quorum state changes. Handling these problems correctly requires special consideration.

      There are two unit tests which simulate the loss of the ZK client connection:

      - testStartAB_BounceFollower()
      - testStartAB_BounceLeader()
      

      We need to extend this to HA3 configurations as well.

      We have looked at a few possible approaches to fix this. It seems like generating mock events to inform the service about its own non-existence might be the simplest. We need to investigate the code in ZkQuorumImpl's watcher that handles reconnects and the zk accessor pattern to identify a clear location where we can intercept the event and also intervene.

      Tearing down the Quorum or HAQuorumService by themselves does not look that promising. The next level of escalation would be to tear down the HAJournalServer to the point where we would then re-open the journal and reconfigure everything. There are some concerns with this approach that the NSS or exported proxy would also be torn down.

        Attachments

          Activity

            People

            Assignee:
            martyncutcher martyncutcher
            Reporter:
            bryanthompson bryanthompson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: