Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-786

HAJournalServer reports "follower" but is in SeekConsensus and is not participating in commits.

    XMLWordPrintable

    Details

      Description

      A problem has been observed where the 2nd follower in an HA3 cluster is not participating in commits (it is in SeekConsensus and stuck at commit point 84 while the rest of the cluster has moved on) but still considers itself to be HAReady and a Follower.

      Quorum Services:

      http://bigdata17.bigdata.com:8090 : leader, pipelineOrder=0, writePipelineAddr=/192.168.1.17:9090, service=other, extendedRunState={server=Running, quorumService=RunMet @ 8767, haReady=110, haStatus=Leader, serviceId=393e60b6-7cba-42c8-a9b4-254b2a0d209a, now=1372865255795}
      http://bigdata16.bigdata.com:8090 : follower, pipelineOrder=1, writePipelineAddr=/192.168.1.16:9090, service=other, extendedRunState={server=Running, quorumService=RunMet @ 8767, haReady=110, haStatus=Follower,  serviceId=597c9d44-fe4f-4ff1-b80e-93947aa50612, now=1372865255663}
      http://bigdata15.bigdata.com:8090 : follower, pipelineOrder=2, writePipelineAddr=/192.168.1.15:9090, service=self, extendedRunState={server=Running, quorumService=SeekConsensus @ 84, haReady=110, haStatus=Follower, serviceId=e2e2db6c-0690-4d26-8227-f348ab5ed0c6, now=1372865255605}
      

      The root cause will be related to how we manage the transition from RunMet to a service leave in preparation for SeekConsensus. We need to develop test coverage around these abnormal transitions. That test coverage can be provided by overriding the HAGlue RMI interface under test suite control. A means to write these tests has already been implemented in the TestHAJournalServerOverrides test suite.

      An attempt to restart the 2nd follower (bigdata15) caused updates to fail on the leader:

      ERROR: 12628203 2013-07-03 11:34:42,104      com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService23 com.bigdata.ha.QuorumCommitImpl.prepare2Phase(QuorumCommitImpl.java:375): java.util.concurrent.ExecutionException: java.lang.IllegalStateException: commitCounter: ( old=84 + 1 ) != new=9763
      

      It also caused the status page to not paint correctly on the leader.

      And the following trace appears on the 2nd follower after 2 restarts:

      WARN : 16425 2013-07-03 11:40:47,231      com.bigdata.journal.jini.ha.HAJournal.executorService3 com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService$RunStateCallable.call(HAJournalServer.java:1127): SeekConsensus: exit, runStateFuture=com.bigdata.concurrent.FutureTaskMon@405a02ef
      WARN : 16425 2013-07-03 11:40:47,231      com.bigdata.journal.jini.ha.HAJournal.executorService1 com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService$RunStateCallable.call(HAJournalServer.java:1127): Error: exit, runStateFuture=com.bigdata.concurrent.FutureTaskMon@5576b9ea
      WARN : 16425 2013-07-03 11:40:47,231      com.bigdata.journal.jini.ha.HAJournal.executorService4 com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService.setRunState(HAJournalServer.java:1043): runState=SeekConsensus, oldRunState=Error, serviceName=com.bigdata.journal.jini.ha.HAJournalServer@bigdata15.bigdata.com#451982499
      ERROR: 16426 2013-07-03 11:40:47,232      com.bigdata.journal.jini.ha.HAJournal.executorService4 com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService$RunStateCallable.call(HAJournalServer.java:1088): java.lang.IllegalStateException: HALogWriter is open.
      java.lang.IllegalStateException: HALogWriter is open.
      	at com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService$SeekConsensusTask.doRun(HAJournalServer.java:1741)
      	at com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService$SeekConsensusTask.doRun(HAJournalServer.java:1712)
      	at com.bigdata.journal.jini.ha.HAJournalServer$HAQuorumService$RunStateCallable.call(HAJournalServer.java:1074)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:662)
      

      This problem repeats and does not resolve.

        Attachments

          Activity

            People

            Assignee:
            martyncutcher martyncutcher
            Reporter:
            bryanthompson bryanthompson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: