Details

      Description

      Review the commit2Phase semantics when a majority of services vote YES for phase2Phase and then one or more services fail (spurious exception) the commit2Phase.

      The code currently does an abort2Phase. This rolls back the commit point as long as we can talk to the services.

      I think that the code should instead cause the service that failed the commit2Phase (locally) to move into an error task and that the other services should retain the commit point. If the quorum breaks as a result, then operator intervention is required (unless a quorum can form around the services that were joined plus any that might be transiently out).

      As part of this task:


      - abort2Phase should have the commit counter to be aborted. check this on the followers. also, for some cases we may only want to issue the abort messages to the prepared followers.


      - review the sequence in which the HALog closing root block is put down. This is done after the new root block on the journal.

      See {{{
      TestHAJournalServerOverride.testStartABC_commit2Phase_B_fails()
      TestHAJournalServerOverride.testABC_stopZookeeper_failA_startZookeeper_quorumMeetsAgainOnNewToken()
      TestHAJournalServerOverride.testAB_stopZookeeper_failB_startZookeeper_quorumBreaksThenMeets()
      }}}

      Note: We have seen the "empty HALog" problem [1] again. This appears to be caused by the leader when it fails to write the closing root block on the HALog. This was observed on an HA3 configuration during longevity testing. There was a 60 second period during which the leader went through 86 commit points using the BSBM EXPLORE + UPDATE workload. The HALog files for those commit points all lacked the closing root block. At the end of that 60 second period, the leader was failed. After a machine and service restart, the leader had all data on its local journal (as verified by DumpJournal for all 3 nodes). This strongly suggests that the leader went through the 2-phase commit and wrote the root blocks on the local journal, but somehow failed to write the root blocks on the HALog files (though it is possible that the leader went into a follower mode and resynced, catching up on the necessary commit points
      - the HAJournalServer.log file is mysteriously silent during this period). This was on bigdata17. The machine seized up somewhere around when this occurred. The leader could not be restarted due to the presence of these logically empty HALog files. The HALog files were validated against those on the other nodes. They had all of the HAWriteMessages (and presumably the payloads for those messages) and appeared to lack only the closing root block to be complete. That closing root block is written by logRootBlock() which is invoked from commit2Phase(). This was observed against r7485.

      The first log written on bigdata without a closing root block was:

      -rw-r--r-- 1 root root  18648 2013-10-29 00:31:28.245476239 -0400 000000000000000112396.ha-log
      

      All HALog files from that commit point until the final HALog file on bigdata did not have a closing root block. There were 86 such files laid down over a 60 second window. The final log written on bigdata17 was:

      -rw-r--r-- 1 root root 222832 2013-10-29 00:32:00.434880535 -0400 000000000000000112480.ha-log
      

      The zookeeper logs on bigdata16 shows that the zookeeper client for bigdata17 was expired at a timestamp that is

      2013-10-28 15:48:22,265 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x141fc46c8720006
      2013-10-29 00:32:11,306 - ERROR [LearnerHandler-/192.168.1.17:50115:LearnerHandler@461] - Unexpected exception causing shutdown while sock still open
      java.net.SocketTimeoutException: Read timed out
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	at java.net.SocketInputStream.read(SocketInputStream.java:150)
      	at java.net.SocketInputStream.read(SocketInputStream.java:121)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
      	at java.io.DataInputStream.readInt(DataInputStream.java:387)
      	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
      	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
      	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
      	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:375)
      2013-10-29 00:32:11,309 - WARN  [LearnerHandler-/192.168.1.17:50115:LearnerHandler@474] - ******* GOODBYE /192.168.1.17:50115 ********
      2013-10-29 00:32:30,000 - INFO  [SessionTracker:ZooKeeperServer@314] - Expiring session 0x241fc46c8740003, timeout of 40000ms exceeded
      2013-10-29 00:32:30,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x241fc46c8740003
      2013-10-29 00:32:30,009 - INFO  [CommitProcessor:2:NIOServerCnxn@1435] - Closed socket connection for client /192.168.1.17:54381 which had sessionid 0x241fc46c8740003
      2013-10-29 00:32:32,025 - INFO  [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level KeeperException when processing sessionid:0x241fc46c8740006 type:setData cxid:0x294 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum Error:KeeperErrorCode = BadVersion for /benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum
      2013-10-29 00:32:32,142 - INFO  [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level KeeperException when processing sessionid:0x241fc46c8740007 type:create cxid:0x1cd zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406 Error:KeeperErrorCode = NodeExists for /benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406
      2013-10-29 00:32:32,223 - INFO  [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level KeeperException when processing sessionid:0x241fc46c8740007 type:create cxid:0x1fe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406 Error:KeeperErrorCode = NodeExists for /benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406
      ...No more events in this timeframe.
      

      The LUS on bigdata15 shows that bigdata17 was dead by 10-28-2013 at 00:32:45.

      Oct 29, 2013 12:32:45 AM com.sun.jini.reggie.RegistrarImpl$EventTask run
      INFO: exception sending event
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[2acdf1ee-7748-4cf3-987a-d1e8d768ed9e,TcpEndpoint[192.168.1.17:48992]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.proxy.$Proxy7.notify(Unknown Source)
      	at com.sun.jini.reggie.RegistrarImpl$EventTask.run(RegistrarImpl.java:1863)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 5 more
      
      Oct 29, 2013 12:35:17 AM net.jini.lookup.JoinManager$ProxyReg fail
      INFO: JoinManager - failure for lookup service proxy: com.sun.jini.reggie.ConstrainableRegistrarProxy[registrar=56cd681c-6e6a-49d2-9f3f-0605980f72f9 Proxy[Registrar,BasicInvocationHandler[BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]]]]
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.jini.reggie.$Proxy5.renewServiceLease(Unknown Source)
      	at com.sun.jini.reggie.ServiceLease.doRenew(ServiceLease.java:85)
      	at com.sun.jini.lease.AbstractLease.renew(AbstractLease.java:83)
      	at net.jini.lease.LeaseRenewalManager.renewAll(LeaseRenewalManager.java:1404)
      	at net.jini.lease.LeaseRenewalManager.access$500(LeaseRenewalManager.java:319)
      	at net.jini.lease.LeaseRenewalManager$RenewTask.run(LeaseRenewalManager.java:451)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 9 more
      

      Similar events were observed on bigdata16:

      Oct 29, 2013 12:32:44 AM com.sun.jini.reggie.RegistrarImpl$EventTask run
      INFO: exception sending event
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[2acdf1ee-7748-4cf3-987a-d1e8d768ed9e,TcpEndpoint[192.168.1.17:48992]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.proxy.$Proxy5.notify(Unknown Source)
      	at com.sun.jini.reggie.RegistrarImpl$EventTask.run(RegistrarImpl.java:1863)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 5 more
      
      Oct 29, 2013 12:35:13 AM net.jini.lookup.JoinManager$ProxyReg fail
      INFO: JoinManager - failure for lookup service proxy: com.sun.jini.reggie.ConstrainableRegistrarProxy[registrar=56cd681c-6e6a-49d2-9f3f-0605980f72f9 Proxy[Registrar,BasicInvocationHandler[BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]]]]
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.jini.reggie.$Proxy2.renewServiceLease(Unknown Source)
      	at com.sun.jini.reggie.ServiceLease.doRenew(ServiceLease.java:85)
      	at com.sun.jini.lease.AbstractLease.renew(AbstractLease.java:83)
      	at net.jini.lease.LeaseRenewalManager.renewAll(LeaseRenewalManager.java:1404)
      	at net.jini.lease.LeaseRenewalManager.access$500(LeaseRenewalManager.java:319)
      	at net.jini.lease.LeaseRenewalManager$RenewTask.run(LeaseRenewalManager.java:451)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 9 more
      

      No relevant messages were found on the console, LUS, or dmesg for bigdata17.

      [1] http://sourceforge.net/apps/trac/bigdata/ticket/679 (HAJournalServer can not restart due to logically empty log file)

        Activity

        beebs Brad Bebee created issue -
        Hide
        bryanthompson bryanthompson added a comment -

        I have reviewed and revised the following aspects of the commit and 2-phase commit protocol:


        - Journal: cancellation of the Future in the GATHER consensus protocol is no longer logged as an ERROR. This is just a data race.


        - HAJournalServer: enterErrorState() is now written to be defensive. Nothing will be thrown out, even if the HAQuorumService is not running (quorum.terminate())).


        - HAJournalServer.HAQuorumServer.start()/terminate() now use an AtomicBoolean to guard those methods and to make enterErrorState() a NOP if the HAQuorumService is currently in either of those methods.


        - HAJournalServer.logRootBlock() no longer has the isJoinedService boolean. This method is only called in the 2-phase commit logic and the 2-phase commit is only executed for services that were joined with the met quorum as of the atomic decision point in commitNow().


        - commitNow()
        - refactored to use a CommitState object and pushed down methods for the different things that are being done into that object. The error handling for the two phase commit logic was simplified to increase understandability.


        - The Prepare2Phase and Commit2Phase tasks were simplified. The core code was pushed down into private inner methods. This makes it easier to analyze the error handling code paths.


        - TestDumpJournal: added coverage of some more conditions looking to replicate an error observed on the HA3 cluster. I was not able to replicate the problem. It may have been related to allocator recycling or abnormal failure mode on bigdata17 as per this ticket:

        ERROR: 2885231 2013-10-28 16:05:28,194      qtp230584058-49 com.bigdata.rdf.sail.webapp.StatusServlet.doGet(StatusServlet.java:863): java.lang.RuntimeException: java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.journal.Name2Addr
        java.lang.RuntimeException: java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.journal.Name2Addr
        	at com.bigdata.journal.DumpJournal.dumpNamedIndicesMetadata(DumpJournal.java:726)
        	at com.bigdata.journal.DumpJournal.dumpJournal(DumpJournal.java:599)
        	at com.bigdata.rdf.sail.webapp.StatusServlet.doGet(StatusServlet.java:485)
        	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:534)
        	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:475)
        	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:929)
        	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:403)
        	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:864)
        	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
        	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:114)
        	at org.eclipse.jetty.server.Server.handle(Server.java:352)
        	at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
        	at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1051)
        	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:590)
        	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:212)
        	at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:426)
        	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:508)
        	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.access$000(SelectChannelEndPoint.java:34)
        	at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
        	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:451)
        	at java.lang.Thread.run(Thread.java:724)
        Caused by: java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.journal.Name2Addr
        	at com.bigdata.journal.AbstractJournal.getIndexWithCommitRecord(AbstractJournal.java:4751)
        	at com.bigdata.journal.DumpJournal.dumpNamedIndicesMetadata(DumpJournal.java:706)
        	... 23 more
        

        TODO:


        - Review abort2Phase(). For some invocation contexts, this should be restricted to services that were joined with the met quorum and that voted YES for the PREPARE message.


        - QuorumCommitImpl: I have reviewed methods, including the cancellation of remote futures and the error paths. modified to use the local executor service to submit RMI requests in parallel and simplified the code paths for interrupt and error handling for RMI requests or for failures to locate the proxy for a service. However, the new code causes some problems with the HA CI test suite and is being withheld while I investigate those issues in more depth.

        I have run threw the HA CI test suite, RWStore test suite, WORM test suite, and TestBigdataSailWithQuads. All is green.

        See BLZG-159

        Committed revision r7501.

        Show
        bryanthompson bryanthompson added a comment - I have reviewed and revised the following aspects of the commit and 2-phase commit protocol: - Journal: cancellation of the Future in the GATHER consensus protocol is no longer logged as an ERROR. This is just a data race. - HAJournalServer: enterErrorState() is now written to be defensive. Nothing will be thrown out, even if the HAQuorumService is not running (quorum.terminate())). - HAJournalServer.HAQuorumServer.start()/terminate() now use an AtomicBoolean to guard those methods and to make enterErrorState() a NOP if the HAQuorumService is currently in either of those methods. - HAJournalServer.logRootBlock() no longer has the isJoinedService boolean. This method is only called in the 2-phase commit logic and the 2-phase commit is only executed for services that were joined with the met quorum as of the atomic decision point in commitNow(). - commitNow() - refactored to use a CommitState object and pushed down methods for the different things that are being done into that object. The error handling for the two phase commit logic was simplified to increase understandability. - The Prepare2Phase and Commit2Phase tasks were simplified. The core code was pushed down into private inner methods. This makes it easier to analyze the error handling code paths. - TestDumpJournal: added coverage of some more conditions looking to replicate an error observed on the HA3 cluster. I was not able to replicate the problem. It may have been related to allocator recycling or abnormal failure mode on bigdata17 as per this ticket: ERROR: 2885231 2013-10-28 16:05:28,194 qtp230584058-49 com.bigdata.rdf.sail.webapp.StatusServlet.doGet(StatusServlet.java:863): java.lang.RuntimeException: java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.journal.Name2Addr java.lang.RuntimeException: java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.journal.Name2Addr at com.bigdata.journal.DumpJournal.dumpNamedIndicesMetadata(DumpJournal.java:726) at com.bigdata.journal.DumpJournal.dumpJournal(DumpJournal.java:599) at com.bigdata.rdf.sail.webapp.StatusServlet.doGet(StatusServlet.java:485) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:534) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:475) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:929) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:403) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:864) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:114) at org.eclipse.jetty.server.Server.handle(Server.java:352) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596) at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1051) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:590) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:212) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:426) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:508) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.access$000(SelectChannelEndPoint.java:34) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:451) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.journal.Name2Addr at com.bigdata.journal.AbstractJournal.getIndexWithCommitRecord(AbstractJournal.java:4751) at com.bigdata.journal.DumpJournal.dumpNamedIndicesMetadata(DumpJournal.java:706) ... 23 more TODO: - Review abort2Phase(). For some invocation contexts, this should be restricted to services that were joined with the met quorum and that voted YES for the PREPARE message. - QuorumCommitImpl: I have reviewed methods, including the cancellation of remote futures and the error paths. modified to use the local executor service to submit RMI requests in parallel and simplified the code paths for interrupt and error handling for RMI requests or for failures to locate the proxy for a service. However, the new code causes some problems with the HA CI test suite and is being withheld while I investigate those issues in more depth. I have run threw the HA CI test suite, RWStore test suite, WORM test suite, and TestBigdataSailWithQuads. All is green. See BLZG-159 Committed revision r7501.
        Hide
        bryanthompson bryanthompson added a comment -

        The issue with the withheld version of QuorumCommitImpl has been tracked down to the non-blocking semantics of HAGlue.getRootBlock(). I have introduced a blocking version of the iHARootBlockRequest message for this method. When that is used from within awaitCommitCounter() in the test suite, then the test suite runs green for the tests that were failing. Thus, this was a data race problem in the test suite.

        AbstractServer: clean up dead code.
        HAJournalServer: clean up dead code.
        AbstractHA3BackupTestCase: clean up import.

        AbstractHAJournalServerTestCase: modify awaitCommitCounter() to use the blocking version of HAGlue.getRootBlock(). This fixes a test suite problem where increased concurrency in the 2-phase PREPARE, COMMIT, and ABORT protocols could cause test suite failures.

        QuorumCommitImpl: Increased parallelism. The PREPARE, COMMIT, and ABORT behaviors are now executed in parallel on the followers and the leader. Simplified the code patterns for error handling.

        (I)HARootBlockRequest: Added boolean parameter for blocking versus non-blocking request.

        AbstractJournal: Added support for blocking versus non-blocking request.

        Committed revision r7505.

        Show
        bryanthompson bryanthompson added a comment - The issue with the withheld version of QuorumCommitImpl has been tracked down to the non-blocking semantics of HAGlue.getRootBlock(). I have introduced a blocking version of the iHARootBlockRequest message for this method. When that is used from within awaitCommitCounter() in the test suite, then the test suite runs green for the tests that were failing. Thus, this was a data race problem in the test suite. AbstractServer: clean up dead code. HAJournalServer: clean up dead code. AbstractHA3BackupTestCase: clean up import. AbstractHAJournalServerTestCase: modify awaitCommitCounter() to use the blocking version of HAGlue.getRootBlock(). This fixes a test suite problem where increased concurrency in the 2-phase PREPARE, COMMIT, and ABORT protocols could cause test suite failures. QuorumCommitImpl: Increased parallelism. The PREPARE, COMMIT, and ABORT behaviors are now executed in parallel on the followers and the leader. Simplified the code patterns for error handling. (I)HARootBlockRequest: Added boolean parameter for blocking versus non-blocking request. AbstractJournal: Added support for blocking versus non-blocking request. Committed revision r7505.
        Hide
        bryanthompson bryanthompson added a comment -

        Throughput for BSBM UPDATE+EXPLORE has gone up. Without concurrent query on the leader, performance now rises to 2100 QMpH. I attribute this to the change to QuorumCommitImpl above. That change was expected to decrease the 2-phase commit latency by executing the PREPARE, COMMIT, and ABORT operations in parallel. The main cost for large write sets is the flush of the data down the pipeline, which is already extremely efficient. The main remaining cost is the concurrency of the 2-phase commit, including the messages mentioned above and the GATHER protocol (which is already executed concurrently).

        QMpH:                   1257.17 query mixes per hour
        QMpH:                   1303.51 query mixes per hour
        QMpH:                   1319.24 query mixes per hour
        QMpH:                   1357.83 query mixes per hour
        QMpH:                   1568.23 query mixes per hour
        QMpH:                   2133.59 query mixes per hour
        QMpH:                   2165.44 query mixes per hour
        QMpH:                   2185.65 query mixes per hour
        QMpH:                   2165.26 query mixes per hour
        QMpH:                   2170.46 query mixes per hour
        
        Show
        bryanthompson bryanthompson added a comment - Throughput for BSBM UPDATE+EXPLORE has gone up. Without concurrent query on the leader, performance now rises to 2100 QMpH. I attribute this to the change to QuorumCommitImpl above. That change was expected to decrease the 2-phase commit latency by executing the PREPARE, COMMIT, and ABORT operations in parallel. The main cost for large write sets is the flush of the data down the pipeline, which is already extremely efficient. The main remaining cost is the concurrency of the 2-phase commit, including the messages mentioned above and the GATHER protocol (which is already executed concurrently). QMpH: 1257.17 query mixes per hour QMpH: 1303.51 query mixes per hour QMpH: 1319.24 query mixes per hour QMpH: 1357.83 query mixes per hour QMpH: 1568.23 query mixes per hour QMpH: 2133.59 query mixes per hour QMpH: 2165.44 query mixes per hour QMpH: 2185.65 query mixes per hour QMpH: 2165.26 query mixes per hour QMpH: 2170.46 query mixes per hour
        Hide
        bryanthompson bryanthompson added a comment -

        I have analyzed the root cause of the override test in which B is told to reject the commit. The test actually relies on an exception being thrown back when the commit2Phase message is received. Thus, B is not actually encountering an error when handling that message and is not being transitioned automatically to the ERROR state.

        I have further cleaned up the 2-phase commit logic in both QuorumCommitImpl and AbstractJournal. The COMMIT behavior for 2-phase commit is now rejected IFF we lack a majority of services that successfully executed the COMMIT (and for which the leader has received notice of that success). The leader will failover if the COMMIT phase fails and the transaction will be failed. (If the PREPARE phase fails, then the transaction is failed, but the leader does not fail over).

        I am now encountering a problem with this test where B goes into the Operator state because it is looking for an HALog that does not exist. I had observed this in the recent longevity tests. Now we have a unit test that can recreate the problem, which is great.

        http://192.168.1.133:8091 : is not joined, pipelineOrder=2, writePipelineAddr=localhost/127.0.0.1:9091, service=other, extendedRunState={server=Running, quorumService=Operator @ 1, haReady=-1, haStatus=NotReady, serviceId=e832b5c0-6c9b-444d-8946-514eacc5e329, now=1383332021358, msg=[HALog not available: commitCounter=2]}
        

        QuorumCommitImpl was modified to return a class containing a summary of the commit response. This makes it possible to write more flexible behaviors in AbstractJournal.commitNow().

        Show
        bryanthompson bryanthompson added a comment - I have analyzed the root cause of the override test in which B is told to reject the commit. The test actually relies on an exception being thrown back when the commit2Phase message is received. Thus, B is not actually encountering an error when handling that message and is not being transitioned automatically to the ERROR state. I have further cleaned up the 2-phase commit logic in both QuorumCommitImpl and AbstractJournal. The COMMIT behavior for 2-phase commit is now rejected IFF we lack a majority of services that successfully executed the COMMIT (and for which the leader has received notice of that success). The leader will failover if the COMMIT phase fails and the transaction will be failed. (If the PREPARE phase fails, then the transaction is failed, but the leader does not fail over). I am now encountering a problem with this test where B goes into the Operator state because it is looking for an HALog that does not exist. I had observed this in the recent longevity tests. Now we have a unit test that can recreate the problem, which is great. http://192.168.1.133:8091 : is not joined, pipelineOrder=2, writePipelineAddr=localhost/127.0.0.1:9091, service=other, extendedRunState={server=Running, quorumService=Operator @ 1, haReady=-1, haStatus=NotReady, serviceId=e832b5c0-6c9b-444d-8946-514eacc5e329, now=1383332021358, msg=[HALog not available: commitCounter=2]} QuorumCommitImpl was modified to return a class containing a summary of the commit response. This makes it possible to write more flexible behaviors in AbstractJournal.commitNow().
        Hide
        bryanthompson bryanthompson added a comment -

        A copy of the logs for this run has been saved on dropbox: https://www.dropbox.com/sh/fque4cmga579033/J9swV2_ggY

        When analyzing the log files, note that the test harness was paused for several minutes while I inspected the /status page and documented the problem.

        Show
        bryanthompson bryanthompson added a comment - A copy of the logs for this run has been saved on dropbox: https://www.dropbox.com/sh/fque4cmga579033/J9swV2_ggY When analyzing the log files, note that the test harness was paused for several minutes while I inspected the /status page and documented the problem.
        Hide
        bryanthompson bryanthompson added a comment -

        Committed revision r7507 for the two comments above. HA CI is green locally (except for known failures). Awaiting feedback from CI.

        Show
        bryanthompson bryanthompson added a comment - Committed revision r7507 for the two comments above. HA CI is green locally (except for known failures). Awaiting feedback from CI.
        Hide
        bryanthompson bryanthompson added a comment -

        I have looked at the logs for the problem above. This is a case where we have moved from SeekConsensus into Resync because the quorum is already met. However, when we enter Resync the HALog for the next commit point does not yet exist
        - this is probably a data race with the creation of that HALog file. Resync then transitions to Operator.

        In this case, we are trying to replicate the current commit point on the leader, but that commit point might not yet exist (because the leader is in the 2-phase commit protocol). The fix for this might be as easy as retrying the resync or awaiting the creation of the HALog on the leader (e.g., by contending for the AbstractJournal's inner read/write lock).

        However, there are two potential problems:

        1. Infinite retry of resync if the HALog never becomes available (solved if we contend for a lock).

        2. Distributed deadlock through contention for the AbstractJournal's read/write lock.

        3. We need to clearly identify the case where the missing HALog file is not the one that is being laid down for the leader. For example, if someone has deleted some HALog files from the leader and we are looking for one of those files. Once clearly identified, this case could be handled either by any of: (a) transitioning to the Operator state with a message that clearly identifies the problem; (b) replicating the missing HALog from another service in the quorum; (c) if another joined service has the missing HALog files, then failing over to that service; or (d) online disaster recovery.


        - We need more unit tests that can hit this problem.


        - This is clearly the root cause that led us to disable the online disaster recovery by default. The data race between the resync on one service and the 2-phase commit on another service can lead to resync falsely believing that there is a missing HALog on the leader when it is in fact just looking during the middle of the 2-phase commit that will create the HALog file that it needs (as soon as we lay down the closing root block on the old HALog, the new HALog will be created).

        I need to look at this a little more. The data race is between:

        HAJournal::getHALogRootBlocksForWriteSet() (as invoked from ResyncTask.call())
        
        and
        
        AbstractJournal::commitNow()
        
        Show
        bryanthompson bryanthompson added a comment - I have looked at the logs for the problem above. This is a case where we have moved from SeekConsensus into Resync because the quorum is already met. However, when we enter Resync the HALog for the next commit point does not yet exist - this is probably a data race with the creation of that HALog file. Resync then transitions to Operator. In this case, we are trying to replicate the current commit point on the leader, but that commit point might not yet exist (because the leader is in the 2-phase commit protocol). The fix for this might be as easy as retrying the resync or awaiting the creation of the HALog on the leader (e.g., by contending for the AbstractJournal's inner read/write lock). However, there are two potential problems: 1. Infinite retry of resync if the HALog never becomes available (solved if we contend for a lock). 2. Distributed deadlock through contention for the AbstractJournal's read/write lock. 3. We need to clearly identify the case where the missing HALog file is not the one that is being laid down for the leader. For example, if someone has deleted some HALog files from the leader and we are looking for one of those files. Once clearly identified, this case could be handled either by any of: (a) transitioning to the Operator state with a message that clearly identifies the problem; (b) replicating the missing HALog from another service in the quorum; (c) if another joined service has the missing HALog files, then failing over to that service; or (d) online disaster recovery. - We need more unit tests that can hit this problem. - This is clearly the root cause that led us to disable the online disaster recovery by default. The data race between the resync on one service and the 2-phase commit on another service can lead to resync falsely believing that there is a missing HALog on the leader when it is in fact just looking during the middle of the 2-phase commit that will create the HALog file that it needs (as soon as we lay down the closing root block on the old HALog, the new HALog will be created). I need to look at this a little more. The data race is between: HAJournal::getHALogRootBlocksForWriteSet() (as invoked from ResyncTask.call()) and AbstractJournal::commitNow()
        Hide
        bryanthompson bryanthompson added a comment -

        Ticket BLZG-849 was created for the issue described and has been resolved in r7508.

        Show
        bryanthompson bryanthompson added a comment - Ticket BLZG-849 was created for the issue described and has been resolved in r7508.
        Hide
        bryanthompson bryanthompson added a comment -

        I have added a unit test where the commit2Phase() message fails on B before the root block is written onto the journal. This test passes as long as the HALog files are not purged with each commit. That is, as long as the DefaultRestorePolicy is in effect. If the HALog files are purged with each commit, then the test still fails, but the failure is because the necessary HALog for B to resync has been purged from both A and C when they went through a commit point with a fully met quorum.

        Show
        bryanthompson bryanthompson added a comment - I have added a unit test where the commit2Phase() message fails on B before the root block is written onto the journal. This test passes as long as the HALog files are not purged with each commit. That is, as long as the DefaultRestorePolicy is in effect. If the HALog files are purged with each commit, then the test still fails, but the failure is because the necessary HALog for B to resync has been purged from both A and C when they went through a commit point with a fully met quorum.
        Hide
        bryanthompson bryanthompson added a comment -

        Added test coverage for spurious exception throw out of commit2Phase() before the root block is written on the Journal.

        Committed revision r7509.

        Show
        bryanthompson bryanthompson added a comment - Added test coverage for spurious exception throw out of commit2Phase() before the root block is written on the Journal. Committed revision r7509.
        Hide
        bryanthompson bryanthompson added a comment -

        I am going to close this ticket for now. We have increased code coverage and confidence in the 2-phase commit protocol. We could probably spend a month writing tests of this, but I am not longer aware of any known problem points and the testing test failures have been corrected through a combination of (a) a refactoring of the 2-phase commit logic; (b) some logic changes in the 2-phase commit error handling code paths; (c) increased concurrency in QuorumCommitImpl, with a corresponding increase in the sustained commit rate for the HA3 cluster; and (d) both fixing unit tests and writing new unit tests.

        Show
        bryanthompson bryanthompson added a comment - I am going to close this ticket for now. We have increased code coverage and confidence in the 2-phase commit protocol. We could probably spend a month writing tests of this, but I am not longer aware of any known problem points and the testing test failures have been corrected through a combination of (a) a refactoring of the 2-phase commit logic; (b) some logic changes in the 2-phase commit error handling code paths; (c) increased concurrency in QuorumCommitImpl, with a corresponding increase in the sustained commit rate for the HA3 cluster; and (d) both fixing unit tests and writing new unit tests.
        beebs Brad Bebee made changes -
        Field Original Value New Value
        Workflow Trac Import v2 [ 11995 ] Trac Import v3 [ 13623 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v3 [ 13623 ] Trac Import v4 [ 14952 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v4 [ 14952 ] Trac Import v5 [ 16340 ]
        beebs Brad Bebee made changes -
        Labels Issue_patch_20150625
        beebs Brad Bebee made changes -
        Status Closed - Won't Fix [ 6 ] Open [ 1 ]
        beebs Brad Bebee made changes -
        Status Open [ 1 ] Accepted [ 10101 ]
        beebs Brad Bebee made changes -
        Status Accepted [ 10101 ] In Progress [ 3 ]
        beebs Brad Bebee made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        beebs Brad Bebee made changes -
        Status Resolved [ 5 ] In Review [ 10100 ]
        beebs Brad Bebee made changes -
        Resolution Fixed [ 1 ] Done [ 10000 ]
        Status In Review [ 10100 ] Done [ 10000 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v5 [ 16340 ] Trac Import v6 [ 17601 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v6 [ 17601 ] Trac Import v7 [ 18996 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v7 [ 18996 ] Trac Import v8 [ 20615 ]

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: