Details

      Description

      Review the commit2Phase semantics when a majority of services vote YES for phase2Phase and then one or more services fail (spurious exception) the commit2Phase.

      The code currently does an abort2Phase. This rolls back the commit point as long as we can talk to the services.

      I think that the code should instead cause the service that failed the commit2Phase (locally) to move into an error task and that the other services should retain the commit point. If the quorum breaks as a result, then operator intervention is required (unless a quorum can form around the services that were joined plus any that might be transiently out).

      As part of this task:


      - abort2Phase should have the commit counter to be aborted. check this on the followers. also, for some cases we may only want to issue the abort messages to the prepared followers.


      - review the sequence in which the HALog closing root block is put down. This is done after the new root block on the journal.

      See {{{
      TestHAJournalServerOverride.testStartABC_commit2Phase_B_fails()
      TestHAJournalServerOverride.testABC_stopZookeeper_failA_startZookeeper_quorumMeetsAgainOnNewToken()
      TestHAJournalServerOverride.testAB_stopZookeeper_failB_startZookeeper_quorumBreaksThenMeets()
      }}}

      Note: We have seen the "empty HALog" problem [1] again. This appears to be caused by the leader when it fails to write the closing root block on the HALog. This was observed on an HA3 configuration during longevity testing. There was a 60 second period during which the leader went through 86 commit points using the BSBM EXPLORE + UPDATE workload. The HALog files for those commit points all lacked the closing root block. At the end of that 60 second period, the leader was failed. After a machine and service restart, the leader had all data on its local journal (as verified by DumpJournal for all 3 nodes). This strongly suggests that the leader went through the 2-phase commit and wrote the root blocks on the local journal, but somehow failed to write the root blocks on the HALog files (though it is possible that the leader went into a follower mode and resynced, catching up on the necessary commit points
      - the HAJournalServer.log file is mysteriously silent during this period). This was on bigdata17. The machine seized up somewhere around when this occurred. The leader could not be restarted due to the presence of these logically empty HALog files. The HALog files were validated against those on the other nodes. They had all of the HAWriteMessages (and presumably the payloads for those messages) and appeared to lack only the closing root block to be complete. That closing root block is written by logRootBlock() which is invoked from commit2Phase(). This was observed against r7485.

      The first log written on bigdata without a closing root block was:

      -rw-r--r-- 1 root root  18648 2013-10-29 00:31:28.245476239 -0400 000000000000000112396.ha-log
      

      All HALog files from that commit point until the final HALog file on bigdata did not have a closing root block. There were 86 such files laid down over a 60 second window. The final log written on bigdata17 was:

      -rw-r--r-- 1 root root 222832 2013-10-29 00:32:00.434880535 -0400 000000000000000112480.ha-log
      

      The zookeeper logs on bigdata16 shows that the zookeeper client for bigdata17 was expired at a timestamp that is

      2013-10-28 15:48:22,265 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x141fc46c8720006
      2013-10-29 00:32:11,306 - ERROR [LearnerHandler-/192.168.1.17:50115:LearnerHandler@461] - Unexpected exception causing shutdown while sock still open
      java.net.SocketTimeoutException: Read timed out
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	at java.net.SocketInputStream.read(SocketInputStream.java:150)
      	at java.net.SocketInputStream.read(SocketInputStream.java:121)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
      	at java.io.DataInputStream.readInt(DataInputStream.java:387)
      	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
      	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
      	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
      	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:375)
      2013-10-29 00:32:11,309 - WARN  [LearnerHandler-/192.168.1.17:50115:LearnerHandler@474] - ******* GOODBYE /192.168.1.17:50115 ********
      2013-10-29 00:32:30,000 - INFO  [SessionTracker:ZooKeeperServer@314] - Expiring session 0x241fc46c8740003, timeout of 40000ms exceeded
      2013-10-29 00:32:30,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x241fc46c8740003
      2013-10-29 00:32:30,009 - INFO  [CommitProcessor:2:NIOServerCnxn@1435] - Closed socket connection for client /192.168.1.17:54381 which had sessionid 0x241fc46c8740003
      2013-10-29 00:32:32,025 - INFO  [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level KeeperException when processing sessionid:0x241fc46c8740006 type:setData cxid:0x294 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum Error:KeeperErrorCode = BadVersion for /benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum
      2013-10-29 00:32:32,142 - INFO  [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level KeeperException when processing sessionid:0x241fc46c8740007 type:create cxid:0x1cd zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406 Error:KeeperErrorCode = NodeExists for /benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406
      2013-10-29 00:32:32,223 - INFO  [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level KeeperException when processing sessionid:0x241fc46c8740007 type:create cxid:0x1fe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406 Error:KeeperErrorCode = NodeExists for /benchmark/com.bigdata.journal.jini.ha.HAJournalServer/HAJournal-1/quorum/votes/1383021120406
      ...No more events in this timeframe.
      

      The LUS on bigdata15 shows that bigdata17 was dead by 10-28-2013 at 00:32:45.

      Oct 29, 2013 12:32:45 AM com.sun.jini.reggie.RegistrarImpl$EventTask run
      INFO: exception sending event
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[2acdf1ee-7748-4cf3-987a-d1e8d768ed9e,TcpEndpoint[192.168.1.17:48992]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.proxy.$Proxy7.notify(Unknown Source)
      	at com.sun.jini.reggie.RegistrarImpl$EventTask.run(RegistrarImpl.java:1863)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 5 more
      
      Oct 29, 2013 12:35:17 AM net.jini.lookup.JoinManager$ProxyReg fail
      INFO: JoinManager - failure for lookup service proxy: com.sun.jini.reggie.ConstrainableRegistrarProxy[registrar=56cd681c-6e6a-49d2-9f3f-0605980f72f9 Proxy[Registrar,BasicInvocationHandler[BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]]]]
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.jini.reggie.$Proxy5.renewServiceLease(Unknown Source)
      	at com.sun.jini.reggie.ServiceLease.doRenew(ServiceLease.java:85)
      	at com.sun.jini.lease.AbstractLease.renew(AbstractLease.java:83)
      	at net.jini.lease.LeaseRenewalManager.renewAll(LeaseRenewalManager.java:1404)
      	at net.jini.lease.LeaseRenewalManager.access$500(LeaseRenewalManager.java:319)
      	at net.jini.lease.LeaseRenewalManager$RenewTask.run(LeaseRenewalManager.java:451)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 9 more
      

      Similar events were observed on bigdata16:

      Oct 29, 2013 12:32:44 AM com.sun.jini.reggie.RegistrarImpl$EventTask run
      INFO: exception sending event
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[2acdf1ee-7748-4cf3-987a-d1e8d768ed9e,TcpEndpoint[192.168.1.17:48992]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.proxy.$Proxy5.notify(Unknown Source)
      	at com.sun.jini.reggie.RegistrarImpl$EventTask.run(RegistrarImpl.java:1863)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 5 more
      
      Oct 29, 2013 12:35:13 AM net.jini.lookup.JoinManager$ProxyReg fail
      INFO: JoinManager - failure for lookup service proxy: com.sun.jini.reggie.ConstrainableRegistrarProxy[registrar=56cd681c-6e6a-49d2-9f3f-0605980f72f9 Proxy[Registrar,BasicInvocationHandler[BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]]]]
      java.rmi.ConnectIOException: I/O exception connecting to BasicObjectEndpoint[21e42e7a-11fd-4360-a3ff-8def549f53b7,TcpEndpoint[192.168.1.17:53382]]; nested exception is:
      	java.net.NoRouteToHostException: No route to host
      	at net.jini.jeri.BasicInvocationHandler.wrapSafeIOException(BasicInvocationHandler.java:893)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:711)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationHandler.java:659)
      	at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:528)
      	at com.sun.jini.reggie.$Proxy2.renewServiceLease(Unknown Source)
      	at com.sun.jini.reggie.ServiceLease.doRenew(ServiceLease.java:85)
      	at com.sun.jini.lease.AbstractLease.renew(AbstractLease.java:83)
      	at net.jini.lease.LeaseRenewalManager.renewAll(LeaseRenewalManager.java:1404)
      	at net.jini.lease.LeaseRenewalManager.access$500(LeaseRenewalManager.java:319)
      	at net.jini.lease.LeaseRenewalManager$RenewTask.run(LeaseRenewalManager.java:451)
      	at com.sun.jini.thread.TaskManager$TaskThread.run(TaskManager.java:331)
      Caused by: java.net.NoRouteToHostException: No route to host
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToSocketAddress(TcpEndpoint.java:678)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connectToHost(TcpEndpoint.java:608)
      	at net.jini.jeri.tcp.TcpEndpoint$ConnectionEndpointImpl.connect(TcpEndpoint.java:543)
      	at net.jini.jeri.connection.ConnectionManager.connect(ConnectionManager.java:242)
      	at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionManager.java:664)
      	at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:370)
      	at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocationHandler.java:708)
      	... 9 more
      

      No relevant messages were found on the console, LUS, or dmesg for bigdata17.

      [1] http://sourceforge.net/apps/trac/bigdata/ticket/679 (HAJournalServer can not restart due to logically empty log file)

        Activity

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: