Details

      Description

      A deadlock was observed in HA after running the BSBM 100M UPDATE mixture for several hours. The deadlock is related to the release time consensus protocol for HA. However, the deadlock is also the cause of a known hot spot for the standalone deployment model.

      The following method
      {{
      DefaultResourceLocator.locateResourceOn(final IIndexManager indexManager,final String namespace, final long timestamp)
      }}}
      uses

      AbstractJournal.getCommitRecord(long timestamp)
      

      to obtain the commitTime corresponding to the timestamp. This latter method is a hot spot since it requires the field read/write lock for the AbstractJournal and access to the CommitRecordIndex must be mutex for readers and mutation (concurrent readers are allow). However, this method is invoked most commonly using a read-only tx, and this is in fact true for the observed deadlock since the resource is being located for a QUERY as shown by the following stack traces:

      This thread was blocked on AbstractJournal.fieldReadWriteLock while holding the NamedLock in DefaultResourceLocator.

      	at com.bigdata.journal.AbstractJournal.getCommitRecord(AbstractJournal.java:4131) <== DEADLOCK
      	at com.bigdata.relation.locator.DefaultResourceLocator.locateResourceOn(DefaultResourceLocator.java:655)
      	at com.bigdata.relation.locator.DefaultResourceLocator.locateResource(DefaultResourceLocator.java:560)
      	at com.bigdata.relation.locator.DefaultResourceLocator.cacheMiss(DefaultResourceLocator.java:381)
      	at com.bigdata.relation.locator.DefaultResourceLocator.locate(DefaultResourceLocator.java:333)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailReadOnlyConnection.newTx(BigdataSail.java:4104)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailReadOnlyConnection.<init>(BigdataSail.java:3986)
      

      New read-only queries (such as the following trace) were unable to start since the NamedLock was held by the thread above:

      	at com.bigdata.concurrent.NamedLock.acquireLock(NamedLock.java:106) <== DEADLOCK
      	at com.bigdata.relation.locator.DefaultResourceLocator.locate(DefaultResourceLocator.java:313)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailReadOnlyConnection.newTx(BigdataSail.java:4104)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailReadOnlyConnection.<init>(BigdataSail.java:3986)
      

      The HA commit protocol was causing the deadlock by (a):

      	at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:399)
      	at com.bigdata.journal.Journal$BarrierState.messageFollowers(Journal.java:708)
      	at com.bigdata.journal.Journal$BarrierState.access$700(Journal.java:314)
      	at com.bigdata.journal.Journal$InnerJournalTransactionService.updateReleaseTimeConsensus(Journal.java:1090)
      	at com.bigdata.journal.AbstractJournal.commitNow(AbstractJournal.java:2937)
      	at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:2797)
      	at com.bigdata.rdf.store.LocalTripleStore.commit(LocalTripleStore.java:80) <== DEADLOCK : LocalTripleStore.this
      	- locked <0x000000070358f3a0> (a com.bigdata.rdf.store.LocalTripleStore)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.commit2(BigdataSail.java:2977) <== DEADLOCK: BigdataSail.this
      	- locked <0x0000000782cf3fd8> (a com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection)
      	at com.bigdata.rdf.sail.BigdataSailRepositoryConnection.commit2(BigdataSailRepositoryConnection.java:272)
      

      Also, new UPDATEs could not run since the monitor for the unisolated view of the LocalTripleStore was held by the thread requesting the commit (which is shown immediately above):

      	at com.bigdata.rdf.store.AbstractTripleStore.getLexiconRelation(AbstractTripleStore.java:1944)
      	- waiting to lock <0x000000070358f3a0> (a com.bigdata.rdf.store.LocalTripleStore) <== DEADLOCK : LocalTripleStore.this
      	at com.bigdata.rdf.sail.sparql.BigdataASTContext.<init>(BigdataASTContext.java:96)
      	at com.bigdata.rdf.sail.sparql.Bigdata2ASTSPARQLParser.<init>(Bigdata2ASTSPARQLParser.java:100)
      	at com.bigdata.rdf.sail.webapp.BigdataRDFContext.getQueryTask(BigdataRDFContext.java:1775)
      	at com.bigdata.rdf.sail.webapp.QueryServlet.doUpdate(QueryServlet.java:345)
      	at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:151)
      	at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:197)
      

      Looking at the followers, I see the 2nd follower at

      	at $Proxy2.notifyEarliestCommitTime(Unknown Source)
      	at com.bigdata.journal.Journal$InnerJournalTransactionService$GatherTask.call(Journal.java:1392)
      	at com.bigdata.journal.Journal$InnerJournalTransactionService$GatherTask.call(Journal.java:1337)
      

      and the first follower is not running a GatherTask at all.

      This last observation leads me to wonder if the problem had to do with a non-atomic decision concerning the #of parties that would participate in the release time consensus protocol or if the code that was running on the HA3 cluster was not monitoring the quorum state for a quorum break or service leave.

      Thus, I think that this issue was misdiagnosed. The problem is that new transaction starts are blocking on getCommitRecordIndex() in the DefaultResourceLocator during the commitNow(). However, that contention can be removed by using the LocalTransactionManager to resolve the ITx for the timestamp and then the readsOnCommitTime from the ITx. This should increase query throughput (on the leader) with concurrent UPDATEs on the leader.

      See https://sourceforge.net/apps/trac/bigdata/ticket/673 (DGC in release time consensus protocol causes native thread leak in HAJournalServer at each commit)

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Added non-blocking fast path to DefaultResourceLocator:

                        /*
                         * Note: Using the local transaction manager to resolve a
                         * read-only transaction identifer to the ITx object and then
                         * retrieving the readsOnCommitTime from that ITx is a
                         * non-blocking code path. It is preferrable to looking up the
                         * commitRecord in the CommitRecordIndex, which can be
                         * contended. This non-blocking code path will handle any
                         * read-only tx, but it will not be able to locate a timestamp
                         * not protected by a tx.
                         */
        
                        final ITx tx = journal.getTransactionManager().getTx(
                                timestamp);
        
                        if (tx != null) {
        
                        // Fast path.
                        commitTime2 = readTime = tx.getReadsOnCommitTime();
                            
                        } else {
                        
        

        Committed revision r7140.

        Show
        bryanthompson bryanthompson added a comment - Added non-blocking fast path to DefaultResourceLocator: /* * Note: Using the local transaction manager to resolve a * read-only transaction identifer to the ITx object and then * retrieving the readsOnCommitTime from that ITx is a * non-blocking code path. It is preferrable to looking up the * commitRecord in the CommitRecordIndex, which can be * contended. This non-blocking code path will handle any * read-only tx, but it will not be able to locate a timestamp * not protected by a tx. */ final ITx tx = journal.getTransactionManager().getTx( timestamp); if (tx != null) { // Fast path. commitTime2 = readTime = tx.getReadsOnCommitTime(); } else { Committed revision r7140.
        Hide
        bryanthompson bryanthompson added a comment -

        I have re-run BSBM 100M UPDATE (on the leader) with concurrent BSBM EXPLORE (on the followers). No deadlock was observed with 120 presentations of each benchmark executing concurrently against the HA3 cluster.

        Another deadlock was observed against r7142. However, this time the followers were able to continue to make progress against their EXPLORE workloads. I was also able to analyze this deadlock to the root cause.

        Another deadlock on the HA3 cluster with UPDATE (leader) and EXPLORE (followers).

        See https://sourceforge.net/apps/trac/bigdata/ticket/677 (HA deadlock under UPDATE + QUERY)

        1. The followers are continuing to make progress on queries.

        2. The following query only appears to fail when the leader is running UPDATE. Perhaps this is related to the release time?

        PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
        PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
        PREFIX dc: <http://purl.org/dc/elements/1.1/>
        
        SELECT DISTINCT ?offer ?price
        WHERE {
        	?offer bsbm:product <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer5082/Product257215> .
        	?offer bsbm:vendor ?vendor .
            ?offer dc:publisher ?vendor .
        	?vendor bsbm:country <http://downlode.org/rdf/iso-3166/countries#US> .
        	?offer bsbm:deliveryDays ?deliveryDays .
        	FILTER (?deliveryDays <= 3)
        	?offer bsbm:price ?price .
            ?offer bsbm:validTo ?date .
            FILTER (?date > "2008-06-20T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> )
        }
        ORDER BY xsd:double(str(?price))
        LIMIT 10
        

        3. The leader is deadlocked attempted to execute:

        DELETE WHERE
        { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromVendor1589/Offer3146763> ?p ?o }
        

        4. Review of the code shows only two threads waiting at the CyclicBarrier. This is the same symptom as the previous stack trace. I connected to the JVM with a debugger and set a break point on the ScheduledFuture that monitors the quorum state while waiting on the CyclicBarrier. I observe that the barrier was setup for 3 parties, which is what was expected.

        The serviceIds are as follows, where A=leader=bigdata17,
        B=bigdata16 (1st follower), and C=bigdata15 (2nd follower):

           A: 48c82ba6-0627-49e7-a598-56ff3f183c8b
           B: 0754bf65-2931-4603-b7c9-c41032d74c36
           C: 4b1c9327-76c9-451b-abf6-d0bad0d989e8
        

        Based on the leadersValue and the responses, (B) has not reported back yet.

           leadersValue: com.bigdata.ha.msg.HANotifyReleaseTimeRequest@9008c59{serviceUUID=48c82ba6-0627-49e7-a598-56ff3f183c8b,pinnedCommitTime=1368817072774,pinnedCommitCounter=53128,timestamp=1368817073043}
        
           responses: {4b1c9327-76c9-451b-abf6-d0bad0d989e8=com.bigdata.ha.msg.HANotifyReleaseTimeRequest@6318ff42{serviceUUID=4b1c9327-76c9-451b-abf6-d0bad0d989e8,pinnedCommitTime=1368817072774,pinnedCommitCounter=53128,timestamp=1368817073076}}
        

        I then connected to B (bigdata16) using a debugger and set a breakpoint on BasicHA.getHAStatus(). I triggered the breakpoint using GET ?/status on that server. This allowed me to inspect the gatherFuture, which was non-null. Drilling down I observe that the GatherTask had encountered an error:

        com.bigdata.util.ClocksNotSynchronizedException
        

        This appears to be the root cause issue for this ticket. Some amount of clock skew caused an exception in the GatherTask which then failed to message the leader which was blocked on the CyclicBarrier. The GatherTask must respond no matter what.

        I was able to cure the deadlock by restarting one of the HAJournalServer instances (I restarted the 2nd follower, bigdata15, but it should not matter which one is restarted since it monitors the quorum state and will break the CyclicBarrier if a joined service leaves the met quorum while it is waiting on the barrier). This caused the commit to fail (it should have succeeded since it still had two joined services, but this is a known flaw), at which point the UPDATE workload resumed.

        Show
        bryanthompson bryanthompson added a comment - I have re-run BSBM 100M UPDATE (on the leader) with concurrent BSBM EXPLORE (on the followers). No deadlock was observed with 120 presentations of each benchmark executing concurrently against the HA3 cluster. Another deadlock was observed against r7142. However, this time the followers were able to continue to make progress against their EXPLORE workloads. I was also able to analyze this deadlock to the root cause. Another deadlock on the HA3 cluster with UPDATE (leader) and EXPLORE (followers). See https://sourceforge.net/apps/trac/bigdata/ticket/677 (HA deadlock under UPDATE + QUERY) 1. The followers are continuing to make progress on queries. 2. The following query only appears to fail when the leader is running UPDATE. Perhaps this is related to the release time? PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT DISTINCT ?offer ?price WHERE { ?offer bsbm:product <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer5082/Product257215> . ?offer bsbm:vendor ?vendor . ?offer dc:publisher ?vendor . ?vendor bsbm:country <http://downlode.org/rdf/iso-3166/countries#US> . ?offer bsbm:deliveryDays ?deliveryDays . FILTER (?deliveryDays <= 3) ?offer bsbm:price ?price . ?offer bsbm:validTo ?date . FILTER (?date > "2008-06-20T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ) } ORDER BY xsd:double(str(?price)) LIMIT 10 3. The leader is deadlocked attempted to execute: DELETE WHERE { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromVendor1589/Offer3146763> ?p ?o } 4. Review of the code shows only two threads waiting at the CyclicBarrier. This is the same symptom as the previous stack trace. I connected to the JVM with a debugger and set a break point on the ScheduledFuture that monitors the quorum state while waiting on the CyclicBarrier. I observe that the barrier was setup for 3 parties, which is what was expected. The serviceIds are as follows, where A=leader=bigdata17, B=bigdata16 (1st follower), and C=bigdata15 (2nd follower): A: 48c82ba6-0627-49e7-a598-56ff3f183c8b B: 0754bf65-2931-4603-b7c9-c41032d74c36 C: 4b1c9327-76c9-451b-abf6-d0bad0d989e8 Based on the leadersValue and the responses, (B) has not reported back yet. leadersValue: com.bigdata.ha.msg.HANotifyReleaseTimeRequest@9008c59{serviceUUID=48c82ba6-0627-49e7-a598-56ff3f183c8b,pinnedCommitTime=1368817072774,pinnedCommitCounter=53128,timestamp=1368817073043} responses: {4b1c9327-76c9-451b-abf6-d0bad0d989e8=com.bigdata.ha.msg.HANotifyReleaseTimeRequest@6318ff42{serviceUUID=4b1c9327-76c9-451b-abf6-d0bad0d989e8,pinnedCommitTime=1368817072774,pinnedCommitCounter=53128,timestamp=1368817073076}} I then connected to B (bigdata16) using a debugger and set a breakpoint on BasicHA.getHAStatus(). I triggered the breakpoint using GET ?/status on that server. This allowed me to inspect the gatherFuture, which was non-null. Drilling down I observe that the GatherTask had encountered an error: com.bigdata.util.ClocksNotSynchronizedException This appears to be the root cause issue for this ticket. Some amount of clock skew caused an exception in the GatherTask which then failed to message the leader which was blocked on the CyclicBarrier. The GatherTask must respond no matter what. I was able to cure the deadlock by restarting one of the HAJournalServer instances (I restarted the 2nd follower, bigdata15, but it should not matter which one is restarted since it monitors the quorum state and will break the CyclicBarrier if a joined service leaves the met quorum while it is waiting on the barrier). This caused the commit to fail (it should have succeeded since it still had two joined services, but this is a known flaw), at which point the UPDATE workload resumed.
        Hide
        bryanthompson bryanthompson added a comment -

        Bug fix for the consensus releaseTime protocol. It was hitting a clock skew error and then failing to message back to the leader with an appropriate response such that the barrier would break and the commit could either proceed or be rejected.

        Committed Revision r7145.

        Show
        bryanthompson bryanthompson added a comment - Bug fix for the consensus releaseTime protocol. It was hitting a clock skew error and then failing to message back to the leader with an appropriate response such that the barrier would break and the commit could either proceed or be rejected. Committed Revision r7145.
        Hide
        bryanthompson bryanthompson added a comment -

        AbstractJournal: Added logic to make AbstractJournal.commit() spin for up to 5000ms if required to obtain a commitTime that moves forward with respect to the then current lastCommitTime. This addresses a potential source of problems with clock skew in HA and could permit us to allow more skew to be present among the nodes of an HA cluster (though I have not yet relaxed that constraint in Journal.assertBefore()).

        AbstractJournal: added getRootBlockWithLock() variant.

        HAJournalServer.awaitJournalToken(): modified to use getRootBlockWithLock() to ensure atomicity of the state changes that awaitJournalToken() is monitoring.

        Journal: More cleanup on the consensus release time protocol. Changes improve reporting of the root cause of an exception in the release time consensus protocol and shift reporting of clock synchronization problems entirely to the leader (based on the timestamps obtain on the leader and the followers).

        @see https://sourceforge.net/apps/trac/bigdata/ticket/530 (Journal HA)

        Committed revision r7151.

        Show
        bryanthompson bryanthompson added a comment - AbstractJournal: Added logic to make AbstractJournal.commit() spin for up to 5000ms if required to obtain a commitTime that moves forward with respect to the then current lastCommitTime. This addresses a potential source of problems with clock skew in HA and could permit us to allow more skew to be present among the nodes of an HA cluster (though I have not yet relaxed that constraint in Journal.assertBefore()). AbstractJournal: added getRootBlockWithLock() variant. HAJournalServer.awaitJournalToken(): modified to use getRootBlockWithLock() to ensure atomicity of the state changes that awaitJournalToken() is monitoring. Journal: More cleanup on the consensus release time protocol. Changes improve reporting of the root cause of an exception in the release time consensus protocol and shift reporting of clock synchronization problems entirely to the leader (based on the timestamps obtain on the leader and the followers). @see https://sourceforge.net/apps/trac/bigdata/ticket/530 (Journal HA) Committed revision r7151.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: