Details

      Description

      This ticket is in response to https://phabricator.wikimedia.org/T97468

      The wikidata development group had been running a long running update test against blazegraph 1.5.1. They recently deleted a large namespace through the UI (workbench I assume). An error was reported, but not saved. Sometime thereafter SPARQL UPDATE requests began failing with the following trace. Unfortunately the log files were not preserved. Group commit was not enabled. Other details are on that wikidata ticket (journal properties, etc).

      19:46:43.487 [main] DEBUG o.w.query.rdf.tool.rdf.RdfRepository - Setting last updated time to Tue Apr 28 16:47:33 UTC 2015
      Exception in thread "main" org.wikidata.query.rdf.tool.exception.ContainedException: Non-200 response from triple store:  HttpResponseProxy{HTTP/1.1 500 Server Er
      ror [Content-Type: text/plain, Transfer-Encoding: chunked, Server: Jetty(9.2.z-SNAPSHOT)] ResponseEntityProxy{[Content-Type: text/plain,Chunked: true]}} body=
      SPARQL-UPDATE: updateStr=DELETE {
        <http://www.wikidata.org> <http://schema.org/dateModified> ?o .
      }
      WHERE {
        <http://www.wikidata.org> <http://schema.org/dateModified> ?o .
      };
      INSERT DATA {
        <http://www.wikidata.org> <http://schema.org/dateModified> "2015-04-28T16:47:33.000Z"^^<xsd:dateTime> .
      }
      
      java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem 
      with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tu
      esday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 201
      5 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1
      , quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427
      , createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at java.util.concurrent.FutureTask.report(FutureTask.java:122)
              at java.util.concurrent.FutureTask.get(FutureTask.java:188)
              at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:261)
              at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlUpdate(QueryServlet.java:359)
              at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:165)
              at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:237)
              at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:137)
              at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
              at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
              at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
              at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
              at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
              at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
              at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
              at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
              at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
              at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
              at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
              at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
              at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
              at org.eclipse.jetty.server.Server.handle(Server.java:497)
              at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
              at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
              at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at java.util.concurrent.FutureTask.report(FutureTask.java:122)
              at java.util.concurrent.FutureTask.get(FutureTask.java:188)
              at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlUpdateTask.call(QueryServlet.java:460)
              at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlUpdateTask.call(QueryServlet.java:371)
              at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at com.bigdata.rdf.task.AbstractApiTask.submitApiTask(AbstractApiTask.java:365)
              at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:258)
              ... 26 more   
      Caused by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1303)
              at com.bigdata.rdf.sail.BigdataSailUpdate.execute2(BigdataSailUpdate.java:152)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$UpdateTask.doQuery(BigdataRDFContext.java:1683)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.innerCall(BigdataRDFContext.java:1310)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:1275)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:517)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              ... 1 more
      Caused by: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3113)
              at com.bigdata.rdf.store.LocalTripleStore.commit(LocalTripleStore.java:99)
              at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.commit2(BigdataSail.java:3370)
              at com.bigdata.rdf.sail.BigdataSailRepositoryConnection.commit2(BigdataSailRepositoryConnection.java:272)
              at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertCommit(AST2BOpUpdate.java:353)
              at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertUpdate(AST2BOpUpdate.java:307)
              at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1295)
              ... 9 more
      Caused by: java.lang.RuntimeException: Problem with entry at -711345699177889370
              at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4967)
              at com.bigdata.rwstore.RWStore.checkDeferredFrees(RWStore.java:3539)
              at com.bigdata.journal.RWStrategy.checkDeferredFrees(RWStrategy.java:781)
              at com.bigdata.journal.AbstractJournal$CommitState.writeCommitRecord(AbstractJournal.java:3476)
              at com.bigdata.journal.AbstractJournal$CommitState.access$2800(AbstractJournal.java:3278)
              at com.bigdata.journal.AbstractJournal.commitNow(AbstractJournal.java:4088)
              at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3111)
              ... 15 more   
      Caused by: java.lang.RuntimeException: addr=-157604858 : cause=java.lang.IllegalStateException: Bad Address: length requested greater than allocated slot
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:2190)
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:1989)
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:2033)
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:1989)
              at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4857)
              at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4953)
              ... 21 more   
      Caused by: java.lang.IllegalStateException: Bad Address: length requested greater than allocated slot
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:2082)
              ... 26 more   
      
              at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:296)
              at org.wikidata.query.rdf.tool.rdf.RdfRepository.updateLeftOffTime(RdfRepository.java:255)
              at org.wikidata.query.rdf.tool.Update.run(Update.java:235)
              at org.wikidata.query.rdf.tool.Update.main(Update.java:100)
      

      ----

      TODO:


      - done. Option to discard the deferred deletes so people can continue to write if there is a problem with deferred deletes (utility class was written).


      - Create/modify long running test focused on mutation (add/delete and not just our REST API, but also SPARQL UPDATE).


      - Modify that test to attempt to cancel DELETE of large namespaces and make sure that we are properly discarding the write set. I suspect that the bug is a failure to properly discard the write set if the namespace delete fails. A subsequent UPDATE request then folded the write set of the DELETE namespace into that subsequent update with the result that the failed namespace delete was not properly rolled back. Things to be suspicious of include the acquisition and release of the global lock on the journal and error handling paths around the mutation associated with the namespace delete.


      - We can also write tests that are more focused on interrupt of namespace create / delete operations, perhaps with a concurrent writer task on some other namespace, to verify that the create/delete semantics remain ACID.


      - Why would BLZG-181 not have caught this problem? (BLZG-181 only protects against a call to abort() that fails. If there is a write set and (a) a failed commit where the code does not then go on to invoke abort() -or- (b) the code simply leaves the scope in which the write set was applied without either committing or aborting then we can have a situation in which buffered writes eventually become durable. In BLZG-1079, the situation was (a). The mutations were not correctly discarded since abort() was not called. This lead to the same symptoms as per this ticket.)


      - Look at TestHA1GroupCommit and TestHA3GroupCommit test_HA1_GroupCommit_createDestroy_ManyNamespaces()

      See https://phabricator.wikimedia.org/T97468 (original wikidata ticket)
      See BLZG-201 SPARQL UPDATE with runtime errors causes problems with lexicon indices (similar symptom)
      See TestNanoSparqlClient._test967() in 1.5.1
      See StressTest_REST_MultiTenancy in 1.5.2 development master (current version of the same stress test)
      See TestConcurrentKBCreate for a concurrency test of CreateKB and some tests of DestroyKB
      See BLZG-1073 Add the ability to track and cancel all queries issued through a BigdataSailRemoteRepositoryConnection
      See BLZG-1082 RWStore commit state not correctly rolled back if abort fails on empty journal
      See BLZG-1083 RWStorage stats cleanup
      See BLZG-181 Add critical section protection to AbstractJournal.abort() and BigdataSailConnection.rollback()
      See BLZG-1082 RWStore commit state not correctly rolled back if abort fails on empty journal
      See BLZG-206 Code review and concepts for provably correctness for abort code paths
      See BLZG-1255 AbstractJournal has RWStrategy tests where it should test on IAllocationManager

        Issue Links

          Activity

          Hide
          jjc Jeremy Carroll added a comment -

          I believe the disconnect happened during an operation that would have deleted a few triples (e.g. about 30), and the DROP GRAPH was later, when we actually see the issue. The evidence is we have two stack traces - one for the disconnect and one for an abort, in sequence, but without timestamps

          Show
          jjc Jeremy Carroll added a comment - I believe the disconnect happened during an operation that would have deleted a few triples (e.g. about 30), and the DROP GRAPH was later, when we actually see the issue. The evidence is we have two stack traces - one for the disconnect and one for an abort, in sequence, but without timestamps
          Hide
          martyncutcher martyncutcher added a comment -

          I have identified a problem restoring large deferred free lists that failed to decode the data length correctly with an incorrect bit mask!

          Show
          martyncutcher martyncutcher added a comment - I have identified a problem restoring large deferred free lists that failed to decode the data length correctly with an incorrect bit mask!
          Hide
          bryanthompson bryanthompson added a comment -

          Problem has been diagnosed by Martyn as a bad bit mask in the decode of the blob headers.

          PR: https://github.com/SYSTAP/bigdata/pull/182

          CI is good.

          Code review is done. I linked the fix back to the ticket and also added documentation on the unit test that was able to provide a reproducer for the ticket.

          Code review commit is 134a75913ac95aacdc3082e14621b4e705a1b3a8 in github branch BLAZEGRAPH_DEFER_FREE_FIX_1236.

          bradbebee Merge to master.

          Show
          bryanthompson bryanthompson added a comment - Problem has been diagnosed by Martyn as a bad bit mask in the decode of the blob headers. PR: https://github.com/SYSTAP/bigdata/pull/182 CI is good. Code review is done. I linked the fix back to the ticket and also added documentation on the unit test that was able to provide a reproducer for the ticket. Code review commit is 134a75913ac95aacdc3082e14621b4e705a1b3a8 in github branch BLAZEGRAPH_DEFER_FREE_FIX_1236. bradbebee Merge to master.
          Hide
          bryanthompson bryanthompson added a comment -

          I have substantially cleaned up this ticket, moving a number of items into:

          BLZG-192 (group commit)
          BLZG-1574 (RWStore / Journal / Tx relayering)
          BLZG-1575 (small slots optimization issues)

          The root cause bug for this ticket has been resolved.

          There is still an open task to provide better reporting for the free list and allocator consistency verification (BLZG-1573).

          Show
          bryanthompson bryanthompson added a comment - I have substantially cleaned up this ticket, moving a number of items into: BLZG-192 (group commit) BLZG-1574 (RWStore / Journal / Tx relayering) BLZG-1575 (small slots optimization issues) The root cause bug for this ticket has been resolved. There is still an open task to provide better reporting for the free list and allocator consistency verification ( BLZG-1573 ).
          Hide
          bryanthompson bryanthompson added a comment -

          Moved the remaining ticket into the next release.

          Show
          bryanthompson bryanthompson added a comment - Moved the remaining ticket into the next release.

            People

            • Assignee:
              martyncutcher martyncutcher
              Reporter:
              bryanthompson bryanthompson
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: