Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1236

Recycler error in 1.5.1

    XMLWordPrintable

    Details

      Description

      This ticket is in response to https://phabricator.wikimedia.org/T97468

      The wikidata development group had been running a long running update test against blazegraph 1.5.1. They recently deleted a large namespace through the UI (workbench I assume). An error was reported, but not saved. Sometime thereafter SPARQL UPDATE requests began failing with the following trace. Unfortunately the log files were not preserved. Group commit was not enabled. Other details are on that wikidata ticket (journal properties, etc).

      19:46:43.487 [main] DEBUG o.w.query.rdf.tool.rdf.RdfRepository - Setting last updated time to Tue Apr 28 16:47:33 UTC 2015
      Exception in thread "main" org.wikidata.query.rdf.tool.exception.ContainedException: Non-200 response from triple store:  HttpResponseProxy{HTTP/1.1 500 Server Er
      ror [Content-Type: text/plain, Transfer-Encoding: chunked, Server: Jetty(9.2.z-SNAPSHOT)] ResponseEntityProxy{[Content-Type: text/plain,Chunked: true]}} body=
      SPARQL-UPDATE: updateStr=DELETE {
        <http://www.wikidata.org> <http://schema.org/dateModified> ?o .
      }
      WHERE {
        <http://www.wikidata.org> <http://schema.org/dateModified> ?o .
      };
      INSERT DATA {
        <http://www.wikidata.org> <http://schema.org/dateModified> "2015-04-28T16:47:33.000Z"^^<xsd:dateTime> .
      }
      
      java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem 
      with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tu
      esday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 201
      5 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1
      , quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427
      , createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at java.util.concurrent.FutureTask.report(FutureTask.java:122)
              at java.util.concurrent.FutureTask.get(FutureTask.java:188)
              at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:261)
              at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlUpdate(QueryServlet.java:359)
              at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:165)
              at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:237)
              at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:137)
              at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
              at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
              at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
              at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
              at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
              at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
              at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
              at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
              at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
              at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
              at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
              at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
              at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
              at org.eclipse.jetty.server.Server.handle(Server.java:497)
              at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
              at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
              at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at java.util.concurrent.FutureTask.report(FutureTask.java:122)
              at java.util.concurrent.FutureTask.get(FutureTask.java:188)
              at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlUpdateTask.call(QueryServlet.java:460)
              at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlUpdateTask.call(QueryServlet.java:371)
              at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at com.bigdata.rdf.task.AbstractApiTask.submitApiTask(AbstractApiTask.java:365)
              at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:258)
              ... 26 more   
      Caused by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1303)
              at com.bigdata.rdf.sail.BigdataSailUpdate.execute2(BigdataSailUpdate.java:152)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$UpdateTask.doQuery(BigdataRDFContext.java:1683)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.innerCall(BigdataRDFContext.java:1310)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:1275)
              at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:517)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              ... 1 more
      Caused by: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0}
              at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3113)
              at com.bigdata.rdf.store.LocalTripleStore.commit(LocalTripleStore.java:99)
              at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.commit2(BigdataSail.java:3370)
              at com.bigdata.rdf.sail.BigdataSailRepositoryConnection.commit2(BigdataSailRepositoryConnection.java:272)
              at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertCommit(AST2BOpUpdate.java:353)
              at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertUpdate(AST2BOpUpdate.java:307)
              at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1295)
              ... 9 more
      Caused by: java.lang.RuntimeException: Problem with entry at -711345699177889370
              at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4967)
              at com.bigdata.rwstore.RWStore.checkDeferredFrees(RWStore.java:3539)
              at com.bigdata.journal.RWStrategy.checkDeferredFrees(RWStrategy.java:781)
              at com.bigdata.journal.AbstractJournal$CommitState.writeCommitRecord(AbstractJournal.java:3476)
              at com.bigdata.journal.AbstractJournal$CommitState.access$2800(AbstractJournal.java:3278)
              at com.bigdata.journal.AbstractJournal.commitNow(AbstractJournal.java:4088)
              at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3111)
              ... 15 more   
      Caused by: java.lang.RuntimeException: addr=-157604858 : cause=java.lang.IllegalStateException: Bad Address: length requested greater than allocated slot
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:2190)
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:1989)
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:2033)
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:1989)
              at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4857)
              at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4953)
              ... 21 more   
      Caused by: java.lang.IllegalStateException: Bad Address: length requested greater than allocated slot
              at com.bigdata.rwstore.RWStore.getData(RWStore.java:2082)
              ... 26 more   
      
              at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:296)
              at org.wikidata.query.rdf.tool.rdf.RdfRepository.updateLeftOffTime(RdfRepository.java:255)
              at org.wikidata.query.rdf.tool.Update.run(Update.java:235)
              at org.wikidata.query.rdf.tool.Update.main(Update.java:100)
      

      ----

      TODO:


      - done. Option to discard the deferred deletes so people can continue to write if there is a problem with deferred deletes (utility class was written).


      - Create/modify long running test focused on mutation (add/delete and not just our REST API, but also SPARQL UPDATE).


      - Modify that test to attempt to cancel DELETE of large namespaces and make sure that we are properly discarding the write set. I suspect that the bug is a failure to properly discard the write set if the namespace delete fails. A subsequent UPDATE request then folded the write set of the DELETE namespace into that subsequent update with the result that the failed namespace delete was not properly rolled back. Things to be suspicious of include the acquisition and release of the global lock on the journal and error handling paths around the mutation associated with the namespace delete.


      - We can also write tests that are more focused on interrupt of namespace create / delete operations, perhaps with a concurrent writer task on some other namespace, to verify that the create/delete semantics remain ACID.


      - Why would BLZG-181 not have caught this problem? (BLZG-181 only protects against a call to abort() that fails. If there is a write set and (a) a failed commit where the code does not then go on to invoke abort() -or- (b) the code simply leaves the scope in which the write set was applied without either committing or aborting then we can have a situation in which buffered writes eventually become durable. In BLZG-1079, the situation was (a). The mutations were not correctly discarded since abort() was not called. This lead to the same symptoms as per this ticket.)


      - Look at TestHA1GroupCommit and TestHA3GroupCommit test_HA1_GroupCommit_createDestroy_ManyNamespaces()

      See https://phabricator.wikimedia.org/T97468 (original wikidata ticket)
      See BLZG-201 SPARQL UPDATE with runtime errors causes problems with lexicon indices (similar symptom)
      See TestNanoSparqlClient._test967() in 1.5.1
      See StressTest_REST_MultiTenancy in 1.5.2 development master (current version of the same stress test)
      See TestConcurrentKBCreate for a concurrency test of CreateKB and some tests of DestroyKB
      See BLZG-1073 Add the ability to track and cancel all queries issued through a BigdataSailRemoteRepositoryConnection
      See BLZG-1082 RWStore commit state not correctly rolled back if abort fails on empty journal
      See BLZG-1083 RWStorage stats cleanup
      See BLZG-181 Add critical section protection to AbstractJournal.abort() and BigdataSailConnection.rollback()
      See BLZG-1082 RWStore commit state not correctly rolled back if abort fails on empty journal
      See BLZG-206 Code review and concepts for provably correctness for abort code paths
      See BLZG-1255 AbstractJournal has RWStrategy tests where it should test on IAllocationManager

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              martyncutcher martyncutcher
              Reporter:
              bryanthompson bryanthompson
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: