Details
-
Type:
Bug
-
Status: Done
-
Priority:
Highest
-
Resolution: Done
-
Affects Version/s: BLAZEGRAPH_RELEASE_1_5_1
-
Fix Version/s: BLAZEGRAPH_2_0_0
-
Component/s: RWStore, Wikidata Query Service
-
Labels:None
Description
This ticket is in response to https://phabricator.wikimedia.org/T97468
The wikidata development group had been running a long running update test against blazegraph 1.5.1. They recently deleted a large namespace through the UI (workbench I assume). An error was reported, but not saved. Sometime thereafter SPARQL UPDATE requests began failing with the following trace. Unfortunately the log files were not preserved. Group commit was not enabled. Other details are on that wikidata ticket (journal properties, etc).
19:46:43.487 [main] DEBUG o.w.query.rdf.tool.rdf.RdfRepository - Setting last updated time to Tue Apr 28 16:47:33 UTC 2015 Exception in thread "main" org.wikidata.query.rdf.tool.exception.ContainedException: Non-200 response from triple store: HttpResponseProxy{HTTP/1.1 500 Server Er ror [Content-Type: text/plain, Transfer-Encoding: chunked, Server: Jetty(9.2.z-SNAPSHOT)] ResponseEntityProxy{[Content-Type: text/plain,Chunked: true]}} body= SPARQL-UPDATE: updateStr=DELETE { <http://www.wikidata.org> <http://schema.org/dateModified> ?o . } WHERE { <http://www.wikidata.org> <http://schema.org/dateModified> ?o . }; INSERT DATA { <http://www.wikidata.org> <http://schema.org/dateModified> "2015-04-28T16:47:33.000Z"^^<xsd:dateTime> . } java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tu esday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 201 5 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1 , quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427 , createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0} at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:261) at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlUpdate(QueryServlet.java:359) at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:165) at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:237) at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:137) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0} at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlUpdateTask.call(QueryServlet.java:460) at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlUpdateTask.call(QueryServlet.java:371) at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at com.bigdata.rdf.task.AbstractApiTask.submitApiTask(AbstractApiTask.java:365) at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:258) ... 26 more Caused by: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0} at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1303) at com.bigdata.rdf.sail.BigdataSailUpdate.execute2(BigdataSailUpdate.java:152) at com.bigdata.rdf.sail.webapp.BigdataRDFContext$UpdateTask.doQuery(BigdataRDFContext.java:1683) at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.innerCall(BigdataRDFContext.java:1310) at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:1275) at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:517) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ... 1 more Caused by: java.lang.RuntimeException: Problem with entry at -711345699177889370: lastRootBlock=rootBlock{ rootBlock=0, challisField=2740925, version=3, nextOffset=3775731623797917, localTime=1430247841782 [Tuesday, April 28, 2015 7:04:01 PM UTC], firstCommitTime=1428692901050 [Friday, April 10, 2015 7:08:21 PM UTC], lastCommitTime=1430247841781 [Tuesday, April 28, 2015 7:04:01 PM UTC], commitCounter=2740925, commitRecordAddr={off=NATIVE:-165623078,len=422}, commitRecordIndexAddr={off=NATIVE:-155674208,len=220}, blockSequence=1, quorumToken=-1, metaBitsAddr=2455328113296235, metaStartAddr=952388, storeType=RW, uuid=f8287971-835c-4bb1-9b83-215d38565a61, offsetBits=42, checksum=-498521427, createTime=1428691460132 [Friday, April 10, 2015 6:44:20 PM UTC], closeTime=0} at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3113) at com.bigdata.rdf.store.LocalTripleStore.commit(LocalTripleStore.java:99) at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.commit2(BigdataSail.java:3370) at com.bigdata.rdf.sail.BigdataSailRepositoryConnection.commit2(BigdataSailRepositoryConnection.java:272) at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertCommit(AST2BOpUpdate.java:353) at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertUpdate(AST2BOpUpdate.java:307) at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1295) ... 9 more Caused by: java.lang.RuntimeException: Problem with entry at -711345699177889370 at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4967) at com.bigdata.rwstore.RWStore.checkDeferredFrees(RWStore.java:3539) at com.bigdata.journal.RWStrategy.checkDeferredFrees(RWStrategy.java:781) at com.bigdata.journal.AbstractJournal$CommitState.writeCommitRecord(AbstractJournal.java:3476) at com.bigdata.journal.AbstractJournal$CommitState.access$2800(AbstractJournal.java:3278) at com.bigdata.journal.AbstractJournal.commitNow(AbstractJournal.java:4088) at com.bigdata.journal.AbstractJournal.commit(AbstractJournal.java:3111) ... 15 more Caused by: java.lang.RuntimeException: addr=-157604858 : cause=java.lang.IllegalStateException: Bad Address: length requested greater than allocated slot at com.bigdata.rwstore.RWStore.getData(RWStore.java:2190) at com.bigdata.rwstore.RWStore.getData(RWStore.java:1989) at com.bigdata.rwstore.RWStore.getData(RWStore.java:2033) at com.bigdata.rwstore.RWStore.getData(RWStore.java:1989) at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4857) at com.bigdata.rwstore.RWStore.freeDeferrals(RWStore.java:4953) ... 21 more Caused by: java.lang.IllegalStateException: Bad Address: length requested greater than allocated slot at com.bigdata.rwstore.RWStore.getData(RWStore.java:2082) ... 26 more at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:296) at org.wikidata.query.rdf.tool.rdf.RdfRepository.updateLeftOffTime(RdfRepository.java:255) at org.wikidata.query.rdf.tool.Update.run(Update.java:235) at org.wikidata.query.rdf.tool.Update.main(Update.java:100)
----
TODO:
- done. Option to discard the deferred deletes so people can continue to write if there is a problem with deferred deletes (utility class was written).
- Create/modify long running test focused on mutation (add/delete and not just our REST API, but also SPARQL UPDATE).
- Modify that test to attempt to cancel DELETE of large namespaces and make sure that we are properly discarding the write set. I suspect that the bug is a failure to properly discard the write set if the namespace delete fails. A subsequent UPDATE request then folded the write set of the DELETE namespace into that subsequent update with the result that the failed namespace delete was not properly rolled back. Things to be suspicious of include the acquisition and release of the global lock on the journal and error handling paths around the mutation associated with the namespace delete.
- We can also write tests that are more focused on interrupt of namespace create / delete operations, perhaps with a concurrent writer task on some other namespace, to verify that the create/delete semantics remain ACID.
- Why would BLZG-181 not have caught this problem? (BLZG-181 only protects against a call to abort() that fails. If there is a write set and (a) a failed commit where the code does not then go on to invoke abort() -or- (b) the code simply leaves the scope in which the write set was applied without either committing or aborting then we can have a situation in which buffered writes eventually become durable. In BLZG-1079, the situation was (a). The mutations were not correctly discarded since abort() was not called. This lead to the same symptoms as per this ticket.)
- Look at TestHA1GroupCommit and TestHA3GroupCommit test_HA1_GroupCommit_createDestroy_ManyNamespaces()
See https://phabricator.wikimedia.org/T97468 (original wikidata ticket)
See BLZG-201 SPARQL UPDATE with runtime errors causes problems with lexicon indices (similar symptom)
See TestNanoSparqlClient._test967() in 1.5.1
See StressTest_REST_MultiTenancy in 1.5.2 development master (current version of the same stress test)
See TestConcurrentKBCreate for a concurrency test of CreateKB and some tests of DestroyKB
See BLZG-1073 Add the ability to track and cancel all queries issued through a BigdataSailRemoteRepositoryConnection
See BLZG-1082 RWStore commit state not correctly rolled back if abort fails on empty journal
See BLZG-1083 RWStorage stats cleanup
See BLZG-181 Add critical section protection to AbstractJournal.abort() and BigdataSailConnection.rollback()
See BLZG-1082 RWStore commit state not correctly rolled back if abort fails on empty journal
See BLZG-206 Code review and concepts for provably correctness for abort code paths
See BLZG-1255 AbstractJournal has RWStrategy tests where it should test on IAllocationManager
Attachments
Issue Links
- relates to
-
BLZG-1420 Investigate PhysicalAddressExceptions from BLZG-1418
-
- Done
-
-
BLZG-1282 HA3 tests of allocator / store state after abort()
-
- Accepted
-
-
BLZG-1283 Re-open backing strategy after a failed abort
-
- Accepted
-
-
BLZG-1255 AbstractJournal has RWStrategy tests where it should test on IAllocationManager
- Open
-
BLZG-206 Code review and concepts for provably correctness for abort code paths
- Open
-
BLZG-14 HA doLocalAbort() should interrupt NSS requests and AbstractTasks
-
- In Progress
-
-
BLZG-1266 Create stress test for REST API with coverage of all methods and abort code paths.
-
- In Progress
-
-
BLZG-1574 Relayer Journal / RWStore / Transaction interfaces
-
- Open
-