Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1863

HA3 / StatementBuffer - poison pill does not cause Future to terminate

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Duplicate
    • Priority: Medium
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      This was revealed during an HA3 stress test run. There is currently an HA3 deadlock in CI for this job: https://ci.blazegraph.com/view/Mapgraph/job/db-enterprise/905/

      Brad has obtained stack traces, which I will attach.

      I see the code on the leader blocked in this bit:

      
      			// Drop a poison pill on the queue.
      			try {
      				
      				queue.put((Batch) Batch.POISON_PILL);
      
      				// block and wait until the flush is done.
      				final Future<Void> ft = this.ft;
      				if (ft != null) {
      					ft.get(); // <== Blocked here.
      				}
      
      

      As demonstrated by this stack trace.

      "com.bigdata.rdf.sail.webapp.BigdataRDFContext.queryService1" daemon prio=10 tid=0x00007fcc78340000 nid=0x36e1 waiting on condition [0x00007fcce0bca000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000000fd182640> (a java.util.concurrent.FutureTask)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
      	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:187)
      	at com.bigdata.rdf.rio.StatementBuffer.flush(StatementBuffer.java:940)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.flushStatementBuffers(BigdataSail.java:3904)
      	- locked <0x00000000c0611e88> (a com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection)
      	at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.commit2(BigdataSail.java:3687)
      	- locked <0x00000000c0611e88> (a com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection)
      	at com.bigdata.rdf.sail.BigdataSailRepositoryConnection.commit2(BigdataSailRepositoryConnection.java:330)
      	at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertCommit(AST2BOpUpdate.java:375)
      	at com.bigdata.rdf.sparql.ast.eval.AST2BOpUpdate.convertUpdate(AST2BOpUpdate.java:321)
      	at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.executeUpdate(ASTEvalHelper.java:1072)
      	at com.bigdata.rdf.sail.BigdataSailUpdate.execute2(BigdataSailUpdate.java:152)
      	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$UpdateTask.doQuery(BigdataRDFContext.java:1934)
      	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.innerCall(BigdataRDFContext.java:1536)
      	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:1501)
      	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:714)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      

      We need to figure out whether this is a concurrency hole in StatementBuffer or an HA3 failover issue.

      It is curious that the DrainQueueCallable does not have any instance that is running, but we are using a queue and ft != null, so there is a Future. Even if the Future was no longer the correct Future (e.g., a new one had been created and we missed it), Future.isDone() should be true for any possible old Future.

      Equally, it is possible that the problem is an HA3 issue and that the replicates writes are not being flushed to some failover state which has not corrected itself and has not timed out.

        Attachments

        1. 13710.txt
          45 kB
        2. 13794.txt
          36 kB
        3. 13880.txt
          36 kB
        4. jstack.985.txt
          48 kB

          Issue Links

            Activity

              People

              Assignee:
              martyncutcher martyncutcher
              Reporter:
              bryanthompson bryanthompson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: