Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-807

Verify that IRunningQuery instances (and nested queries) are correctly cancelled when interrupted

    Details

      Description

      Once the caller has a TupleQueryResult (or GraphQueryResult) object, invoking.close() on that object should do a IRunningQuery.cancel(true/mayInterruptIfRunning/).

      This ticket is a task request to verify the correct functioning of this feature.

      Correct interruption of some query plans is heavily tested by BSBM queries that use a LIMIT. When the LIMIT is reached, the SliceOp internally causes the query to be cancelled.

      SliceOp:

                          if (halt) {
      
                              if (log.isInfoEnabled())
                                  log.info("Slice will interrupt query.");
      
      						context.getRunningQuery().halt((Void) null);
      
                          }
      

      AbstractRunningQuery:

          final public void halt(final Void v) {
      
              lock.lock();
      
              try {
      
                  // signal normal completion.
                  future.halt((Void) v);
      
                  // interrupt anything which is running.
                  cancel(true/* mayInterruptIfRunning */);
      
              } finally {
      
                  lock.unlock();
      
              }
      
          }
      

      However, we need to either (a) audit the BSBM query plans to verify that this works for nested query plans or (b) develop new tests that look for a failure to cancel a sub-select or other interesting query structures in a timely fashion.

      Note: The query deadline is a completely different mechanism. The query timeout is expressed in seconds through the openrdf AP (AbstractQuery.setMaxQueryTime(int:seconds). The maxQueryTime is translated into a milliseconds timeout by BigdataSailTupleQuery.evaluate() and friends and then converted into a deadline within the RunState of the AbstractRunningQuery. The RunState only examines the deadline when bigdata operators start/stop. If a query has a bad plan, then it might spam the heap and cause big GC pauses that do not allow it to make progress. In this case, it might not halt at the deadline because bops do not start/stop. Other problems that could result in a missed deadline would be an operator that builds a very large hash index from an unconstrained index scan, etc.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        We are able to replicate a problem for a query using a named subquery.

        select ?snippet with {   select * {     ?snippet ?p ?o .   }  } as %filter where {    include %filter . }
        

        which gets translated into:

        WITH {
          QueryType: SELECT
          SELECT * 
            JoinGroupNode {
              StatementPatternNode(VarNode(snippet), VarNode(p), VarNode(o)) [scope=DEFAULT_CONTEXTS]
            }
        } AS %filter
        QueryType: SELECT
        includeInferred=true
        timeout=10000
        SELECT ( VarNode(snippet) AS VarNode(snippet) )
          JoinGroupNode {
            INCLUDE %filter
          }
        

        The timeout (in milliseconds) is imposed by the application harness (which specifies a timeout in seconds).

        This query fails to terminate correctly when TupleQueryResult.close() is invoked by the application. We can observe the cancelation of the top-level query, but not the named subquery. I am investigating when we are not observing the cancelation of the named subquery. I do see the ChunkedRunningQuery.ChunkTask being interrupted, but that interrupt is not being propagated into the JVMNamedSubqueryOp.ControllerTask or SubqueryTask.

        I believe that the root cause may be that we rely on the closing of the source iterator for the chunk task (an execution pass for some bop) to cause the termination of bop execution pass. However, that might not be having an effect on the JVMNamedSubquery since it runs the sub-select before it drains the source. It may also be failing to cancel queries as eagerly as possible in other operators.

        I am currently looking at ChunkedRunningQuery.ChunkTask.call() and examining if it can be made to interrupt the operator evaluation task in a reliable and safe manner by canceling the FutureTask returned from BOp.eval().

        Show
        bryanthompson bryanthompson added a comment - We are able to replicate a problem for a query using a named subquery. select ?snippet with { select * { ?snippet ?p ?o . } } as %filter where { include %filter . } which gets translated into: WITH { QueryType: SELECT SELECT * JoinGroupNode { StatementPatternNode(VarNode(snippet), VarNode(p), VarNode(o)) [scope=DEFAULT_CONTEXTS] } } AS %filter QueryType: SELECT includeInferred=true timeout=10000 SELECT ( VarNode(snippet) AS VarNode(snippet) ) JoinGroupNode { INCLUDE %filter } The timeout (in milliseconds) is imposed by the application harness (which specifies a timeout in seconds). This query fails to terminate correctly when TupleQueryResult.close() is invoked by the application. We can observe the cancelation of the top-level query, but not the named subquery. I am investigating when we are not observing the cancelation of the named subquery. I do see the ChunkedRunningQuery.ChunkTask being interrupted, but that interrupt is not being propagated into the JVMNamedSubqueryOp.ControllerTask or SubqueryTask. I believe that the root cause may be that we rely on the closing of the source iterator for the chunk task (an execution pass for some bop) to cause the termination of bop execution pass. However, that might not be having an effect on the JVMNamedSubquery since it runs the sub-select before it drains the source. It may also be failing to cancel queries as eagerly as possible in other operators. I am currently looking at ChunkedRunningQuery.ChunkTask.call() and examining if it can be made to interrupt the operator evaluation task in a reliable and safe manner by canceling the FutureTask returned from BOp.eval().
        Hide
        bryanthompson bryanthompson added a comment -

        Of interest, Mike suggests that this problem is not observed with a SPARQL named subquery. I have not attempted to confirm that yet.

        Show
        bryanthompson bryanthompson added a comment - Of interest, Mike suggests that this problem is not observed with a SPARQL named subquery. I have not attempted to confirm that yet.
        Hide
        bryanthompson bryanthompson added a comment -

        There were some problems with the customer's test harness. Specifically, it was shutting down the Journal (and hence the executor service on which the query was running) without waiting for the query to terminate after it had been cancelled. This was causing RejectedExecutionException instances to be thrown when the query attempted to notify the query controller that a given operator had halted.

        Once that issue was corrected, it became obvious that the root cause was in fact the failure to propagate the interrupt out of BlockingBuffer.BlockingIterator.hasNext() as suggested by the customer BLZG-798. With this change the query with the nested subquery now terminates in a timely manner.

        I am running through the SPARQL test suite locally before a commit. I will commit the updated version of the customer's test case as well.

        We will need to do some longevity testing and performance testing on this change to verify that there are no undesired side-effects which arise from propagating that interrupt.

        I have also looked at the testOrderByQueriesAreInterruptable() test in the RepositoryConnectionTest class. I have lifted a copy of that test into our code. Examination of this test shows that the query is cancelled in a timely fashion IF the ORDER BY operator has not yet begun to execute. This is in keeping with the semantics of deadline as implemented by bigdata. A deadline is only examined when we start or stop the evaluation of a query operator.

        If we need to make deadlines responsive for operators that are long running, then we would have to do something like schedule a future to cancel the query if it was still running after a deadline.

        Changes are to:


        - BlockingBuffer.BlockingIterator.hasNext()
        - the interrupt is now propagated.
        - ChunkedRunningQuery
        - javadoc only.
        - BigdataConnectionTest
        - lifted a version of testOrderByQueriesAreInterruptable() into our version of that test suite.

        Local CI is good for the AST evaluation and TestBigdataSailWithQuads test suites.

        @see https://sourceforge.net/apps/trac/bigdata/ticket/707 (BlockingBuffer.close() does not unblock threads)

        Committed revision r7252.

        Show
        bryanthompson bryanthompson added a comment - There were some problems with the customer's test harness. Specifically, it was shutting down the Journal (and hence the executor service on which the query was running) without waiting for the query to terminate after it had been cancelled. This was causing RejectedExecutionException instances to be thrown when the query attempted to notify the query controller that a given operator had halted. Once that issue was corrected, it became obvious that the root cause was in fact the failure to propagate the interrupt out of BlockingBuffer.BlockingIterator.hasNext() as suggested by the customer BLZG-798 . With this change the query with the nested subquery now terminates in a timely manner. I am running through the SPARQL test suite locally before a commit. I will commit the updated version of the customer's test case as well. We will need to do some longevity testing and performance testing on this change to verify that there are no undesired side-effects which arise from propagating that interrupt. I have also looked at the testOrderByQueriesAreInterruptable() test in the RepositoryConnectionTest class. I have lifted a copy of that test into our code. Examination of this test shows that the query is cancelled in a timely fashion IF the ORDER BY operator has not yet begun to execute. This is in keeping with the semantics of deadline as implemented by bigdata. A deadline is only examined when we start or stop the evaluation of a query operator. If we need to make deadlines responsive for operators that are long running, then we would have to do something like schedule a future to cancel the query if it was still running after a deadline. Changes are to: - BlockingBuffer.BlockingIterator.hasNext() - the interrupt is now propagated. - ChunkedRunningQuery - javadoc only. - BigdataConnectionTest - lifted a version of testOrderByQueriesAreInterruptable() into our version of that test suite. Local CI is good for the AST evaluation and TestBigdataSailWithQuads test suites. @see https://sourceforge.net/apps/trac/bigdata/ticket/707 (BlockingBuffer.close() does not unblock threads) Committed revision r7252.
        Hide
        bryanthompson bryanthompson added a comment -

        CI is good.

        LUBM U50 performance is good.

        BSBM 100M performance might have a slight regression. I am re-running with the same JVM as the historical runs. I think that this was normally tested on bigdata11, but I am testing on bigdata17. I have verified that there is no regression against bigdata11.

        This ticket is closed.

        Show
        bryanthompson bryanthompson added a comment - CI is good. LUBM U50 performance is good. BSBM 100M performance might have a slight regression. I am re-running with the same JVM as the historical runs. I think that this was normally tested on bigdata11, but I am testing on bigdata17. I have verified that there is no regression against bigdata11. This ticket is closed.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: