Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1163

Optimized variable projection into subqueries/subgroups

    Details

      Description

      In patterns like the one from Ticket BLZG-913

      PREFIX  rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX  xsd: <http://www.w3.org/2001/XMLSchema#>
      
      SELECT  ?ps ?p ?o
      WHERE { 
        GRAPH <http://example.com/graph1>
        {
          ?ps ?p ?o.
          {
            SELECT ?ps
            WHERE 
            {
              ?ps a  <http://example.com/data/Person>. 
            }
          }
        }
      }
      

      a hash index is set up for the binding set produced by the outer triple pattern (?ps, ?p, ?o), for later reuse, and subsequently the variable ?ps is projected and flooded into the subquery. In the projection step, we need to apply an additional JVMDistinctBindingSetsOp to avoid duplicates (cf. ticket BLZG-913). However, this DISTINCT comes for free: the key is (always?) defined by exactly those variables that are projected into the subquery, so it is already computed when setting up the hash index.

      Note that we might even apply this pattern for queries where, for instance, the SELECT subquery is replaced through an OPTIONAL join group. Currently, in such cases no projection is applied at all, but it should be possible to use the same pattern (i.e., a distinct projection).

      Think about the benefits of such an optimization and a possible generalization. This might be considered in the scope of a general strategy/framework to drop variables that are no longer required.

        Activity

        Hide
        michaelschmidt michaelschmidt added a comment -

        Actually implemented flooding of projected vdistinct ariables inside optionals as part of Ticket BLZG-1119 (i.e., what's mentioned in the second paragraph). So all that's left to be done is implementing the optimization for both cases. As part of these activities, it might make sense to unify the code constructing the operators, if feasible (putting it into a parameterized method, currently it's more or less duplicated).

        Show
        michaelschmidt michaelschmidt added a comment - Actually implemented flooding of projected vdistinct ariables inside optionals as part of Ticket BLZG-1119 (i.e., what's mentioned in the second paragraph). So all that's left to be done is implementing the optimization for both cases. As part of these activities, it might make sense to unify the code constructing the operators, if feasible (putting it into a parameterized method, currently it's more or less duplicated).
        Hide
        michaelschmidt michaelschmidt added a comment -

        Here's a related query that fails and might be investigated as part of the ticket's implementation:

        SELECT *
        WHERE
        { 
          BIND(1 as ?A)
          OPTIONAL { { BIND(2 as ?B) } UNION { BIND(3 as ?C) } }
          OPTIONAL { BIND( 'unbound' as ?B ) }
          OPTIONAL { BIND( 'unbound' as ?C ) }
        } 
        

        -> throws an exception:

             [java] java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Required property: com.bigdata.bop.controller.SubqueryAnnotations.subquery : class com.bigdata.bop.controller.JVMNamedSubqueryOp
             [java] 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
             [java] 	at java.util.concurrent.FutureTask.get(FutureTask.java:188)
             [java] 	at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlQuery(QueryServlet.java:517)
             [java] 	at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:191)
             [java] 	at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:237)
             [java] 	at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:144)
             [java] 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
             [java] 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
             [java] 	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
             [java] 	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
             [java] 	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
             [java] 	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
             [java] 	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
             [java] 	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
             [java] 	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
             [java] 	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
             [java] 	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
             [java] 	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
             [java] 	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
             [java] 	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
             [java] 	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
             [java] 	at org.eclipse.jetty.server.Server.handle(Server.java:497)
             [java] 	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
             [java] 	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
             [java] 	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
             [java] 	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
             [java] 	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
             [java] 	at java.lang.Thread.run(Thread.java:745)
             [java] Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Required property: com.bigdata.bop.controller.SubqueryAnnotations.subquery : class com.bigdata.bop.controller.JVMNamedSubqueryOp
             [java] 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
             [java] 	at java.util.concurrent.FutureTask.get(FutureTask.java:188)
             [java] 	at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlQueryTask.call(QueryServlet.java:710)
             [java] 	at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlQueryTask.call(QueryServlet.java:534)
             [java] 	at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:67)
             [java] 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
             [java] 	at com.bigdata.rdf.task.AbstractApiTask.submitApiTask(AbstractApiTask.java:315)
             [java] 	at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:220)
             [java] 	... 26 more
             [java] Caused by: java.lang.IllegalStateException: Required property: com.bigdata.bop.controller.SubqueryAnnotations.subquery : class com.bigdata.bop.controller.JVMNamedSubqueryOp
             [java] 	at com.bigdata.bop.CoreBaseBOp.getRequiredProperty(CoreBaseBOp.java:243)
             [java] 	at com.bigdata.bop.controller.JVMNamedSubqueryOp.<init>(JVMNamedSubqueryOp.java:122)
             [java] 	at com.bigdata.bop.controller.JVMNamedSubqueryOp.<init>(JVMNamedSubqueryOp.java:143)
             [java] 	at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.addNamedSubquery(AST2BOpUtility.java:864)
             [java] 	at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.addNamedSubqueries(AST2BOpUtility.java:805)
             [java] 	at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.convertQueryBaseWithScopedVars(AST2BOpUtility.java:391)
             [java] 	at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.convert(AST2BOpUtility.java:247)
             [java] 	at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.evaluateTupleQuery(ASTEvalHelper.java:244)
             [java] 	at com.bigdata.rdf.sail.BigdataSailTupleQuery.evaluate(BigdataSailTupleQuery.java:93)
             [java] 	at com.bigdata.rdf.sail.BigdataSailTupleQuery.evaluate(BigdataSailTupleQuery.java:75)
             [java] 	at org.openrdf.repository.sail.SailTupleQuery.evaluate(SailTupleQuery.java:75)
             [java] 	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$TupleQueryTask.doQuery(BigdataRDFContext.java:1409)
             [java] 	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.innerCall(BigdataRDFContext.java:1296)
             [java] 	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:1261)
             [java] 	at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:503)
             [java] 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
             [java] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
             [java] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        
        
        Show
        michaelschmidt michaelschmidt added a comment - Here's a related query that fails and might be investigated as part of the ticket's implementation: SELECT * WHERE { BIND(1 as ?A) OPTIONAL { { BIND(2 as ?B) } UNION { BIND(3 as ?C) } } OPTIONAL { BIND( 'unbound' as ?B ) } OPTIONAL { BIND( 'unbound' as ?C ) } } -> throws an exception: [java] java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Required property: com.bigdata.bop.controller.SubqueryAnnotations.subquery : class com.bigdata.bop.controller.JVMNamedSubqueryOp [java] at java.util.concurrent.FutureTask.report(FutureTask.java:122) [java] at java.util.concurrent.FutureTask.get(FutureTask.java:188) [java] at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlQuery(QueryServlet.java:517) [java] at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:191) [java] at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:237) [java] at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:144) [java] at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) [java] at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [java] at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) [java] at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) [java] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) [java] at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) [java] at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) [java] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) [java] at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) [java] at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [java] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) [java] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [java] at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) [java] at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) [java] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) [java] at org.eclipse.jetty.server.Server.handle(Server.java:497) [java] at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) [java] at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) [java] at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) [java] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) [java] at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) [java] at java.lang.Thread.run(Thread.java:745) [java] Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Required property: com.bigdata.bop.controller.SubqueryAnnotations.subquery : class com.bigdata.bop.controller.JVMNamedSubqueryOp [java] at java.util.concurrent.FutureTask.report(FutureTask.java:122) [java] at java.util.concurrent.FutureTask.get(FutureTask.java:188) [java] at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlQueryTask.call(QueryServlet.java:710) [java] at com.bigdata.rdf.sail.webapp.QueryServlet$SparqlQueryTask.call(QueryServlet.java:534) [java] at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:67) [java] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [java] at com.bigdata.rdf.task.AbstractApiTask.submitApiTask(AbstractApiTask.java:315) [java] at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:220) [java] ... 26 more [java] Caused by: java.lang.IllegalStateException: Required property: com.bigdata.bop.controller.SubqueryAnnotations.subquery : class com.bigdata.bop.controller.JVMNamedSubqueryOp [java] at com.bigdata.bop.CoreBaseBOp.getRequiredProperty(CoreBaseBOp.java:243) [java] at com.bigdata.bop.controller.JVMNamedSubqueryOp.<init>(JVMNamedSubqueryOp.java:122) [java] at com.bigdata.bop.controller.JVMNamedSubqueryOp.<init>(JVMNamedSubqueryOp.java:143) [java] at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.addNamedSubquery(AST2BOpUtility.java:864) [java] at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.addNamedSubqueries(AST2BOpUtility.java:805) [java] at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.convertQueryBaseWithScopedVars(AST2BOpUtility.java:391) [java] at com.bigdata.rdf.sparql.ast.eval.AST2BOpUtility.convert(AST2BOpUtility.java:247) [java] at com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.evaluateTupleQuery(ASTEvalHelper.java:244) [java] at com.bigdata.rdf.sail.BigdataSailTupleQuery.evaluate(BigdataSailTupleQuery.java:93) [java] at com.bigdata.rdf.sail.BigdataSailTupleQuery.evaluate(BigdataSailTupleQuery.java:75) [java] at org.openrdf.repository.sail.SailTupleQuery.evaluate(SailTupleQuery.java:75) [java] at com.bigdata.rdf.sail.webapp.BigdataRDFContext$TupleQueryTask.doQuery(BigdataRDFContext.java:1409) [java] at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.innerCall(BigdataRDFContext.java:1296) [java] at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:1261) [java] at com.bigdata.rdf.sail.webapp.BigdataRDFContext$AbstractQueryTask.call(BigdataRDFContext.java:503) [java] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [java] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [java] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        Hide
        michaelschmidt michaelschmidt added a comment -

        Let?s consider the following pattern
        ?
        {
        Q_before
        optionalGroup_1
        optionalGroup_2
        ?
        optionalGroup_n
        }

        where we assume that Q_before is translated already and optionalGroup_1, ? optionalGroup_n are complex OPTIONAL groups containing arbitrary subqueries. Then, starting with the result of Q_before, my proposed approach (building/modifying what was done in ASTComplexOptionalOptimizer so far) is as follows:

        <algorithm>
        Let q_cur :=

        {Q_before}

        (assume a hash index has been calculated, say hashIndex(Q_before)

        > Iteration loop
        project the ?relevant? variables from q_cur (using DISTINCT, to avoid duplicates) and pass them into group_1 -> this helps to avoid the computation of unnecessary bindings
        (ii) evaluate group_1 by joining it with the projected set computed in , i.e. computing \join optionalGroup_1
        (iii) project the ?relevant? variables from the evaluation result of group_1 -> this helps to avoid passing around variables that are not used in subsequent evaluation steps or the final output,
        (iv) set q_cur := result from step (iii), compute hashIndex(q_cur)
        (v) go back to step and continue with the next group (if there is no more group, quit iteration loop)

        Finally, join Q_before with q_cur using hashIndex(q_cur)
        </algorithm>

        Note that the final join is required to make sure that bindings from q_cur that were projected out in the first call of step are added again. Also note that, since we iteratively pass in the current mapping sets, it suffices to perform the final join with the latest result (previously, in the ASTComplexJoinOptimizer there were joins with all these subqueries in the end (this was leading to wrong multiplicities in the general case).

        The two major questions that are not answered above are:

                1. Step : what are the ?relevant? variables?
                  What we can always safely pass in (while still adhering to the bottom up semantics) are the known bindings for the definitely produced variables of optionalGroup_i, let?s call them definite_i. Whenever definite_i is a proper subset of the the variables that are possibly bound in q_cur, we need to add a DISTINCT projection for these variables before passing them in (see example 1 below). It is also safe to pass in any subset of definite_i, under the assumption that we add a DISTINCT projection (not sure whether this would make sense in some cases though).
          1. Example 1

        > optionalGroup_1 := OPTIONAL { ?B -> 1 OPTIONAL { ?C -> 2 } }

        1. Example 1.1
          Assume our input mapping set is q_init := q_cur :=
          Unknown macro: { { ?A \-> 1, ?B \-> 1 }, { ?A -> 2, ?B-> 1}

          }. According to the rule above, it is valid to pass in the bindings for ?B. Projecting for DISTINCT ?B?s in q_cur, we thus pass in bs_in :=

          Unknown macro: { { ?B \-> 1 } }

          , which gives us

          Unknown macro: { { ?B \-> 1, ?C \-> 2 } } as the result of evaluation optionalGroup_1. Joining this result in the end with the hashIndex(q_init) gives us the expected result, namely { { ?A \-> 1, ?B \-> 1, ?C \-> 2 }, { ?A -> 2, ?B-> 1, ?C -> 2} }. Note that without DISTINCT projection we get the multiplicities wrong.

          # Example 1.2
          As a second example, assume our input mapping set looks like q_init := q_cur := { { ?A \-> 1, ?C->1 } } instead. Following the rules mentioned above, we push in the empty mapping bs_in := { { } }, since ?B is not bound in q_init. This gives us the inner result bs_inner := { { ?B -> 1, ?C -> 2 } }

          , so the final result remains

          Unknown macro: { { ?A \-> 1, ?C \-> 1 } }

          . In particular, note that in this case we are not allowed to flood in the (projected) maybe variables, namely bs_in :=

          Unknown macro: { { ?C \-> 1} }

          : this would give us bs_inner :=

          Unknown macro: { { ?B \-> 1, ?C \-> 1 } }

          as the result of evaluating the inner group and

          Unknown macro: { { ?A \-> 1, ?B \-> 1, ?C \-> 1 } }

          as the (wrong) overall result.

        It should be noted that the pattern in the example query is ill designed. For well designed patterns it should still be safe to pass in the maybe variables rather than only definite variables. Also note that we may safely flood in any variable that is not contained in the maybe variables of the group pattern (non conflicting vars). As a consequence, if the maybe variables of group_i equal the certain variables of group_i, it is safe to pass an arbitrary DISTINCT projection on the binding set. So the crucial condition hence is that none of the maybe vars are included (or, even more precisely: none of the maybe vars used in an ill designed way).

                1. Case (iii): what to project away after having evaluated an optionalGroup?
                  In some cases we may want to project away unneeded variables delivered by the optional group. The rule here is as follows: we need to retain all variables that are either reused in ?upper? parts of the query or in any of the subsequent join groups, namely optionalGroup_(i+1), ?, optionalGroup_n. You may want to think of this as all variables along the ancestor and following-sibling axis. Whenever this set of variables is a proper subset of the maybe vars of the respective join group, it thus might make sense to add an additional projection in step (ii.2), though. I have to think about multiplicities, but my initial guess would be that in this case we must not add a DISTINCT operator.

        This is not specific to the scenario we?re considering, we might add such a projection at any time in the query tree.

        ?

        Note that this strategy is partially implemented in the master/current BLZG-1163 branch, as part of the fixes in the above mentioned bugs.

        There are a couple more open questions here that we might come up with:
        In case , it might make sense to pass in only a subset of the variables. This is something we could possibly decide based on appropriate cost measures. I would consider such optimizations as an edge case, though.
        It is not yet clear in which cases the hash index might be used for distinct projection -> given that we project in a subset of the join variables only (i.e., the join variables may contain maybe vars, while the projection does not), the hash index might or might not be used for distinct projection; this should be considered when implementing 1071
        There are some critical edge cases with FILTER expressions inside OPTIONAL. According to the SPARQL semantics ((http://www.w3.org/TR/rdf-sparql-query/#convertGraphPattern), these filters are ?lifted up?, in particular they may (and often do) operator over variables bound in through Q_before, but not mentioned in other places in the subgroup. Hence, the filter variables are not necessarily definitely bound variables. In such cases, the strategy sketched above fails, as it does not floods in bindings for these variables. Currently, there is some special handing in ASTSubGroupJoinVarOptimizer. However, (a) I do not like the idea of this optimizer doing correctness critical tasks and (b) I?m not sure about the semantics of such cases in ill designed patterns, i.e. where the filter variable (from an outer query) is bound inside an inner OPTIONAL inside the join group, pattern Q_before binds ?a, and the subgroup is OPTIONAL { pattern-without-?a FILTER (?a=?) OPTIONAL { BIND-?a-somehow } }.
        The strategy naturally carries over to the flooding of bindings into subqueries (in fact, treating subqueries is less complex as the OPTIONAL case)

        I would propose to come up with a more generalised framework (e.g. in the form of an optimiser or general translation patterns implemented in streamlined functions). This could be done as part of BLZG-1163. Any thoughts on the above? At some point (next sprint?) I?d like to discuss how this maps to the operators in StaticAnalysis, what you think about it, and how to best implement it.

        Note that we should also adjust the documentation of ASTComplexOptionalOptimizer at some point (which is not inline with what's going on anymore).

        Show
        michaelschmidt michaelschmidt added a comment - Let?s consider the following pattern ? { Q_before optionalGroup_1 optionalGroup_2 ? optionalGroup_n } where we assume that Q_before is translated already and optionalGroup_1, ? optionalGroup_n are complex OPTIONAL groups containing arbitrary subqueries. Then, starting with the result of Q_before, my proposed approach (building/modifying what was done in ASTComplexOptionalOptimizer so far) is as follows: <algorithm> Let q_cur := {Q_before} (assume a hash index has been calculated, say hashIndex(Q_before) > Iteration loop project the ?relevant? variables from q_cur (using DISTINCT, to avoid duplicates) and pass them into group_1 -> this helps to avoid the computation of unnecessary bindings (ii) evaluate group_1 by joining it with the projected set computed in , i.e. computing \join optionalGroup_1 (iii) project the ?relevant? variables from the evaluation result of group_1 -> this helps to avoid passing around variables that are not used in subsequent evaluation steps or the final output, (iv) set q_cur := result from step (iii), compute hashIndex(q_cur) (v) go back to step and continue with the next group (if there is no more group, quit iteration loop) Finally, join Q_before with q_cur using hashIndex(q_cur) </algorithm> Note that the final join is required to make sure that bindings from q_cur that were projected out in the first call of step are added again. Also note that, since we iteratively pass in the current mapping sets, it suffices to perform the final join with the latest result (previously, in the ASTComplexJoinOptimizer there were joins with all these subqueries in the end (this was leading to wrong multiplicities in the general case). The two major questions that are not answered above are: Step : what are the ?relevant? variables? What we can always safely pass in (while still adhering to the bottom up semantics) are the known bindings for the definitely produced variables of optionalGroup_i, let?s call them definite_i. Whenever definite_i is a proper subset of the the variables that are possibly bound in q_cur, we need to add a DISTINCT projection for these variables before passing them in (see example 1 below). It is also safe to pass in any subset of definite_i, under the assumption that we add a DISTINCT projection (not sure whether this would make sense in some cases though). Example 1 > optionalGroup_1 := OPTIONAL { ?B -> 1 OPTIONAL { ?C -> 2 } } Example 1.1 Assume our input mapping set is q_init := q_cur := Unknown macro: { { ?A \-> 1, ?B \-> 1 }, { ?A -> 2, ?B-> 1} }. According to the rule above, it is valid to pass in the bindings for ?B. Projecting for DISTINCT ?B?s in q_cur, we thus pass in bs_in := Unknown macro: { { ?B \-> 1 } } , which gives us Unknown macro: { { ?B \-> 1, ?C \-> 2 } } as the result of evaluation optionalGroup_1. Joining this result in the end with the hashIndex(q_init) gives us the expected result, namely { { ?A \-> 1, ?B \-> 1, ?C \-> 2 }, { ?A -> 2, ?B-> 1, ?C -> 2} }. Note that without DISTINCT projection we get the multiplicities wrong. # Example 1.2 As a second example, assume our input mapping set looks like q_init := q_cur := { { ?A \-> 1, ?C->1 } } instead. Following the rules mentioned above, we push in the empty mapping bs_in := { { } }, since ?B is not bound in q_init. This gives us the inner result bs_inner := { { ?B -> 1, ?C -> 2 } } , so the final result remains Unknown macro: { { ?A \-> 1, ?C \-> 1 } } . In particular, note that in this case we are not allowed to flood in the (projected) maybe variables, namely bs_in := Unknown macro: { { ?C \-> 1} } : this would give us bs_inner := Unknown macro: { { ?B \-> 1, ?C \-> 1 } } as the result of evaluating the inner group and Unknown macro: { { ?A \-> 1, ?B \-> 1, ?C \-> 1 } } as the (wrong) overall result. It should be noted that the pattern in the example query is ill designed. For well designed patterns it should still be safe to pass in the maybe variables rather than only definite variables. Also note that we may safely flood in any variable that is not contained in the maybe variables of the group pattern (non conflicting vars). As a consequence, if the maybe variables of group_i equal the certain variables of group_i, it is safe to pass an arbitrary DISTINCT projection on the binding set. So the crucial condition hence is that none of the maybe vars are included (or, even more precisely: none of the maybe vars used in an ill designed way). Case (iii): what to project away after having evaluated an optionalGroup? In some cases we may want to project away unneeded variables delivered by the optional group. The rule here is as follows: we need to retain all variables that are either reused in ?upper? parts of the query or in any of the subsequent join groups, namely optionalGroup_(i+1), ?, optionalGroup_n. You may want to think of this as all variables along the ancestor and following-sibling axis. Whenever this set of variables is a proper subset of the maybe vars of the respective join group, it thus might make sense to add an additional projection in step (ii.2), though. I have to think about multiplicities, but my initial guess would be that in this case we must not add a DISTINCT operator. This is not specific to the scenario we?re considering, we might add such a projection at any time in the query tree. ? Note that this strategy is partially implemented in the master/current BLZG-1163 branch, as part of the fixes in the above mentioned bugs. There are a couple more open questions here that we might come up with: In case , it might make sense to pass in only a subset of the variables. This is something we could possibly decide based on appropriate cost measures. I would consider such optimizations as an edge case, though. It is not yet clear in which cases the hash index might be used for distinct projection -> given that we project in a subset of the join variables only (i.e., the join variables may contain maybe vars, while the projection does not), the hash index might or might not be used for distinct projection; this should be considered when implementing 1071 There are some critical edge cases with FILTER expressions inside OPTIONAL. According to the SPARQL semantics (( http://www.w3.org/TR/rdf-sparql-query/#convertGraphPattern ), these filters are ?lifted up?, in particular they may (and often do) operator over variables bound in through Q_before, but not mentioned in other places in the subgroup. Hence, the filter variables are not necessarily definitely bound variables. In such cases, the strategy sketched above fails, as it does not floods in bindings for these variables. Currently, there is some special handing in ASTSubGroupJoinVarOptimizer. However, (a) I do not like the idea of this optimizer doing correctness critical tasks and (b) I?m not sure about the semantics of such cases in ill designed patterns, i.e. where the filter variable (from an outer query) is bound inside an inner OPTIONAL inside the join group, pattern Q_before binds ?a, and the subgroup is OPTIONAL { pattern-without-?a FILTER (?a=?) OPTIONAL { BIND-?a-somehow } }. The strategy naturally carries over to the flooding of bindings into subqueries (in fact, treating subqueries is less complex as the OPTIONAL case) I would propose to come up with a more generalised framework (e.g. in the form of an optimiser or general translation patterns implemented in streamlined functions). This could be done as part of BLZG-1163 . Any thoughts on the above? At some point (next sprint?) I?d like to discuss how this maps to the operators in StaticAnalysis, what you think about it, and how to best implement it. Note that we should also adjust the documentation of ASTComplexOptionalOptimizer at some point (which is not inline with what's going on anymore).
        Hide
        michaelschmidt michaelschmidt added a comment -

        The following changes were made (partially related to ticket BLZG-1119):


        - ASTComplexOptionalOptimizer was fixed (strategy in place caused problems); however, the new optimizer is currently not superior to the standard translation pattern, so the class is not included in the DefaultOptimizerList and marked as deprecated
        - Changed translation strategy for subqueries and complex (optional) join groups in AST2BOpUtility, now projecting in the relevant variables into the respective subgroup
        - Built-in support in HashIndexOp (HTreeHashJoinUtility, JVMHashJoinUtility, HashIndexOp) for computing distinct projections over a given set of variables
        - Fixed in ASTSubGroupJoinVarOptimizer, which caused problems related to bottom-up vs. top-down semantics
        - Relaxed JVMDistinctFilter to deal with projection for empty variable set
        - Added many test case (and adjusted some test cases to new query execution plans)

        Show
        michaelschmidt michaelschmidt added a comment - The following changes were made (partially related to ticket BLZG-1119 ): - ASTComplexOptionalOptimizer was fixed (strategy in place caused problems); however, the new optimizer is currently not superior to the standard translation pattern, so the class is not included in the DefaultOptimizerList and marked as deprecated - Changed translation strategy for subqueries and complex (optional) join groups in AST2BOpUtility, now projecting in the relevant variables into the respective subgroup - Built-in support in HashIndexOp (HTreeHashJoinUtility, JVMHashJoinUtility, HashIndexOp) for computing distinct projections over a given set of variables - Fixed in ASTSubGroupJoinVarOptimizer, which caused problems related to bottom-up vs. top-down semantics - Relaxed JVMDistinctFilter to deal with projection for empty variable set - Added many test case (and adjusted some test cases to new query execution plans)
        Hide
        michaelschmidt michaelschmidt added a comment -

        Implemented in branch ticket_1071, issued pull request.

        Show
        michaelschmidt michaelschmidt added a comment - Implemented in branch ticket_1071, issued pull request.
        Hide
        michaelschmidt michaelschmidt added a comment -

        Ran performance and regression test for fix:


        - BSBM: showing about the same performance
        - govtrack: observed performance gain of ~10-20% for affected queries (query10, query40-1, query41)

        Merged branch into master. Once CI on master succeeds as well, this ticket will be closed.

        Show
        michaelschmidt michaelschmidt added a comment - Ran performance and regression test for fix: - BSBM: showing about the same performance - govtrack: observed performance gain of ~10-20% for affected queries (query10, query40-1, query41) Merged branch into master. Once CI on master succeeds as well, this ticket will be closed.
        Hide
        michaelschmidt michaelschmidt added a comment -

        CI on master looks good, closing issue.

        Show
        michaelschmidt michaelschmidt added a comment - CI on master looks good, closing issue.

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            michaelschmidt michaelschmidt
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: