Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-510

Optimize OPTIONALs with multiple statement patterns

    Details

      Description

      mroycsi wrote: Is there an obvious reason that these two queries have drastically different query speeds: First one takes 8 seconds, while the second query takes only 123 milli-seconds.

      SELECT *
      WHERE {
            OPTIONAL {
                  {
                        ?_var1 rdf:type <http://suawa.org/mediadb#Album>.
                        ?_var1 p1:genre ?_var8. 
                        ?_var8 dc:title ?_var9. 
                        FILTER ((?_var9 in("Folk", "Hip-Hop")))
                              . 
                              OPTIONAL {
                                    ?_var1 dc:title ?_var10
                              }. 
                              OPTIONAL {
                                    ?_var1 p1:mainArtist ?_var12. 
                                    ?_var12 dc:title ?_var11
                              }
                        }
            }. 
      }
      ORDER BY ?_var1
      LIMIT 20
      
      SELECT *
      WHERE {
            OPTIONAL {
                  {
                        ?_var1 rdf:type <http://suawa.org/mediadb#Album>.
                        ?_var1 p1:genre ?_var8. 
                        ?_var8 dc:title ?_var9. 
                        FILTER ((?_var9 in("Folk", "Hip-Hop")))
                              . 
                              OPTIONAL {
                                    ?_var1 dc:title ?_var10
                              }. 
                              OPTIONAL {
                                    ?_var1 p1:mainArtist ?_var12. 
                              }
                        }
            }. 
      }
      ORDER BY ?_var1
      LIMIT 20
      

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Added a unit test for an optional group with a filter which always fails. This demonstrates a bug/feature in the QueryEngine. If an operator has never been run, then the lastPass annotation will not cause that operator to be triggered. However, it must be triggered in order for the optional hash join at the end of the optional group to be executed when the optional group does not produce any solutions. It also appears that we need to trigger an aggregation operator on a last pass evaluation even if there are no solutions since it looks like SPARQL wants us to output a solution for a single empty group in that case. I will take up this issue with lastPass evaluation next.

        We also noticed a bug in QueryResultUtil where it will mistakenly accept some solutions as equals() when one solution in fact has bindings which are not present in the other solution. This bug was masking some test failures. Mike is going to take a look at the problem with QueryResultUtil.

        Committed revision r5321.

        Show
        bryanthompson bryanthompson added a comment - Added a unit test for an optional group with a filter which always fails. This demonstrates a bug/feature in the QueryEngine. If an operator has never been run, then the lastPass annotation will not cause that operator to be triggered. However, it must be triggered in order for the optional hash join at the end of the optional group to be executed when the optional group does not produce any solutions. It also appears that we need to trigger an aggregation operator on a last pass evaluation even if there are no solutions since it looks like SPARQL wants us to output a solution for a single empty group in that case. I will take up this issue with lastPass evaluation next. We also noticed a bug in QueryResultUtil where it will mistakenly accept some solutions as equals() when one solution in fact has bindings which are not present in the other solution. This bug was masking some test failures. Mike is going to take a look at the problem with QueryResultUtil. Committed revision r5321.
        Hide
        bryanthompson bryanthompson added a comment -


        - Added an AST optimizer level unit test for lifting out a subquery

        involving LIMIT and ORDER BY. All the cases which are also handled

        by ASTSparql11SubqueryOptimizer have unit tests now.


        - Modified logic to replace the JoinGroupNode when lifting out a

        sub-select rather than leaving the INCLUDE embedded in the existing

        JoinGroupNode. The named subquery name is now something simpler than

        a UUID as well (which makes it predicatable for the unit tests).


        - Change to lift out subqueries involving aggregates.


        - Change to lift out SubqueryRoot if the runOnce query hint is

        specified.


        - Change to lift out subqueries where none of the projected variables

        are incoming bound with unit test.

        Note: ASK (aka EXISTS/NOT-EXISTS) subqueries are currently NOT lifted.
        I need to look into this case further. An ASK subquery currently only
        projects the anonymous variable which gets bound depending on whether
        or not the ASK subquery has a solution. I need to look at what should
        be projected into an ASK subquery and make sure that we are doing that
        right and then look at what would be involved in lifting an ASK
        subquery.


        - Moved some of the logic for the identification of join variables into
        the StaticAnalysis class. We need to develop this logic further and
        write unit tests for it as well. The code needs to decide the join
        variables based on the position in which the subquery will actually be
        run. For the moment it should assume that INCLUDEs run first in a
        join group while non-lifted sub-selects run after the required
        statement patterns. However, it does not do that and we will have to
        modify this again as part of the RTO integration.


        - Added convenience method to conditionally create and return the

        NamedSubqueriesNode and cleaned up code which was doing this in a

        variety of different places.

        Committed revision r5323.

        Show
        bryanthompson bryanthompson added a comment - - Added an AST optimizer level unit test for lifting out a subquery involving LIMIT and ORDER BY. All the cases which are also handled by ASTSparql11SubqueryOptimizer have unit tests now. - Modified logic to replace the JoinGroupNode when lifting out a sub-select rather than leaving the INCLUDE embedded in the existing JoinGroupNode. The named subquery name is now something simpler than a UUID as well (which makes it predicatable for the unit tests). - Change to lift out subqueries involving aggregates. - Change to lift out SubqueryRoot if the runOnce query hint is specified. - Change to lift out subqueries where none of the projected variables are incoming bound with unit test. Note: ASK (aka EXISTS/NOT-EXISTS) subqueries are currently NOT lifted. I need to look into this case further. An ASK subquery currently only projects the anonymous variable which gets bound depending on whether or not the ASK subquery has a solution. I need to look at what should be projected into an ASK subquery and make sure that we are doing that right and then look at what would be involved in lifting an ASK subquery. - Moved some of the logic for the identification of join variables into the StaticAnalysis class. We need to develop this logic further and write unit tests for it as well. The code needs to decide the join variables based on the position in which the subquery will actually be run. For the moment it should assume that INCLUDEs run first in a join group while non-lifted sub-selects run after the required statement patterns. However, it does not do that and we will have to modify this again as part of the RTO integration. - Added convenience method to conditionally create and return the NamedSubqueriesNode and cleaned up code which was doing this in a variety of different places. Committed revision r5323.
        Hide
        bryanthompson bryanthompson added a comment -

        MikeP has committed code to run UNION using the TEE pattern, but we have not yet applied this pattern to running multiple named subqueries in parallel. This change was r5322.

        Show
        bryanthompson bryanthompson added a comment - MikeP has committed code to run UNION using the TEE pattern, but we have not yet applied this pattern to running multiple named subqueries in parallel. This change was r5322.
        Hide
        bryanthompson bryanthompson added a comment -

        Bug fix to QueryBase. The parent reference was not being cleared when a group node was lifted into a named subquery. This could cause static analysis on the named subquery to produce the wrong join variables because it was following the parent reference back into the original parent group in the main query. This was effecting some cases where we would then apply the hash join pattern to solve an optional group in the named subquery.

        Committed revision r5354.

        Show
        bryanthompson bryanthompson added a comment - Bug fix to QueryBase. The parent reference was not being cleared when a group node was lifted into a named subquery. This could cause static analysis on the named subquery to produce the wrong join variables because it was following the parent reference back into the original parent group in the main query. This was effecting some cases where we would then apply the hash join pattern to solve an optional group in the named subquery. Committed revision r5354.
        Hide
        bryanthompson bryanthompson added a comment -

        Modified RunState to always trigger an operator evaluated on the query controller which has requested last pass evaluation. startQuery() now immediately modifies the RunState for any operator running on the query controller and requesting last pass evaluation such that it appears as if that operator has already been evaluated zero times. (The code makes a distinction between never evaluated, in which case various collections do not have an entry for the bopId of the operator and evaluated at least once, in which case those collections are non-empty. For evaluated "zero" times, we make the appropriate collections non-empty but populate them with zero counters, etc.)

        The test suite for RunState was updated to expect this mock evaluation pattern.

        I've made the sub-group using hash join pattern the default with this commit.

        This issue is closed. Related issues which remain include:

        Committed revision r5355.

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/397 (AST Optimizer for queries with multiple complex optional groups)
        [2] https://sourceforge.net/apps/trac/bigdata/ticket/396 (Modify the static query optimizer to use the required join group as an input to the optional groups.)
        [3] https://sourceforge.net/apps/trac/bigdata/ticket/395 (HTree hash join performance)

        Show
        bryanthompson bryanthompson added a comment - Modified RunState to always trigger an operator evaluated on the query controller which has requested last pass evaluation. startQuery() now immediately modifies the RunState for any operator running on the query controller and requesting last pass evaluation such that it appears as if that operator has already been evaluated zero times. (The code makes a distinction between never evaluated, in which case various collections do not have an entry for the bopId of the operator and evaluated at least once, in which case those collections are non-empty. For evaluated "zero" times, we make the appropriate collections non-empty but populate them with zero counters, etc.) The test suite for RunState was updated to expect this mock evaluation pattern. I've made the sub-group using hash join pattern the default with this commit. This issue is closed. Related issues which remain include: Committed revision r5355. [1] https://sourceforge.net/apps/trac/bigdata/ticket/397 (AST Optimizer for queries with multiple complex optional groups) [2] https://sourceforge.net/apps/trac/bigdata/ticket/396 (Modify the static query optimizer to use the required join group as an input to the optional groups.) [3] https://sourceforge.net/apps/trac/bigdata/ticket/395 (HTree hash join performance)

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: