Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1108

GROUP BY optimization using distinct-term-scan and fast-range-count

    Details

      Description

      The following query can be rewritten to use the distinct term scan (to produce the bindings for ?z) and then a group-by over the fast-range counts for the as-bound values of ?x rdf:type ?z. This would take the runtime cost from on the order of 4-8 seconds for a large data set to the order of a few 10s of milliseconds.

      SELECT  (COUNT(*) as ?count) ?z WHERE {  ?x rdf:type ?z  } GROUP BY ?z;
      

        Activity

        Hide
        michaelschmidt michaelschmidt added a comment -

        Implemented in branch group-by-optimization.

        The implementation approach is as follows:

          • A new class ASTSimpleGroupByAndCountOptimizer.java **
            The classe implements the core optimization step, which optimizes
            <code>
            SELECT (COUNT as ?count) ?z WHERE { ?x rdf:type ?z } GROUP BY ?z
            </code>

        and similar patterns. The optimizer aims at establishing an execution plan that applies a combination of distinct term scan pattern (to efficiently compute the distinct values for the group variable) and fast range count pattern to efficiently calculate the COUNT, without materialization of the variables on which the COUNT operation is performed.

        The basic idea is to
        replace the GROUP BY pattern through a SELECT DISTINCT subquery to calculate the distinct bindings for variable ?z first, and
        (ii) use a fast range count operator to efficiently calculate the COUNT.

        Note that the subquery produced in step may (where possible) be optimized by the ASTDistinctTermScanOptimizer, i.e. if possible the subquery producing the ?z bindings will be replaced by a distinct term scan in a later optimization step.

          • As part of the fix, the ASTDistinctTermScanOptimizer has been adjusted. **
            Previously, the optimizer only applied to SELECT DISTINCT ?var { tp } where the triple pattern contains variables only. However, distinct term scans are also applicable for triple patterns containing constants whenever the concatenation of the constants and the distinct variable is matched by an existing index. The getApplicableKeyOrderIfExists method checks for such an index being available. In case an index is found, the triple pattern is annotated with the index for later optimization (we must make sure to use exactly this index for the distinct term scan operation.
          • The enhancement of the ASTDistinctTermScanOptimizer **
            -> to deal with triple patterns containing constants required adjustments to the DistinctTermScanOp, which now switches between a DistinctTermAdvancer (in case there are no constants in the tp) and a DistinctMultiTermAdvancer (in case there are one or more constants). Further, code was added to account for the fact that we are not necessarily iterating over the first iv encoded in the key in case we're dealing with constants.
          • Addition test case and adjustments **
            New test cases were implemented in TestSimpleGroupByAndCountOptimizer, as well as minor adjustments/fixes to existing test cases added; adjustments basically come due to the facts that the DistinctTermScanOptimizer now adds query hints (namely, the key order to use) and (ii) that some test cases for the DistinctTermScanOptimizer now succeed in applying the optimization (for triple patterns containing constants)
          • A minor bug in the FastRangeCountOp was fixed (the solution set was not properly passed) **
        Show
        michaelschmidt michaelschmidt added a comment - Implemented in branch group-by-optimization. The implementation approach is as follows: A new class ASTSimpleGroupByAndCountOptimizer.java ** The classe implements the core optimization step, which optimizes <code> SELECT (COUNT as ?count) ?z WHERE { ?x rdf:type ?z } GROUP BY ?z </code> and similar patterns. The optimizer aims at establishing an execution plan that applies a combination of distinct term scan pattern (to efficiently compute the distinct values for the group variable) and fast range count pattern to efficiently calculate the COUNT, without materialization of the variables on which the COUNT operation is performed. The basic idea is to replace the GROUP BY pattern through a SELECT DISTINCT subquery to calculate the distinct bindings for variable ?z first, and (ii) use a fast range count operator to efficiently calculate the COUNT. Note that the subquery produced in step may (where possible) be optimized by the ASTDistinctTermScanOptimizer, i.e. if possible the subquery producing the ?z bindings will be replaced by a distinct term scan in a later optimization step. As part of the fix, the ASTDistinctTermScanOptimizer has been adjusted. ** Previously, the optimizer only applied to SELECT DISTINCT ?var { tp } where the triple pattern contains variables only. However, distinct term scans are also applicable for triple patterns containing constants whenever the concatenation of the constants and the distinct variable is matched by an existing index. The getApplicableKeyOrderIfExists method checks for such an index being available. In case an index is found, the triple pattern is annotated with the index for later optimization (we must make sure to use exactly this index for the distinct term scan operation. The enhancement of the ASTDistinctTermScanOptimizer ** -> to deal with triple patterns containing constants required adjustments to the DistinctTermScanOp, which now switches between a DistinctTermAdvancer (in case there are no constants in the tp) and a DistinctMultiTermAdvancer (in case there are one or more constants). Further, code was added to account for the fact that we are not necessarily iterating over the first iv encoded in the key in case we're dealing with constants. Addition test case and adjustments ** New test cases were implemented in TestSimpleGroupByAndCountOptimizer, as well as minor adjustments/fixes to existing test cases added; adjustments basically come due to the facts that the DistinctTermScanOptimizer now adds query hints (namely, the key order to use) and (ii) that some test cases for the DistinctTermScanOptimizer now succeed in applying the optimization (for triple patterns containing constants) A minor bug in the FastRangeCountOp was fixed (the solution set was not properly passed) **
        Hide
        bryanthompson bryanthompson added a comment -

        Done with code review. Looks good.

        Change log.
        - Merged in delta from the master.
        - Lots of final attributes. Helps to reason about data races, etc.
        - Some import organization.


        - TODO A correct rejection case should be written for the fast-range-count optimizer when the triple or quad store is using isolatable indices (see BigdataSail.Options.ISOLATABLE_INDICES) since the fast-range-count is an over-estimate if delete markers are in use rather than being exact. This issue predates this branch
        - we did not write that test case when initially implementing the fast-range-count optimizer.

        This looks read to be merged back to the master.

        I would like to see whether a performance regression exists for this branch and also for the jetty-client refactor branch (which is nearly through its code review). See BIGDATA_JETTY_CLIENT_2.

        Show
        bryanthompson bryanthompson added a comment - Done with code review. Looks good. Change log. - Merged in delta from the master. - Lots of final attributes. Helps to reason about data races, etc. - Some import organization. - TODO A correct rejection case should be written for the fast-range-count optimizer when the triple or quad store is using isolatable indices (see BigdataSail.Options.ISOLATABLE_INDICES) since the fast-range-count is an over-estimate if delete markers are in use rather than being exact. This issue predates this branch - we did not write that test case when initially implementing the fast-range-count optimizer. This looks read to be merged back to the master. I would like to see whether a performance regression exists for this branch and also for the jetty-client refactor branch (which is nearly through its code review). See BIGDATA_JETTY_CLIENT_2.
        Hide
        michaelschmidt michaelschmidt added a comment -

        Checked in final changes and merged into master branch.

        Show
        michaelschmidt michaelschmidt added a comment - Checked in final changes and merged into master branch.
        Hide
        michaelschmidt michaelschmidt added a comment -

        Fixed undeterministic behaviour in order to repair test case (and made variables final) caused by HashSets.

        Fixes (make code deterministic):
        - bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/optimizers/ASTDistinctTermScanOptimizer.java

        • Associated fixes in Test Case:*
          - bigdata-rdf/src/test/com/bigdata/rdf/sparql/ast/optimizers/TestASTDistinctTermScanOptimizer.java

        -> fixed belatedly (ticket closed already) in branch quads-loading-in-triple-mode, issuing pull request

        Show
        michaelschmidt michaelschmidt added a comment - Fixed undeterministic behaviour in order to repair test case (and made variables final) caused by HashSets. Fixes (make code deterministic): - bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/optimizers/ASTDistinctTermScanOptimizer.java Associated fixes in Test Case:* - bigdata-rdf/src/test/com/bigdata/rdf/sparql/ast/optimizers/TestASTDistinctTermScanOptimizer.java -> fixed belatedly (ticket closed already) in branch quads-loading-in-triple-mode, issuing pull request

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: