Details

    • Type: New Feature
    • Status: Accepted
    • Resolution: Unresolved
    • Affects Version/s: QUADS_QUERY_BRANCH
    • Fix Version/s: None
    • Component/s: Query Engine
    • Labels:
      None

      Description

      Bigdata scores over 17,000 QMpH (Query Mixes Per Hour) without Q5 of BSBM and only ~3000 QMpH when it is included in the query mix.

      Examine the evaluation plan for this Q5 in detail as it has a tremendous impact on our BSBM performance. Without it we see the following on the 100M data set:

      17301	QMpH
      ---------------------
      430.15	Q1
      275.63	Q2
      91.07	Q3 
      135.11	Q4
      N/A	Q5
      52.87	Q7
      60.11	Q8
      210.07	Q9
      83.68	Q10
      125.33	Q11
      173.77	Q12
      

      Looking at the 100M performance matrix[1], it is clear that we are still a bit slow on Q3 and Q9 when compared to some of the other stores, but fixing Q5 should go a long way to improving our total performance on that benchmark.

      [1] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        I've been looking at this a bit using the RuntimeQueryOptimizer (RTO) [1]. I think that we need to review the filter attachment decisions made by RTO before going much further in this direction.

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/64

        Show
        bryanthompson bryanthompson added a comment - I've been looking at this a bit using the RuntimeQueryOptimizer (RTO) [1] . I think that we need to review the filter attachment decisions made by RTO before going much further in this direction. [1] https://sourceforge.net/apps/trac/bigdata/ticket/64
        Hide
        bryanthompson bryanthompson added a comment -

        By applying constraints to the join graph, the RTO now produces a solution for BSBM Q5 which is 2x faster than the static query optimizer. In addition, running Q5 through the RTO without constraints now ensures that the unconstrained joins are run last, which is appropriate. As can be seen from the query evaluation times (10 trials), the RTO solution w/o the constraints is only slightly better than the static query optimizer. However, the RTO solution w/ the constraints is nearly 2x better than the static query optimizer.

        RTO solutions with and without constraints

        w/o : Total times: static=8741, runtime=8025, delta(static-runtime)=716
        w/  : Total times: static=7201, runtime=3686, delta(static-runtime)=3515
        

        The difference in the identified join paths may be readily identified per the tables below. The cumulative cost metric is significantly lower when the RTO considers the constraints. Constrained joins via the constraints are possible between vertices (3,4) and (5,6). In fact, the RTO automatically arranges those constraints in ordered pairs. It also runs vertex 0 last, which is responsible for materializing the rdfs:label. All in all, this appears to be a very sensible evaluation plan for BSBM Q5.

        RTO estimation of the best join path w/o constraints:

        *** Selected join path: [1, 2, 0, 4, 6, 3, 5]
        vertex sourceCard  *          f (   out/    in/  read) =    estCard  : sumEstCard
             1        N/A  *            (   N/A/   N/A/   N/A) =         16E :         16
             2         16E *     150.00 (   600/     4/ 10921) =       2400  :      13337
             0       2400  *       1.00 (   600/   600/   600) =       2400  :      16337
             4       2400  *       1.00 (   600/   600/   600) =       2400  :      19337
             6       2400  *       1.00 (   600/   600/   600) =       2400  :      22337
             3       2400  *       1.00 (   600/   600/   600) =       2400  :      25337
             5        N/A  *        N/A (   N/A/   N/A/   N/A) =        N/A  :        N/A
        

        RTO estimation of the best join path w/ constraints:

        *** Selected join path: [1, 2, 4, 3, 6, 5, 0]
        vertex sourceCard  *          f (   out/    in/  read) =    estCard  : sumEstCard
             1        N/A  *            (   N/A/   N/A/   N/A) =         16E :         16
             2         16E *     150.00 (   600/     4/ 10921) =       2400  :      13337
             4       2400  *       1.00 (   600/   600/   600) =       2400  :      16337
             3       2400  *       0.16 (    97/   600/   600) =        387  :      17324
             6        387  *       1.00 (    97/    97/    97) =        387  :      17808
             5        387  *       0.28 (    27/    97/    97) =        107  :      18012
             0        107  *       1.00 (    27/    27/    27) =        107  :      18146
        
        Show
        bryanthompson bryanthompson added a comment - By applying constraints to the join graph, the RTO now produces a solution for BSBM Q5 which is 2x faster than the static query optimizer. In addition, running Q5 through the RTO without constraints now ensures that the unconstrained joins are run last, which is appropriate. As can be seen from the query evaluation times (10 trials), the RTO solution w/o the constraints is only slightly better than the static query optimizer. However, the RTO solution w/ the constraints is nearly 2x better than the static query optimizer. RTO solutions with and without constraints w/o : Total times: static=8741, runtime=8025, delta(static-runtime)=716 w/ : Total times: static=7201, runtime=3686, delta(static-runtime)=3515 The difference in the identified join paths may be readily identified per the tables below. The cumulative cost metric is significantly lower when the RTO considers the constraints. Constrained joins via the constraints are possible between vertices (3,4) and (5,6). In fact, the RTO automatically arranges those constraints in ordered pairs. It also runs vertex 0 last, which is responsible for materializing the rdfs:label. All in all, this appears to be a very sensible evaluation plan for BSBM Q5. RTO estimation of the best join path w/o constraints: *** Selected join path: [1, 2, 0, 4, 6, 3, 5] vertex sourceCard * f ( out/ in/ read) = estCard : sumEstCard 1 N/A * ( N/A/ N/A/ N/A) = 16E : 16 2 16E * 150.00 ( 600/ 4/ 10921) = 2400 : 13337 0 2400 * 1.00 ( 600/ 600/ 600) = 2400 : 16337 4 2400 * 1.00 ( 600/ 600/ 600) = 2400 : 19337 6 2400 * 1.00 ( 600/ 600/ 600) = 2400 : 22337 3 2400 * 1.00 ( 600/ 600/ 600) = 2400 : 25337 5 N/A * N/A ( N/A/ N/A/ N/A) = N/A : N/A RTO estimation of the best join path w/ constraints: *** Selected join path: [1, 2, 4, 3, 6, 5, 0] vertex sourceCard * f ( out/ in/ read) = estCard : sumEstCard 1 N/A * ( N/A/ N/A/ N/A) = 16E : 16 2 16E * 150.00 ( 600/ 4/ 10921) = 2400 : 13337 4 2400 * 1.00 ( 600/ 600/ 600) = 2400 : 16337 3 2400 * 0.16 ( 97/ 600/ 600) = 387 : 17324 6 387 * 1.00 ( 97/ 97/ 97) = 387 : 17808 5 387 * 0.28 ( 27/ 97/ 97) = 107 : 18012 0 107 * 1.00 ( 27/ 27/ 27) = 107 : 18146
        Hide
        bryanthompson bryanthompson added a comment -

        It appears that the constraints are being attached to the last join in Rule2BOpUtility. I need to fix this before we can derive any benefit from the optimized query since it will do too much work if FILTER evaluation is deferred until the solutions are materialized.

        Show
        bryanthompson bryanthompson added a comment - It appears that the constraints are being attached to the last join in Rule2BOpUtility. I need to fix this before we can derive any benefit from the optimized query since it will do too much work if FILTER evaluation is deferred until the solutions are materialized.
        Hide
        bryanthompson bryanthompson added a comment -

        https://sourceforge.net/apps/trac/bigdata/ticket/260 has been filed for dynamically attaching constraints. I will retest BSBM Q5 (the non-reduced query mix) once this issue has been resolved.

        Show
        bryanthompson bryanthompson added a comment - https://sourceforge.net/apps/trac/bigdata/ticket/260 has been filed for dynamically attaching constraints. I will retest BSBM Q5 (the non-reduced query mix) once this issue has been resolved.
        Hide
        bryanthompson bryanthompson added a comment -

        BSBM Q5 uses filters which express a key-range constraint on an access path as an arithmetic function. For example:

        FILTER (?simProperty1 < (?origProperty1 + 120) && ?simProperty1 > (?origProperty1 - 120))
        

        If this is not lifted into a key-range constraint on the access path then the following triple pattern will always be 1-bound, reading a large portion of the database each time it is evaluated.

        ?product bsbm:productPropertyNumeric1 ?simProperty1 .
        

        Therefore, the next attack on this query is to take up https://sourceforge.net/apps/trac/bigdata/ticket/238 (lift key-range constraint onto access path).

        Show
        bryanthompson bryanthompson added a comment - BSBM Q5 uses filters which express a key-range constraint on an access path as an arithmetic function. For example: FILTER (?simProperty1 < (?origProperty1 + 120) && ?simProperty1 > (?origProperty1 - 120)) If this is not lifted into a key-range constraint on the access path then the following triple pattern will always be 1-bound, reading a large portion of the database each time it is evaluated. ?product bsbm:productPropertyNumeric1 ?simProperty1 . Therefore, the next attack on this query is to take up https://sourceforge.net/apps/trac/bigdata/ticket/238 (lift key-range constraint onto access path).
        Hide
        mikepersonick mikepersonick added a comment -

        1. add isNumericResult to IValueExpression interface
        2. isNumericResult=true for MathBOp and numeric IConstants
        3. look for the circumstance where a variable is both LT and GT an isNumericResult VE
        4. create a RangeBOp that produces two IVs
        - from and to
        5. modify Predicate to work with a RangeBOp for a term (right now it only works with IVariableOrConstants
        6. swap in the RangeBOp in place of the IVariable term for predicates in the query that use the ranged variable
        7. change AbstractKeyOrder
        - how it constructs its from and to keys for predicates
        8. RangeBOp is kinda like an IVariable in that it can be unbound (if it relies on a value expression which itself contains unbound variables
        - i.e. MathBOp). Keep an eye out for gotchas here

        Show
        mikepersonick mikepersonick added a comment - 1. add isNumericResult to IValueExpression interface 2. isNumericResult=true for MathBOp and numeric IConstants 3. look for the circumstance where a variable is both LT and GT an isNumericResult VE 4. create a RangeBOp that produces two IVs - from and to 5. modify Predicate to work with a RangeBOp for a term (right now it only works with IVariableOrConstants 6. swap in the RangeBOp in place of the IVariable term for predicates in the query that use the ranged variable 7. change AbstractKeyOrder - how it constructs its from and to keys for predicates 8. RangeBOp is kinda like an IVariable in that it can be unbound (if it relies on a value expression which itself contains unbound variables - i.e. MathBOp). Keep an eye out for gotchas here
        Hide
        mikepersonick mikepersonick added a comment -

        that comment was supposed to be for ticket 238

        https://sourceforge.net/apps/trac/bigdata/ticket/238

        Show
        mikepersonick mikepersonick added a comment - that comment was supposed to be for ticket 238 https://sourceforge.net/apps/trac/bigdata/ticket/238
        Hide
        bryanthompson bryanthompson added a comment -

        Mike wrote: You could try hand coding a pipeline that looked like this:

        PREFIX BIGDATA_QUERY_HINTS: <http://www.bigdata.com/queryHints#com.bigdata.rdf.sail.QueryHints.optimizer=None>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
        SELECT DISTINCT ?product ?productLabel
        WHERE { 
            {
              %ProductXYZ% bsbm:productFeature ?prodFeature .
              ?product bsbm:productFeature ?prodFeature .
              FILTER (%ProductXYZ% != ?product)
            } hashjoin(sharedVar=?product) {
              ?product bsbm:productPropertyNumeric1 ?simProperty1 .
              %ProductXYZ% bsbm:productPropertyNumeric1 ?origProperty1 .
              FILTER (?simProperty1 < (?origProperty1 + 120) && ?simProperty1 > (?origProperty1 - 120))
            } hashjoin(sharedVar=?product) {
              ?product bsbm:productPropertyNumeric2 ?simProperty2 .
              %ProductXYZ% bsbm:productPropertyNumeric2 ?origProperty2 .
              FILTER (?simProperty2 < (?origProperty2 + 170) && ?simProperty2 > (?origProperty2 - 170))
            }
            ?product rdfs:label ?productLabel .
        }
        ORDER BY ?productLabel
        LIMIT 5
        
        Show
        bryanthompson bryanthompson added a comment - Mike wrote: You could try hand coding a pipeline that looked like this: PREFIX BIGDATA_QUERY_HINTS: <http://www.bigdata.com/queryHints#com.bigdata.rdf.sail.QueryHints.optimizer=None> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> SELECT DISTINCT ?product ?productLabel WHERE { { %ProductXYZ% bsbm:productFeature ?prodFeature . ?product bsbm:productFeature ?prodFeature . FILTER (%ProductXYZ% != ?product) } hashjoin(sharedVar=?product) { ?product bsbm:productPropertyNumeric1 ?simProperty1 . %ProductXYZ% bsbm:productPropertyNumeric1 ?origProperty1 . FILTER (?simProperty1 < (?origProperty1 + 120) && ?simProperty1 > (?origProperty1 - 120)) } hashjoin(sharedVar=?product) { ?product bsbm:productPropertyNumeric2 ?simProperty2 . %ProductXYZ% bsbm:productPropertyNumeric2 ?origProperty2 . FILTER (?simProperty2 < (?origProperty2 + 170) && ?simProperty2 > (?origProperty2 - 170)) } ?product rdfs:label ?productLabel . } ORDER BY ?productLabel LIMIT 5
        Hide
        michaelschmidt michaelschmidt added a comment -

        Here are a couple more ideas for optimizing Q5:

        • The single result triple patterns binding ?origProperty1 and ?origProperty2 are evaluated late, inducing a join of cardinality 80k. As they're both single-valued, they should probably be evaluated right at the beginning. Such patterns can conceptually be understood as a BIND, and their values could be propagated early on, binding and partially pre-evaluating the FILTER conditions.
        • FILTER decomposition and inlining of FILTER expressions might help to apply filters early on, see BLZG-1131
        • Having the ?origProperty1 and ?origProperty2 evaluated first, we could also apply the FILTERs earlier, namely right after (or even while) joining in the triple patterns for ?simProperty1
        • It seems to me that a hash join for joining the 72k intermediate results for the ?product is not the most efficient choice. It might be worth ordering the products (or extracting them in order) and implementing merge joins with inlined FILTER constraints.
        Show
        michaelschmidt michaelschmidt added a comment - Here are a couple more ideas for optimizing Q5: The single result triple patterns binding ?origProperty1 and ?origProperty2 are evaluated late, inducing a join of cardinality 80k. As they're both single-valued, they should probably be evaluated right at the beginning. Such patterns can conceptually be understood as a BIND, and their values could be propagated early on, binding and partially pre-evaluating the FILTER conditions. FILTER decomposition and inlining of FILTER expressions might help to apply filters early on, see BLZG-1131 Having the ?origProperty1 and ?origProperty2 evaluated first, we could also apply the FILTERs earlier, namely right after (or even while ) joining in the triple patterns for ?simProperty1 It seems to me that a hash join for joining the 72k intermediate results for the ?product is not the most efficient choice. It might be worth ordering the products (or extracting them in order) and implementing merge joins with inlined FILTER constraints.

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated: