Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-537

Default graphs always uses SCAN + FILTER and lacks efficient PARALLEL SUBQUERY code path

    Details

    • Type: New Feature
    • Status: In Progress
    • Resolution: Unresolved
    • Affects Version/s: TERMS_REFACTOR_BRANCH
    • Fix Version/s: None
    • Component/s: Query Plan Generator
    • Labels:
      None

      Description

      Evidence suggests that the SCAN+FILTER pattern is either always or nearly always more efficient than an expander pattern which the contexts in the data set are "joined" into the query.

        Issue Links

          Activity

          Hide
          bryanthompson bryanthompson added a comment -

          mroy wrote:

          I was actually just looking at this last nite. I had set the filter threshold down to 5 graphs a few months ago, and kind of forgot about it until last nite when I ran a query with 2 graphs in the default graph dataset, and the query took 20ses, but setting the threshold to always do filter,  it took only 200ms.  
          
          I'm attaching 2  example queries, one with 12 graphs in the default graph list, and one with all 200+ in the default graphs. 
          For defaultGraphs12.rq, 12 graphs: 200ms filter | 15000ms subquery For defaultGraphs230.rq, 237 graphs: 300ms filter | 157000 ms subquery
          
          In my experience so far, I haven't found an example where the subquery version is faster than the filter version.
          

          Concerning named graph query patterns, mroy wrote:

          I haven't tested this as much, but the few cases I've tried, the calculated scan cost is lower than the subquery cost, so it uses scan anyway.
          The one case where this doesn't seem to calculate correctly is when you have millions of graphs, and the subquery sample calculates its cost as 0, I'm assuming because the sampled graphs don't contain the triple patterns in question. In that case it used subquery, and the query that took 5sec with scan, run indefinitely with subquery.
          
          Show
          bryanthompson bryanthompson added a comment - mroy wrote: I was actually just looking at this last nite. I had set the filter threshold down to 5 graphs a few months ago, and kind of forgot about it until last nite when I ran a query with 2 graphs in the default graph dataset, and the query took 20ses, but setting the threshold to always do filter, it took only 200ms. I'm attaching 2 example queries, one with 12 graphs in the default graph list, and one with all 200+ in the default graphs. For defaultGraphs12.rq, 12 graphs: 200ms filter | 15000ms subquery For defaultGraphs230.rq, 237 graphs: 300ms filter | 157000 ms subquery In my experience so far, I haven't found an example where the subquery version is faster than the filter version. Concerning named graph query patterns, mroy wrote: I haven't tested this as much, but the few cases I've tried, the calculated scan cost is lower than the subquery cost, so it uses scan anyway. The one case where this doesn't seem to calculate correctly is when you have millions of graphs, and the subquery sample calculates its cost as 0, I'm assuming because the sampled graphs don't contain the triple patterns in question. In that case it used subquery, and the query that took 5sec with scan, run indefinitely with subquery.
          Hide
          bryanthompson bryanthompson added a comment -

          Matt,

          I obtained baseline numbers for those queries as follows (the first column is milliseconds). Baseline performance is Ok. Both queries wound up running against the SCAN + FILTER pattern by default.

          12020	queries/defaultGraphs12
          11193	queries/defaultGraphs12
          11708	queries/defaultGraphs12
          1052	queries/defaultGraphs230
          910	queries/defaultGraphs230
          626	queries/defaultGraphs230
          

          I then modified AST2BOpJoinsBLZG-772 as follows. This should force the use of the "expander" pattern.

                  if (false && (subqueryCostReport == null
                          || scanCostReport.cost < subqueryCostReport.cost)) {
          

          The 12 graph query runs in pretty much the same time:

          solutions=3, chunks=1, elapsed=10935ms.
          solutions=3, chunks=1, elapsed=10182ms.
          solutions=3, chunks=1, elapsed=10729ms.
          

          However, I can confirm that the 230 graph query runs quite a long time.

          I think that this is most likely to be a bug in the DGExpander code. I see that it is using the parallel expander logic there. My inclination is to replace this expander pattern with a hash join against an inline access path. Either the DataSetJoin or a JVMHashSolutionSetHashJoin. That will let us drop a relatively complicated class (DGExpander). Once that is done we can examine the SCAN + FILTER versus "EXPANDER" pattern tradeoffs again. Of course, the "EXPANDER" at that point is just another join. The decision to "SCAN + FILTER" versus "EXPAND" should be made up a level in the query planner (AST2BOpUtility). It can just add either one join with an IN filter (SCAN+FILTER) or two joins (EXPAND), where the additional join is the join against the data set.

          In the meanwhile, I think that just disabling the DGExpander path, as you have done, should work fine. Clearly running the DGExpander is a lose. I am going to turn in off for now in SVN and reference this issue.

          Show
          bryanthompson bryanthompson added a comment - Matt, I obtained baseline numbers for those queries as follows (the first column is milliseconds). Baseline performance is Ok. Both queries wound up running against the SCAN + FILTER pattern by default. 12020 queries/defaultGraphs12 11193 queries/defaultGraphs12 11708 queries/defaultGraphs12 1052 queries/defaultGraphs230 910 queries/defaultGraphs230 626 queries/defaultGraphs230 I then modified AST2BOpJoinsBLZG-772 as follows. This should force the use of the "expander" pattern. if (false && (subqueryCostReport == null || scanCostReport.cost < subqueryCostReport.cost)) { The 12 graph query runs in pretty much the same time: solutions=3, chunks=1, elapsed=10935ms. solutions=3, chunks=1, elapsed=10182ms. solutions=3, chunks=1, elapsed=10729ms. However, I can confirm that the 230 graph query runs quite a long time. I think that this is most likely to be a bug in the DGExpander code. I see that it is using the parallel expander logic there. My inclination is to replace this expander pattern with a hash join against an inline access path. Either the DataSetJoin or a JVMHashSolutionSetHashJoin. That will let us drop a relatively complicated class (DGExpander). Once that is done we can examine the SCAN + FILTER versus "EXPANDER" pattern tradeoffs again. Of course, the "EXPANDER" at that point is just another join. The decision to "SCAN + FILTER" versus "EXPAND" should be made up a level in the query planner (AST2BOpUtility). It can just add either one join with an IN filter (SCAN+FILTER) or two joins (EXPAND), where the additional join is the join against the data set. In the meanwhile, I think that just disabling the DGExpander path, as you have done, should work fine. Clearly running the DGExpander is a lose. I am going to turn in off for now in SVN and reference this issue.
          Hide
          bryanthompson bryanthompson added a comment -

          The run time for the 230 graph query with the EXPANDER was:

          solutions=10, chunks=1, elapsed=138056ms.
          

          I see that the named graphs "EXPANDER" path actually DOES use the DataSetJoin. I will look at putting that into place in the default graphs code path and see if it performs better.

          Show
          bryanthompson bryanthompson added a comment - The run time for the 230 graph query with the EXPANDER was: solutions=10, chunks=1, elapsed=138056ms. I see that the named graphs "EXPANDER" path actually DOES use the DataSetJoin. I will look at putting that into place in the default graphs code path and see if it performs better.
          Hide
          bryanthompson bryanthompson added a comment -

          I have added query hints to control the sampling limit, the default behavior when sampling is disabled, for the SCAN + FILTER versus PARALLEL SUBQUERY decision for named and default graph queries. See QueryHints#ACCESS_PATH_SAMPLE_LIMIT and QueryHints#ACCESS_PATH_SCAN_AND_FILTER. These things may also be overridden via query hints either for the entire query (scope=Query) or for specific statement patterns (when the scope is more selective).

          I have not changed the default behavior. The sample limit is still 100 and the behavior if you disable sampling will default to SCAN+FILTER. I want to look at integrating the DataSetJoin into the default graph code path next (it is already used in the named graph code path to handle the PARALLEL SUBQUERY case).

          Committed revision r5697.

          Show
          bryanthompson bryanthompson added a comment - I have added query hints to control the sampling limit, the default behavior when sampling is disabled, for the SCAN + FILTER versus PARALLEL SUBQUERY decision for named and default graph queries. See QueryHints#ACCESS_PATH_SAMPLE_LIMIT and QueryHints#ACCESS_PATH_SCAN_AND_FILTER. These things may also be overridden via query hints either for the entire query (scope=Query) or for specific statement patterns (when the scope is more selective). I have not changed the default behavior. The sample limit is still 100 and the behavior if you disable sampling will default to SCAN+FILTER. I want to look at integrating the DataSetJoin into the default graph code path next (it is already used in the named graph code path to handle the PARALLEL SUBQUERY case). Committed revision r5697.
          Hide
          bryanthompson bryanthompson added a comment -

          I have looked a little further into the DataSetJoin. The reason why this was not put into place for default graph "PARALLEL SUBQUERY" patterns is that the DISTINCT SPO filter winds up attached only to the join against the statement index and does not operate across all graph contexts which are bound by a given source solution flowing through the DataSetJoin. I have added comments in the AST2BOpJoins code to document this and I have DISABLED the PARALLEL SUBQUERY code path for default graph APs.

          PARALLEL SUBQUERY is still enabled for named graph APs. In this case we are able to use the DataSetJoin and I have no evidence which suggests that doing so runs into the performance problems which were documented above for the default graph APs.

          If a performance problem is observed with the SCAN+FILTER behavior for default graph APs, then either we should debug the DGExpander or we could explore an alternative query plan structure using a sub-group within which we execute the DataSetJoin and the statement pattern join and then project just the distinct SPOs from that subgroup.

          I am done with this issue for now. I will leave it open but reduce the priority level since we currently lack an efficient means to do parallel subquery for a default graph AP.

          Committed revision r5698.

          Show
          bryanthompson bryanthompson added a comment - I have looked a little further into the DataSetJoin. The reason why this was not put into place for default graph "PARALLEL SUBQUERY" patterns is that the DISTINCT SPO filter winds up attached only to the join against the statement index and does not operate across all graph contexts which are bound by a given source solution flowing through the DataSetJoin. I have added comments in the AST2BOpJoins code to document this and I have DISABLED the PARALLEL SUBQUERY code path for default graph APs. PARALLEL SUBQUERY is still enabled for named graph APs. In this case we are able to use the DataSetJoin and I have no evidence which suggests that doing so runs into the performance problems which were documented above for the default graph APs. If a performance problem is observed with the SCAN+FILTER behavior for default graph APs, then either we should debug the DGExpander or we could explore an alternative query plan structure using a sub-group within which we execute the DataSetJoin and the statement pattern join and then project just the distinct SPOs from that subgroup. I am done with this issue for now. I will leave it open but reduce the priority level since we currently lack an efficient means to do parallel subquery for a default graph AP. Committed revision r5698.
          Hide
          bryanthompson bryanthompson added a comment -

          The DGExpander appears to be doing what it is supposed to be doing. I see it binding the context position and issuing running the set of as-bound access paths. I also see that most of the as-bound access paths wind up empty. It looks like that, when there is a hit, it is in just one of the graphs. Most of the times when there is a hit, there is just one hit.

          I suspect that the SCAN+FILTER default graph AP winds up being less expensive for reasons which are mostly about the shape of the data (sparsity and scale) and the query. I am going to remove the FIXME from the DGExpander code and update the comments to reflect these observations. There is a mechanism now to configure bigdata such that it never computes the cost of scan versus subquery and always does a scan. I think that we should leave this at that until we can find some counter example queries where the scan is more expensive and/or identify a better heuristic for choosing parallel subquery.

          I am leaving the PARALLEL SUBQUERY code path disabled for now. I will leave this issue open until we get more insight about when a parallel subquery plan would do better than a scan+filter plan.

          Show
          bryanthompson bryanthompson added a comment - The DGExpander appears to be doing what it is supposed to be doing. I see it binding the context position and issuing running the set of as-bound access paths. I also see that most of the as-bound access paths wind up empty. It looks like that, when there is a hit, it is in just one of the graphs. Most of the times when there is a hit, there is just one hit. I suspect that the SCAN+FILTER default graph AP winds up being less expensive for reasons which are mostly about the shape of the data (sparsity and scale) and the query. I am going to remove the FIXME from the DGExpander code and update the comments to reflect these observations. There is a mechanism now to configure bigdata such that it never computes the cost of scan versus subquery and always does a scan. I think that we should leave this at that until we can find some counter example queries where the scan is more expensive and/or identify a better heuristic for choosing parallel subquery. I am leaving the PARALLEL SUBQUERY code path disabled for now. I will leave this issue open until we get more insight about when a parallel subquery plan would do better than a scan+filter plan.

            People

            • Assignee:
              Unassigned
              Reporter:
              bryanthompson bryanthompson
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: