Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1958

Implement ChunkedMaterialization inside query to exploit parallelization

    Details

    • Type: Improvement
    • Status: Done
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: BLAZEGRAPH_2_1_1
    • Fix Version/s: BLAZEGRAPH_2_2_0
    • Component/s: None
    • Labels:
      None

      Description

      For some of the WatDiv queries we have simple join groups only, and materialization of results is done in the end, after the projection operator through the iterator. The problem is that this iterator (although chunked) is not parallelized, so when running one query after and having large results, the materialization phase often dominates query runtime (for one of the WatDiv queries: computing the join ~0.3s vs. projection >7s).

      As a solution to this problem, we want to add a ChunkedMaterializationOp in query planning for certain kinds of queries. The materialization could be done just before passing results to the chunked materialization op. This way, it would run in parallel (according to the current maxParallel setting). In principle, this makes sense for all queries where we output the full result (no slicing etc.), we probably need to benchmark this when changing it generally.

      Additionally, in cases where we project on all the variables (and no intermediate variables have been introduced) we may also drop the final projection entirely.

        Activity

        Hide
        michaelschmidt michaelschmidt added a comment - - edited

        Looking at the test case failures, it looks like most of them are caused by the second optimization, namely dropping the top-level projection if the set of projection vars is a superset of the spanned variables.

        The reason is that in some case we have variables bound that are not included in the spanned variable set. Here's one example:

        SELECT * { 
        
        { SELECT ($X AS $Y)
         {
           BIND ("y" AS $X)
         }
        }
        { SELECT ($yy as $Y)
            {
              { BIND ( "y0" as $yy ) .
                BIND( "x0" as $xx )
              } UNION {
                BIND ( "y" as $yy ) .
                BIND( "x" as $xx )
              } UNION {
                BIND ( "y1" as $yy ) .
                BIND( "x2" as $xx )
              }
            }
        }
        
        }
        

        Interestingly, in that case $X will be bound in the final result if we drop the final projection for $Y. The problem here is that the translation of the first subquery does not put a projection on top.

        Here's another case:

        # Search query.
        PREFIX bd: <http://www.bigdata.com/rdf/search#>
        prefix xsd: <http://www.w3.org/2001/XMLSchema#>
        prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
        prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
        prefix foaf: <http://xmlns.com/foaf/0.1/>
        
        select ?subj ?label 
         where {
          ?subj rdf:type foaf:Person .
          ?subj rdfs:label ?label .  
          ?label bd:search "Mi*" . 
          ?label bd:minRelevance "0"^^xsd:double . 
        }
        

        The problem here is that the SERVICE returns, in addition to bindings for ?subj and ?label, an anonymous variable carrying the score.

        There are actually two options:
        1.) Remove the second optimization, thus leaving the top-level projection in place
        2.) Make sure that we project on all code paths on variables that should not be visible

        I think for the sake of correctness, the latter should be done anyways. It's just the question whether we want to do that right now, it might be some effort. Note that this is kind of related to https://jira.blazegraph.com/browse/BLZG-1901, which aims at dropping unneeded variables as early as possible.

        Show
        michaelschmidt michaelschmidt added a comment - - edited Looking at the test case failures, it looks like most of them are caused by the second optimization, namely dropping the top-level projection if the set of projection vars is a superset of the spanned variables. The reason is that in some case we have variables bound that are not included in the spanned variable set. Here's one example: SELECT * { { SELECT ($X AS $Y) { BIND ( "y" AS $X) } } { SELECT ($yy as $Y) { { BIND ( "y0" as $yy ) . BIND( "x0" as $xx ) } UNION { BIND ( "y" as $yy ) . BIND( "x" as $xx ) } UNION { BIND ( "y1" as $yy ) . BIND( "x2" as $xx ) } } } } Interestingly, in that case $X will be bound in the final result if we drop the final projection for $Y. The problem here is that the translation of the first subquery does not put a projection on top. Here's another case: # Search query. PREFIX bd: <http: //www.bigdata.com/rdf/search#> prefix xsd: <http: //www.w3.org/2001/XMLSchema#> prefix rdf: <http: //www.w3.org/1999/02/22-rdf-syntax-ns#> prefix rdfs: <http: //www.w3.org/2000/01/rdf-schema#> prefix foaf: <http: //xmlns.com/foaf/0.1/> select ?subj ?label where { ?subj rdf:type foaf:Person . ?subj rdfs:label ?label . ?label bd:search "Mi*" . ?label bd:minRelevance "0" ^^xsd: double . } The problem here is that the SERVICE returns, in addition to bindings for ?subj and ?label, an anonymous variable carrying the score. There are actually two options: 1.) Remove the second optimization, thus leaving the top-level projection in place 2.) Make sure that we project on all code paths on variables that should not be visible I think for the sake of correctness, the latter should be done anyways. It's just the question whether we want to do that right now, it might be some effort. Note that this is kind of related to https://jira.blazegraph.com/browse/BLZG-1901 , which aims at dropping unneeded variables as early as possible.
        Hide
        michaelschmidt michaelschmidt added a comment -

        We agreed on disabling the optimization for now, it should be reconsidered when tackling BLZG-1901. Pushed to PR, re-running CI.

        Show
        michaelschmidt michaelschmidt added a comment - We agreed on disabling the optimization for now, it should be reconsidered when tackling BLZG-1901 . Pushed to PR, re-running CI.
        Hide
        michaelschmidt michaelschmidt added a comment -

        CI is green: https://github.com/SYSTAP/bigdata/pull/423.

        bryanthompson please have a quick look, the only change I'm unsure about is a one-liner in SidIV.java, where I now set the value cache in the asValue method (this was causing a test case to fail, because after calling getTerms() the IV cache of a SIDIV was not set). If you are okay with the change, I will merge this down and Alexandre can recompile for our WatDiv session tomorrow.

        Show
        michaelschmidt michaelschmidt added a comment - CI is green: https://github.com/SYSTAP/bigdata/pull/423 . bryanthompson please have a quick look, the only change I'm unsure about is a one-liner in SidIV.java, where I now set the value cache in the asValue method (this was causing a test case to fail, because after calling getTerms() the IV cache of a SIDIV was not set). If you are okay with the change, I will merge this down and Alexandre can recompile for our WatDiv session tomorrow.
        Hide
        michaelschmidt michaelschmidt added a comment - - edited

        Merged this down, let's quickly discuss the change mentioned above before closing (could easily be changed belatedly).

        Alexandre Riazanov master is now ready for our interactive watdiv session, could you please deploy it on Bryan's workstation?

        Show
        michaelschmidt michaelschmidt added a comment - - edited Merged this down, let's quickly discuss the change mentioned above before closing (could easily be changed belatedly). Alexandre Riazanov master is now ready for our interactive watdiv session, could you please deploy it on Bryan's workstation?
        Hide
        michaelschmidt michaelschmidt added a comment - - edited

        Code has been acknowledged by Bryan. Closing issue.

        See BLZG-1960 as a follow-up issue.

        Show
        michaelschmidt michaelschmidt added a comment - - edited Code has been acknowledged by Bryan. Closing issue. See BLZG-1960 as a follow-up issue.

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            michaelschmidt michaelschmidt
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: