Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1902

Large CONSTRUCT query slow

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Medium
    • Resolution: Unresolved
    • Affects Version/s: BLAZEGRAPH_2_1_0
    • Fix Version/s: BLAZEGRAPH_2_X_BACKLOG
    • Component/s: None
    • Labels:
      None

      Description

      I gave the scenario sketched at https://gist.github.com/ktk/a04e267dd776da2511692e96fc2b5d99 a quick try. First, I executed the following two count version of the queries (in a non-optimized, out-of the box blazegraph master > 2.1.0, triples mode, non-analytic), both reporting ~30M triples:

      1a.) Query as is

      PREFIX blv: <http://blv.ch/>
      PREFIX schema: <http://schema.org/>
      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      
      SELECT (COUNT(*) AS ?cnt) 
      WHERE {
      
          ?move a schema:TransferAction ;
          dc:date ?date ;
          schema:toLocation ?toFarm .
      
          ?othermove a schema:TransferAction ;
          dc:date ?otherdate ;
          schema:fromLocation ?toFarm .
      
          FILTER (?date <= ?otherdate)
      
      } 
      

      The query took 4min 45s, with large intermediate results for some of the joins being produced (>30M). Still, the query plan is fully pipelined.

      1b.) Hand optimized version

      PREFIX blv: <http://blv.ch/>
      PREFIX schema: <http://schema.org/>
      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      
      SELECT (COUNT(*) AS ?cnt) 
      WITH {
        SELECT * WHERE {
          ?move a schema:TransferAction ;
          dc:date ?date ;
          schema:toLocation ?toFarm .
        }
      } AS %s1
      WITH {
        SELECT * WHERE {
          ?othermove a schema:TransferAction ;
          dc:date ?otherdate ;
          schema:fromLocation ?toFarm .
        }
      } AS %s2
      WHERE {
      
        INCLUDE %s1
        INCLUDE %s2
      
          FILTER (?date <= ?otherdate)  
      } 
      

      This query is way faster – it's essentially forcing a bushy plan in combination with an efficient merge join. This goes through in ~45s.

      2.) Executing the original CONSTRUCT, according to Brad, takes about 40mins (similiar to the number reported on the Website).

      Conclusion: query evaluation performance could definitely be improved by better planning (-> bushy plan), but doesn't seem to be the major bottleneck here. Also wondering where the time goes, unrequired materialization might be one root cause, but possibly not the only one. Estimating an insert ration of (only) 50k stmts/sec, what is what we get for loading, it should even be possible to get this down to 2-3 minutes.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              michaelschmidt michaelschmidt
              Reporter:
              michaelschmidt michaelschmidt
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: