Details

    • Type: Bug
    • Status: Done
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: BLAZEGRAPH_RELEASE_1_5_2
    • Component/s: None
    • Labels:
      None

      Description

      I am wishing to dump the content of a current blazegraph instance (running in quads mode)

      I initially tried:

      $ curl -X POST http://localhost:2333/bigdata/sparql --data-urlencode 'query=CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o }' -H 'Accept:text/plain' > /dev/null
      

      which did not work because after about 2.5G of data we entered GC hell, and progress slowed dramatically. This is not surprising because of the implicit duplicates check.

      So I found the name of the single graph with the bulk of the data and tried:

      $ curl -X POST http://localhost:2333/bigdata/sparql --data-urlencode 'query=CONSTRUCT { ?s ?p ?o } WHERE { graph <https://test-similarpatients.syapse.com/graph/diagnostics-inc/abox> { ?s ?p ?o } }' -H 'Accept:text/plain' > /dev/null
      

      This faired better, but still entered GC hell after 4.5G of data. This query should simply be streaming the data ??? and hence should not be challenging, other than simply the execution time.

        Issue Links

          Activity

          Hide
          jjc Jeremy Carroll added a comment -

          To Brad's question:

          the work done does minimally support the data dump objective, if used with care! But the work is incomplete. Given the release schedule it may be reasonable to release, depending on other business pressure.

          Show
          jjc Jeremy Carroll added a comment - To Brad's question: the work done does minimally support the data dump objective, if used with care! But the work is incomplete. Given the release schedule it may be reasonable to release, depending on other business pressure.
          Hide
          beebs Brad Bebee added a comment -

          OK. Good news. Yes, we want to freeze the code on Monday for the release.

          Show
          beebs Brad Bebee added a comment - OK. Good news. Yes, we want to freeze the code on Monday for the release.
          Hide
          jjc Jeremy Carroll added a comment -

          Also good, a slightly more complex query:

          CONSTRUCT { ?s ?p ?o } WHERE { graph ?g { hint:Query hint:constructDistinctSPO false . ?s ?p ?o FILTER ( ?g != ?s ) } }
          

          I let run to completion, and it downloaded 63GB of data, 330M triples with no performance degradation

          Show
          jjc Jeremy Carroll added a comment - Also good, a slightly more complex query: CONSTRUCT { ?s ?p ?o } WHERE { graph ?g { hint:Query hint:constructDistinctSPO false . ?s ?p ?o FILTER ( ?g != ?s ) } } I let run to completion, and it downloaded 63GB of data, 330M triples with no performance degradation
          Hide
          jjc Jeremy Carroll added a comment -

          The following query also works, which is good because it creates an unbounded number of new IRIs as it executes.

          CONSTRUCT { ?s ?p ?o } 
          WHERE { 
             graph ?g { 
                 hint:Query hint:constructDistinctSPO false . 
                 ?ss ?pp ?o 
                 BIND(iri(substr(str(?ss), 1, strlen(str(?ss))-1)) as ?s) 
                 BIND(iri(substr(str(?pp), 1, strlen(str(?pp))-1)) as ?p) 
             }
           }
          
          Show
          jjc Jeremy Carroll added a comment - The following query also works, which is good because it creates an unbounded number of new IRIs as it executes. CONSTRUCT { ?s ?p ?o } WHERE { graph ?g { hint:Query hint:constructDistinctSPO false . ?ss ?pp ?o BIND(iri(substr(str(?ss), 1, strlen(str(?ss))-1)) as ?s) BIND(iri(substr(str(?pp), 1, strlen(str(?pp))-1)) as ?p) } }
          Hide
          bryanthompson bryanthompson added a comment -

          Ok. The goal here was to enable to specific case you identified at the top - scalable dump - by disabling the hash indices using to impose distinct on the CONSTRUCT. It turns out for that query that there were two such hash indices. One for the CONSTRUCT output and another that was implicitly added to the query by the ASTConstructOptimizer when it turned the query into a SELECT REDUCED. The use of hint:constructDistinctSPO := false will disable both of those hash indices.

          A quads mode default graph query requires a hash index. In order to scale, explicitly using the GRAPH keyword for a named graph (vs default graph) query.

          Hash indices are used by a lot of query patterns. We have not (and can not) disable all such hash indices. To have the kind of scaling you want, you will need to stick to queries that do not use quads mode default graph access paths and that do not use any other operations that require a hash index (such as OPTIONAL, sub-SELECT, etc.).

          Thanks,
          Bryan

          Show
          bryanthompson bryanthompson added a comment - Ok. The goal here was to enable to specific case you identified at the top - scalable dump - by disabling the hash indices using to impose distinct on the CONSTRUCT. It turns out for that query that there were two such hash indices. One for the CONSTRUCT output and another that was implicitly added to the query by the ASTConstructOptimizer when it turned the query into a SELECT REDUCED. The use of hint:constructDistinctSPO := false will disable both of those hash indices. A quads mode default graph query requires a hash index. In order to scale, explicitly using the GRAPH keyword for a named graph (vs default graph) query. Hash indices are used by a lot of query patterns. We have not (and can not) disable all such hash indices. To have the kind of scaling you want, you will need to stick to queries that do not use quads mode default graph access paths and that do not use any other operations that require a hash index (such as OPTIONAL, sub-SELECT, etc.). Thanks, Bryan

            People

            • Assignee:
              beebs Brad Bebee
              Reporter:
              jjc Jeremy Carroll
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: