Details

      Description

      This was discussed in the Help forum:
      http://sourceforge.net/projects/bigdata/forums/forum/676946/topic/6873890

      A query that my application is using is taking a very long time (over a minute of CPU time) and eating lots of memory. The same query takes much less than a second on Fuseki, so I think there might be a problem somewhere. I'm using the stock configuration, except I've enabled quads mode and the text index (not used in the query). I'm running Bigdata inside Tomcat on Ubuntu 12.04 amd64.

      First, you can load the data I'm using into Bigdata using this SPARQL Update statement:

      LOAD <http://light.onki.fi/onki-light/rest/data/ysa> INTO GRAPH <http://www.yso.fi/onto/ysa>
      

      The dataset is a SKOS vocabulary with about 300k triples. You can get it from the above URL but I will also attach the file (ysa.rdf gzipped) to this ticket.

      Then run this query:

      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
       CONSTRUCT { 
        ?uri ?p ?o . 
        ?type rdfs:label ?typelabel . 
        ?uri skos:narrower ?n . 
        ?n a ?nt . 
        ?n skos:prefLabel ?nl . 
      } WHERE { 
        BIND ( <http://www.yso.fi/onto/ysa/Y141994> as ?uri ) 
        ?uri ?p ?o . 
        OPTIONAL { 
          ?uri a ?type . 
          ?type rdfs:label ?typelabel . 
        } 
        OPTIONAL { 
          ?n skos:broader ?uri . 
          ?n a ?nt . 
          ?n skos:prefLabel ?nl . 
        }
      }
      

      I will also attach the Explain result for the query to this ticket.

      It appears that Bigdata performs the BIND assignment very late in the query processing. Jena ARQ (in Fuseki) instead will do it first, saving lots of time.

      I tested both Bigdata 1.2.2 and the SVN head RELEASE_1_2_0 of yesterday. In both cases the result was similar.

      See BLZG-876, which may have the same underlying issue.

        Activity

        Hide
        oisuomin oisuomin added a comment -

        Turns out I can't attach the RDF file, which is 2MB gzipped. The file size limit for this issue tracker is 256KB. But anyway, the LOAD command above works.

        Show
        oisuomin oisuomin added a comment - Turns out I can't attach the RDF file, which is 2MB gzipped. The file size limit for this issue tracker is 256KB. But anyway, the LOAD command above works.
        Hide
        bryanthompson bryanthompson added a comment -

        Mike,

        Can you please take a look at this?

        Thanks,
        Bryan

        Show
        bryanthompson bryanthompson added a comment - Mike, Can you please take a look at this? Thanks, Bryan
        Hide
        bryanthompson bryanthompson added a comment -

        This came up recently on the developers list. JeremyC points out that BIND, as defined at [1], is a purely syntactic shorthand which would appear to directly associate a value expression with a specific BPG.

        The BIND form allows a value to be assigned to a variable from a basic graph pattern or
        property path expression. Use of BIND ends the preceding basic graph pattern. The
        variable introduced by the BIND clause must not have been used in the group graph
        pattern up to the point of use in BIND.
        

        This is much simpler, but I have some questions. Also, it is interesting that nothing else in SPARQL has such order dependent semantics. For example, JOINs and FILTERs may be freely reordered.

        • What happens if the bind is the first thing in the group? I.e., no proceeding BPG?
        • What if the BIND could be reordered from a place where it's required variables were not bound into one where they are? Should it be run there, where it can succeed?
        • What if you want to do a BIND after a sub-SELECT? Are the variables from the sub-SELECT visible, or does that depend on where we order the sub select?

        If the goal is to purely have BIND be a syntax for running a specific value expression at a specific location in the query, then it should just become an annotation of the StatementPatternNode and get applied by the PipelineJoin operator when it evaluates the join. (There is also a hash join against an access path operator that would need the same logic).

        BIND() is currently implemented as a constraint, just like a FILTER, but it is constraint that can have a side-effect on a solution
        - it is unique in this respect. I am not immediately clear on the as-implemented reordering rules for BIND. The normal rules for FILTERs are that they run as soon as their variables are "known bound" and before the end of the join group in any case.

        [1] http://www.w3.org/TR/sparql11-query/#bind

        Show
        bryanthompson bryanthompson added a comment - This came up recently on the developers list. JeremyC points out that BIND, as defined at [1] , is a purely syntactic shorthand which would appear to directly associate a value expression with a specific BPG. The BIND form allows a value to be assigned to a variable from a basic graph pattern or property path expression. Use of BIND ends the preceding basic graph pattern. The variable introduced by the BIND clause must not have been used in the group graph pattern up to the point of use in BIND. This is much simpler, but I have some questions. Also, it is interesting that nothing else in SPARQL has such order dependent semantics. For example, JOINs and FILTERs may be freely reordered. What happens if the bind is the first thing in the group? I.e., no proceeding BPG? What if the BIND could be reordered from a place where it's required variables were not bound into one where they are? Should it be run there, where it can succeed? What if you want to do a BIND after a sub-SELECT? Are the variables from the sub-SELECT visible, or does that depend on where we order the sub select? If the goal is to purely have BIND be a syntax for running a specific value expression at a specific location in the query, then it should just become an annotation of the StatementPatternNode and get applied by the PipelineJoin operator when it evaluates the join. (There is also a hash join against an access path operator that would need the same logic). BIND() is currently implemented as a constraint, just like a FILTER, but it is constraint that can have a side-effect on a solution - it is unique in this respect. I am not immediately clear on the as-implemented reordering rules for BIND. The normal rules for FILTERs are that they run as soon as their variables are "known bound" and before the end of the join group in any case. [1] http://www.w3.org/TR/sparql11-query/#bind
        Hide
        jeremycarroll jeremycarroll added a comment -

        There is a syapse internal defect AP-1200 which is also a performance related issue with a query involving BIND. When I get to work on that, I will first try rewriting the (complex) query using subselects instead of binds. If that gives good enough performance, then one fix maybe to add a new optimizer that replaces BINDs with SUBSELECTS pretty early in the optimization process.

        Show
        jeremycarroll jeremycarroll added a comment - There is a syapse internal defect AP-1200 which is also a performance related issue with a query involving BIND. When I get to work on that, I will first try rewriting the (complex) query using subselects instead of binds. If that gives good enough performance, then one fix maybe to add a new optimizer that replaces BINDs with SUBSELECTS pretty early in the optimization process.
        Hide
        jeremycarroll jeremycarroll added a comment -

        See also trac 794

        Show
        jeremycarroll jeremycarroll added a comment - See also trac 794
        Hide
        bryanthompson bryanthompson added a comment -

        I would not rely on bigdata to order sub-selects in an optimal fashion. The easiest path is to just fix BIND to run as early as possible.

        Show
        bryanthompson bryanthompson added a comment - I would not rely on bigdata to order sub-selects in an optimal fashion. The easiest path is to just fix BIND to run as early as possible.
        Hide
        michaelschmidt michaelschmidt added a comment -

        Implemented first version in branch slow-bind-query. There may be two problems left, which we have to think about:


        - For non well designed graph patterns, it may not be valid to push bindings (see also http://trac.bigdata.com/ticket/1087)
        - What is the semantic of conflicting situations, i.e. if an outer BIND clause is binding the same variable -> currently we implement override semantics, which might be not what the standard says

        Show
        michaelschmidt michaelschmidt added a comment - Implemented first version in branch slow-bind-query. There may be two problems left, which we have to think about: - For non well designed graph patterns, it may not be valid to push bindings (see also http://trac.bigdata.com/ticket/1087 ) - What is the semantic of conflicting situations, i.e. if an outer BIND clause is binding the same variable -> currently we implement override semantics, which might be not what the standard says
        Hide
        michaelschmidt michaelschmidt added a comment -

        Concerning the two issues above:

        • Given that the ASTBottomUpOptimizer rewrites ill-designed patterns, I'd be quite confident that there are no problems (at least no new problems introduced by the optimizer).
        • In case of an outer VALUES clause, now the optimizer takes no effect (added a pre-check)

        Checked in changes in branch slow-bind-query and issued pull request. Please review and merge into master if no further issues pop up. Note: as this is an edge case, I did not run CI, I'd simply check if the number of failed test cases in the master branch changed prior to closing this bug.

        Show
        michaelschmidt michaelschmidt added a comment - Concerning the two issues above: Given that the ASTBottomUpOptimizer rewrites ill-designed patterns, I'd be quite confident that there are no problems (at least no new problems introduced by the optimizer). In case of an outer VALUES clause, now the optimizer takes no effect (added a pre-check) Checked in changes in branch slow-bind-query and issued pull request. Please review and merge into master if no further issues pop up. Note: as this is an edge case, I did not run CI, I'd simply check if the number of failed test cases in the master branch changed prior to closing this bug.
        Hide
        michaelschmidt michaelschmidt added a comment -

        This has been resolved but is currently outcommented due to ongoing problems, see ticket BLZG-1141 for follow-up actions. Closing ticket.

        Show
        michaelschmidt michaelschmidt added a comment - This has been resolved but is currently outcommented due to ongoing problems, see ticket BLZG-1141 for follow-up actions. Closing ticket.

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            oisuomin oisuomin
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: