Details

    • Type: Bug
    • Status: Accepted
    • Priority: Highest
    • Resolution: Unresolved
    • Affects Version/s: BIGDATA_RELEASE_1_4_0
    • Fix Version/s: None
    • Component/s: Query Plan Generator
    • Labels:
      None

      Description

      There are still some issues related to BIND and VALUES queries.

      1.) Currently BIND expressions are always evaluated at the end of JOIN groups. This is suboptimal from a performance perspective.

      2.) There are issues related to the pushing of variables bound in BIND/VALUES clauses into subqueries. See for example ASTSubGroupJoinVarOptimizer

      3.) The computation of definiteVars for bind expressions could be improved: currently, variables bound in BIND expressions are never considered definite (as BIND in general may fail), but trivial cases such as BIND (?x as <http://constant/uri>) might be considered "safe".

      4.) The following should give the empty result, but does not. Probably, the outer VALUES clause is somehow pushed inside:

      SELECT *
      WHERE {

      { BIND ( "x" as $X ) FILTER( BOUND($Y)) }

      }
      VALUES ?Y { "y" }

      This leads to the more general question of when it is allowed to push down variables into subqueries (cf. the remark about SPARQL 1.1 semantics in ASTSubGroupJoinVarOptimizer). We should come up with a general framework to make qualified decisions about pushing in variables.

        Issue Links

          Activity

          Hide
          michaelschmidt michaelschmidt added a comment -

          As part of this bug, there are some problems (failing test cases) in combination with the ASTSimpleBindingsOptimizer. Currently, the ASTSimpleBindingsOptimizer has been outcommented, we may want to re-enable it again and fix the above mentioned (as well as other) issues.

          Show
          michaelschmidt michaelschmidt added a comment - As part of this bug, there are some problems (failing test cases) in combination with the ASTSimpleBindingsOptimizer. Currently, the ASTSimpleBindingsOptimizer has been outcommented, we may want to re-enable it again and fix the above mentioned (as well as other) issues.
          Hide
          michaelschmidt michaelschmidt added a comment -

          Fix for both #48 and #50 proposed in branch BLZG48and50. This addresses the following problems:

          • There’s a bug in method com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.mergeBindingSets(ASTContainer, IBindingSet) when merging multiple binding sets together
          • The strategy for evaluating VALUES clauses is somewhat scattered in the code (see also #1141): first of all, ASTEvalHelper collects all definitely bound bindings/values clauses and statically injects these values into the base binding set. In addition, the underlying VALUES clauses are /partially/ reconsidered when evaluating the query (AST2BOpUtitlity.addValues() translates them into operators). With partially I mean that VALUES clauses at top level are ignored (there’s no special handling for the com.bigdata.rdf.sparql.ast.QueryBase.Annotations.BINDINGS_CLAUSE annotation in convertBaseQueryWithScopedVars, but there’s a translation of them in join groups. Ingestion of VALUES into the “base” mapping might also cause problems when reasoning about the structure of queries based on variables.

          The approach started in branch BLZG48and50 is as follows:

          1.) Completely eliminate the special handling of “pre-merging” definitely bound VALUES clauses, but rather translate all of them regularly using addValues() -> this gives us correctness
          2.) As part of the planned refactoring of the ASTJoinOrderByTypeOptimizer, find the right place to do so -> this should give us performance
          3.) If still useful, we might add a regular optimizer “pulling out” nested VALUES and merging VALUES clauses, rather than having the current (distributed/non-standard) approach for handling them.

          There are two challenges that we need to overcome when implementing this approach:

          a.) The derived incoming binding set for the query is currently "used" by some of the optimizers (namely: ASTBindingAssigner, ASTSimpleBindingsOptimizer, and ASTRangeCountOptimizer.attachRangeCounts used by ASTStaticJoinOptimizer). When eliminating/static injection of the bindings we need to make sure that these optimizers have the information they need.
          b.) We must make sure that the BINDINGS/VALUES clauses are evaluated early on, to avoid performance implications.

          Show
          michaelschmidt michaelschmidt added a comment - Fix for both #48 and #50 proposed in branch BLZG48and50. This addresses the following problems: There’s a bug in method com.bigdata.rdf.sparql.ast.eval.ASTEvalHelper.mergeBindingSets(ASTContainer, IBindingSet) when merging multiple binding sets together The strategy for evaluating VALUES clauses is somewhat scattered in the code (see also #1141): first of all, ASTEvalHelper collects all definitely bound bindings/values clauses and statically injects these values into the base binding set. In addition, the underlying VALUES clauses are /partially/ reconsidered when evaluating the query (AST2BOpUtitlity.addValues() translates them into operators). With partially I mean that VALUES clauses at top level are ignored (there’s no special handling for the com.bigdata.rdf.sparql.ast.QueryBase.Annotations.BINDINGS_CLAUSE annotation in convertBaseQueryWithScopedVars, but there’s a translation of them in join groups. Ingestion of VALUES into the “base” mapping might also cause problems when reasoning about the structure of queries based on variables. The approach started in branch BLZG48and50 is as follows: 1.) Completely eliminate the special handling of “pre-merging” definitely bound VALUES clauses, but rather translate all of them regularly using addValues() -> this gives us correctness 2.) As part of the planned refactoring of the ASTJoinOrderByTypeOptimizer, find the right place to do so -> this should give us performance 3.) If still useful, we might add a regular optimizer “pulling out” nested VALUES and merging VALUES clauses, rather than having the current (distributed/non-standard) approach for handling them. There are two challenges that we need to overcome when implementing this approach: a.) The derived incoming binding set for the query is currently "used" by some of the optimizers (namely: ASTBindingAssigner, ASTSimpleBindingsOptimizer, and ASTRangeCountOptimizer.attachRangeCounts used by ASTStaticJoinOptimizer). When eliminating/static injection of the bindings we need to make sure that these optimizers have the information they need. b.) We must make sure that the BINDINGS/VALUES clauses are evaluated early on, to avoid performance implications.
          Hide
          michaelschmidt michaelschmidt added a comment -

          This ticket has been mainly resolved as part of the join refactoring. Keeping it open, as there are some subtickets left. The only point from the points above that remains open is

          1. The computation of definiteVars for bind expressions could be improved: currently, variables bound in BIND expressions are never considered definite (as BIND in general may fail), but trivial cases such as BIND (?x as <http://constant/uri>) might be considered "safe".

          This is not considered crucial though. Please find below some generic comments on the join order refactoring which addressed most of the issues mentioned in this ticket.

          Implemented new strategy for join ordering within join groups:

          1. Key changes:
            1. Implemented new optimizer (ASTJoinGroupOptimizer), which puts the constructs into the right order
              1. Addressing mainly two correctness problems
                1. Reordering of non-reorderable constructs (OPTIONAL problematics)
                2. Proper handling of nodes that require variables to be bound (FILTER NOT EXISTS, BIND, certain SERVICEs)
            2. Approaches to the two problems
              1. Join order optimization now takes “partitions” (in form of OPTIONALs into account), reordering across partitions only where valid
              2. New Interface IVariableBindingRequirements with central method getRequiredVariables(), which determines the variables that must be bound prior to executing a mode
            3. Optimizations:
              1. Proper treatment of FILTER NOT EXISTS queries, now correct + more precise placement (as for other FILTERs)
              2. Also more precise placement of FILTER expressions that cannot be attached to a single join group node, namely
                1. … as early as possible
                2. … correctly placed in case triple patterns are reordered by the static optimizer
              3. Proper treatment of complex SERVICEs requiring incoming bound variables -> placed at first possible position
              4. Proper treatment of BIND and VALUES clauses -> placed at first
            4. Reusable components for FILTER placement, extraction of interesting variable sets from join groups, etc. -> clear the way for accelerated implementation of other rewriting heuristics
            5. Two “modes” for optimizer:
              1. Assert valid order & optimize
              2. Assert valid order only
            6. Test cases for optimizer itself at query and AST level
          1. Refactoring changes
            1. Interface IVariableBindingRequirements, with the central method getRequiredBound() and associated methods at ServiceFactory (with slightly different signature for extracting this information from service nodes)
              1. Implementation of the interface in various classes based on StaticAnalysis utility methods
              2. For services, the interface implementation is (at the time being) the same everywhere (requiredBound = \emptyset, imposing no constraint), see FulltextSearchServiceFactory.getRequiredBound() for how an implementation could look like (and we may want to adjust other services to offer incoming bound variables in the future)
              3. The implementation of this interface amounts for most of the changes. Many of them invoke setting up a new empty hash set.
            2. The optimizer itself: ASTJoinGroupOrderOptimizer
              1. Closely related are ASTJoinGroupPartition(s), offering data structure and the key utility methods for placement of nodes within partitions based on heuristics and variable binding constraints
              2. IASTJoinGroupPartitionReorderer: interface for an inter-partition optimizer; currently, there’s only a simple implementation (TypeBasedASTJoinGroupPartitionReorderer, in large parts reflecting the behaviour of the old optimizer), but this is how I would envision to hook in the StaticAnalysis (the interface might need to be changed/extended, it’s a first step).
            3. Reusable utility classes (used by the optimizer)
              1. GroupNodeVarBindingInfo & GroupNodeVarBindingInfoMap -> summary container for looking up different aspects related to variables in the group node (independent from its position in a given join group), and an associated class easing construction of GroupNodeVarBindingInfo objects for a set of nodes
              2. ASTFilterPlacer: utility class that supports the precise and correct placement of FILTER expressions within a list of IGroupMemberNodes
                ASTJoinGroupFilterExistsInfo: class implementing functionality to identify and access FILTER [NOT] EXISTS nodes within a join group (which are translated as a hybrid of an ASK subquery and a FilterNode, thus requiring special handling)
              3. ASTTypeBasedNodeClassifier (&ASTTypeBasedNodeClassifierConstraint): supports, given a set of types as input, the partitioning of a node list into lists of nodes with the give type + rest, including additional constraints for membership to a certain partition
            4. Minor things
          2. ASTBottomUpOptimizer -> minor bugfix (as documented inline)
          3. ASTSparql11SubqueryOptimizer -> minor bugfix (inheritance of bindings clauses to subqueries)
          4. ASTStaticBindingsOptimizer -> minor bugfix (see in code documentation for details)
          5. DefaultOptimizerList -> hooked in new optimizer
          Show
          michaelschmidt michaelschmidt added a comment - This ticket has been mainly resolved as part of the join refactoring. Keeping it open, as there are some subtickets left. The only point from the points above that remains open is The computation of definiteVars for bind expressions could be improved: currently, variables bound in BIND expressions are never considered definite (as BIND in general may fail), but trivial cases such as BIND (?x as < http://constant/uri >) might be considered "safe". This is not considered crucial though. Please find below some generic comments on the join order refactoring which addressed most of the issues mentioned in this ticket. — Implemented new strategy for join ordering within join groups: Key changes: Implemented new optimizer (ASTJoinGroupOptimizer), which puts the constructs into the right order Addressing mainly two correctness problems Reordering of non-reorderable constructs (OPTIONAL problematics) Proper handling of nodes that require variables to be bound (FILTER NOT EXISTS, BIND, certain SERVICEs) Approaches to the two problems Join order optimization now takes “partitions” (in form of OPTIONALs into account), reordering across partitions only where valid New Interface IVariableBindingRequirements with central method getRequiredVariables(), which determines the variables that must be bound prior to executing a mode Optimizations: Proper treatment of FILTER NOT EXISTS queries, now correct + more precise placement (as for other FILTERs) Also more precise placement of FILTER expressions that cannot be attached to a single join group node, namely … as early as possible … correctly placed in case triple patterns are reordered by the static optimizer Proper treatment of complex SERVICEs requiring incoming bound variables -> placed at first possible position Proper treatment of BIND and VALUES clauses -> placed at first Reusable components for FILTER placement, extraction of interesting variable sets from join groups, etc. -> clear the way for accelerated implementation of other rewriting heuristics Two “modes” for optimizer: Assert valid order & optimize Assert valid order only Test cases for optimizer itself at query and AST level Refactoring changes Interface IVariableBindingRequirements, with the central method getRequiredBound() and associated methods at ServiceFactory (with slightly different signature for extracting this information from service nodes) Implementation of the interface in various classes based on StaticAnalysis utility methods For services, the interface implementation is (at the time being) the same everywhere (requiredBound = \emptyset, imposing no constraint), see FulltextSearchServiceFactory.getRequiredBound() for how an implementation could look like (and we may want to adjust other services to offer incoming bound variables in the future) The implementation of this interface amounts for most of the changes. Many of them invoke setting up a new empty hash set. The optimizer itself: ASTJoinGroupOrderOptimizer Closely related are ASTJoinGroupPartition(s), offering data structure and the key utility methods for placement of nodes within partitions based on heuristics and variable binding constraints IASTJoinGroupPartitionReorderer: interface for an inter-partition optimizer; currently, there’s only a simple implementation (TypeBasedASTJoinGroupPartitionReorderer, in large parts reflecting the behaviour of the old optimizer), but this is how I would envision to hook in the StaticAnalysis (the interface might need to be changed/extended, it’s a first step). Reusable utility classes (used by the optimizer) GroupNodeVarBindingInfo & GroupNodeVarBindingInfoMap -> summary container for looking up different aspects related to variables in the group node (independent from its position in a given join group), and an associated class easing construction of GroupNodeVarBindingInfo objects for a set of nodes ASTFilterPlacer: utility class that supports the precise and correct placement of FILTER expressions within a list of IGroupMemberNodes ASTJoinGroupFilterExistsInfo: class implementing functionality to identify and access FILTER [NOT] EXISTS nodes within a join group (which are translated as a hybrid of an ASK subquery and a FilterNode, thus requiring special handling) ASTTypeBasedNodeClassifier (&ASTTypeBasedNodeClassifierConstraint): supports, given a set of types as input, the partitioning of a node list into lists of nodes with the give type + rest, including additional constraints for membership to a certain partition Minor things ASTBottomUpOptimizer -> minor bugfix (as documented inline) ASTSparql11SubqueryOptimizer -> minor bugfix (inheritance of bindings clauses to subqueries) ASTStaticBindingsOptimizer -> minor bugfix (see in code documentation for details) DefaultOptimizerList -> hooked in new optimizer

            People

            • Assignee:
              michaelschmidt michaelschmidt
              Reporter:
              michaelschmidt michaelschmidt
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: