Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-535

Optimize hash joins when there are no source solutions (or only the exogenous bindings)




      Join variables are currently set based on the incoming bound variables to the subgroup and the definitely bound variables in the subgroup. When a subgroup will run first in the query, it has no incoming bound variables. This causes it to build a hash index with no join variables. This leads to hash joins without join variables, which are VERY expensive.

      The reason why we can not specify join variables based on the definitely bound variables in the sub-group is that the hash code of the exogenous/empty solution will be typically be undefined since it will not have bindings for the join variables.

      We need to recognize this case and handle it differently. Rather than reducing the join variables to an empty set, the join variables should be all definitely bound variables and we should INCLUDE the solution set into the parent using a different operator.

      Basically, all the operator needs to do is drain the solutions from the hash index, pushing them into the pipeline (its sink). It can not do a hash join against the source solution (either empty or containing just the exogenous bindings) because there is no guarantee that the source solution will share the join variables (in fact, it nearly always will NOT share the join variables). However, the source solution has very low cardinality (ONE).

      Therefore, for each solution in the hash index, it attempts a join with each source solution in turn. While this is conceptually a cross product unconstrained by the presence of the join variables, in fact there is only one source solution so this amounts to a single scan of the hash index in which we possibly pickup and/or filter based on the exogenous bindings.

      This operation is currently executed by a (JVM|HTree)SolutionSetHashJoin. Perhaps the easiest thing would be to add an annotation to that hash join indicating that it should IGNORE the join variables and do a full (1 x M) cross product.




            bryanthompson bryanthompson
            bryanthompson bryanthompson
            0 Vote for this issue
            2 Start watching this issue