Details

      Description

      Hash join performance with the HTree has a de-serialization bottleneck as observed under a profiler. During an HTree hash join, 100% of all reported CPU time is default object deserialization for solutions stored on the HTree.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Investigation has already revealed a problem with the hash code being computed by the HashJoinUtility and a better hash function has been selected. A bad hash function can cause a hot spot since collision buckets (solutions having the same hash code) will become larger as the hash function degrades. In the extreme case (which also occurs when there are no join variables), all solutions will have the same hash code and the hash join degenerates to a full scan of the HTree for each source solution to be joined. Solutions which do not join are rejected, but this is naturally far more expensive than a join having at least one join variable which shows a good distribution over the solutions in the hash index.

        The HashJoinUtility has also been modified to "vector" the hash join. This involves taking a "chunk" of source solutions, computing their hash codes based on the declared join variables, and then sorting the source solutions by those hash codes. For each distinct hash code, the corresponding collision bucket in the htree is scanned exactly once and all source solutions for that hash code are joined with all solutions in that collision bucket.

        Show
        bryanthompson bryanthompson added a comment - Investigation has already revealed a problem with the hash code being computed by the HashJoinUtility and a better hash function has been selected. A bad hash function can cause a hot spot since collision buckets (solutions having the same hash code) will become larger as the hash function degrades. In the extreme case (which also occurs when there are no join variables), all solutions will have the same hash code and the hash join degenerates to a full scan of the HTree for each source solution to be joined. Solutions which do not join are rejected, but this is naturally far more expensive than a join having at least one join variable which shows a good distribution over the solutions in the hash index. The HashJoinUtility has also been modified to "vector" the hash join. This involves taking a "chunk" of source solutions, computing their hash codes based on the declared join variables, and then sorting the source solutions by those hash codes. For each distinct hash code, the corresponding collision bucket in the htree is scanned exactly once and all source solutions for that hash code are joined with all solutions in that collision bucket.
        Hide
        bryanthompson bryanthompson added a comment -

        Deserialization costs still strongly dominate the hash join operation. The next step is to optimize the (de-)serialization of solution sets. For this, I have the following proposal.

        1. A "schema" will be identified for the join. The schema will consist of the ordered set of variables which are either "known" bound or "maybe" bound. The join variables corresponds directly to the "known" bound variables. Any order can be used as along as it is stable. Ordering first by the variables names is an obvious choice. However, we could also take the variables in the order of their first appearance (e.g., by using a LinkedHashSet of the observed variables as the embodiment of the schema). In order to be complete, any exogenous variables must also be included in this analysis (variables for which bindings are given when the query was compiled). (The "as observed" LinkedHashSet approach would automatically handle this case.) Variables which are not bound in a given solution would be represented by a "NullIV" (TermId with a tid of 0L).

        2. Only the IVs will be stored in the existing HTree (the "rightSolutions"). IVs will be stored in the order declared by the "schema" for the hash join. This encoding is known to be both fast, tight, and efficient. It is the same encoding that we use in the statement indices. Rather than storing the serialized java object solutions on raw records, we could increase the threshold before something is promoted to a raw object and easily store most solutions (encoded as IV[]s) within the HTree bucket nodes. This should produce a significantly smaller in-memory footprint for the HTree. (This is not possible with the existing java object serialization as the serialized solutions are much to "fat" to be stored in the HTree buckets.)

        3. A custom routine will be written to "join" two solutions, one of which is represented as an IBindingSet and the other of which is represented by the schema and an encoded IV[]. We can either encode the source solution per the schema or decode the solution from the HTree.

        4. Any RDF Value objects cached on IVs WILL NOT be stored in the primary HTree ("rightSolutions"). Instead, they will be stored in a secondary HTree scoped to the specific hash join (IV2VAL) which maps IVs to BigdataValues. Note that IV2VAL HTree only contains entries for those IVs which had cached BigdataValue objects. It is quite likely that the same IV will appear multiple times. For efficiency, its cached BigdataValue must be recorded in the HTree exactly once. The key for that HTree will be either the IV itself (the HTree can handle variable length keys) or the hash code of the IV (in which case we must store both the IV and the Value object in the byte[] value). The BigdataValue will be serialized using the same routines which we use to store BigdataValue objects in the TERM2ID or BLOBS indices.

        5. The solution join based on the source IBindingSet and the IV[] will be fast. Once we know that two solutions in fact join, the IVs in the IV[] can be resolved against IV2VAL HTree. If necessary, this join can also be vectored. This would be done by sorting the IVs to be resolved based on their hash codes, and stepping through the collision buckets for those IVs. Since the same IV can appear many times across the source solutions, vectoring this join could substantially reduce the required effort to reunite the IVs with their cached values. Also, we do not need to perform this join for any IVs having a cached value conveyed by the source solution. That cached value will have been already attached in the output solution from the solution join (step 3 above).

        6. It may also be worth while to implement a HashSet or HashMap based hash index / hash join operation in order to compare performance tradeoffs on hash joins of different cardinality, selectivity in their hash codes, and join hit ratios. An HashSet / HashMap based hash join should be optimally efficient since it does not require any (de-)serialization up until the point that heap pressure begins to degrade the performance of the JVM. Any such assessment of the tradeoff point should also be made in the light of the concurrent query workload as that can significantly increase the heap pressure.

        The techniques outlined above could also be used to encode chunks of solutions which are being queued in the query engine for evaluation by a downstream operator or buffered for transmission to another node in a cluster. In order to do this, we would need to annotate each operator with the complete set of known and maybe bound variables. This can be done based on a static analysis when the query is compiled. (It could also be done dynamically by forming the order at the time that the solutions are encoded as long as the order is transmitted along with the encoded solutions). However, when moving solutions off the heap and into a direct buffer, we need to maintain the compactness of the buffer so it can be transmitted as a unit. Some more thought will have to go into techniques suitable for this purpose, as well as for efficient decode of the BigdataValue objects for IVs which cache such objects.

        Show
        bryanthompson bryanthompson added a comment - Deserialization costs still strongly dominate the hash join operation. The next step is to optimize the (de-)serialization of solution sets. For this, I have the following proposal. 1. A "schema" will be identified for the join. The schema will consist of the ordered set of variables which are either "known" bound or "maybe" bound. The join variables corresponds directly to the "known" bound variables. Any order can be used as along as it is stable. Ordering first by the variables names is an obvious choice. However, we could also take the variables in the order of their first appearance (e.g., by using a LinkedHashSet of the observed variables as the embodiment of the schema). In order to be complete, any exogenous variables must also be included in this analysis (variables for which bindings are given when the query was compiled). (The "as observed" LinkedHashSet approach would automatically handle this case.) Variables which are not bound in a given solution would be represented by a "NullIV" (TermId with a tid of 0L). 2. Only the IVs will be stored in the existing HTree (the "rightSolutions"). IVs will be stored in the order declared by the "schema" for the hash join. This encoding is known to be both fast, tight, and efficient. It is the same encoding that we use in the statement indices. Rather than storing the serialized java object solutions on raw records, we could increase the threshold before something is promoted to a raw object and easily store most solutions (encoded as IV[]s) within the HTree bucket nodes. This should produce a significantly smaller in-memory footprint for the HTree. (This is not possible with the existing java object serialization as the serialized solutions are much to "fat" to be stored in the HTree buckets.) 3. A custom routine will be written to "join" two solutions, one of which is represented as an IBindingSet and the other of which is represented by the schema and an encoded IV[]. We can either encode the source solution per the schema or decode the solution from the HTree. 4. Any RDF Value objects cached on IVs WILL NOT be stored in the primary HTree ("rightSolutions"). Instead, they will be stored in a secondary HTree scoped to the specific hash join (IV2VAL) which maps IVs to BigdataValues. Note that IV2VAL HTree only contains entries for those IVs which had cached BigdataValue objects. It is quite likely that the same IV will appear multiple times. For efficiency, its cached BigdataValue must be recorded in the HTree exactly once. The key for that HTree will be either the IV itself (the HTree can handle variable length keys) or the hash code of the IV (in which case we must store both the IV and the Value object in the byte[] value). The BigdataValue will be serialized using the same routines which we use to store BigdataValue objects in the TERM2ID or BLOBS indices. 5. The solution join based on the source IBindingSet and the IV[] will be fast. Once we know that two solutions in fact join, the IVs in the IV[] can be resolved against IV2VAL HTree. If necessary, this join can also be vectored. This would be done by sorting the IVs to be resolved based on their hash codes, and stepping through the collision buckets for those IVs. Since the same IV can appear many times across the source solutions, vectoring this join could substantially reduce the required effort to reunite the IVs with their cached values. Also, we do not need to perform this join for any IVs having a cached value conveyed by the source solution. That cached value will have been already attached in the output solution from the solution join (step 3 above). 6. It may also be worth while to implement a HashSet or HashMap based hash index / hash join operation in order to compare performance tradeoffs on hash joins of different cardinality, selectivity in their hash codes, and join hit ratios. An HashSet / HashMap based hash join should be optimally efficient since it does not require any (de-)serialization up until the point that heap pressure begins to degrade the performance of the JVM. Any such assessment of the tradeoff point should also be made in the light of the concurrent query workload as that can significantly increase the heap pressure. The techniques outlined above could also be used to encode chunks of solutions which are being queued in the query engine for evaluation by a downstream operator or buffered for transmission to another node in a cluster. In order to do this, we would need to annotate each operator with the complete set of known and maybe bound variables. This can be done based on a static analysis when the query is compiled. (It could also be done dynamically by forming the order at the time that the solutions are encoded as long as the order is transmitted along with the encoded solutions). However, when moving solutions off the heap and into a direct buffer, we need to maintain the compactness of the buffer so it can be transmitted as a unit. Some more thought will have to go into techniques suitable for this purpose, as well as for efficient decode of the BigdataValue objects for IVs which cache such objects.
        Hide
        bryanthompson bryanthompson added a comment -

        Added a JVM based variant of the HashJoinUtility and a test suite for the same. Re-factored the SubqueryHashJoin to use the JVMHashJoinUtility. Added JVM variants of HashIndexOp (JVMHashIndexOp) and SolutionSetHashJoinOp (JVMSolutionSetHashJoinOp). Modified AST2BOpUtility to have conditional code paths to use the JVM based hash joins rather than the HTree based hash joins.

        These changes will make it possible to test the performance of the complex optional queries using the hash join approach before we refactor the HTree hash join logic in order to remove the deserialization bottleneck.

        Committed revision r5343.

        Show
        bryanthompson bryanthompson added a comment - Added a JVM based variant of the HashJoinUtility and a test suite for the same. Re-factored the SubqueryHashJoin to use the JVMHashJoinUtility. Added JVM variants of HashIndexOp (JVMHashIndexOp) and SolutionSetHashJoinOp (JVMSolutionSetHashJoinOp). Modified AST2BOpUtility to have conditional code paths to use the JVM based hash joins rather than the HTree based hash joins. These changes will make it possible to test the performance of the complex optional queries using the hash join approach before we refactor the HTree hash join logic in order to remove the deserialization bottleneck. Committed revision r5343.
        Hide
        bryanthompson bryanthompson added a comment -

        Added a JVM version of the NamedSubqueryOp and refactored AST2BOpUtility and AST2BOpContext to isolate the boolean switches for generating the hash-join pattern for sub-group evaluation and the hash-join pattern for sub-query evaluation. The JVM version of the NamedSubqueryOp will be used unless native hash joins are specified in AST2BOpContext.

        NamedSubqueryIncludeOp is no longer used. It has been replaced with the more general purpose SolutionSetHashJoinOp (HTree) and JVMSolutionSetHashJoinOp. The class is still hanging around as it declares some annotations used by the other hash join operator classes but it will be eventually removed entirely.

        Modified TestASTLiftPreFiltersOptimizer to succeed and log an error indicating that no tests have been written. We still need to write this ast optimizer. It is responsible for lifting pre-filters into the parent group.

        Committed revision r5350.

        Show
        bryanthompson bryanthompson added a comment - Added a JVM version of the NamedSubqueryOp and refactored AST2BOpUtility and AST2BOpContext to isolate the boolean switches for generating the hash-join pattern for sub-group evaluation and the hash-join pattern for sub-query evaluation. The JVM version of the NamedSubqueryOp will be used unless native hash joins are specified in AST2BOpContext. NamedSubqueryIncludeOp is no longer used. It has been replaced with the more general purpose SolutionSetHashJoinOp (HTree) and JVMSolutionSetHashJoinOp. The class is still hanging around as it declares some annotations used by the other hash join operator classes but it will be eventually removed entirely. Modified TestASTLiftPreFiltersOptimizer to succeed and log an error indicating that no tests have been written. We still need to write this ast optimizer. It is responsible for lifting pre-filters into the parent group. Committed revision r5350.
        Hide
        bryanthompson bryanthompson added a comment -

        The query at the bottom of this comment was developed based on a long running query identified by CSI against the govtrack data set. This query produces 2M results (20968700). The query runs quickly using "as-bound" pipeline joins, completing in only 20 seconds. However, because the query uses a high volume join, it is also suitable for performance tuning of the Htree hash join.

        The query can be run in several ways.

        A. If the "hashJoin" query hint is left out, then it will run using "as-bound" pipeline joins and completes in 20 seconds.

        B. If the "hashJoin" query hint is specified, then it will run using an Htree hash join and completes in 70 seconds (this is down from an initial run time of 25 minutes).

        C. If the "hashJoin" query hint is specified and the "keyOrder" query hint is also specified, then the query will run in 40 seconds. This query plan is faster because there is better locality on the PCSO index than on the POCS index for this specific query.

        While the pipeline join is faster for this specific high volume query, there will be queries where the htree based hash join will be faster. Whether one as-bound pipeline joins or htree hash joins will be faster depends on variables which are difficult to predict in advance and heuristics for recognizing such join strategy tradeoffs can only be developed with more data and queries.

        # Query producing 20968700 results.
        # 20s with pipeline join.
        # 70s with htree hash join and POCS index (default).
        # 40s with htree hash join and PCSO index.
        
        PREFIX hint: <http://www.bigdata.com/queryHints#>
        SELECT (COUNT(*) as ?count)
        WHERE{
        GRAPH ?g {
        # 315k statements, 300ms for AP scan.
        ?_var10 a <http://www.rdfabout.com/rdf/schema/vote/Option>.
        # 2M statements, 17,623ms for AP scan.
        ?_var10 <http://www.rdfabout.com/rdf/schema/vote/votedBy> ?_var3 .
        # Query hint to enable a hash join.
        hint:BGP hint:com.bigdata.rdf.sparql.ast.eval.hashJoin "true" .
        # Query hint to use a different index (IFF using a hash join)
        # Note: default index for hash join will be POCS.
        hint:BGP hint:com.bigdata.bop.IPredicate.keyOrder "PCSO" .
        }
        }
        
        Show
        bryanthompson bryanthompson added a comment - The query at the bottom of this comment was developed based on a long running query identified by CSI against the govtrack data set. This query produces 2M results (20968700). The query runs quickly using "as-bound" pipeline joins, completing in only 20 seconds. However, because the query uses a high volume join, it is also suitable for performance tuning of the Htree hash join. The query can be run in several ways. A. If the "hashJoin" query hint is left out, then it will run using "as-bound" pipeline joins and completes in 20 seconds. B. If the "hashJoin" query hint is specified, then it will run using an Htree hash join and completes in 70 seconds (this is down from an initial run time of 25 minutes). C. If the "hashJoin" query hint is specified and the "keyOrder" query hint is also specified, then the query will run in 40 seconds. This query plan is faster because there is better locality on the PCSO index than on the POCS index for this specific query. While the pipeline join is faster for this specific high volume query, there will be queries where the htree based hash join will be faster. Whether one as-bound pipeline joins or htree hash joins will be faster depends on variables which are difficult to predict in advance and heuristics for recognizing such join strategy tradeoffs can only be developed with more data and queries. # Query producing 20968700 results. # 20s with pipeline join. # 70s with htree hash join and POCS index (default). # 40s with htree hash join and PCSO index. PREFIX hint: <http://www.bigdata.com/queryHints#> SELECT (COUNT(*) as ?count) WHERE{ GRAPH ?g { # 315k statements, 300ms for AP scan. ?_var10 a <http://www.rdfabout.com/rdf/schema/vote/Option>. # 2M statements, 17,623ms for AP scan. ?_var10 <http://www.rdfabout.com/rdf/schema/vote/votedBy> ?_var3 . # Query hint to enable a hash join. hint:BGP hint:com.bigdata.rdf.sparql.ast.eval.hashJoin "true" . # Query hint to use a different index (IFF using a hash join) # Note: default index for hash join will be POCS. hint:BGP hint:com.bigdata.bop.IPredicate.keyOrder "PCSO" . } }

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: