Deserialization costs still strongly dominate the hash join operation. The next step is to optimize the (de-)serialization of solution sets. For this, I have the following proposal.
1. A "schema" will be identified for the join. The schema will consist of the ordered set of variables which are either "known" bound or "maybe" bound. The join variables corresponds directly to the "known" bound variables. Any order can be used as along as it is stable. Ordering first by the variables names is an obvious choice. However, we could also take the variables in the order of their first appearance (e.g., by using a LinkedHashSet of the observed variables as the embodiment of the schema). In order to be complete, any exogenous variables must also be included in this analysis (variables for which bindings are given when the query was compiled). (The "as observed" LinkedHashSet approach would automatically handle this case.) Variables which are not bound in a given solution would be represented by a "NullIV" (TermId with a tid of 0L).
2. Only the IVs will be stored in the existing HTree (the "rightSolutions"). IVs will be stored in the order declared by the "schema" for the hash join. This encoding is known to be both fast, tight, and efficient. It is the same encoding that we use in the statement indices. Rather than storing the serialized java object solutions on raw records, we could increase the threshold before something is promoted to a raw object and easily store most solutions (encoded as IVs) within the HTree bucket nodes. This should produce a significantly smaller in-memory footprint for the HTree. (This is not possible with the existing java object serialization as the serialized solutions are much to "fat" to be stored in the HTree buckets.)
3. A custom routine will be written to "join" two solutions, one of which is represented as an IBindingSet and the other of which is represented by the schema and an encoded IV. We can either encode the source solution per the schema or decode the solution from the HTree.
4. Any RDF Value objects cached on IVs WILL NOT be stored in the primary HTree ("rightSolutions"). Instead, they will be stored in a secondary HTree scoped to the specific hash join (IV2VAL) which maps IVs to BigdataValues. Note that IV2VAL HTree only contains entries for those IVs which had cached BigdataValue objects. It is quite likely that the same IV will appear multiple times. For efficiency, its cached BigdataValue must be recorded in the HTree exactly once. The key for that HTree will be either the IV itself (the HTree can handle variable length keys) or the hash code of the IV (in which case we must store both the IV and the Value object in the byte value). The BigdataValue will be serialized using the same routines which we use to store BigdataValue objects in the TERM2ID or BLOBS indices.
5. The solution join based on the source IBindingSet and the IV will be fast. Once we know that two solutions in fact join, the IVs in the IV can be resolved against IV2VAL HTree. If necessary, this join can also be vectored. This would be done by sorting the IVs to be resolved based on their hash codes, and stepping through the collision buckets for those IVs. Since the same IV can appear many times across the source solutions, vectoring this join could substantially reduce the required effort to reunite the IVs with their cached values. Also, we do not need to perform this join for any IVs having a cached value conveyed by the source solution. That cached value will have been already attached in the output solution from the solution join (step 3 above).
6. It may also be worth while to implement a HashSet or HashMap based hash index / hash join operation in order to compare performance tradeoffs on hash joins of different cardinality, selectivity in their hash codes, and join hit ratios. An HashSet / HashMap based hash join should be optimally efficient since it does not require any (de-)serialization up until the point that heap pressure begins to degrade the performance of the JVM. Any such assessment of the tradeoff point should also be made in the light of the concurrent query workload as that can significantly increase the heap pressure.
The techniques outlined above could also be used to encode chunks of solutions which are being queued in the query engine for evaluation by a downstream operator or buffered for transmission to another node in a cluster. In order to do this, we would need to annotate each operator with the complete set of known and maybe bound variables. This can be done based on a static analysis when the query is compiled. (It could also be done dynamically by forming the order at the time that the solutions are encoded as long as the order is transmitted along with the encoded solutions). However, when moving solutions off the heap and into a direct buffer, we need to maintain the compactness of the buffer so it can be transmitted as a unit. Some more thought will have to go into techniques suitable for this purpose, as well as for efficient decode of the BigdataValue objects for IVs which cache such objects.