Details
-
Type:
Improvement
-
Status: Done
-
Priority:
High
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: BLAZEGRAPH_2_0_0
-
Component/s: Query Plan Generator
-
Labels:None
Description
As we're making more and more use of hash joins (e.g. when projecting in DISTINCT variables into subgroups), it might be nice to have a pipelined version of the hash join operator.
Candidates might be, for instance,
- https://infosys.uni-saarland.de/datenbankenlernen/14452inverted.pdf
- http://www-users.cs.umn.edu/~mokbel/papers/hashmj-icde04.pdf
We need to have a closer look at what’s possible here and what the specific requirements are, maybe there are also alternatives here (non-hash based). Possibly a merge join would also be a better choice: when computing the DISTINCT projection we could sort right away, passing the distinct projection in sorted as well, and taking care not to destroy order.
One problem in implementing this strategy is concurrency of such collections. Obviously there need to be synchronization barriers around mutation. The HTree is only thread-safe for concurrent readers. If there is a writer, then there must not be a concurrent reader. The current solution set hash join checkpoints the HTree once all solutions have been accepted. Once the index has been checkpointed the readers obtain a read-only view from the checkpoint.
See also BLZG-657, as well as the paper on eddies at http://db.cs.berkeley.edu/papers/sigmod00-eddy.pdf.