Affects Version/s: None
Fix Version/s: None
Right now IRangeQuery.iterator() is a sequential scan. If used from the level of the AccessPath, then a simple asynchronous iterator pattern is used to setup a producer / consumer decoupling.
A true parallel scan would divide up the key-range among a number of threads and each thread would hand off blocks of tuples at a time to the consumer. This would allow us to schedule one IO per thread servicing the iterator.
I have recently implemented this pattern for the GPU data loader using a parallel scan on the SPO index. In that case this was implemented in the application code.
There are at least two reasonable ways to achieve a more general purpose parallel scan.
- Mark the iterator with IRangeQuery.PARALLEL. This indicates that the iterator may deliver results out of order (and inherently that the iterator visitation order is not stable).
- Scale-out already supports PARALLEL and interprets it as a range iterator that is parallel over the shards. However, this ticket is to allow parallel within a single index in order to increase the read pressure on the IO system.
- AccessPath has two code paths for iterators. One is synchronous. If the range count of the iterator is over a threshold then an asynchronous iterator is used. That iterator has a single thread as the producer and the iterator (like all iterators) has a single thread as the consumer. This code could be modified to use a blocking queue driven by multiple readers each of which was assigned a sub-key-range partition of the key-range specified to the rangeIterator(). The FILTER(s) (if any) associated with the AccessPath could be pushed down into the individual sub-key-range iterators to avoid bringing back tuples that do not satisfy the iterators into the queue.
- The parallel flag could be pushed down into the iterator itself. This could have several advantages:
- The parallel iterator could be aware of the structure of the BTree. Thus rather than starting N fingers that are evenly distributed across the key range, it could collaborate to strip mine the leaves of the current node.
- A fixed size thread pool could be used where the pool was sized to the IOPS capability of the IO system.
- A single thread and non-blocking IO pattern could be used.
- Note: A similar approach was tried in the past with a reader pool on the Journal, but that did not provide parallelism within the iterators.
Note: Due to the lack of a stable order, the PARALLEL scan semantics are not compatible with some high level query use cases. In particular, they can not be used for:
- Spatial joins: The spatial join operator relies on the order of the underlying index to know when it can skip over key-ranges.
- Cutoff join cardinality estimation (this requires that we produce N output tuples by feeding in input tuples and cutting off the join if we get N outputs).
- Once order has been established on a solution multi-set, the order needs to be retained for some operations. In particular, order by + limit and order by + distinct + limit.
- michaelschmidt - Can you identify some other cases where reordering is not ok?
The two general approaches (pushing down the PARALLEL flag onto the IRangeQuery.rangeIterator() and pushing down the logic to parallelize across N sub-partitions into the AccessPath) both have the advantage that they are invisible to the application (other than the unordered iterator results). In contrast, the approach that I took with the GPU data loader pushes the parallelism into the data loader code where it is intrinsically more complex to replicate such parallel code into each situation in which we want to accelerate a scan and do not require strict order for the scan.
One complexity with the ITupleIterator is that it is not currently a CloseableIterator. This means that we have no means to terminate the resources dedicated to strip mining the index pages and the iterator would race ahead until it saturated a blocking queue having some maximum capacity. The AccessPath handles the termination of the asynchronous iterator pattern using a BlockingBuffer. The BlockingBuffer is aware of the producer and is able to cancel the producer using its Future. This causes the producing scan to terminate when we the IChunkedOrderedIterator returned by the IAccessPath.iterator() method. (And of course the iterator would no longer actually be order preserving if we parallelize it.)