Status: In Progress
Affects Version/s: None
Fix Version/s: None
For query we have many more threads and thus we can drive many more IOPS. This is not true when loading data. In general we are restricted to only a few threads that actually read or write on the disk. Reads occur during the index updates. This also drives eviction from the write retention queue into the write cache. Writes occur when the write cache is full. There is only one thread that actually writes on the disk, but it can just drop 8k blocks of random writes onto the disk write queue as fast as that queue will accept them. I do not think that this thread is a bottleneck problem. I think that the problem is that we have very few threads driving reads during bulk load.
The reads are most efficient when performed with locality using ordered reads. So we have (TERM2ID + BLOBS). This is currently sequential, but those two could be parallelized. Then we have ID2TERM. This COULD be overlapped with the statement index operations, but that requires a refactor of how the statement buffer integrates with the lexicon. Scale-out actually does this already. Then we have the statement index operations and those are in parallel, so we have 3 threads driving the indices at that point (or 6 for quads). So there is not enough concurrency during read on the indices while LOADING data. (We can also increase the parallelism by overlapping more of the index write operations and that should help outside of such large bulk load scenarios. There is already a ticket for this.)
One possible "fix" for this using the DataLoader is to give it a LOT of write cache / write retention queue capacity. The write cache will be tested before we go to the disk for a read. If it is in the write cache we do not need to do anything outside of Blazegraph. I think that the SSD latency for those reads (since there are only 1-3 threads driving them) is just bogging down the reads for writes that do get evicted from the write cache. So a larger write cache would be fewer reads that have any latency and higher throughput in the writer. Note that index writes DO drive reads since we need to traverse the index, check for a key, and then conditionally do the insert. So we need to read if there is a miss in the index traversal.
Note that we can actually override the #of write cache buffers separately for the DataLoader. We just need to add the option into the set of options that the data loader can recognize on the command line using -D. This is done in main(). The place is clearly marked. This can also be changed directly in RWStore.properties, but then it effects both load and query and the additional write cache is most useful during load.
For the OSP index we can use the small slots optimization. Martyn has a branch with a proposed fix for slot waste. That could get pulled into the branch here and tested. Or we can test on other data sets first.
Most of the points discussed above are already captured by tickets under BLZG-641. However there are two other approaches that we could look at, which is the focus of this ticket.
1. One approach is to schedule the reads that we need to do to support writes in advance. This is basically a prefetch mechanism. There is a ticket for prefetch for key-range scans (BLZG-1387), but it could equally apply on an index where we have good locality and will need to visit a series of pages. (Where locality of the index updates is poor, the small slot optimization can help to improve the locality on the disk by co-locating index pages in the same 8k disk page.)
2. Another approach is to first order and then partition the writes on each index and assign a thread to each such partition. This is basically what the scale-out architecture is doing when it splits a shard. Doing this for scale-up means managing conflicting updates where the boundaries of those partitions intersection. This turns into a lock protocol problem. One nice thing about the current design is that it is basically lock free for writers and readers. The only lock is when there is a page miss, and that lock is just to load that specific page. B+Tree locking schemes can add a lot of complexity. But we might be able to define a partitioning of the updates and lock management from inside of the IIndexProcedure when we have a tighter coupling available with the B+Tree.
Given the relative complexity, I would definitely try the prefetch approach first. The trick is to find a prefetch scheme that does not read pages that we will not need. This should be possible given that we have an array of keys that will be touched by the update. Since we know the branching factor of the index, we could use keyAt(index) to jump a page at a time. But we need to stop this if we are running into a part of the index that we will not touch with the update. There are some utility methods for scale-out to split a key across a series of partitions. This is a similar problem, but the partitions are defined by the pages and we want to demand a bunch of pages at once and then service the keys for those pages as the pages arrive. So having a sorted key array gives us locality in the index, prefetch gives us a lot of scheduled IOs, and servicing the keys that fall into each page as it is read (by a single servicing thread) keeps that thread busy and allows us to avoid locking issues in the index.