It takes about 10x longer to evict a record to the channel than to insert a record into the write cache, which makes sense. And evicting an entire cache block to the disk is several orders of magnitude slower, which also makes sense. Thus if the write cache is keeping up it means that the index mutation and eviction rate is the bottleneck. Since index mutation is pretty fast initially (we have high TPS throughput initially), I am going to put my finger on the B+Tree write cache eviction logic as the probable bottleneck, but this is worth talking through in more detail. This does make sense since any latency in eviction from the write retention queue translates directly into an index latency (mutation blocks during eviction).
I already raised the notion of a parallel eviction (by B+Tree level) during an index checkpoint, which would reduce the commit latency.
The main thing going on during eviction is serializing pages (which is basically a column wise compression of the page metadata and application determined compression of the keys and values). But the situation is a little more complex when we evict a non-leaf page since we need to also evict all dirty children before evicting the parent. We try to figure out which of these is the bottleneck by looking (for each index) at the serialization time vs the page write time (if we have the appropriate counters, which I will check). Serialization will be page by page. Write will include the recursive write of all dirty index pages, plus their serialization time. Since "write" is just insert into the cache, the opportunity is not to reduce the write latency (which 2900 ns), but to do more work in parallel during eviction.
This could lead into at least two strategies.
1. Take the dirty children and evict them in parallel by B+Tree level (similar to the checkpoint optimization). The other
2. When we need to evict something, scan backward in the write retention queue for a good chunk of work that we can do in parallel. This might mean evicting the LRU 100 leaves in parallel using a thread pool or just looking back over those 100 LRU items and organizing them by B+Tree level, and then evicting them in parallel. The concept would be that the index writer must block for a fixed latency (the latency to serialize an index page), but if we have free CPU cores, then we can actually evict many pages in parallel in the same amount of time. This would reduce the average page eviction latency by as much as 100x (assuming perfect scaling, which is not likely .