Affects Version/s: None
Fix Version/s: None
Tests with parallel pre-fetch on various data sets demonstrate a clear bottleneck on the TERM2ID index writes. This bottleneck arises from scatter induced in the TERM2ID index by what amounts to random numbers embedded into URIs.
There are two ways of addressing this issue:
- Use an InlineURIHandler to recognize URIs fitting some pattern and inline them directly into the statement indices. This completely removes the induced burden on the TERM2ID index. While scatter will still be induced on the statement indices, the writers on the statement indices run in parallel and thus we do not get into a situation with additive latency. Support for additional intrinsic datatypes was added (
- Support column-wise compression (BLZG-13) in the statement indices and use a page local dictionary to convert URIs, Literals, etc. into page local integers and then layer on additional compression techniques to obtain a tightly packed page. This requires more effort and I will add some tickets to BLZG-13 for a roadmap in this direction.
This ticket exists to:
- Document the TERM2ID bottleneck
- Provide work arounds for various data sets
- Document how to implement these workarounds on the wiki.