This is a feature request to improve the load rate into the single machine database. Currently, a single thread drives the parser and the index updates. The index updates themselves are executed against a thread pool, but the parser is not executing while the index updates are being performed.
There are several ways in which load performance could be improved:
- Run a separate thread for the parser and buffer the parser output such that there is always data available for index updates.
- Run concurrent parser/loader tasks against the same connection. This could be done for the DataLoader, the InsertServlet, and the LOAD update operation.
For line-oriented RDF formats and sparse matrix formats we can also break the file into blocks and assign the blocks to a thread pool. Each thread would fine the start of the next line in each block and hand off the remainder to the thread for the previous block. The threads could read the data more quickly into the intermediate format and this will decrease the time parsing versus writing the indexed data structures.
In addition to co-threading, we could increase the data density:
- Pack TermIds. See [1,3].
- Use namespace compression .
We should also provide for the dynamic extension of the Vocabulary as part of this effort. My notes on this follow: Simply update the Vocabulary in the GRS. We have the rest of the byte values and then the rest of the short values all of which can be used. If we have a vocabulary item which is already in TERM2ID, BLOBS, or inline, then we store the TermId or BlobIV in the Vocabulary. What matters is that the TermIds, BlobIVs and Vocabulary are consistent. It is not necessary that all Vocabulary items are represented by inline IVs. It is necessary that the IVCache is set and that they are immediately available from the LexiconConfiguration.
See https://docs.google.com/document/d/1R8tWnAQUWcXl4tMszPztvHmfyWrnHHhsNbmvNBqIEWU/edit for some documentation on work in progress in the load-performance branch.
 BLZG-629 (PartlyInlineURIs)
 BLZG-654 (Pack TIDs)
BLZG-658 (Use PSOutputStream/ConstantStore for large/small blobs)
BLZG-660 (Support PSOutputStream/InputStream at IRawStore)