Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-641 Improve load performance
  3. BLZG-1522

RDF Parser and index writers should overlap




      Run a separate thread for the parser and buffer the parser output such that there is always data available for index updates. The generalized version of this is BLZG-1523. That would use a pool of parsers. Each parser would drop a StatementBuffer onto a blocking queue. The blocking queue would be drained by a single threaded executor. The executor would do the index writes.

      In fact, it is possible to also parallelize the writes on the reverse lexicon index (ID2TERM) and the statement indices (SPO, POS, OSP, etc.). We must first write on the TERM2ID/BLOBS index. We can then write on the remaining indices in parallel.

      The scale-out architecture already does all of this in the AsynchronousStatementBufferFactory class. However, there are some drawbacks with that class. First, it is fairly tightly coupled to the scale-out architecture. Second, it does not write on the text index and does not handle RDF*. But it is possible to extract some of the patterns and configuration parameters used to overlap the various stages of parsing, term resolution (against TERM2ID and BLOBS), reverse dictionary writes (ID2TERM), and statement index writes (SPO, POS, etc.).

      Application notes:

      • The changes to the StatementBuffer can drive the heap significantly harder. If you are already overriding either BigdataSail.Options.BUFFER_CAPACITY or DataLoader.Options_BUFFER_CAPACITY to use a larger batch size, then you could significantly increase the heap pressure with this change.
      • The use of a non-zero queue capacity can increase the effective amount of data that is being buffered quite significantly. Caution is recommended when overriding the default buffer capacity and/or queue capacity. The best performance will probably come from small (10k - 20k) buffer capacity values combined with a queueCapacity of 5-20. Larger values will increase the GC burden and could require a larger heap, but the net throughput might also increase.
      • The default (10k) is probably a reasonable value for BigdataSail.Options.BUFFER_CAPACITY. If you use the DataLoader, then be aware that the default for DataLoader.Options.BUFFER_CAPACITY is already 100k. Values of 20k (bufferCapacity) and 10-20 (queueCapacity) may have the best performance for many use cases.




            bryanthompson bryanthompson
            bryanthompson bryanthompson
            0 Vote for this issue
            4 Start watching this issue