Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-641

Improve load performance

    XMLWordPrintable

    Details

      Description

      This is a feature request to improve the load rate into the single machine database. Currently, a single thread drives the parser and the index updates. The index updates themselves are executed against a thread pool, but the parser is not executing while the index updates are being performed.

      There are several ways in which load performance could be improved:

      1. Run a separate thread for the parser and buffer the parser output such that there is always data available for index updates.
      2. Run concurrent parser/loader tasks against the same connection. This could be done for the DataLoader, the InsertServlet, and the LOAD update operation.

      For line-oriented RDF formats and sparse matrix formats we can also break the file into blocks and assign the blocks to a thread pool. Each thread would fine the start of the next line in each block and hand off the remainder to the thread for the previous block. The threads could read the data more quickly into the intermediate format and this will decrease the time parsing versus writing the indexed data structures.

      In addition to co-threading, we could increase the data density:

      1. Pack TermIds. See [1,3].
      2. Use namespace compression [2].

      We should also provide for the dynamic extension of the Vocabulary as part of this effort. My notes on this follow: Simply update the Vocabulary in the GRS. We have the rest of the byte values and then the rest of the short values all of which can be used. If we have a vocabulary item which is already in TERM2ID, BLOBS, or inline, then we store the TermId or BlobIV in the Vocabulary. What matters is that the TermIds, BlobIVs and Vocabulary are consistent. It is not necessary that all Vocabulary items are represented by inline IVs. It is necessary that the IVCache is set and that they are immediately available from the LexiconConfiguration.

      See https://docs.google.com/document/d/1R8tWnAQUWcXl4tMszPztvHmfyWrnHHhsNbmvNBqIEWU/edit for some documentation on work in progress in the load-performance branch.

      [1] BLZG-314 (TermIdEncoder)
      [2] BLZG-629 (PartlyInlineURIs)
      [3] BLZG-654 (Pack TIDs)
      [4] BLZG-658 (Use PSOutputStream/ConstantStore for large/small blobs)
      [5] BLZG-660 (Support PSOutputStream/InputStream at IRawStore)

        Attachments

          Issue Links

          1.
          Experiment with heap size, GC mode, and nursery settings Sub-task Open Brad Bebee
          2.
          RDF Parser and index writers should overlap Sub-task Done bryanthompson
          3.
          SEARCH index is written on one time too many? Sub-task Done bryanthompson
          4.
          Reduced TERM2ID scatter induced by UUIDs and other random things in URIs. Sub-task Open bryanthompson
          5.
          Expose more parallelism in lexicon index writes Sub-task Open bryanthompson
          6.
          Expose more parallelism by overlapping ID2TERM and SEARCH index writes with statement index writes Sub-task Open bryanthompson
          7.
          Run a pool of RDF parsers that target a single loader task. Sub-task Open Unassigned
          8.
          Pack TIDs (breaks binary compatibility) Sub-task Open Unassigned
          9.
          Poor person's durable queues pattern for DataLoader Sub-task Done bryanthompson
          10.
          Add option to make the DataLoader robust to files that cause rio to throw a fatal exception Sub-task Done michaelschmidt
          11.
          Add DataLoader option to run DumpJournal after each batch Sub-task Done bryanthompson
          12.
          Add putIfAbsent pattern for conditional insert Sub-task Done bryanthompson
          13.
          Schedule more IOs when loading data. Sub-task In Progress bryanthompson
          14.
          Improve read/replace singleton property value Sub-task Open mikepersonick
          15.
          StatementBuffer must flush on SP boundary for property graphs Sub-task Accepted mikepersonick
          16.
          Option to configure sail write connection for low level statement buffer writer behavior Sub-task Open bryanthompson
          17.
          PARALLEL iterator pattern in AccessPath or IRangeQuery Sub-task Accepted bryanthompson
          18.
          Add PREFETCH option for IRangeQuery Sub-task Open martyncutcher
          19.
          Implement support for DTE extension types for URIs Sub-task Done mikepersonick
          20.
          Examine impact of dirty list threshold vs direct buffer size on write cache performance for bulk load Sub-task Done bryanthompson
          21.
          Modify the default behavior for setting the clear/dirty list threshold Sub-task Done michaelschmidt
          22.
          Examine whether we can parallelize the warm-up procedure Sub-task Open Unassigned
          23.
          Examine whether we can parallelize DumpJournal -pages Sub-task Open Unassigned
          24.
          Update DataLoader documentation on the wiki Sub-task Done maria.krokhaleva
          25.
          Decrease storage overhead for small raw records (ConstantAllocator) Sub-task Open martyncutcher
          26.
          DataLoader.Options.FLUSH does not defer flush of StatementBuffer Sub-task Done bryanthompson
          27.
          Concurrent modification error in load-performance branch? Sub-task Done bryanthompson
          28.
          Dynamic extension of the Vocabulary Sub-task Open Brad Bebee
          29.
          Parallelize the VocabBuilder Sub-task Open Unassigned
          30.
          Review, document, and blog the VocabBuilder Sub-task Open bradbebee
          31.
          Merge load-performance to master Sub-task Done Brad Bebee
          32.
          DataLoader should sort files within each directory to establish a stable order for file loading Sub-task Done michaelschmidt
          33.
          Prefix and Suffix Inline URI Handler Sub-task Done Brad Bebee
          34.
          Use a more-intelligent Inlining Strategy by Default Sub-task Open Brad Bebee
          35.
          Consider strategies for allowing concurrent writers on a BTree/HTree Sub-task Open bryanthompson
          36.
          Add BTreeCounters for cache hit and cache miss Sub-task Done bryanthompson
          37.
          Add dynamic counters for the write retention queue so we can better understand the dynamics of the eviction policy Sub-task Done Brad Bebee
          38.
          Reduce commit latency by parallel checkpoint by level of dirty pages in an index Sub-task Done Brad Bebee
          39.
          Reduce commit latency by parallelizing delete block processing Sub-task Done martyncutcher
          40.
          Improve B+Tree/HTree write retention cache eviction throughput Sub-task In Progress bryanthompson
          41.
          AbstractBTree.touch() synchronization hot spot Sub-task Reopened bryanthompson
          42.
          Concurrent writers on B+Tree (and possibly HTree) Sub-task Open bryanthompson
          43.
          Growth in RWStore.alloc() cumulative time (but latency looks ok) Sub-task Done martyncutcher
          44.
          RWStore.showAllocators() must take the allocation lock Sub-task Done martyncutcher
          45.
          Remove touch() contention Sub-task Open bryanthompson
          46.
          Relax B+Tree underflow and overflow thresholds Sub-task Open Alexandre Riazanov
          47.
          Adjust B+Tree leaf split points to chose shorter separator keys. Sub-task Open bryanthompson

            Activity

              People

              Assignee:
              bryanthompson bryanthompson
              Reporter:
              bryanthompson bryanthompson
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated: