Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-641

Improve load performance

    XMLWordPrintable

    Details

      Description

      This is a feature request to improve the load rate into the single machine database. Currently, a single thread drives the parser and the index updates. The index updates themselves are executed against a thread pool, but the parser is not executing while the index updates are being performed.

      There are several ways in which load performance could be improved:

      1. Run a separate thread for the parser and buffer the parser output such that there is always data available for index updates.
      2. Run concurrent parser/loader tasks against the same connection. This could be done for the DataLoader, the InsertServlet, and the LOAD update operation.

      For line-oriented RDF formats and sparse matrix formats we can also break the file into blocks and assign the blocks to a thread pool. Each thread would fine the start of the next line in each block and hand off the remainder to the thread for the previous block. The threads could read the data more quickly into the intermediate format and this will decrease the time parsing versus writing the indexed data structures.

      In addition to co-threading, we could increase the data density:

      1. Pack TermIds. See [1,3].
      2. Use namespace compression [2].

      We should also provide for the dynamic extension of the Vocabulary as part of this effort. My notes on this follow: Simply update the Vocabulary in the GRS. We have the rest of the byte values and then the rest of the short values all of which can be used. If we have a vocabulary item which is already in TERM2ID, BLOBS, or inline, then we store the TermId or BlobIV in the Vocabulary. What matters is that the TermIds, BlobIVs and Vocabulary are consistent. It is not necessary that all Vocabulary items are represented by inline IVs. It is necessary that the IVCache is set and that they are immediately available from the LexiconConfiguration.

      See https://docs.google.com/document/d/1R8tWnAQUWcXl4tMszPztvHmfyWrnHHhsNbmvNBqIEWU/edit for some documentation on work in progress in the load-performance branch.

      [1] BLZG-314 (TermIdEncoder)
      [2] BLZG-629 (PartlyInlineURIs)
      [3] BLZG-654 (Pack TIDs)
      [4] BLZG-658 (Use PSOutputStream/ConstantStore for large/small blobs)
      [5] BLZG-660 (Support PSOutputStream/InputStream at IRawStore)

        Attachments

          Issue Links

          1.
          Experiment with heap size, GC mode, and nursery settings Sub-task Open Brad Bebee (Blazegraph)
          2.
          Reduced TERM2ID scatter induced by UUIDs and other random things in URIs. Sub-task Open bryanthompson
          3.
          Expose more parallelism in lexicon index writes Sub-task Open bryanthompson
          4.
          Expose more parallelism by overlapping ID2TERM and SEARCH index writes with statement index writes Sub-task Open bryanthompson
          5.
          Run a pool of RDF parsers that target a single loader task. Sub-task Open Unassigned
          6.
          Pack TIDs (breaks binary compatibility) Sub-task Open Unassigned
          7.
          Schedule more IOs when loading data. Sub-task In Progress bryanthompson
          8.
          Improve read/replace singleton property value Sub-task Open mikepersonick
          9.
          StatementBuffer must flush on SP boundary for property graphs Sub-task Accepted mikepersonick
          10.
          Option to configure sail write connection for low level statement buffer writer behavior Sub-task Open bryanthompson
          11.
          PARALLEL iterator pattern in AccessPath or IRangeQuery Sub-task Accepted bryanthompson
          12.
          Add PREFETCH option for IRangeQuery Sub-task Open martyncutcher
          13.
          Examine whether we can parallelize the warm-up procedure Sub-task Open Unassigned
          14.
          Examine whether we can parallelize DumpJournal -pages Sub-task Open Unassigned
          15.
          Dynamic extension of the Vocabulary Sub-task Open Brad Bebee (Blazegraph)
          16.
          Parallelize the VocabBuilder Sub-task Open Unassigned
          17.
          Review, document, and blog the VocabBuilder Sub-task Open bradbebee
          18.
          Use a more-intelligent Inlining Strategy by Default Sub-task Open Brad Bebee (Blazegraph)
          19.
          Consider strategies for allowing concurrent writers on a BTree/HTree Sub-task Open bryanthompson
          20.
          Improve B+Tree/HTree write retention cache eviction throughput Sub-task In Progress bryanthompson
          21.
          AbstractBTree.touch() synchronization hot spot Sub-task Reopened bryanthompson
          22.
          Concurrent writers on B+Tree (and possibly HTree) Sub-task Open bryanthompson
          23.
          Remove touch() contention Sub-task Open bryanthompson
          24.
          Relax B+Tree underflow and overflow thresholds Sub-task Open Alexandre Riazanov
          25.
          Adjust B+Tree leaf split points to chose shorter separator keys. Sub-task Open bryanthompson

            Activity

              People

              Assignee:
              bryanthompson bryanthompson
              Reporter:
              bryanthompson bryanthompson
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated: