Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-369

Implement an persistence capable hash table to support analytic query

    Details

      Description

      A hash table provides O(1) lookup, but does not support range scans. However, a hash table is suitable for a number of index requirements. For example, it would be a better fit for the ID2TERM and TERM2ID indices in the triple/quad store lexicon. A hash table is also required for various kinds of analytic query operations, including hash-joins, high data volume distincts, and high data volume fact tables for aggregation queries. The java hash table implementations are suitable for transient data structures as long as the data scale is small (when compared to the scale of a database index). However, Java garbage collectors are strongly challenged when presented with lots of long lived objects such as would be generated by a hash join. Therefore, we need to implement a persistence capable hash table to support analytic query (getting the data out of the Java heap and into either direct ByteBuffers or onto the disk). The hash table can also be used to improve lexicon performance.

      There are two broad categories of hash maps: static and dynamic. All
      in all, static hashing is problematic unless the maximum #of records is known in advance. The broad class of dynamic hashing includes extendable (or extensible) hashing, which is based based on an address table) on the one hand and linear hashing and extended) spiral hashing, which do not use an address table, on the other hand. (There are also some hybrid designs. For example, in bounded disorder a B+Tree is combined with hash tables for the leaf nodes which are substantially larger than the B+Tree page size (being made up multiple pages themselves). Bounded disorder trees are an interesting hybrid which supports out of order key-range scans and has many of the benefits of a hash table (access to a record in one or two IOs assuming that the upper part of the B+Tree nodes is wired into memory). However, the index segment already possesses many of these characteristics as the nodes region is fully buffered in memory and we do one IO (guaranteed) per leaf.)

      Considering the dynamic hashing algorithms, it seems likely that the choice of the algorithm will come down to whether we have a bare file index format for hashing (extended spiral hashing might be the best choice if we develop new persistence store specifically for this purpose), whether we use RWStore for page allocation management, and whether or not the hash table is integrated with the MVCC architecture (based on copy-on-write combined with eventual release or recycling of deleted records).

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Martyn wrote: Updates to the HTree supporting removal, use of the FrontCodedRaba for ordered key values and the lazy creation of BucketPages.

        Committed revision r5312.

        Show
        bryanthompson bryanthompson added a comment - Martyn wrote: Updates to the HTree supporting removal, use of the FrontCodedRaba for ordered key values and the lazy creation of BucketPages. Committed revision r5312.
        Hide
        bryanthompson bryanthompson added a comment -

        Martyn is going to work through the page ellision logic under removal next. He is also going to work up a demonstration program showing on you can have a very small JVM heap (32M) and a very large HTree instance (as much as you have RAM).

        Show
        bryanthompson bryanthompson added a comment - Martyn is going to work through the page ellision logic under removal next. He is also going to work up a demonstration program showing on you can have a very small JVM heap (32M) and a very large HTree instance (as much as you have RAM).
        Hide
        bryanthompson bryanthompson added a comment -

        The page elision on insert work is done. Martyn is now working on page elision on removal. He is also working on a performance comparison of the HTree versus the Java HashMap / LinkedHashMap classes so we can have some sense of what that performance tradeoff looks like.

        Show
        bryanthompson bryanthompson added a comment - The page elision on insert work is done. Martyn is now working on page elision on removal. He is also working on a performance comparison of the HTree versus the Java HashMap / LinkedHashMap classes so we can have some sense of what that performance tradeoff looks like.
        Hide
        bryanthompson bryanthompson added a comment -

        Martyn fixed two bugs pertaining to ensuring that rawRecords array is updated when by insertRawTuple. These bugs were blocking the high volume htree hash join queries.

        Show
        bryanthompson bryanthompson added a comment - Martyn fixed two bugs pertaining to ensuring that rawRecords array is updated when by insertRawTuple. These bugs were blocking the high volume htree hash join queries.
        Hide
        bryanthompson bryanthompson added a comment -

        Bug fix dealing with flush as triggered by TestReopen.

        Show
        bryanthompson bryanthompson added a comment - Bug fix dealing with flush as triggered by TestReopen.

          People

          • Assignee:
            martyncutcher martyncutcher
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: