Details

    • Type: Sub-task
    • Status: Reopened
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Bigdata Federation
    • Labels:

      Description

      Make the DataService highly available using HDFS. There are really only a few things that we need to handle:

      1. Where to store the root blocks. Currently the root blocks are written on the head of the file, but that does not work for HDFS. Choices include:


      - Zookeeper (they are small and 1:1 with the journal files for each DS node).
      - HDFS in a file with the same basename as the journal but using a .rbs extension.
      - An object/blob store such as S3.

      @martyncutcher suggests writing the root blocks on the end of the HDFS file. This would work as long as the file append and file length information are atomically updated, which seems likely.

      2. The file extension policy needs to be reviewed. We currently pre-allocate 200MB and then incrementally extend according to a hard-coded policy based on the initial (200MB) and maximum extent (also 200MB
      - this is the target size for the live journal before it is closed out and a new one opened). However, even assuming that HDFS allows us to pre-extend the file, this rapid growth could leave holes on which we can not write. We need to take a look at the HDFS file append mechanisms and find a way to hide the fact of file extension from the WORMStrategy and the AbstractJournal.

      3. The design of Java IO is such that an interrupt during an IO transparently closes the backing FileChannel. This causes readers to observe an AsynchronousFileCloseException. The code base uses a re-opener pattern to transparently re-open such FileChannels in reader threads. This has very little overhead on local file systems. If HDFS also adheres to this transparent close of the file access layer when there is an interrupt during an IO then we need to make sure that the overhead associated with this re-opener pattern is also quite small. If it is not, then this will have an extreme adverse impact on query. Note that the interrupts can occur in both reader and writer threads, but we only use the re-opener pattern in reader threads.

      4. We need to verify that HDFS permits a reader to access written data without a sync/close. If not, then we need to fully buffer the writes since the last commit point in memory. One possibility would be a large write cache such that the writes were fully buffered up to the commit point (WORMStrategy does support the write cache). We could let the write cache incrementally flush, but we would have to retain the buffers on a list for read back until the commit. Or we could disallow incremental flush, require that the entire write set up to the group commit (plus any concurrent unisolated writes that would get melded into the next commit point) are buffered in order to ensure read back without IO (and without the possibility of interrupted IO). The commit protocol (with the exclusive lock) would then evict the write cache to the HDFS file (and optionally sync the file) and then append the root block to the end of the file and definitely sync the file. At this point the data would be restart safe.

      5. Allocations can not be discarded, even on abort, once there has been a write on the HDFS file. Basically we need to only advance the allocator pointer. If we assume that the root block is at the end of the file, but there have been flushed writes without a new root block because of an abort() then we can?t find the old root block. Therefore the behavior really needs to be a single atomic write. And we need to do it whether we commit or abort. Abort would just write the old root block again.

      Whatever approach we use needs to correctly align the write addresses in the buffer and on the disk.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Closed. Not relevant to the new architecture.

        Show
        bryanthompson bryanthompson added a comment - Closed. Not relevant to the new architecture.
        Hide
        bryanthompson bryanthompson added a comment -

        As a side note, Martyn and I discussed two approaches to an RWStore for HDFS.

        1. A logical to physical mapping that is replaced at each commit. jdbm had this sort of architecture.

        2. A segment oriented storage model (like the MemStore) in which segments (which would be set to the block size) are recruited for reallocation once they become sufficient sparse and otherwise writes go into a new segment. Each segment would be a file in HDFS. If a segment was recruited for reallocation due to sparsity, we would simply read in the old segment, apply the writes to the in-memory copy of the segment, and then write out the new segment. The segment numbers could be purely ascending and the meta-allocation map could be updated when a new segment version was written such that all allocations for the old segment were now found in the new segment. After the commit we could simply delete the old segment numbers. On restart, any segments not found in the meta-allocations region would be deleted (they would be unreachable).

        For both cases, we need to completely write out the meta-allocation information at each commit. This will have significant latency, but that latency can be hidden under group commit and still achieve high throughput.

        We also need to write out the root blocks. They could be at the head and tail of the meta-allocation file. A valid meta-allocation file would have the same root blocks region at both the head and tail (a test for a partial write). The most current and valid meta-allocation file would be the root block for the current version of the store. Any older meta-allocation files could then be deleted.

        It may also make sense to have a write once regions for strings (raw record support). Such raw records are primarily used by the lexicon indices. However, they should be reclaimed if we delete a given namespace.

        Show
        bryanthompson bryanthompson added a comment - As a side note, Martyn and I discussed two approaches to an RWStore for HDFS. 1. A logical to physical mapping that is replaced at each commit. jdbm had this sort of architecture. 2. A segment oriented storage model (like the MemStore) in which segments (which would be set to the block size) are recruited for reallocation once they become sufficient sparse and otherwise writes go into a new segment. Each segment would be a file in HDFS. If a segment was recruited for reallocation due to sparsity, we would simply read in the old segment, apply the writes to the in-memory copy of the segment, and then write out the new segment. The segment numbers could be purely ascending and the meta-allocation map could be updated when a new segment version was written such that all allocations for the old segment were now found in the new segment. After the commit we could simply delete the old segment numbers. On restart, any segments not found in the meta-allocations region would be deleted (they would be unreachable). For both cases, we need to completely write out the meta-allocation information at each commit. This will have significant latency, but that latency can be hidden under group commit and still achieve high throughput. We also need to write out the root blocks. They could be at the head and tail of the meta-allocation file. A valid meta-allocation file would have the same root blocks region at both the head and tail (a test for a partial write). The most current and valid meta-allocation file would be the root block for the current version of the store. Any older meta-allocation files could then be deleted. It may also make sense to have a write once regions for strings (raw record support). Such raw records are primarily used by the lexicon indices. However, they should be reclaimed if we delete a given namespace.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated: