• Type: Sub-task
    • Status: Open
    • Resolution: Unresolved
    • Affects Version/s: BIGDATA_RELEASE_1_5_0
    • Fix Version/s: None
    • Component/s: Bigdata Federation
    • Labels:


      This ticket is to add HDFS support for highly available shard storage. HDFS is an append-only file system. The scale-out architecture is append only except for the root blocks on the WORM mode journal files. Those can be stored in a basename.rbs file. This makes the entire system append only, so it is a great match with HDFS.

      See BLZG-27 (Simplify scale-out configuration and deployment)
      See HA MDS / DS (BLZG-277)
      See HA TXS (BLZG-278)

      See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html for top-level HDFS implementation assumptions.
      See https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html (FileSystem is the core API)
      See http://www.aosabook.org/en/hdfs.html

      There are several services with durable data. In principle the entire system is a write-once architecture except that the current root blocks are stored at the head of the Journal backing file. Since HDFS does not support update, we need to put the current root block either into an object store (S3) or a zookeeper (which we are using anyway).

      - DataService/MetadataService. This service has three kinds of storage in scale-out.
      - The live Journal. This is a WORM mode Journal used to absorb writes.
      - Historical journals. Once a live Journal fills up, it is closed out to further writes an a new Journal is opened to buffer additional writes. The historical journals and index segments comprising shard views are eventually released once the last visible commit point is advanced beyond the last commit point stored on a given DataService.
      - IndexSegment files. These are write-once, read-many key-order on the disk B+Tree files. The nodes of the B+Tree and the leaves of the B+Tree are each stored in their own region on the file. When an index segment is opened, the nodes regions is read in a single IO into a direct memory buffer. While the leaves could also be fully buffered, the practice is to read them into memory on demand. Once a leaf is in memory it is retained by the appropriate retention queue for the index segment. Index segments themselves are retained by the StoreManager on an LRU basis with a capacity and cache timeout.

      Archival storage (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html). It looks like we could transparently take advantage of archival storage policy Id 7 (hot, which is the default). This would be an excellent match with an immortal store where older journals and index shards are never released, just moved to archive storage. We would need to parallelize the file system scan during DataService startup to have it be efficient when the number of journals and index segment files becomes large (10,000+). I do not see any other impact. This could be enabled by a simple configuration parameter on the com.bigdata.service.jini.TransactionServer component. new NV(TransactionServer.Options.MIN_RELEASE_AGE, Long.MAX_VALUE.toString())

      Failover needs to be realized for the services. This is simply a matter of having a leader election for each active service. There is existing code for this purpose based on a zookeeper master election. It just needs to be tested and appropriate deployment and management protocols written up.




            Unassigned Unassigned
            bryanthompson bryanthompson
            0 Vote for this issue
            2 Start watching this issue