Details

    • Type: Sub-task
    • Status: Open
    • Resolution: Unresolved
    • Affects Version/s: BIGDATA_RELEASE_1_5_0
    • Fix Version/s: None
    • Component/s: Bigdata Federation
    • Labels:
      None

      Description

      This ticket is to add HDFS support for highly available shard storage. HDFS is an append-only file system. The scale-out architecture is append only except for the root blocks on the WORM mode journal files. Those can be stored in a basename.rbs file. This makes the entire system append only, so it is a great match with HDFS.

      See BLZG-27 (Simplify scale-out configuration and deployment)
      See HA MDS / DS (BLZG-277)
      See HA TXS (BLZG-278)

      See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html for top-level HDFS implementation assumptions.
      See https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html (FileSystem is the core API)
      See http://www.aosabook.org/en/hdfs.html

      There are several services with durable data. In principle the entire system is a write-once architecture except that the current root blocks are stored at the head of the Journal backing file. Since HDFS does not support update, we need to put the current root block either into an object store (S3) or a zookeeper (which we are using anyway).


      - DataService/MetadataService. This service has three kinds of storage in scale-out.
      - The live Journal. This is a WORM mode Journal used to absorb writes.
      - Historical journals. Once a live Journal fills up, it is closed out to further writes an a new Journal is opened to buffer additional writes. The historical journals and index segments comprising shard views are eventually released once the last visible commit point is advanced beyond the last commit point stored on a given DataService.
      - IndexSegment files. These are write-once, read-many key-order on the disk B+Tree files. The nodes of the B+Tree and the leaves of the B+Tree are each stored in their own region on the file. When an index segment is opened, the nodes regions is read in a single IO into a direct memory buffer. While the leaves could also be fully buffered, the practice is to read them into memory on demand. Once a leaf is in memory it is retained by the appropriate retention queue for the index segment. Index segments themselves are retained by the StoreManager on an LRU basis with a capacity and cache timeout.

      Archival storage (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html). It looks like we could transparently take advantage of archival storage policy Id 7 (hot, which is the default). This would be an excellent match with an immortal store where older journals and index shards are never released, just moved to archive storage. We would need to parallelize the file system scan during DataService startup to have it be efficient when the number of journals and index segment files becomes large (10,000+). I do not see any other impact. This could be enabled by a simple configuration parameter on the com.bigdata.service.jini.TransactionServer component. new NV(TransactionServer.Options.MIN_RELEASE_AGE, Long.MAX_VALUE.toString())

      Failover needs to be realized for the services. This is simply a matter of having a leader election for each active service. There is existing code for this purpose based on a zookeeper master election. It just needs to be tested and appropriate deployment and management protocols written up.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Brad wrote:

        https://github.com/SYSTAP/bigdata/tree/blazegraph_hdfs/blazeraph-hdfs is a branch and new Maven project to read and write from HDFS.  It's just a wrapper around a sample program.  You'll need to setup a local HDFS instance on your machine.
        
        Then (update for your local core-site.xml file).
        
        mvn package 
        
        java -Dcom.blazegraph.hdfs.core-site=/opt/hadoop-current/etc/hadoop/core-site.xml -jar target/blazegraph-hdfs-1.0-SNAPSHOT-jar-with-dependencies.jar /blazetest
        
        

        See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html for the single machine cluster setup.

        Note: @bryan is using the pseudo cluster setup per http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation.

        *Make sure that zookeeper is running before starting hdfs. *

        Format the filesystem:

          $ bin/hdfs namenode -format
        

        Start NameNode daemon and DataNode daemon:

        $ sbin/start-dfs.sh
        

        The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

        Browse the web interface for the NameNode; by default it is available at: NameNode

        http://localhost:50070/
        

        Some notes from the link above:{{{
        Make the HDFS directories required to execute MapReduce jobs:

        $ bin/hdfs dfs -mkdir /user

        $ bin/hdfs dfs -mkdir /user/<username>

        Copy the input files into the distributed filesystem:

        $ bin/hdfs dfs -put etc/hadoop input

        Run some of the examples provided:

        $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'

        Examine the output files:
        Copy the output files from the distributed filesystem to the local filesystem and examine them:

        $ bin/hdfs dfs -get output output

        $ cat output/*

        }}}

        View the output files on the distributed filesystem:

          $ bin/hdfs dfs -cat output/*
        

        When you're done, stop the daemons with:

          $ sbin/stop-dfs.sh
        
        Show
        bryanthompson bryanthompson added a comment - Brad wrote: https://github.com/SYSTAP/bigdata/tree/blazegraph_hdfs/blazeraph-hdfs is a branch and new Maven project to read and write from HDFS. It's just a wrapper around a sample program. You'll need to setup a local HDFS instance on your machine. Then (update for your local core-site.xml file). mvn package java -Dcom.blazegraph.hdfs.core-site=/opt/hadoop-current/etc/hadoop/core-site.xml -jar target/blazegraph-hdfs-1.0-SNAPSHOT-jar-with-dependencies.jar /blazetest See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html for the single machine cluster setup. Note: @bryan is using the pseudo cluster setup per http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation . *Make sure that zookeeper is running before starting hdfs. * Format the filesystem: $ bin/hdfs namenode -format Start NameNode daemon and DataNode daemon: $ sbin/start-dfs.sh The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs). Browse the web interface for the NameNode; by default it is available at: NameNode http://localhost:50070/ Some notes from the link above:{{{ Make the HDFS directories required to execute MapReduce jobs: $ bin/hdfs dfs -mkdir /user $ bin/hdfs dfs -mkdir /user/<username> Copy the input files into the distributed filesystem: $ bin/hdfs dfs -put etc/hadoop input Run some of the examples provided: $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs [a-z.] +' Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them: $ bin/hdfs dfs -get output output $ cat output/* }}} View the output files on the distributed filesystem: $ bin/hdfs dfs -cat output/* When you're done, stop the daemons with: $ sbin/stop-dfs.sh
        Hide
        martyncutcher martyncutcher added a comment -

        I have written a simple test with the LocalFileSystem. The FileSystem.append option is not available to the LocaFileSystem, but is accessible if the RawLocalFileSystem is retrieved with getRaw(). This does not seem ideal, but the DFS_SUPPORT_APPEND_KEY is true by default anyway, and setting it to true in the Configuration makes no difference.

        I was interested in the stream protocols to append new data and read.

        The FSDataOutputStream.getPos(), does not return the file position, but rather the delta for this streams own operations. Therefore, to determine the file offset it is necessary to get the file length prior to retrieving the stream with append (for an existing file) and add this to the offset returned by getPos(). This can be retrieved with fileSystem.getFileStatus(path).getLen().

        The data is immediately available to be read once fsout.hflush() is called; fsout.hsync() is not necessary.

        Show
        martyncutcher martyncutcher added a comment - I have written a simple test with the LocalFileSystem. The FileSystem.append option is not available to the LocaFileSystem, but is accessible if the RawLocalFileSystem is retrieved with getRaw(). This does not seem ideal, but the DFS_SUPPORT_APPEND_KEY is true by default anyway, and setting it to true in the Configuration makes no difference. I was interested in the stream protocols to append new data and read. The FSDataOutputStream.getPos(), does not return the file position, but rather the delta for this streams own operations. Therefore, to determine the file offset it is necessary to get the file length prior to retrieving the stream with append (for an existing file) and add this to the offset returned by getPos(). This can be retrieved with fileSystem.getFileStatus(path).getLen(). The data is immediately available to be read once fsout.hflush() is called; fsout.hsync() is not necessary.

          People

          • Assignee:
            Unassigned
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: