Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-197 BlazeGraph release 2.1 (Scale-out GA)
  3. BLZG-3

Decouple the bulk loader configuration from the main bigdata configuration file



    • Type: Sub-task
    • Status: In Progress
    • Resolution: Unresolved
    • Affects Version/s: BLAZEGRAPH_RELEASE_1_5_1
    • Fix Version/s: None
    • Component/s: Bigdata Federation
    • Labels:


      Decouple the bulk loader configuration from the main bigdata configuration file. This will greatly simplify the main configuration file and make it possible to have sample configuration files for different bulk loader tasks.

      We wind up provisioning the specific triple or quad store instance when running the bulk loader for the first time against the namespace for that triple/quads store. For purely historical reasons, the bulk loader is configured by two component sections in the bigdata configuration file:

      - lubm : This is where we are setting the properties which will govern the triple/quad store.

      - com.bigdata.rdf.load.MappedRDFDataLoadMaster : This is where we describe the bulk load job.

      The MappedRDFDataLoadMaster section also uses some back references into fields defined in the lubm section, but the entire lubm section could be folded into the MappedRDFDataLoadMaster section.

      At present, there are the following back references from into the rest of the configuration file:

      bigdata.dataServiceCount : It seems that we should simply run with all logical data services found in jini/zookeeper.

      bigdata.clientServiceCount : It seems to me that the #of client services could default to all unless overridden.

      There is also a relatively complex declaration of the services templates which is used to describe which services must be running as a precondition for the bulk loader job. I propose that this should be either folder into the bulk loader code or these preconditions abolished as they basically assert that the configured system must be running (see bigdataCluster.config#1848).

      awaitServicesTimeout = 10000;

      servicesTemplates = new ServicesTemplate[] {...}

      And bigdataCluster.config#1893, a template is established which says that the bulk loader will use dedicated client service nodes (rather than running the distributed bulk load job on the data service nodes, which can also host distributed job execution).

      clientsTemplate = new ServicesTemplate(...);

      I like to use distinct client service nodes because the bulk loader tends to be memory hungry when it is buffering data for the shards in memory before writing on the data services. Running the distributed job on the data services adds to the burden of the data service nodes and we still need to buffer the data and then scatter it to the appropriate shards. It is possible to reduce the memory demand of the bulk loader (by adjusting the queue capacity and chunk size used by the asynchronous write pipeline), and I have done this in the bigdataStandalone.config file. For this reason, the triple/quads store configurations are not "one size fits all" and I will get into this in a follow on email which addresses performance tuning properties used in the configuration files.

      There are also some optional properties which really should be turned off unless you are engaged in forensics:

      indexDumpDir = new File("@NAS@/"jobName"-indexDumps");

      indexDumpNamespace = lubm.namespace;

      Based on this, it seems that we could isolate the bulk loader configuration relatively easily into its own configuration file. That configuration file would only need to know a bare minimum of things:

      - jini groups and locators.
      - zookeeper quorum IPs and ports.




            beebs Brad Bebee
            bryanthompson bryanthompson
            0 Vote for this issue
            2 Start watching this issue