Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-468 Clean up and optimize blank node handling
  3. BLZG-1613

Data files with a very large number of blank nodes cause JVM heap growth and can lead to an OutOfMemoryException

    XMLWordPrintable

    Details

      Description

      I have been attempting to load some datasets into Blazegraph running on an EC2 i2.xlarge instance (4 CPU, 30 GiB RAM with 800 GB Instance Store SSD) and have been struggling to load relatively modest sized datasets. The largest I have successfully loaded was 100 million triples on this instance type.

      Loads are carried out using the Bulk Loader described at https://wiki.blazegraph.com/wiki/index.php/Bulk_Data_Load with the following database properties file:

      # set the initial and maximum extent of the journal
      # Initial size of 200 MB
      com.bigdata.journal.AbstractJournal.initialExtent=209715200
      # Max size of 200 GB
      com.bigdata.journal.AbstractJournal.maximumExtent=214748364800
      
      # turn off automatic inference in the SAIL
      com.bigdata.rdf.sail.truthMaintenance=false
      
      # don't store justification chains, meaning retraction requires full manual 
      # re-closure of the database
      com.bigdata.rdf.store.AbstractTripleStore.justify=false
      
      # Don't compute the closure
      com.bigdata.rdf.store.DataLoader.closure=None
      
      # turn off the statement identifiers feature for provenance
      com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false
      
      # turn off the free text index
      com.bigdata.rdf.store.AbstractTripleStore.textIndex=false
      
      # RWStore (scalable single machine backend)
      com.bigdata.journal.AbstractJournal.bufferMode=DiskRW
      

      This is essentially a variant of https://github.com/earldouglas/blazegraph/blob/master/bigdata-sails/src/samples/com/bigdata/samples/fastload.properties modified to up the max journal size and to disable closure computation.

      The dataset size is 250 million triples which is in a single GZipped Turtle file (~3.2GB).

      As a first attempt I allocated 6GB to the JVM heap and the loader ran for several hours before eventually dying with a GC overhead limit exceeded error. I then upped the JVM heap to 8GB and it ran for about 6 hours before I gave up and killed it.

      For reference the successfully loaded 100 million triple version of the data loads in about 80 minutes on the same EC2 setup

      Is there anything obviously wrong with the above setup that I should address in order to get the load to work?

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              michaelschmidt michaelschmidt
              Reporter:
              rvesse Rob Vesse
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated: