I have been attempting to load some datasets into Blazegraph running on an EC2 i2.xlarge instance (4 CPU, 30 GiB RAM with 800 GB Instance Store SSD) and have been struggling to load relatively modest sized datasets. The largest I have successfully loaded was 100 million triples on this instance type.
Loads are carried out using the Bulk Loader described at https://wiki.blazegraph.com/wiki/index.php/Bulk_Data_Load with the following database properties file:
This is essentially a variant of https://github.com/earldouglas/blazegraph/blob/master/bigdata-sails/src/samples/com/bigdata/samples/fastload.properties modified to up the max journal size and to disable closure computation.
The dataset size is 250 million triples which is in a single GZipped Turtle file (~3.2GB).
As a first attempt I allocated 6GB to the JVM heap and the loader ran for several hours before eventually dying with a GC overhead limit exceeded error. I then upped the JVM heap to 8GB and it ran for about 6 hours before I gave up and killed it.
For reference the successfully loaded 100 million triple version of the data loads in about 80 minutes on the same EC2 setup
Is there anything obviously wrong with the above setup that I should address in order to get the load to work?