A problem has been identified where some of the client services may fail to make progressing during a bulk data load. The symptom is that toldTriplesRestartSafeCount becomes flat at some point for that client. The other client(s) may continue to process documents.
The stuck client will continue to make progress against some shards and will continue to parse documents, but it will be unable to reach completion on most (or all) documents because some threads are blocked awaiting the inner ReentrantLock on the documentRestartSafeLatch for some document. The stack trace passes through handleChunk() and into Latch.dec(), as can be seen below. Since the stack trace passes through handleChunk(), the client is unable to write any further data on the shard associated with that chunk until the lock is obtained. However, based on an examination of thread dumps, there is no thread holding the lock.
The problem appears to be related to a JVM bug, http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370. While that bug is resolved in JDK1.6.0_18, there are problems with JDK1.6.0_18 which lead to segfaults. However, a workaround is specified in that bug report, which is to specify "-XX:+UseMembar" as a JVM parameter.
"com.bigdata.service.jini.JiniFederation.executorService424" daemon prio=10 tid=0x00002ab07366b800 nid=0x724d waiting on condition [0x0000000060188000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00002aaab43b4cf0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)