Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-444

CI hang in StressTestConcurrentUnisolatedIndices

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Affects Version/s: QUADS_QUERY_BRANCH
    • Fix Version/s: None
    • Component/s: Journal

      Description

      A hang has been observed in CI which appears to be a lost wake up or concurrency hole pertaining to the dirty list in write cache service.

      "com.bigdata.rwstore.RWStore$11" daemon prio=5 tid=101d34800 nid=0x12b442000 waiting on condition [12b441000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <6fe84d9d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
      	at com.bigdata.io.writecache.WriteCacheService$WriteTask.call(WriteCacheService.java:525)
      	at com.bigdata.io.writecache.WriteCacheService$WriteTask.call(WriteCacheService.java:497)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:680) 
      

      I am going to file this as "critical" since we need to keep an eye on it while working to clean up CI and prepare the next release.

      Likely explanations include:


      - resource leakage under CI;


      - JVM lost signal (this used to be a problem in some 1.6 releases and who knows what's up with the OSX JVMs. The workaround was to enable the "membar" option to the JVM. This has since been fixed in standard JVM releases
      - since 1.6.0_18 if I recall.)


      - incorrect propagation of an interrupt, dropped exception, etc.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        The issue appears to be related to a shutdown of the writeCacheService. Our hypothesis is that the shutdown could have been interrupted while awaiting lock.writeLock().lock() in close(). If this were to occur, then the WriteTask would never get interrupted and thus would not terminate.

        Martyn is going to write a shutdown stress test for the WriteCacheService. I am going to rewrote close() as follows:


        - Use a CAS pattern for open rather than a volatile. This will provide atomic decision making for which thread gets to close() the WriteCacheService.


        - Modify close() to interrupt the WriteTask before gaining any locks.


        - Define an AsynchronousCloseException. Set an instance of this on firstCause and set 'halt'. This will provide a clear indication when a concurrent operation is terminated due to an asynchronous close, such as can occur when shutdownNow() is invoked on the Journal.

        I may also not take the write lock at all and simply close the various WriteCache buffers and clear 'current' without the lock. Once we have set firstCause, any exception will be reported as that firstCause rather than whatever it might have been.

        Show
        bryanthompson bryanthompson added a comment - The issue appears to be related to a shutdown of the writeCacheService. Our hypothesis is that the shutdown could have been interrupted while awaiting lock.writeLock().lock() in close(). If this were to occur, then the WriteTask would never get interrupted and thus would not terminate. Martyn is going to write a shutdown stress test for the WriteCacheService. I am going to rewrote close() as follows: - Use a CAS pattern for open rather than a volatile. This will provide atomic decision making for which thread gets to close() the WriteCacheService. - Modify close() to interrupt the WriteTask before gaining any locks. - Define an AsynchronousCloseException. Set an instance of this on firstCause and set 'halt'. This will provide a clear indication when a concurrent operation is terminated due to an asynchronous close, such as can occur when shutdownNow() is invoked on the Journal. I may also not take the write lock at all and simply close the various WriteCache buffers and clear 'current' without the lock. Once we have set firstCause, any exception will be reported as that firstCause rather than whatever it might have been.
        Hide
        bryanthompson bryanthompson added a comment -

        Martyn, can you please close this out once you have that stress test working and can verify that the changes I made (per above) actually do resolve the problem. Thanks, Bryan

        Show
        bryanthompson bryanthompson added a comment - Martyn, can you please close this out once you have that stress test working and can verify that the changes I made (per above) actually do resolve the problem. Thanks, Bryan
        Hide
        bryanthompson bryanthompson added a comment -

        Martyn, I am not convinced that our last interpretation of this issue was correct. I think that we need to continue to clean up CI (open journals) before we can really tell what's going on in this thread dump.

        Show
        bryanthompson bryanthompson added a comment - Martyn, I am not convinced that our last interpretation of this issue was correct. I think that we need to continue to clean up CI (open journals) before we can really tell what's going on in this thread dump.
        Hide
        bryanthompson bryanthompson added a comment -

        Renamed the track issue to be more general. I am also going to attach a new thread dump.

        Show
        bryanthompson bryanthompson added a comment - Renamed the track issue to be more general. I am also going to attach a new thread dump.
        Hide
        bryanthompson bryanthompson added a comment -

        We've restored some settings in the committed version of this stress test, including a timeout. Hopefully, that will prevent future CI deadlocks even though it does not really answer the question of why we are occasionally observing those deadlocks.

        Committed revision r4541.

        Show
        bryanthompson bryanthompson added a comment - We've restored some settings in the committed version of this stress test, including a timeout. Hopefully, that will prevent future CI deadlocks even though it does not really answer the question of why we are occasionally observing those deadlocks. Committed revision r4541.
        Hide
        bryanthompson bryanthompson added a comment -

        This test is no longer hanging. We are occasionally observing checksum errors, but they are not related to the deadlock issue and are being pursued under a different ticket [1].

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/282 (checksum errors in CI)

        Show
        bryanthompson bryanthompson added a comment - This test is no longer hanging. We are occasionally observing checksum errors, but they are not related to the deadlock issue and are being pursued under a different ticket [1] . [1] https://sourceforge.net/apps/trac/bigdata/ticket/282 (checksum errors in CI)

          People

          • Assignee:
            martyncutcher martyncutcher
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: