Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-778

Consensus protocol does not detect clock skew correctly.




      The code is designed to detect a situation in which a timestamps do not move forward during the commit protocol. You can see the logic in Journal.BarrierState and Journal.InnerJournalTransactionService.

      Caused by: com.bigdata.util.ClocksNotSynchronizedException: service1=d7152918-a2c1-41ac-bfce-2f8c19be0dc7, serviceId2=e8a6f073-5caa-4bd4-9209-95bb61a3c6cc, skew=9783ms exceeds maximumSkew=5000ms.
      at com.bigdata.journal.AbstractJournal.assertBefore(AbstractJournal.java:1795)
      at com.bigdata.journal.Journal$BarrierState.run(Journal.java:522)
      at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:213)
      at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:355)
      at com.bigdata.journal.Journal$InnerJournalTransactionService.notifyEarliestCommitTime(Journal.java:1686)
      at com.bigdata.journal.AbstractJournal$BasicHA.notifyEarliestCommitTime(AbstractJournal.java:7288)

      The leader takes a timestamp from its timestamp factory at the start of the consensus protocol. This protocol coordinates an agreement among the services about the earliest commit point that will remain visible due to (a) the minReleaseAge; and (b) the earliestActiveTx on each service. The event sequence looks like this:

      - Leader: takes timestamp then issue "gather" request to followers.
      - Follower: notes its earliest visible commit point based on the earliest active transaction and the minReleaseAge. Takes a timestamp. Calls back to the leader with those data.
      - Leader: once all responses are received, takes 2nd timestamp.

      The leader then asserts that the first timestamp is before the timestamp on the followers. It also asserts that the timestamp on each follower is before the leader's second timestamp. If these assertions fail (by more than 5s as presently configured) then the com.bigdata.util.ClocksNotSynchronizedException is thrown.

      I think that the problem is in this method in AbstractJournal. As you can see, it is using the absolute value of the delta between the two timestamps. Thus, it will fail not only where there is clock skew (specifically, not only when t2 is before t1), but also where there is significant latency (e.g., due to a major GC pause on one service).

      protected void assertBefore(final UUID serviceId1, final UUID serviceId2,
                  final long t1, final long t2) throws ClocksNotSynchronizedException {
              // Maximum allowed clock skew.
              final long maxSkew = getMaximumClockSkewMillis();
              final long delta = Math.abs(t1 - t2);
              if (delta < maxSkew)
              throw new ClocksNotSynchronizedException("service1=" + serviceId1
                      + ", serviceId2=" + serviceId2 + ", skew=" + delta
                      + "ms exceeds maximumSkew=" + maxSkew + "ms.");

      This method on AbstractJournal is providing the maximum allowed clock skew. We have not yet raised this as a configuration parameter, but we will do so. In the moment, you could simply override the return value.

          protected long getMaximumClockSkewMillis() {
              return 5000;




            bryanthompson bryanthompson
            bryanthompson bryanthompson
            0 Vote for this issue
            1 Start watching this issue