Details
-
Type:
Sub-task
-
Status: Done
-
Priority:
Medium
-
Resolution: Done
-
Affects Version/s: BLAZEGRAPH_RELEASE_1_5_1
-
Fix Version/s: BLAZEGRAPH_RELEASE_1_5_2
-
Component/s: Journal
-
Labels:None
Description
As described in BLZG-1236, we believe that a possible cause for some of the RWStore related issues is a data race where a mutation request is cancelled. This causes interrupts to be propagated to the various threads doing work for the task. Those interrupts are noticed eventually, but there is some possibility that the interrupt will not be noticed until after the top-level thread for the operation has invoked AbstractJournal.abort(). If this occurs then a write or free on the RWStore can be executed after the abort. This would cause that mutation to be melded into the next commit group.
This is not a problem for reads since the task has been cancelled and the error on the failed read effectively gets buried as a knock-on error and not a first cause.
This is might not always be problem a for writes. In the normal case, the allocation is probably just wasted space. But there might be code paths that do lead to problems.
For free() this could lead to a freeing bit not set or similar kind of allocation error.
This ticket is to create a branch in which we test a workaround. The workaround is just to add the following at the top of AbstractJournal.abort(). The purpose of this is to introduce enough latency into the abort() code path to significantly decrease the likelihood of this data race condition playing out.
Thread.sleep(1000/*ms*/);
Attachments
Issue Links
- relates to
-
BLZG-1313 alloc()/free() may be called after RWStore.reset() due to data race.
-
- Open
-