A problem has been observed where any of the doConditionalXXX() methods in QuorumActorBase, e.g., doConditionalMemberRemove() can lead to a deadlock in which the code blocks awaiting a Condition that is never signaled. With reference to the code block below, one actor thread enters the while() loop inside of the guard(). Concurrently, a different thread takes some other action that essentially invalidates the action of the first thread. For example, it might cast a vote which requires an addMember(). The invalidation probably occurs due to a race condition in which one thread wakes up first when the condition is signaled and causes a modification such that the other thread will not observe the target Condition being satisfied when it wakes up. The result is not really a deadlock, but the interference of the two actions has the result that one will never become satisfied.
One possible fix is to establish a queue and a single thread for the actor. Actions are placed onto the queue by the client, which then blocks and awaits their Future. Since only a single action executes at a time, the Condition should become true.
When the HAJournalServer enters the ErrorTask, it should clear the action queue and interrupt any running action.
This pattern would also make it easier to write code that waited up to a timeout for a quorum action to succeed. Such code could be useful in HAJournalServer's beforeShutdownHook() where it instructs the Quorum to enact a memberLeave()
- this case is currently handled by explicit code in AbstractQuorum.terminate().
To support interrupt, the code that awaits a lock in the QuorumActor (and the derived ZKQuorumImpl's actor methods) should use lock.lockInterruptibly() rather than lock.lock().
- We could remove guard() code (no
- we need this to interrupt the current action).
- Clean up AbstractQuorum.terminate()
- Always use lockIterruptable().