As documented in
BLZG-157 (Longevity Testing), there is an issue where the write replication protocol does not always resynchronize correctly when there is a leader fail or follower fail scenario.
The problem was initially noticed with unread data in the socket channel that needs to be cleared before we can accept the next payload on that socket.
One approach has been to develop a marker for each payload and write that marker on the channel. The channel is then drained to the marker in order to resynchronize. However, there is some concern that this could drain infinitely (or until the service was forced to change its upstream/downstream replication targets) if the marker or payload had been missed.
To deal with this concern, we are developing a flowchart of the write replication protocol and examining error handling strategies more broadly in that logic. For example, another plausible resynchronization protocol would be based around simply closing the upstream/downstream socket channel and then transparently re-opening it.
Branch MGC_1_3_0 was created from r7608 of branches/BIGDATA_RELEASE_1_3_0.
This issue only occurs when there is a sure kill or sudden halt of a process. Normal shutdown will not trigger the issue documented on this ticket. As a workaround, it is possible to provoke a leader fail event using /status?error on the quorum leader. This will automatically clear the socket channels since the quorum will break and all HASendService and HAReceiveService instances will be torn down. The new quorum will then form among the remaining services (including the service that was just forced into a temporary error state) and a new leader will be elected. Writes can proceed against the new leader.