Develop and document a longevity and stress test protocol for HA QA.
- One tenant (BSBM 100M load, then run UPDATE on leader with concurrent EXPLORE on followers). The UPDATE mixture should run for at least 3 days. Query performance should be flat across the entire run (with minor variance).
- Multiple tenants: Same, but with loads into multiple tenant namespaces and multiple UPDATE / EXPLORE mixtures running.
- Check Journals and HALogs for equal and consistent. HALogs can be checked while accepting updates using the DumpLogDigests utility. Journal and HALog digests can be compared easily using /status?digests if the #of HALogs has been reduced following a snapshot and there are no concurrent updates.
- Verify snapshots taken per default nightly policy.
- Verify snapshots can be decompressed and put into use (e.g., restore test). Do a DumpJournal on the snapshot to look for data level problems in the indices.
- Do a DumpJournal on the live journal after the tests to look for data level problems in the indices.
- Do a range count per tenant on all instances to verify same statement on each server
- Verify rolling update pattern (failover of services one after the other as if we were deploying a rolling update).
- Take down one service during a period of sustained updates. Let a large number of commits go through (100s), then restart the service. Verify that it resyncs and joins properly.
- Note the BSBM UPDATE + EXPLORE mixture throughput both with and without a concurrent query load on the leader and followers. This is an indication of the commit rate that can be sustained by the cluster.
- Examine the #of open files (which will include sockets), #of threads, and the JVM heap size. These should be bounded. If there is a resource leak, then it will show up in one of these areas.
BLZG-849, we should issue some requests that cause abort2Phase() invocations during the longevity tests.
It would be useful to create a longevity testing harness that interacts with real processing running on three (or more) VMs, monitors the processes, applies an appropriate workload (e.g., BSBM EXPLORE+UPDATE on the leader and BSBM EXPLORE on the followers using some desired number of client threads), sudden kills (kill -9) a random service within a period of time each time the quorum becomes fully met and all services are HAReady. This would allow us to stress the platform and its failover abilities. We could automate the verification of HALog files (digest equality) and even occasionally restore and roll forward snapshots to verify the effective digest on the services as of a given commit point. An ACID stress test could be achieved by modifying the BSBM harness to do DELETE+INSERT or DROP ALL+LOAD tests in which the number of visible statements should never be changed.