Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-157

Longevity and stress test protocol for HA QA

    Details

      Description

      Develop and document a longevity and stress test protocol for HA QA.


      - One tenant (BSBM 100M load, then run UPDATE on leader with concurrent EXPLORE on followers). The UPDATE mixture should run for at least 3 days. Query performance should be flat across the entire run (with minor variance).


      - Multiple tenants: Same, but with loads into multiple tenant namespaces and multiple UPDATE / EXPLORE mixtures running.


      - Check Journals and HALogs for equal and consistent. HALogs can be checked while accepting updates using the DumpLogDigests utility. Journal and HALog digests can be compared easily using /status?digests if the #of HALogs has been reduced following a snapshot and there are no concurrent updates.

      http://bigdata15.bigdata.com:8090/status?digests
      http://bigdata16.bigdata.com:8090/status?digests
      http://bigdata17.bigdata.com:8090/status?digests
      


      - Verify snapshots taken per default nightly policy.


      - Verify snapshots can be decompressed and put into use (e.g., restore test). Do a DumpJournal on the snapshot to look for data level problems in the indices.


      - Do a DumpJournal on the live journal after the tests to look for data level problems in the indices.


      - Do a range count per tenant on all instances to verify same statement on each server

      SELECT (COUNT(*) as ?count) {?a ?b ?c}
      


      - Verify rolling update pattern (failover of services one after the other as if we were deploying a rolling update).


      - Take down one service during a period of sustained updates. Let a large number of commits go through (100s), then restart the service. Verify that it resyncs and joins properly.


      - Note the BSBM UPDATE + EXPLORE mixture throughput both with and without a concurrent query load on the leader and followers. This is an indication of the commit rate that can be sustained by the cluster.


      - Examine the #of open files (which will include sockets), #of threads, and the JVM heap size. These should be bounded. If there is a resource leak, then it will show up in one of these areas.

      # Look for leaked threads, memory, and file handles.
      while x=0; do cat /proc/`jps | grep HAJournalServer| cut -d' ' -f1`/status | egrep "(Threads|VmSize)"|xargs echo; sleep 60; done
      
      # find errors in the logs.
      grep ERROR ~/workspace/BIGDATA_RELEASE_1_3_0/*.log
      
      # find runState changes in the logs.
      grep runState= ~/workspace/BIGDATA_RELEASE_1_3_0/*.log
      


      - Per BLZG-849, we should issue some requests that cause abort2Phase() invocations during the longevity tests.

      ----
      It would be useful to create a longevity testing harness that interacts with real processing running on three (or more) VMs, monitors the processes, applies an appropriate workload (e.g., BSBM EXPLORE+UPDATE on the leader and BSBM EXPLORE on the followers using some desired number of client threads), sudden kills (kill -9) a random service within a period of time each time the quorum becomes fully met and all services are HAReady. This would allow us to stress the platform and its failover abilities. We could automate the verification of HALog files (digest equality) and even occasionally restore and roll forward snapshots to verify the effective digest on the services as of a given commit point. An ACID stress test could be achieved by modifying the BSBM harness to do DELETE+INSERT or DROP ALL+LOAD tests in which the number of visible statements should never be changed.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Note: For above, the retention period is equivalent to 1 week.

        I have added an option to force a service into the ERROR state. Once in the ERROR state it will automatically attempt to recovery.

        Ah. The problem is that bigdata17 ran out of disk. That took it offline. Since it was offline. purgeLogs() was unable to actually purge anything. This is because we do not permit HALog files to be purged if the quorum is not fully met. This is done in order to prevent a situation a leader would not have sufficient log files on hand to restore the failed service. If this were to occur, then the failed service would have to undergo a disaster rebuild rather than simply resynchronizing from the leader. I have updated the code to clarify this. The wiki already documents this behavior.

        Committed Revision r7648.

        I am going to restart bigdata17. It should then sync and join. At that point the HALogs and snapshots should be purged to the minimum required by the restore policy.

        Show
        bryanthompson bryanthompson added a comment - Note: For above, the retention period is equivalent to 1 week. I have added an option to force a service into the ERROR state. Once in the ERROR state it will automatically attempt to recovery. Ah. The problem is that bigdata17 ran out of disk. That took it offline. Since it was offline. purgeLogs() was unable to actually purge anything. This is because we do not permit HALog files to be purged if the quorum is not fully met. This is done in order to prevent a situation a leader would not have sufficient log files on hand to restore the failed service. If this were to occur, then the failed service would have to undergo a disaster rebuild rather than simply resynchronizing from the leader. I have updated the code to clarify this. The wiki already documents this behavior. Committed Revision r7648. I am going to restart bigdata17. It should then sync and join. At that point the HALogs and snapshots should be purged to the minimum required by the restore policy.
        Hide
        bryanthompson bryanthompson added a comment -

        Well, restarting bigdata17 has problems. Presumably this situation was caused by running out of disk space.

        WARNING: Exception creating service.
        java.lang.reflect.InvocationTargetException
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        	at com.sun.jini.start.NonActivatableServiceDescriptor.create(NonActivatableServiceDescriptor.java:674)
        	at com.sun.jini.start.ServiceStarter.create(ServiceStarter.java:287)
        	at com.sun.jini.start.ServiceStarter.processServiceDescriptors(ServiceStarter.java:445)
        	at com.sun.jini.start.ServiceStarter.main(ServiceStarter.java:476)
        Caused by: java.lang.RuntimeException: Startup failure
        	at com.bigdata.journal.jini.ha.AbstractServer.fatal(AbstractServer.java:542)
        	at com.bigdata.journal.jini.ha.AbstractServer.run(AbstractServer.java:1871)
        	at com.bigdata.journal.jini.ha.HAJournalServer.<init>(HAJournalServer.java:568)
        	... 8 more
        Caused by: net.jini.config.ConfigurationException: HAJournalClass=com.bigdata.journal.jini.ha.HAJournal; caused by:
        	java.lang.reflect.InvocationTargetException
        	at com.bigdata.journal.jini.ha.HAJournalServer.newHAJournal(HAJournalServer.java:796)
        	at com.bigdata.journal.jini.ha.HAJournalServer.newService(HAJournalServer.java:705)
        	at com.bigdata.journal.jini.ha.HAJournalServer.newService(HAJournalServer.java:129)
        	at com.bigdata.journal.jini.ha.AbstractServer.run(AbstractServer.java:1861)
        	... 9 more
        Caused by: java.lang.reflect.InvocationTargetException
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        	at com.bigdata.journal.jini.ha.HAJournalServer.newHAJournal(HAJournalServer.java:764)
        	... 12 more
        Caused by: com.bigdata.rwstore.PhysicalAddressResolutionException: Address did not resolve to physical address: -33087571
        	at com.bigdata.rwstore.RWStore.getData(RWStore.java:1883)
        	at com.bigdata.journal.RWStrategy.readFromLocalStore(RWStrategy.java:726)
        	at com.bigdata.journal.RWStrategy.read(RWStrategy.java:153)
        	at com.bigdata.journal.AbstractJournal._getCommitRecord(AbstractJournal.java:4300)
        	at com.bigdata.journal.AbstractJournal.<init>(AbstractJournal.java:1313)
        	at com.bigdata.journal.Journal.<init>(Journal.java:217)
        	at com.bigdata.journal.jini.ha.HAJournal.<init>(HAJournal.java:326)
        	at com.bigdata.journal.jini.ha.HAJournal.<init>(HAJournal.java:307)
        	... 17 more
        

        I am going to restore the most recent snapshot and then let it roll forward and see if it resyncs and joins properly.

        # change to the service directory.
        cd benchmark/HAJournal-1/HAJournalServer
        # rename the old journal (optional).
        mv bigdata-ha.jnl old.jnl
        # unpack the most recent snapshot.
        zcat snapshot/000/000/000/000/004/471/000000000000004471442.jnl.gz > bigdata-ha.jnl
        

        and then restart.

        Note: I had to go back a few snapshots. Evidently the system had been more or less out of disk for a while and several recent snapshots could not be rebuilt properly. Clearly people need to provision enough disk and set an alarm on the disk remaining! This would be a good thing to build into the platform
        - see BLZG-867. If the free space remaining on the volume for the service directory, log files, journal, HALog files, or snapshots falls below a minimum threshold then force the service into an "OPERATOR" state.

        Bigdata17 was successfully restored from a snapshot. The server wound up applying several million HALog files before resyncing and joining with the met quorum. All three servers showed an abrupt jump in free space once bigdata17 was resynchronized since they were able to release a large number of HALog files.

        The servers in this replication cluster are still holding onto a number of snapshots. The restore policy specifies 1 week of restore time. This means that it can not release the oldest snapshot (which is more than one week old) until another snapshot is at least one week old. This is because the oldest available snapshot specifies the first available restore point. With a one week restore policy, that snapshot must be at least one week old. Therefore the system should begin releasing snapshots again on December 17th, which would make the 2nd snapshot one week old and therefore allow the first snapshot to be released.

        Service: path=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer
        Service: snapshotPolicy=DefaultSnapshotPolicy{timeOfDay=0200, percentLogSize=20%}, countdown=18:16, shouldSnapshot=false
        Service: restorePolicy=DefaultRestorePolicy{minRestoreAge=604800000ms,minSnapshots=1,minRestorePoints=0}
        HAJournal: file=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer/bigdata-ha.jnl, commitCounter=5307292, nbytes=19567149056
        
        rootBlock{ rootBlock=0, challisField=5307292, version=3, nextOffset=1195349532443535, localTime=1387025032643 [Saturday, December 14, 2013 7:43:52 AM EST], firstCommitTime=1382905393987 [Sunday, October 27, 2013 4:23:13 PM EDT], lastCommitTime=1387025032631 [Saturday, December 14, 2013 7:43:52 AM EST], commitCounter=5307292, commitRecordAddr={off=NATIVE:-33361743,len=422}, commitRecordIndexAddr={off=NATIVE:-27402832,len=220}, blockSequence=1, quorumToken=67, metaBitsAddr=145157815009471, metaStartAddr=298571, storeType=RW, uuid=cc980e9e-b974-435a-b1a5-18c6ef432d66, offsetBits=42, checksum=738469401, createTime=1382905368291 [Sunday, October 27, 2013 4:22:48 PM EDT], closeTime=0}
        HALogDir: nfiles=951690, nbytes=85212874540, path=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer/HALog, compressorKey=DBS, lastHALogClosed=000000000000005307320, liveLog=000000000000005307321.ha-log
        SnapshotDir: nfiles=6, nbytes=35805592023, path=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer/snapshot
        SnapshotFile: commitTime=1386193837703 [Wednesday, December 4, 2013 4:50:37 PM EST], commitCounter=4355632, nbytes=5862449418
        SnapshotFile: commitTime=1386659917696 [Tuesday, December 10, 2013 2:18:37 AM EST], commitCounter=4471427, nbytes=5901231001
        SnapshotFile: commitTime=1386723017312 [Tuesday, December 10, 2013 7:50:17 PM EST], commitCounter=4655632, nbytes=5899833810
        SnapshotFile: commitTime=1386835324431 [Thursday, December 12, 2013 3:02:04 AM EST], commitCounter=4847422, nbytes=6025812124
        SnapshotFile: commitTime=1386923049340 [Friday, December 13, 2013 3:24:09 AM EST], commitCounter=5014517, nbytes=6051171652
        SnapshotFile: commitTime=1387010774394 [Saturday, December 14, 2013 3:46:14 AM EST], commitCounter=5264520, nbytes=6065094018
        

        In summary, all is working correctly, but we need to setup a disk space free monitor either outside or inside of bigdata for an HA3 cluster to watch for out of disk conditions. This condition should alert the operator and force the service into the OPERATOR state to avoid attempts to write on a service that is either low on disk or out of disk.

        Show
        bryanthompson bryanthompson added a comment - Well, restarting bigdata17 has problems. Presumably this situation was caused by running out of disk space. WARNING: Exception creating service. java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.sun.jini.start.NonActivatableServiceDescriptor.create(NonActivatableServiceDescriptor.java:674) at com.sun.jini.start.ServiceStarter.create(ServiceStarter.java:287) at com.sun.jini.start.ServiceStarter.processServiceDescriptors(ServiceStarter.java:445) at com.sun.jini.start.ServiceStarter.main(ServiceStarter.java:476) Caused by: java.lang.RuntimeException: Startup failure at com.bigdata.journal.jini.ha.AbstractServer.fatal(AbstractServer.java:542) at com.bigdata.journal.jini.ha.AbstractServer.run(AbstractServer.java:1871) at com.bigdata.journal.jini.ha.HAJournalServer.<init>(HAJournalServer.java:568) ... 8 more Caused by: net.jini.config.ConfigurationException: HAJournalClass=com.bigdata.journal.jini.ha.HAJournal; caused by: java.lang.reflect.InvocationTargetException at com.bigdata.journal.jini.ha.HAJournalServer.newHAJournal(HAJournalServer.java:796) at com.bigdata.journal.jini.ha.HAJournalServer.newService(HAJournalServer.java:705) at com.bigdata.journal.jini.ha.HAJournalServer.newService(HAJournalServer.java:129) at com.bigdata.journal.jini.ha.AbstractServer.run(AbstractServer.java:1861) ... 9 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.bigdata.journal.jini.ha.HAJournalServer.newHAJournal(HAJournalServer.java:764) ... 12 more Caused by: com.bigdata.rwstore.PhysicalAddressResolutionException: Address did not resolve to physical address: -33087571 at com.bigdata.rwstore.RWStore.getData(RWStore.java:1883) at com.bigdata.journal.RWStrategy.readFromLocalStore(RWStrategy.java:726) at com.bigdata.journal.RWStrategy.read(RWStrategy.java:153) at com.bigdata.journal.AbstractJournal._getCommitRecord(AbstractJournal.java:4300) at com.bigdata.journal.AbstractJournal.<init>(AbstractJournal.java:1313) at com.bigdata.journal.Journal.<init>(Journal.java:217) at com.bigdata.journal.jini.ha.HAJournal.<init>(HAJournal.java:326) at com.bigdata.journal.jini.ha.HAJournal.<init>(HAJournal.java:307) ... 17 more I am going to restore the most recent snapshot and then let it roll forward and see if it resyncs and joins properly. # change to the service directory. cd benchmark/HAJournal-1/HAJournalServer # rename the old journal (optional). mv bigdata-ha.jnl old.jnl # unpack the most recent snapshot. zcat snapshot/000/000/000/000/004/471/000000000000004471442.jnl.gz > bigdata-ha.jnl and then restart. Note: I had to go back a few snapshots. Evidently the system had been more or less out of disk for a while and several recent snapshots could not be rebuilt properly. Clearly people need to provision enough disk and set an alarm on the disk remaining! This would be a good thing to build into the platform - see BLZG-867 . If the free space remaining on the volume for the service directory, log files, journal, HALog files, or snapshots falls below a minimum threshold then force the service into an "OPERATOR" state. Bigdata17 was successfully restored from a snapshot. The server wound up applying several million HALog files before resyncing and joining with the met quorum. All three servers showed an abrupt jump in free space once bigdata17 was resynchronized since they were able to release a large number of HALog files. The servers in this replication cluster are still holding onto a number of snapshots. The restore policy specifies 1 week of restore time. This means that it can not release the oldest snapshot (which is more than one week old) until another snapshot is at least one week old. This is because the oldest available snapshot specifies the first available restore point. With a one week restore policy, that snapshot must be at least one week old. Therefore the system should begin releasing snapshots again on December 17th, which would make the 2nd snapshot one week old and therefore allow the first snapshot to be released. Service: path=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer Service: snapshotPolicy=DefaultSnapshotPolicy{timeOfDay=0200, percentLogSize=20%}, countdown=18:16, shouldSnapshot=false Service: restorePolicy=DefaultRestorePolicy{minRestoreAge=604800000ms,minSnapshots=1,minRestorePoints=0} HAJournal: file=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer/bigdata-ha.jnl, commitCounter=5307292, nbytes=19567149056 rootBlock{ rootBlock=0, challisField=5307292, version=3, nextOffset=1195349532443535, localTime=1387025032643 [Saturday, December 14, 2013 7:43:52 AM EST], firstCommitTime=1382905393987 [Sunday, October 27, 2013 4:23:13 PM EDT], lastCommitTime=1387025032631 [Saturday, December 14, 2013 7:43:52 AM EST], commitCounter=5307292, commitRecordAddr={off=NATIVE:-33361743,len=422}, commitRecordIndexAddr={off=NATIVE:-27402832,len=220}, blockSequence=1, quorumToken=67, metaBitsAddr=145157815009471, metaStartAddr=298571, storeType=RW, uuid=cc980e9e-b974-435a-b1a5-18c6ef432d66, offsetBits=42, checksum=738469401, createTime=1382905368291 [Sunday, October 27, 2013 4:22:48 PM EDT], closeTime=0} HALogDir: nfiles=951690, nbytes=85212874540, path=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer/HALog, compressorKey=DBS, lastHALogClosed=000000000000005307320, liveLog=000000000000005307321.ha-log SnapshotDir: nfiles=6, nbytes=35805592023, path=/root/workspace/BIGDATA_RELEASE_1_3_0/benchmark/HAJournal-1/HAJournalServer/snapshot SnapshotFile: commitTime=1386193837703 [Wednesday, December 4, 2013 4:50:37 PM EST], commitCounter=4355632, nbytes=5862449418 SnapshotFile: commitTime=1386659917696 [Tuesday, December 10, 2013 2:18:37 AM EST], commitCounter=4471427, nbytes=5901231001 SnapshotFile: commitTime=1386723017312 [Tuesday, December 10, 2013 7:50:17 PM EST], commitCounter=4655632, nbytes=5899833810 SnapshotFile: commitTime=1386835324431 [Thursday, December 12, 2013 3:02:04 AM EST], commitCounter=4847422, nbytes=6025812124 SnapshotFile: commitTime=1386923049340 [Friday, December 13, 2013 3:24:09 AM EST], commitCounter=5014517, nbytes=6051171652 SnapshotFile: commitTime=1387010774394 [Saturday, December 14, 2013 3:46:14 AM EST], commitCounter=5264520, nbytes=6065094018 In summary, all is working correctly, but we need to setup a disk space free monitor either outside or inside of bigdata for an HA3 cluster to watch for out of disk conditions. This condition should alert the operator and force the service into the OPERATOR state to avoid attempts to write on a service that is either low on disk or out of disk.
        Hide
        bryanthompson bryanthompson added a comment -

        Pushed code updates to bigdata17 and then bigdata16. Both services restarted (one at a time) without causing a quorum break. Bigdata15 is still the leader. The code changes provide the ability to force a service into an ERROR state and the ability to report the disk bytes available on the various directories for the service.

        The bigdata12, 13, 14 cluster is at commit point 1,157,396 with continuing EXPLORE+UPDATE on the leader.

        The bigdata15, 16, 17 cluster is at commit point 5,365,944 with continuing EXPLORE+UPDATE on the leader.

        Show
        bryanthompson bryanthompson added a comment - Pushed code updates to bigdata17 and then bigdata16. Both services restarted (one at a time) without causing a quorum break. Bigdata15 is still the leader. The code changes provide the ability to force a service into an ERROR state and the ability to report the disk bytes available on the various directories for the service. The bigdata12, 13, 14 cluster is at commit point 1,157,396 with continuing EXPLORE+UPDATE on the leader. The bigdata15, 16, 17 cluster is at commit point 5,365,944 with continuing EXPLORE+UPDATE on the leader.
        Hide
        bryanthompson bryanthompson added a comment -

        Resumed the UPDATE+EXPLORE mixture on bigdata15,16,17. They have released a large number of HALogs and are no longer in a state with extremely low disk.

        Show
        bryanthompson bryanthompson added a comment - Resumed the UPDATE+EXPLORE mixture on bigdata15,16,17. They have released a large number of HALogs and are no longer in a state with extremely low disk.
        Hide
        bryanthompson bryanthompson added a comment -

        This issue is closed. In the future, it would be good to increase the automation of the longevity testing process. Any such automation and future longevity testing for QA follow the basic procedure laid out in this ticket.

        Show
        bryanthompson bryanthompson added a comment - This issue is closed. In the future, it would be good to increase the automation of the longevity testing process. Any such automation and future longevity testing for QA follow the basic procedure laid out in this ticket.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: