Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-14

HA doLocalAbort() should interrupt NSS requests and AbstractTasks

    Details

    • Type: Task
    • Status: In Progress
    • Priority: High
    • Resolution: Unresolved
    • Affects Version/s: BIGDATA_RELEASE_1_2_3
    • Fix Version/s: None
    • Component/s: HAJournalServer
    • Labels:
      None

      Description

      An issue has been observed where a transition to an error state in an HAJournalServer can cause a variety of exceptions to be logged by forcing a rollback to the previous commit point without first interrupting any unisolated tasks that are operation on the journal (any running transactions are interrupted, but not the unisolated operations). This does not cause data corruption since the journal will refuse to commit. However, it does cause unsightly error messages.

      A possible fix is to hook (a) the NSS; and (b) the ConcurrencyManager or WriteExecutorService to ensure that unisolated operations are also cancelled. If we implement BLZG-670 (Concurrent unisolated operations against multiple KBs on the same Journal), then those concurrent connections should also be interrupted.

      See BLZG-1036 (Name2Addr.indexNameScan(prefix) uses scan + filter)
      See BLZG-670 (Concurrent unisolated operations against multiple KBs on the same Journal)

        Issue Links

          Activity

          Hide
          bryanthompson bryanthompson added a comment -

          I've been looking at how to address this ticket. The main integration point currently appears to be AbstractJournal.setQuorumToken2() in the code path where a service leaves a met quorum. For this ticket, we are only interested in UNISOLATED operations and thus only when it is the leader that leaves the met quorum
          - hence, this will also be a quorum break.

          My thoughts are to use a well-known ThreadGuard and a class that extends FutureTask to automatically do the run() within ThreadGuard.guard(), e.g., GuardedFutureTask. The AbstractJournal can own that ThreadGuard and provide an appropriate factory for the GuardedFutureTask. We need to create a pattern that can be used to run such unisolated tasks with an appropriate guard. E.g., IIndexManager.runWithUnisolatedTaskGuard(Runnable).

          The following kinds of tasks would have to be wrapped up with this pattern:


          - SPARQL UPDATE. The SPARQL UPDATE execution pattern is already wrapped up as a Callable.


          - Mutation methods of the REST API. The REST API mutation methods are not currently wrapped up as Callable objects, but this encapsulation should be easy to provide. Once so wrapped, they can be "guarded". This might be easiest to achieve by submitting the task to IIndexManager.runWithUnisolatedTaskGuard(Runnable).


          - Anything submitted through the ConcurrencyManager as an UNISOLATED task (though this facility is not used right now in the HAJournalServer). The tasks on the ConcurrencyManager all extend AbstractTask.


          - Anything submitted through HAGlue.submit(IIndexManagerCallable,boolean:asyncProxy). For this case, we would be best off using either IIndexManager.runWithUnisolatedTaskGuard(Runnable) or GuardedFutureTask.

          For both AbstractTask and UpdateTask, the code may rely on the specific base class which could preclude wrapping the objects with a GuardedFutureTask. However, we could simply use the well-known ThreadGuard (hung off of the AbstractJournal) and make sure that the "Callable" executes within a guard().

          Show
          bryanthompson bryanthompson added a comment - I've been looking at how to address this ticket. The main integration point currently appears to be AbstractJournal.setQuorumToken2() in the code path where a service leaves a met quorum. For this ticket, we are only interested in UNISOLATED operations and thus only when it is the leader that leaves the met quorum - hence, this will also be a quorum break. My thoughts are to use a well-known ThreadGuard and a class that extends FutureTask to automatically do the run() within ThreadGuard.guard(), e.g., GuardedFutureTask. The AbstractJournal can own that ThreadGuard and provide an appropriate factory for the GuardedFutureTask. We need to create a pattern that can be used to run such unisolated tasks with an appropriate guard. E.g., IIndexManager.runWithUnisolatedTaskGuard(Runnable). The following kinds of tasks would have to be wrapped up with this pattern: - SPARQL UPDATE. The SPARQL UPDATE execution pattern is already wrapped up as a Callable. - Mutation methods of the REST API. The REST API mutation methods are not currently wrapped up as Callable objects, but this encapsulation should be easy to provide. Once so wrapped, they can be "guarded". This might be easiest to achieve by submitting the task to IIndexManager.runWithUnisolatedTaskGuard(Runnable). - Anything submitted through the ConcurrencyManager as an UNISOLATED task (though this facility is not used right now in the HAJournalServer). The tasks on the ConcurrencyManager all extend AbstractTask. - Anything submitted through HAGlue.submit(IIndexManagerCallable,boolean:asyncProxy). For this case, we would be best off using either IIndexManager.runWithUnisolatedTaskGuard(Runnable) or GuardedFutureTask. For both AbstractTask and UpdateTask, the code may rely on the specific base class which could preclude wrapping the objects with a GuardedFutureTask. However, we could simply use the well-known ThreadGuard (hung off of the AbstractJournal) and make sure that the "Callable" executes within a guard().
          Hide
          bryanthompson bryanthompson added a comment -

          I am looking at how to improve encapsulation of the REST API methods. There is some tension between providing the servlet response and request objects and hiding the access to the journal and the sail connection. We want the sail connection to be obtained from within the call() method so its scope is strictly limited to the period during which the ConcurrencyManager is holding the necessary lock. This is what will give us the ability to run concurrent mutation operations with group commit. This means pushing down the "getUnisolatedConnection()" (which should use a tx anyway if the caller has specified one in the update). However, some of the REST API methods require direct access to the servlet API in order to using efficient streaming operations on the servlet request and/or servlet response.

          I have create a branch (NSS_GROUP_COMMIT) to explore this issue. Overall, the issue about interrupting NSS requests is quite viable. Support group commit is trickier, but looks possible. However, there are some methods where group commit would break the ACID semantics of the current implementation (this is discussed below). Those methods could be rewritten, e.g., as SPARQL UPDATE requests. There are also some complexities around the overhead of declaring the locks required to support group commit and the differences between the global index views used by the IBigdataFederation and the Journal local views (which can use the ConcurrencyManager). Details are below for all of this.

          Some points:


          - REST API mutation methods should support read/write tx if the KB is so configured. Right now they are using the unisolated connection regardless.


          - The RDFParser options are being configured from constants in the REST API code rather than using the configured properties on the KB instance. For example:

          rdfParser.setVerifyData(false);
          rdfParser.setStopAtFirstError(true);
          rdfParser.setDatatypeHandling(RDFParser.DatatypeHandling.IGNORE);
          // Note: the setPreserveBNodeIDs() method is not being invoked.
          


          - The REST API methods SHOULD be able to provide unnested http responses with simple status codes and explanations (status lines). This means that they need to either error check the API before entering the Callable or they need to directly operation on the HttpServletResponse and commit it before throwing out an exception. Right now, the preferred pattern is that we do not throw exceptions for API errors. Instead, we commit the response and return. This really needs to be done outside of the Callable or we need to have a well known exception type that is used for errors that have already been handled and the response committed.


          - The IBigdataFederation can not use the ConcurrencyManager to run the tasks in precisely the same manner. There is an IConcurrencyManager available on the IDataService, but it is used to guard the individual named index partitions rather than the global index views. Further, the NSS can run against the global index views from any process that is connected to an IBigdataFederation, and such processes will not have an IConcurrencyManager unless they are IDataService or IMetadataService instances. Thus, we need to abstract out two patterns: one that can be used when the backing database is an IJournal instance and one that can be used when the backing database is an IBigdataFederation instance. This suggests that we need a pattern using a delegation model that is either wrapped by an AbstractTask (for the Journal) or by a simpler pattern (for an IBigdataFederation).


          - The immediate concern is being able to interrupt the unisolated tasks for a leader fail scenario. A secondary concern is whether we can leverage the ConcurrencyManager to allow concurrent unisolated updates (also overriding the Journal's semaphore).


          - I have code that runs the InsertServlet tasks on the ConcurrencyManager for the journal. There is some additional overhead involved because each task must discover and then declare all of the indices associated with a given KB. However, this approach should permit concurrent operations against different KB instances if we also disable the Journal semaphore. It would be more efficient if we could do the index name declarations based on the namespace along, but this requires an efficient prefix scan on Name2Addr and there is a known bug for that. Some of the test harness needs to be aware and operate using the same concurrency control mechanisms, especially the code that creates the KB instance (if this is done explicitly) and that destroys the KB instance. The tasks are calling through to BigdataSailConnection.commit() and .rollback(). Those methods call through to the LocalTripleStore.commit() and LocalTripleStore.abort() methods. Since those calls are being pass down to the AbstractTask.IsolatedActionJournal, I have had to make the methods into NOPs on the IsolatedActionJournal (they were throwing UnsupportedOperationExceptions). The general pattern for job-based concurrency is that the commit is applied automatically when the AbstractTask finishes and the write set of the job is discarded automatically if the task fails.


          - There is a potential issue where the existing code commits the response and returns, e.g., from the InsertServlet. Any task that does not fail (thrown exception) will commit. This means that mutations operations that fail will still attempt to join a commit point. This is inappropriate and could cause resource leaks (e.g., if the operation failed after writing on the Journal). We really should throw out a typed exception, but in launderThrowable() ignore that typed exception if the response has already been committed. That way the task will not join a commit point.


          - There is a flag in BigdataServlet that currently enables or disables the use of the ConcurrencyManager. Disable/Enable group commit on the Journal from the NSS API. There SHOULD be a global flag should control this and also disable the journal's semaphore and should disable the wrapping of BTree as an UnisolatedReadWriteIndex, and should disable the calls to commit() or abort() from the LocalTripleStore to the Journal.


          - DELETE WITH QUERY, UPDATE WITH QUERY : allowing group commit for these methods on the REST API could break the ACID contract. The methods currently rely on reading from the lastCommitTime and writing on the unisolated connection. With group commit, there can be mutations that have been checkpointed and hence are visible to the unisolated connection, but which are not yet visible as of the lastCommitTime. Thus, if another request modified the KB since the last commit point and before the DELETE WITH QUERY or UPDATE WITH QUERY is executed, then those modifications will not be observed by the reader on the lastUpdateTime view.


          - SPARQL QUERY/UPDATE: We need to refactor the code that manages the running queries in BigdataRDFServlet so we can separate out the concurrency control of the views from the control over the #of running queries and/or update requests and the metadata that we manage to track and report on those requests.


          - MULTI TENANCY: We also need to refactor the MultiTenancyServlet to push down Callable tasks that can be executed with the appropriate concurrency controls.

          See https://sourceforge.net/apps/trac/bigdata/ticket/743 (AbstractTripleStore.destroy() does not filter for correct prefix
          - this notes the problem with prefix scans on Name2Addr; this issue might also be noted on the feature for durable named solution sets.)

          Show
          bryanthompson bryanthompson added a comment - I am looking at how to improve encapsulation of the REST API methods. There is some tension between providing the servlet response and request objects and hiding the access to the journal and the sail connection. We want the sail connection to be obtained from within the call() method so its scope is strictly limited to the period during which the ConcurrencyManager is holding the necessary lock. This is what will give us the ability to run concurrent mutation operations with group commit. This means pushing down the "getUnisolatedConnection()" (which should use a tx anyway if the caller has specified one in the update). However, some of the REST API methods require direct access to the servlet API in order to using efficient streaming operations on the servlet request and/or servlet response. I have create a branch (NSS_GROUP_COMMIT) to explore this issue. Overall, the issue about interrupting NSS requests is quite viable. Support group commit is trickier, but looks possible. However, there are some methods where group commit would break the ACID semantics of the current implementation (this is discussed below). Those methods could be rewritten, e.g., as SPARQL UPDATE requests. There are also some complexities around the overhead of declaring the locks required to support group commit and the differences between the global index views used by the IBigdataFederation and the Journal local views (which can use the ConcurrencyManager). Details are below for all of this. Some points: - REST API mutation methods should support read/write tx if the KB is so configured. Right now they are using the unisolated connection regardless. - The RDFParser options are being configured from constants in the REST API code rather than using the configured properties on the KB instance. For example: rdfParser.setVerifyData(false); rdfParser.setStopAtFirstError(true); rdfParser.setDatatypeHandling(RDFParser.DatatypeHandling.IGNORE); // Note: the setPreserveBNodeIDs() method is not being invoked. - The REST API methods SHOULD be able to provide unnested http responses with simple status codes and explanations (status lines). This means that they need to either error check the API before entering the Callable or they need to directly operation on the HttpServletResponse and commit it before throwing out an exception. Right now, the preferred pattern is that we do not throw exceptions for API errors. Instead, we commit the response and return. This really needs to be done outside of the Callable or we need to have a well known exception type that is used for errors that have already been handled and the response committed. - The IBigdataFederation can not use the ConcurrencyManager to run the tasks in precisely the same manner. There is an IConcurrencyManager available on the IDataService, but it is used to guard the individual named index partitions rather than the global index views. Further, the NSS can run against the global index views from any process that is connected to an IBigdataFederation, and such processes will not have an IConcurrencyManager unless they are IDataService or IMetadataService instances. Thus, we need to abstract out two patterns: one that can be used when the backing database is an IJournal instance and one that can be used when the backing database is an IBigdataFederation instance. This suggests that we need a pattern using a delegation model that is either wrapped by an AbstractTask (for the Journal) or by a simpler pattern (for an IBigdataFederation). - The immediate concern is being able to interrupt the unisolated tasks for a leader fail scenario. A secondary concern is whether we can leverage the ConcurrencyManager to allow concurrent unisolated updates (also overriding the Journal's semaphore). - I have code that runs the InsertServlet tasks on the ConcurrencyManager for the journal. There is some additional overhead involved because each task must discover and then declare all of the indices associated with a given KB. However, this approach should permit concurrent operations against different KB instances if we also disable the Journal semaphore. It would be more efficient if we could do the index name declarations based on the namespace along, but this requires an efficient prefix scan on Name2Addr and there is a known bug for that. Some of the test harness needs to be aware and operate using the same concurrency control mechanisms, especially the code that creates the KB instance (if this is done explicitly) and that destroys the KB instance. The tasks are calling through to BigdataSailConnection.commit() and .rollback(). Those methods call through to the LocalTripleStore.commit() and LocalTripleStore.abort() methods. Since those calls are being pass down to the AbstractTask.IsolatedActionJournal, I have had to make the methods into NOPs on the IsolatedActionJournal (they were throwing UnsupportedOperationExceptions). The general pattern for job-based concurrency is that the commit is applied automatically when the AbstractTask finishes and the write set of the job is discarded automatically if the task fails. - There is a potential issue where the existing code commits the response and returns, e.g., from the InsertServlet. Any task that does not fail (thrown exception) will commit. This means that mutations operations that fail will still attempt to join a commit point. This is inappropriate and could cause resource leaks (e.g., if the operation failed after writing on the Journal). We really should throw out a typed exception, but in launderThrowable() ignore that typed exception if the response has already been committed. That way the task will not join a commit point. - There is a flag in BigdataServlet that currently enables or disables the use of the ConcurrencyManager. Disable/Enable group commit on the Journal from the NSS API. There SHOULD be a global flag should control this and also disable the journal's semaphore and should disable the wrapping of BTree as an UnisolatedReadWriteIndex, and should disable the calls to commit() or abort() from the LocalTripleStore to the Journal. - DELETE WITH QUERY, UPDATE WITH QUERY : allowing group commit for these methods on the REST API could break the ACID contract. The methods currently rely on reading from the lastCommitTime and writing on the unisolated connection. With group commit, there can be mutations that have been checkpointed and hence are visible to the unisolated connection, but which are not yet visible as of the lastCommitTime. Thus, if another request modified the KB since the last commit point and before the DELETE WITH QUERY or UPDATE WITH QUERY is executed, then those modifications will not be observed by the reader on the lastUpdateTime view. - SPARQL QUERY/UPDATE: We need to refactor the code that manages the running queries in BigdataRDFServlet so we can separate out the concurrency control of the views from the control over the #of running queries and/or update requests and the metadata that we manage to track and report on those requests. - MULTI TENANCY: We also need to refactor the MultiTenancyServlet to push down Callable tasks that can be executed with the appropriate concurrency controls. See https://sourceforge.net/apps/trac/bigdata/ticket/743 (AbstractTripleStore.destroy() does not filter for correct prefix - this notes the problem with prefix scans on Name2Addr; this issue might also be noted on the feature for durable named solution sets.)
          Hide
          bryanthompson bryanthompson added a comment -

          It might be useful to add the commit()/abort() protocol on AbstractTask for the IsolatedActionJournal. commit() would be the implicit behavior for a read/write task that ends normally. abort() would set a flag that the write set should be discarded. control must explicitly return from the AbstractTask before the task can be either aborted or committed. This would provide a control flow that would be closer to the control flow of a BigdataSailConnection. However, the connection semantics are still somewhat different as a thread that does not write on the Journal will not await the next commit point when calling commit() on the connection. It will simply return immediately.

          We will need to have CreateKB and DestroyKB tasks designed to execute on the ConcurrencyManager. The CreateKB task only needs the GlobalRowStore lock. The DestroyKB task needs the GlobalRowStore lock and the lock for each of the named indices associated with the KB instance. These tasks will need to be used when we are running with group commit support enabled rather than with the embedded BigdataSailConnection mode. This is necessary to fix a stochastic problem with the NSS test suite when using group commit which (informed guess) arises from a data race with the group commit point for the visibility of the newly created KB instance. Likewise, the test suite is mixing different access models when using tripleStore.destroy() in its AbstractTestNanoSparqlServerClient.dropTripleStore() method.

          Show
          bryanthompson bryanthompson added a comment - It might be useful to add the commit()/abort() protocol on AbstractTask for the IsolatedActionJournal. commit() would be the implicit behavior for a read/write task that ends normally. abort() would set a flag that the write set should be discarded. control must explicitly return from the AbstractTask before the task can be either aborted or committed. This would provide a control flow that would be closer to the control flow of a BigdataSailConnection. However, the connection semantics are still somewhat different as a thread that does not write on the Journal will not await the next commit point when calling commit() on the connection. It will simply return immediately. We will need to have CreateKB and DestroyKB tasks designed to execute on the ConcurrencyManager. The CreateKB task only needs the GlobalRowStore lock. The DestroyKB task needs the GlobalRowStore lock and the lock for each of the named indices associated with the KB instance. These tasks will need to be used when we are running with group commit support enabled rather than with the embedded BigdataSailConnection mode. This is necessary to fix a stochastic problem with the NSS test suite when using group commit which (informed guess) arises from a data race with the group commit point for the visibility of the newly created KB instance. Likewise, the test suite is mixing different access models when using tripleStore.destroy() in its AbstractTestNanoSparqlServerClient.dropTripleStore() method.
          Hide
          bryanthompson bryanthompson added a comment -

          Continued progress against this ticket in support of BLZG-670. I am iterating over the REST API method implementations and restructuring them to support both BLZG-670 and, by extension, this ticket as well. It is not necessary to change over to group commit in order to derive the benefit from this refactoring, but when we do change over it will be with a single boolean switch from the existing operational mode to the group commit operational mode.

          Show
          bryanthompson bryanthompson added a comment - Continued progress against this ticket in support of BLZG-670 . I am iterating over the REST API method implementations and restructuring them to support both BLZG-670 and, by extension, this ticket as well. It is not necessary to change over to group commit in order to derive the benefit from this refactoring, but when we do change over it will be with a single boolean switch from the existing operational mode to the group commit operational mode.
          Hide
          bryanthompson bryanthompson added a comment -

          At this point group commit (BLZG-670) is closed. An API exists that would make it easy to wrap tasks with a ThreadGuard pattern such that a leader fail event could interrupt all such running tasks without forcing the shutdown of the executor service for the Journal. This glue needs to be implemented and tested.

          Show
          bryanthompson bryanthompson added a comment - At this point group commit ( BLZG-670 ) is closed. An API exists that would make it easy to wrap tasks with a ThreadGuard pattern such that a leader fail event could interrupt all such running tasks without forcing the shutdown of the executor service for the Journal. This glue needs to be implemented and tested.
          Hide
          bryanthompson bryanthompson added a comment -

          It would appear that this line in AbstractJournal.setQuorumToken2() is where we should be terminating any tasks running on the WriteExecutorService or queued on the ConcurrencyManager / LockManager.

                          ((AbstractTransactionService) getLocalTransactionManager()
                                  .getTransactionService()).abortAllTx();
          

          This method calls through to AbstractTransactionService.abortAllTx(). AbstractTransactionService is extended by AbstractHATransactionService which is extended by JournalTransactionService and the private class Journal.InnerJournalTransactionService. The InnerJournalTransactionService has a lot of the glue code for HA 2-phase commits, including the gather consensus protocol.

          Both the local transaction manager (from which we access the transaction service) and the ConcurrencyManager are initialized in the Journal's constructor. And the ConcurrencyManager actually accepts the localTransactionManager object into its constructor. This suggests that the right place to install the guard pattern is in the ConcurrencyManager.

                  localTransactionManager = newLocalTransactionManager();
          
                  concurrencyManager = new ConcurrencyManager(properties,
                          localTransactionManager, this);
          

          Looking a little further, AbstractTask<init> reaches back and obtains this same local transaction manager reference as well as the ConcurrencyManager itself.

                  // make sure we have the real ConcurrencyManager for addCounters()
                  this.concurrencyManager = (ConcurrencyManager) (concurrencyManager instanceof Journal ? ((Journal) concurrencyManager)
                          .getConcurrencyManager()
                          : concurrencyManager);
          
                  this.transactionManager = (AbstractLocalTransactionManager) concurrencyManager
                          .getTransactionManager();
          
          Show
          bryanthompson bryanthompson added a comment - It would appear that this line in AbstractJournal.setQuorumToken2() is where we should be terminating any tasks running on the WriteExecutorService or queued on the ConcurrencyManager / LockManager. ((AbstractTransactionService) getLocalTransactionManager() .getTransactionService()).abortAllTx(); This method calls through to AbstractTransactionService.abortAllTx(). AbstractTransactionService is extended by AbstractHATransactionService which is extended by JournalTransactionService and the private class Journal.InnerJournalTransactionService. The InnerJournalTransactionService has a lot of the glue code for HA 2-phase commits, including the gather consensus protocol. Both the local transaction manager (from which we access the transaction service) and the ConcurrencyManager are initialized in the Journal's constructor. And the ConcurrencyManager actually accepts the localTransactionManager object into its constructor. This suggests that the right place to install the guard pattern is in the ConcurrencyManager. localTransactionManager = newLocalTransactionManager(); concurrencyManager = new ConcurrencyManager(properties, localTransactionManager, this); Looking a little further, AbstractTask<init> reaches back and obtains this same local transaction manager reference as well as the ConcurrencyManager itself. // make sure we have the real ConcurrencyManager for addCounters() this.concurrencyManager = (ConcurrencyManager) (concurrencyManager instanceof Journal ? ((Journal) concurrencyManager) .getConcurrencyManager() : concurrencyManager); this.transactionManager = (AbstractLocalTransactionManager) concurrencyManager .getTransactionManager();
          Hide
          bryanthompson bryanthompson added a comment -

          I have modified the concurrency manager and task-oriented concurrency interfaces and implementations to return a FutureTask. This was easier than I had previously suspected. We can now wrap the behavior of the Runnable FutureTask, e.g., with a ThreadGuard pattern.

          I have added an abortAllTx() method on the ConcurrencyManager and modified the InnerJournalTransactionService to call through to that method. This now co-locates the logic through which we submit tasks to the WriteExecutorService (via the LockManager) and the logic to abort/discard submitted tasks when going through a leader fail or similar scenario.

          Tasks can be queued on the LockManager, queue on the WriteExecutorService, executing on the WriteExecutorService, or pending a group commit on the WriteExecutor service. In all cases, a leader failure (or similar scenario) should cause the task to be dequeued (if it is pending), interrupted (if it is running), and the commit group to be failed (along with all members).

          I am committing the refactoring to date to CI. It has already passed part of the HA test suite, the journal test suite, and the RWStore test suite. Tomorrow I will look at how to properly discard/abort queued/pending/running tasks. One option would be to shutdown the lock manager and write executor service and then restart them. This would be the least problematic from the perspective of the those classes, but it might not play out so nicely with the ConcurrencyManager and the Journal. This is what I need to look at next.

          Commit 3d733845e30f72dda2eb9ddd41e29b461fb0a1fb

          Show
          bryanthompson bryanthompson added a comment - I have modified the concurrency manager and task-oriented concurrency interfaces and implementations to return a FutureTask. This was easier than I had previously suspected. We can now wrap the behavior of the Runnable FutureTask, e.g., with a ThreadGuard pattern. I have added an abortAllTx() method on the ConcurrencyManager and modified the InnerJournalTransactionService to call through to that method. This now co-locates the logic through which we submit tasks to the WriteExecutorService (via the LockManager) and the logic to abort/discard submitted tasks when going through a leader fail or similar scenario. Tasks can be queued on the LockManager, queue on the WriteExecutorService, executing on the WriteExecutorService, or pending a group commit on the WriteExecutor service. In all cases, a leader failure (or similar scenario) should cause the task to be dequeued (if it is pending), interrupted (if it is running), and the commit group to be failed (along with all members). I am committing the refactoring to date to CI. It has already passed part of the HA test suite, the journal test suite, and the RWStore test suite. Tomorrow I will look at how to properly discard/abort queued/pending/running tasks. One option would be to shutdown the lock manager and write executor service and then restart them. This would be the least problematic from the perspective of the those classes, but it might not play out so nicely with the ConcurrencyManager and the Journal. This is what I need to look at next. Commit 3d733845e30f72dda2eb9ddd41e29b461fb0a1fb
          Hide
          bryanthompson bryanthompson added a comment -

          I am going to defer this task to 1.5.2. Per the issue description, this is about pleasant error messages rather than core correctness. It will also require writing / reviewing tests concerning the concurrency manager, lock manager, and write executor service.

          Show
          bryanthompson bryanthompson added a comment - I am going to defer this task to 1.5.2. Per the issue description, this is about pleasant error messages rather than core correctness. It will also require writing / reviewing tests concerning the concurrency manager, lock manager, and write executor service.

            People

            • Assignee:
              bryanthompson bryanthompson
              Reporter:
              bryanthompson bryanthompson
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: