Details

    • Type: Sub-task
    • Status: Done
    • Priority: High
    • Resolution: Done
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We have started seeing this issue on our QA box, which runs a heavy load with five different namespaces.
      I will add comments with variants of the stack traces that we are seeing.
      I am not seeing very much pattern to when these occur.

      1. go.sh
        3 kB
        Jeremy Carroll
      2. go2.sh
        4 kB
        bryanthompson
      3. logg
        1.23 MB
        Jeremy Carroll

        Issue Links

          Activity

          Hide
          bryanthompson bryanthompson added a comment - - edited

          Update from last sync call on this issue:

          • Move discard of resource locator cache above discard of committers. This might fix the remaining error.
          • Refine scope of the default resource locator cache clear to just the unisolated views and make sure we have a sufficient lock. This is a performance optimization since we do not need to discard read-only views.
          • Lift the SPARQL QUERY and SPARQL UPDATE parse in QueryServlet out of the code that is holding the connection object. The connection is no longer required since we do not use it to parse the query / update request. This is a performance optimization since the parser can run without the connection and thus we can overlap more work. – See BLZG-2039 (new ticket for this).
          Show
          bryanthompson bryanthompson added a comment - - edited Update from last sync call on this issue: Move discard of resource locator cache above discard of committers. This might fix the remaining error. Refine scope of the default resource locator cache clear to just the unisolated views and make sure we have a sufficient lock. This is a performance optimization since we do not need to discard read-only views. Lift the SPARQL QUERY and SPARQL UPDATE parse in QueryServlet out of the code that is holding the connection object. The connection is no longer required since we do not use it to parse the query / update request. This is a performance optimization since the parser can run without the connection and thus we can overlap more work. – See BLZG-2039 (new ticket for this).
          Hide
          bryanthompson bryanthompson added a comment -

          We have figured out where the remaining issue is coming from. It is this code in AbstractApiTask. The issue is that the unisolated connection does not exist until after we have resolved the view of the AbstractTripleStore using the DefaultResourceLocator. This breaks the serialization such that we can resolve the unisolated view in the DefaultResourceLocator concurrent with an AbstractJournal.abort(). Hence, we can discover an unisolated view of the AbstractTripleStore whose indices are then invalidated by the abort() and then go on to attempt to use that AbstractTripleStore in an unisolated operation.

              protected BigdataSailRepositoryConnection getConnection()
                      throws SailException, RepositoryException {
          
                  // resolve the default namespace.
                  final AbstractTripleStore tripleStore = (AbstractTripleStore) getIndexManager()
                          .getResourceLocator().locate(namespace, timestamp);
          
                  if (tripleStore == null) {
          
          			throw new DatasetNotFoundException("Not found: namespace="
          					+ namespace);
          
                  }
          
                  // Wrap with SAIL.
                  final BigdataSail sail = new BigdataSail(tripleStore);
          
                  final BigdataSailRepository repo = new BigdataSailRepository(sail);
          
                  repo.initialize();
          
                  final BigdataSailRepositoryConnection conn = (BigdataSailRepositoryConnection) repo
                        .getConnection();
          
          

          We are discussing ways of closing this gap.

          Show
          bryanthompson bryanthompson added a comment - We have figured out where the remaining issue is coming from. It is this code in AbstractApiTask. The issue is that the unisolated connection does not exist until after we have resolved the view of the AbstractTripleStore using the DefaultResourceLocator. This breaks the serialization such that we can resolve the unisolated view in the DefaultResourceLocator concurrent with an AbstractJournal.abort(). Hence, we can discover an unisolated view of the AbstractTripleStore whose indices are then invalidated by the abort() and then go on to attempt to use that AbstractTripleStore in an unisolated operation. protected BigdataSailRepositoryConnection getConnection() throws SailException, RepositoryException { // resolve the default namespace. final AbstractTripleStore tripleStore = (AbstractTripleStore) getIndexManager() .getResourceLocator().locate(namespace, timestamp); if (tripleStore == null ) { throw new DatasetNotFoundException( "Not found: namespace=" + namespace); } // Wrap with SAIL. final BigdataSail sail = new BigdataSail(tripleStore); final BigdataSailRepository repo = new BigdataSailRepository(sail); repo.initialize(); final BigdataSailRepositoryConnection conn = (BigdataSailRepositoryConnection) repo .getConnection(); We are discussing ways of closing this gap.
          Hide
          bryanthompson bryanthompson added a comment -

          Here is a 2nd version of the provided script which triggered some additional issues (the concurrency hole between the locate of an unisolated view of a resource and obtaining the unisolated connection for that resource).

          Show
          bryanthompson bryanthompson added a comment - Here is a 2nd version of the provided script which triggered some additional issues (the concurrency hole between the locate of an unisolated view of a resource and obtaining the unisolated connection for that resource).
          Hide
          bryanthompson bryanthompson added a comment -

          We have identified a possible fix involving a new lock and an overlapping lock pattern. martyncutcher is implementing and testing a fix.

          Show
          bryanthompson bryanthompson added a comment - We have identified a possible fix involving a new lock and an overlapping lock pattern. martyncutcher is implementing and testing a fix.
          Hide
          bryanthompson bryanthompson added a comment - - edited

          Relevant PRs are documented at BLZG-2041, but also summarized here:

          Status: CI is looking pretty good. Benchmark run has been started. "go.sh" (version 2) runs clean with the main PR.

          Show
          bryanthompson bryanthompson added a comment - - edited Relevant PRs are documented at BLZG-2041 , but also summarized here: https://github.com/SYSTAP/bigdata/pull/465 is the main PR for this ticket. https://github.com/SYSTAP/tinkerpop3/pull/13 TP3 patch https://github.com/SYSTAP/bigdata-gpu/pull/406 bigdata-gpu patch. Status: CI is looking pretty good. Benchmark run has been started. "go.sh" (version 2) runs clean with the main PR.

            People

            • Assignee:
              bryanthompson bryanthompson
              Reporter:
              jjc Jeremy Carroll
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: