Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1285

Sticky index error after running concurrent update

    Details

    • Type: Bug
    • Status: Done
    • Priority: High
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: BLAZEGRAPH_RELEASE_1_5_2
    • Component/s: None
    • Labels:
      None

      Description

      After running updater for a while, got this error:

      ERROR: AbstractTripleStore.java:1765: java.lang.IllegalStateException: Index is in error state
      java.lang.IllegalStateException: Index is in error state

      Full log attached.

      I also find this in /tmp:

      rw-rr- 1 blazegraph blazegraph 1.1G May 18 14:40 bigdata3104270295584281148.tmp
      rw-rr- 1 blazegraph blazegraph 33M May 18 14:40 bigdata3369438409198414383.tmp

      1. log.1431812556.gz
        32 kB
        stasmalyshev
      2. log.1431902533.gz
        518 kB
        stasmalyshev

        Issue Links

          Activity

          Hide
          stasmalyshev stasmalyshev added a comment - - edited

          Client-side log (the bigger one is client-side, the interesting part is at the end).

          Show
          stasmalyshev stasmalyshev added a comment - - edited Client-side log (the bigger one is client-side, the interesting part is at the end).
          Hide
          bryanthompson bryanthompson added a comment - - edited

          The initial error indicates that the BTree has a sticky (but not durable) error and that the database needs to go through an abort to restore a clean state for the index. This is the block of code that threw out that exception

            if (btree.error != null) {
                      /**
                       * This occurs if an error was detected against a mutable view of
                       * the index (the unisolated index view) and the caller has not
                       * discarded the index and caused it to be reloaded from the most
                       * recent checkpoint.
                       * 
                       * @see <a href="http://trac.blazegraph.com/ticket/1005"> Invalidate
                       *      BTree objects if error occurs during eviction </a>
                       */
                      throw new IllegalStateException(AbstractBTree.ERROR_ERROR_STATE,
                              btree.error);
                  }
          

          By itself this exception does not indicate any durable problem. I have not yet seen the root cause leading to this exception.

          Show
          bryanthompson bryanthompson added a comment - - edited The initial error indicates that the BTree has a sticky (but not durable) error and that the database needs to go through an abort to restore a clean state for the index. This is the block of code that threw out that exception if (btree.error != null ) { /** * This occurs if an error was detected against a mutable view of * the index (the unisolated index view) and the caller has not * discarded the index and caused it to be reloaded from the most * recent checkpoint. * * @see <a href= "http: //trac.blazegraph.com/ticket/1005" > Invalidate * BTree objects if error occurs during eviction </a> */ throw new IllegalStateException(AbstractBTree.ERROR_ERROR_STATE, btree.error); } By itself this exception does not indicate any durable problem. I have not yet seen the root cause leading to this exception.
          Hide
          bryanthompson bryanthompson added a comment -

          Ok. Here is the root cause. You can see that there is a request to create a new triple store.

          	at com.bigdata.btree.Node.insert(Node.java:922)
          	at com.bigdata.btree.AbstractBTree.insert(AbstractBTree.java:2105)
          	at com.bigdata.btree.AbstractBTree.insert(AbstractBTree.java:2049)
          	at com.bigdata.sparse.AtomicRowWriteRead.atomicWrite(AtomicRowWriteRead.java:258)
          	at com.bigdata.sparse.AtomicRowWriteRead.apply(AtomicRowWriteRead.java:181)
          	at com.bigdata.sparse.AtomicRowWriteRead.apply(AtomicRowWriteRead.java:49)
          	at com.bigdata.btree.AbstractBTree.submit(AbstractBTree.java:3184)
          	at com.bigdata.btree.UnisolatedReadWriteIndex.submit(UnisolatedReadWriteIndex.java:648)
          	at com.bigdata.sparse.SparseRowStore.write(SparseRowStore.java:795)
          	at com.bigdata.sparse.SparseRowStore.write(SparseRowStore.java:658)
          	at com.bigdata.sparse.SparseRowStore.write(SparseRowStore.java:559)
          	at com.bigdata.relation.AbstractResource.create(AbstractResource.java:664)
          	at com.bigdata.rdf.spo.SPORelation.create(SPORelation.java:411)
          	at com.bigdata.rdf.store.AbstractTripleStore.create(AbstractTripleStore.java:1645)
          	... 18 more
          

          I strongly suspect that this is exactly what is driving those stochastic errors in the DefaultResourceLocator. Please track down who is issuing those requests. How many namespaces exist? Why are they being created? I do need to understand what the application is doing to diagnose these issues (BLZG-1268)

          Show
          bryanthompson bryanthompson added a comment - Ok. Here is the root cause. You can see that there is a request to create a new triple store. at com.bigdata.btree.Node.insert(Node.java:922) at com.bigdata.btree.AbstractBTree.insert(AbstractBTree.java:2105) at com.bigdata.btree.AbstractBTree.insert(AbstractBTree.java:2049) at com.bigdata.sparse.AtomicRowWriteRead.atomicWrite(AtomicRowWriteRead.java:258) at com.bigdata.sparse.AtomicRowWriteRead.apply(AtomicRowWriteRead.java:181) at com.bigdata.sparse.AtomicRowWriteRead.apply(AtomicRowWriteRead.java:49) at com.bigdata.btree.AbstractBTree.submit(AbstractBTree.java:3184) at com.bigdata.btree.UnisolatedReadWriteIndex.submit(UnisolatedReadWriteIndex.java:648) at com.bigdata.sparse.SparseRowStore.write(SparseRowStore.java:795) at com.bigdata.sparse.SparseRowStore.write(SparseRowStore.java:658) at com.bigdata.sparse.SparseRowStore.write(SparseRowStore.java:559) at com.bigdata.relation.AbstractResource.create(AbstractResource.java:664) at com.bigdata.rdf.spo.SPORelation.create(SPORelation.java:411) at com.bigdata.rdf.store.AbstractTripleStore.create(AbstractTripleStore.java:1645) ... 18 more I strongly suspect that this is exactly what is driving those stochastic errors in the DefaultResourceLocator. Please track down who is issuing those requests. How many namespaces exist? Why are they being created? I do need to understand what the application is doing to diagnose these issues ( BLZG-1268 )
          Hide
          bryanthompson bryanthompson added a comment -

          This is the same execution environment. BLZG-1285 indicates that there are in fact writes on the global row store index which provides a good explanation of why we have been seeing concurrent modification issues in BLZG-1268

          Show
          bryanthompson bryanthompson added a comment - This is the same execution environment. BLZG-1285 indicates that there are in fact writes on the global row store index which provides a good explanation of why we have been seeing concurrent modification issues in BLZG-1268
          Hide
          stasmalyshev stasmalyshev added a comment -

          Hmm... this is weird, the app is not supposed to create any new triple stores, it all should be in one namespace. Looks like it's coming from an update query, but I have no idea why it would cause a store creation. I'm pretty sure the update app was the only one accessing the store. I'll see if any requests that it makes can be interpreted this way, e.g. due to wrong URL etc.

          Show
          stasmalyshev stasmalyshev added a comment - Hmm... this is weird, the app is not supposed to create any new triple stores, it all should be in one namespace. Looks like it's coming from an update query, but I have no idea why it would cause a store creation. I'm pretty sure the update app was the only one accessing the store. I'll see if any requests that it makes can be interpreted this way, e.g. due to wrong URL etc.
          Hide
          stasmalyshev stasmalyshev added a comment -

          Restart of the DB seems to bring it back to normal.

          Show
          stasmalyshev stasmalyshev added a comment - Restart of the DB seems to bring it back to normal.
          Hide
          stasmalyshev stasmalyshev added a comment -

          For the more info about the app - the updater app is pretty basic. The workflow is as follows:
          1. It downloads the list of recent changes from wikidata.org and extract list of entity IDs (looking like Q12345 etc.)
          2. It uses SELECT to extract info about current revision number for each entity.
          3. If the number is old, it starts the update sequence, as follows:
          3.1. Extracts some data on dependent values for the entity with SELECT
          3.2 Runs a series of DELETE queries to remove old data for the entity.
          3.3. Runs an INSERT query that inserts the new data for the entity.
          3.4. Runs another DELETE query to clean up data that may not be in use anymore like old value/reference nodes.

          All queries use REST API. Steps 3.2 and 3.3 are done as one big query sent to the API. Steps 3.x are done in parallel (10 threads currently) so that each entity ID is assigned to a thread that does all the operations for it. I can provide the details on the queries used if needed. The query that shows in backtrace leading to Store.create seems to be one of the queries run in step 3.2.

          Just occurred to me - can it happen that VALUES clause causes AbstractTripleStore.create? It seems to come from computeClosureForStatementIdentifiers which creates some kind of temp store, but not sure what it means.

          Show
          stasmalyshev stasmalyshev added a comment - For the more info about the app - the updater app is pretty basic. The workflow is as follows: 1. It downloads the list of recent changes from wikidata.org and extract list of entity IDs (looking like Q12345 etc.) 2. It uses SELECT to extract info about current revision number for each entity. 3. If the number is old, it starts the update sequence, as follows: 3.1. Extracts some data on dependent values for the entity with SELECT 3.2 Runs a series of DELETE queries to remove old data for the entity. 3.3. Runs an INSERT query that inserts the new data for the entity. 3.4. Runs another DELETE query to clean up data that may not be in use anymore like old value/reference nodes. All queries use REST API. Steps 3.2 and 3.3 are done as one big query sent to the API. Steps 3.x are done in parallel (10 threads currently) so that each entity ID is assigned to a thread that does all the operations for it. I can provide the details on the queries used if needed. The query that shows in backtrace leading to Store.create seems to be one of the queries run in step 3.2. Just occurred to me - can it happen that VALUES clause causes AbstractTripleStore.create? It seems to come from computeClosureForStatementIdentifiers which creates some kind of temp store, but not sure what it means.
          Hide
          bryanthompson bryanthompson added a comment -

          Ah. Turn off RDR. It is the truth maintenance for the statements about statements that is creating those temporary files and is probably responsible for those triple store creates against the GRS (I will need to look at how this interaction is happening - I would not have expected it to be registering those temporary triple stores in the GRS).

          com.bigdata.rdf.sail.BigdataSailHelper has some code that you can look at for how to change the properties that are in effect for a triple store. You need to change from sids => triples).

          You can also just edit the code and make it return immediately.

              public IChunkedOrderedIterator<ISPO> computeClosureForStatementIdentifiers(
                      IChunkedOrderedIterator<ISPO> src) {
               
          *        if(!statementIdentifiers||true) {* // return immediately.
                      
                      // There will be no statement identifiers unless they were enabled.
                      
                      return src;
                      
                  }
          

          And change your properties file for your deployment to specify triples rather than RDR.

          We will track down the issue against the RDR mode and resolve it. But this will unblock you and will speed up the load significantly.

          Show
          bryanthompson bryanthompson added a comment - Ah. Turn off RDR. It is the truth maintenance for the statements about statements that is creating those temporary files and is probably responsible for those triple store creates against the GRS (I will need to look at how this interaction is happening - I would not have expected it to be registering those temporary triple stores in the GRS). com.bigdata.rdf.sail.BigdataSailHelper has some code that you can look at for how to change the properties that are in effect for a triple store. You need to change from sids => triples). You can also just edit the code and make it return immediately. public IChunkedOrderedIterator<ISPO> computeClosureForStatementIdentifiers( IChunkedOrderedIterator<ISPO> src) { * if (!statementIdentifiers|| true ) {* // return immediately. // There will be no statement identifiers unless they were enabled. return src; } And change your properties file for your deployment to specify triples rather than RDR. We will track down the issue against the RDR mode and resolve it. But this will unblock you and will speed up the load significantly.
          Hide
          stasmalyshev stasmalyshev added a comment -

          Looks like it is indeed the case about the link to BLZG-1268: after patching out statementIdentifiers NPEs seem to be gone - or at least become much rarer, as I haven't seen one since the last restart of the service. I'll watch them for the next day or so but so far it appears like removing the RDR stuff eliminated them.

          Show
          stasmalyshev stasmalyshev added a comment - Looks like it is indeed the case about the link to BLZG-1268 : after patching out statementIdentifiers NPEs seem to be gone - or at least become much rarer, as I haven't seen one since the last restart of the service. I'll watch them for the next day or so but so far it appears like removing the RDR stuff eliminated them.
          Hide
          bryanthompson bryanthompson added a comment -

          The root cause was the RDR truth maintenance logic for the statements about statements during retraction of assertions.

          Show
          bryanthompson bryanthompson added a comment - The root cause was the RDR truth maintenance logic for the statements about statements during retraction of assertions.

            People

            • Assignee:
              bryanthompson bryanthompson
              Reporter:
              stasmalyshev stasmalyshev
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: