Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1081

very rare NotMaterializedException: XSDBoolean(true)

    Details

      Description

      When running a soak test with characteristics to follow, I get a very hard to understand error, very rarely, and in conditions I have failed to replicate other than in my own test harness (that is testing my own code, not bigdata).

      My code interacts with bigdata only through the http interface to the NSS. Enabling logging on bigdata makes the problem vanish, but I can reproduce the problem with logging in my code. In particular enabling the ASTEvalHelper log makes the problem disappear.

      Attached are 6 logs:
      - a log of the NSS being basically the stdout
      - showing two stack traces, one at approx 14:00:29, which is the one for which I had the other logging enabled
      - five logs from my code, one for each of five different namespaces being used in the period 21:00:28 to 21:00:29 (note the 7hour time zone difference)

      The update that failed is in sparqlu2iDMc.log.part

      I note that the error concerns a boolean(true) but there are no such values in any of the logs. There were some boolean(true)s being used in earlier completed queries and updates; and I would expect some boolean(true) values to be in the triple store.

      The version of bigdata I was running is 1.3.2 + five additional commits and patches as agreed with Systap, in particular a patch fixing 1026.

      The test itself has the following characteristics.
      Every forty minutes there is a new round of tests.
      There are five concurrent parts to the test, each of which is identical.
      Each part creates a namespaces, does some operations, maybe taking 15 minutes over a 35 minute period, and then deletes the namespace.
      The namespace names are reused, not on every round, but ... I have 15 namespace names, and at any time 5 are in use. Each part logs all the queries and updates it is sending in several separate log files.

      Typically each query involves resources that start with a URI http://localsyapsehost:NNNNN/ where the number NNNNN is assigned by jenkins differently to each of the five parts, hence it is easy to tell the queries from each namespace apart.

      I have not seen the problem in the first round of testing (i.e. the first 40 minutes), and I believe it requires the reuse of namespace names to be seen; on the other hand, reusing a namespace name does not guarantee an issue. It typically takes 3 or 4 hours to get a single fault. Staggering the parts by one minute also seems to make the problem go away.
      I have seen the error report occur in SELECT queries as well as UPDATEs (this particular instance is an update).
      The error always seem to occur in only two parts of my test suite, this is one of them.

      I have longer logs but not complete logs of all the operations since the beginning of the journal file.

      1. bigdata.log
        70 kB
        jeremycarroll
      2. bigdata-extended-exception.log
        95 kB
        jeremycarroll
      3. bigdata-showing-call-stack-setting-value-to-null.log
        66 kB
        jeremycarroll
      4. sparqllkOveN.log.part
        26 kB
        jeremycarroll
      5. sparqlN5zQw4.log.part
        29 kB
        jeremycarroll
      6. sparqlu2iDMc.log.part
        28 kB
        jeremycarroll
      7. sparqlz1CfWH.log.part
        113 kB
        jeremycarroll
      8. sparqlZoZaiI.log.part
        135 kB
        jeremycarroll

        Activity

        Hide
        jeremycarroll jeremycarroll added a comment -

        I modified the bigdata code being used in the Syapse test harness that reproduces this as per
        https://github.com/jeremycarroll/bigdata/commit/456d4727437299d0075cbdbc2f2022b0a553f889

        This creates an exception at the point where a null value in the cache of a BooleanIV is made, that can later result in a non-materialized exception, and reports that exception, with its original stack trace, as the cause of the non-materialized exception.

        The log entry is attached and shows that the problem is here:

        com.bigdata.bop.rdf.join.ChunkedMaterializationOp.getBindingSet(ChunkedMaterializationOp.java:518)

        with value = null, where the iv can be cleared in error, since a LiteralIV returns false for needsMaterialization.

        I believe this code is being called as a result of a cancel request; but I am not sure.

        Hence the fix is a simple change to this method.

        Show
        jeremycarroll jeremycarroll added a comment - I modified the bigdata code being used in the Syapse test harness that reproduces this as per https://github.com/jeremycarroll/bigdata/commit/456d4727437299d0075cbdbc2f2022b0a553f889 This creates an exception at the point where a null value in the cache of a BooleanIV is made, that can later result in a non-materialized exception, and reports that exception, with its original stack trace, as the cause of the non-materialized exception. The log entry is attached and shows that the problem is here: com.bigdata.bop.rdf.join.ChunkedMaterializationOp.getBindingSet(ChunkedMaterializationOp.java:518) with value = null, where the iv can be cleared in error, since a LiteralIV returns false for needsMaterialization. I believe this code is being called as a result of a cancel request; but I am not sure. Hence the fix is a simple change to this method.
        Hide
        bryanthompson bryanthompson added a comment -

        I am rolling back this change so we can make a 1.3.3 release. Changes are at ChunkedMaterializationOp.javaBLZG-133

        New code (rolled back).

                        if (value == null && iv.needsMaterialization()) {
        
                            throw new RuntimeException("Could not resolve: iv=" + iv);
        
                        }
        
                        /*
                         * Replace the binding.
                         * 
                         * FIXME This probably needs to strip out the
                         * BigdataSail#NULL_GRAPH since that should not become bound.
                         */
                        ((IV) iv).setValue(value);
        
        

        Old code (restored):

                        if (value == null ) {
                        	
                        	if ( iv.needsMaterialization() ) {
        
                                throw new RuntimeException("Could not resolve: iv=" + iv);
                            
                        	} // else do nothing, e.g. for a literal - DO NOT remove the binding, trac1028.
        
                        } else {
        
                        	/*
                        	 * Replace the binding.
                        	 * 
                        	 * FIXME This probably needs to strip out the
                        	 * BigdataSail#NULL_GRAPH since that should not become bound.
                        	 */
                        	((IV) iv).setValue(value);
                        }
        
        

        Committed revision r8691.

        Show
        bryanthompson bryanthompson added a comment - I am rolling back this change so we can make a 1.3.3 release. Changes are at ChunkedMaterializationOp.javaBLZG-133 New code (rolled back). if (value == null && iv.needsMaterialization()) { throw new RuntimeException("Could not resolve: iv=" + iv); } /* * Replace the binding. * * FIXME This probably needs to strip out the * BigdataSail#NULL_GRAPH since that should not become bound. */ ((IV) iv).setValue(value); Old code (restored): if (value == null ) { if ( iv.needsMaterialization() ) { throw new RuntimeException("Could not resolve: iv=" + iv); } // else do nothing, e.g. for a literal - DO NOT remove the binding, trac1028. } else { /* * Replace the binding. * * FIXME This probably needs to strip out the * BigdataSail#NULL_GRAPH since that should not become bound. */ ((IV) iv).setValue(value); } Committed revision r8691.
        Hide
        bryanthompson bryanthompson added a comment -

        0. The code in the ChunkedMaterializationOp (per above) is clearly broken. In fact, there are two code paths in that method and both of them have the same error.

        1. The ChunkedMaterializationOp is being invoked with materializeInlineIVs:=true. Why?

        2. This method is also invoked from the ChunkedMaterializationIterator.

        3. The runtime error shows up when there is an attempt to access the IVCache which was incorrectly cleared in the code.

        Show
        bryanthompson bryanthompson added a comment - 0. The code in the ChunkedMaterializationOp (per above) is clearly broken. In fact, there are two code paths in that method and both of them have the same error. 1. The ChunkedMaterializationOp is being invoked with materializeInlineIVs:=true. Why? 2. This method is also invoked from the ChunkedMaterializationIterator. 3. The runtime error shows up when there is an attempt to access the IVCache which was incorrectly cleared in the code.
        Hide
        jeremycarroll jeremycarroll added a comment -

        Attachment bigdata-showing-call-stack-setting-value-to-null.log has been added with description: log where setting the cached value to null raised exception

        Show
        jeremycarroll jeremycarroll added a comment - Attachment bigdata-showing-call-stack-setting-value-to-null.log has been added with description: log where setting the cached value to null raised exception
        Hide
        jeremycarroll jeremycarroll added a comment -

        In the attachment bigdata-showing-call-stack-setting-value-to-null.log this was ane earlier trace where the code that was running throw an exception immediately whenever an BooleanIV had a setValue call with a null value. The fact this was happening persuaded me that the issue here is that for some reason we were occasionally setting the value to null, and then we could not materialize it later

        Show
        jeremycarroll jeremycarroll added a comment - In the attachment bigdata-showing-call-stack-setting-value-to-null.log this was ane earlier trace where the code that was running throw an exception immediately whenever an BooleanIV had a setValue call with a null value. The fact this was happening persuaded me that the issue here is that for some reason we were occasionally setting the value to null, and then we could not materialize it later
        Hide
        bryanthompson bryanthompson added a comment -

        commit fix to git: 6af52301c44b8b87c8a7daf69bc7250c95be0b0a

        commit fix to svn: Committed revision r8695.

        Show
        bryanthompson bryanthompson added a comment - commit fix to git: 6af52301c44b8b87c8a7daf69bc7250c95be0b0a commit fix to svn: Committed revision r8695.

          People

          • Assignee:
            thompsonbry thompsonbry
            Reporter:
            jeremycarroll jeremycarroll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: