Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-848

Stochastic results with Analytic Query Mode

    Details

      Description

      We have managed to replicate a problem with stochastic results using the analytic query mode and FOAF data. These data are in SVN under the path indicated below. The RWStore.properties file that gets deployed with the WAR was used (bigdata-war/src/resources/RWStore.properties).

      DROP ALL;
      LOAD <file:bigdata-rdf/src/resources/data/foaf/data-0.nq.gz>;
      LOAD <file:bigdata-rdf/src/resources/data/foaf/data-1.nq.gz>;
      LOAD <file:bigdata-rdf/src/resources/data/foaf/data-2.nq.gz>;
      LOAD <file:bigdata-rdf/src/resources/data/foaf/data-3.nq.gz>;
      

      Verify the number of triples:

      select (count(*) as ?c) {?a ?b ?c}
      
      ==> 352678
      

      If you select [x] Explain (to avoid overwhelming the web browser) and then execute the following query, it produces 320766 solutions. This is a stable result.

      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      SELECT ?x ?z (COUNT(?y) as ?connectionCount)
      WHERE {
         ?x foaf:knows ?y .
         ?y foaf:knows ?z .
         FILTER NOT EXISTS { ?x foaf:knows ?z }
         FILTER ( ?x != ?y )
      } GROUP BY ?x ?z
      HAVING ( ?connectionCount >= 1 )
      

      If you then also select [x] Analytic and rerun the query, it produces a different number of solutions on each run. For example:

      solutions=323105, chunks=1, subqueries=0, elapsed=6718ms.  (mini)
      solutions=323102, chunks=1, subqueries=0, elapsed=6905ms.  (mini)
      solutions=323130, chunks=1, subqueries=0, elapsed=12512ms. (airbook)
      solutions=323162, chunks=1, subqueries=0, elapsed=11220ms. (airbook)
      

      This appears to be the same issue that a customer had been reporting with a proprietary data set.

      If we reduce the query (removing all aggregation operators) then the number of solutions is still stochastic:

      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      SELECT *
      WHERE {
         ?x foaf:knows ?y .
         ?y foaf:knows ?z .
         FILTER NOT EXISTS { ?x foaf:knows ?z }
         FILTER ( ?x != ?y )
      }
      
      solutions=423302, chunks=4234, subqueries=0, elapsed=8315ms. (airbook)
      solutions=422955, chunks=4231, subqueries=0, elapsed=8338ms. (airbook)
      

      Further reducing the query (removing all filters) provides a stable result:

      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      SELECT *
      WHERE {
         ?x foaf:knows ?y .
         ?y foaf:knows ?z .
      #   FILTER NOT EXISTS { ?x foaf:knows ?z }
      #   FILTER ( ?x != ?y )
      }
      
      solutions=456525, chunks=4572, subqueries=0, elapsed=3873ms.
      solutions=456525, chunks=4572, subqueries=0, elapsed=3970ms.
      solutions=456525, chunks=4572, subqueries=0, elapsed=3854ms.
      

      If we use just the FILTER ( x != y ) then the result is stable:

      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      SELECT *
      WHERE {
         ?x foaf:knows ?y .
         ?y foaf:knows ?z .
      #   FILTER NOT EXISTS { ?x foaf:knows ?z }
         FILTER ( ?x != ?y )
      }
      
      solutions=456288, chunks=4570, subqueries=0, elapsed=4001ms.
      solutions=456288, chunks=4569, subqueries=0, elapsed=4016ms.
      solutions=456288, chunks=4569, subqueries=0, elapsed=3991ms.
      

      If we use just the FILTER NOT EXISTS then the result is stochastic:

      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      SELECT *
      WHERE {
         ?x foaf:knows ?y .
         ?y foaf:knows ?z .
         FILTER NOT EXISTS { ?x foaf:knows ?z }
      #   FILTER ( ?x != ?y )
      }
      
      solutions=423603, chunks=4237, subqueries=0, elapsed=8094ms.
      solutions=423551, chunks=4237, subqueries=0, elapsed=8121ms.
      solutions=423582, chunks=4237, subqueries=0, elapsed=8464ms.
      

      Thus, this issue appears to be with FILTER NOT EXISTS. It does not appear to be present in the simple FILTER, the simple access paths, or the aggregation code. This reduces the code that we need to review.

      The optimized AST for this minimal query is:

      Optimized AST
      
      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      QueryType: SELECT
      includeInferred=true
      SELECT VarNode(x) VarNode(y) VarNode(z)
        JoinGroupNode {
          StatementPatternNode(VarNode(x), ConstantNode(Vocab(106)[http://xmlns.com/foaf/0.1/knows]), VarNode(y)) [scope=DEFAULT_CONTEXTS]
            com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=24382
            com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=POCS
          StatementPatternNode(VarNode(y), ConstantNode(Vocab(106)[http://xmlns.com/foaf/0.1/knows]), VarNode(z)) [scope=DEFAULT_CONTEXTS]
            com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=24382
            com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=POCS
          QueryType: ASK
          SELECT VarNode(x) VarNode(z) VarNode(-exists-1)[anonymous]
            JoinGroupNode {
              StatementPatternNode(VarNode(x), ConstantNode(Vocab(106)[http://xmlns.com/foaf/0.1/knows]), VarNode(z)) [scope=DEFAULT_CONTEXTS]
                com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.estimatedCardinality=24382
                com.bigdata.rdf.sparql.ast.eval.AST2BOpBase.originalIndex=POCS
            }
          @askVar=-exists-1
          FILTER( com.bigdata.rdf.sparql.ast.NotExistsNode(VarNode(-exists-1))[ com.bigdata.rdf.sparql.ast.FunctionNode.scalarVals=null, com.bigdata.rdf.sparql.ast.FunctionNode.functionURI=http://www.bigdata.com/sparql-1.1-undefined-functionsnot-exists, graphPattern=JoinGroupNode, valueExpr=com.bigdata.rdf.internal.constraints.NotBOp(com.bigdata.rdf.internal.constraints.EBVBOp(-exists-1))] )
        }
      

      This gets turned into the following query plan:{{

      { Query Plan com.bigdata.bop.solutions.ProjectionOp[17](DropOp[16])[ com.bigdata.bop.BOp.bopId=17, com.bigdata.bop.BOp.evaluationContext=CONTROLLER, com.bigdata.bop.PipelineOp.sharedState=true, com.bigdata.bop.join.JoinAnnotations.select=[y, z|x,], com.bigdata.bop.engine.QueryEngine.queryId=0e8ef44f-6a56-4d26-bf2b-4d901fb0b55e] bq. com.bigdata.bop.solutions.DropOp[16](ConditionalRoutingOp[13])[ com.bigdata.bop.BOp.bopId=16, com.bigdata.bop.solutions.DropOp.dropVars=[-exists-1]] com.bigdata.bop.bset.ConditionalRoutingOp[13](ChunkedMaterializationOp[15])[ com.bigdata.bop.BOp.bopId=13, com.bigdata.bop.bset.ConditionalRoutingOp.condition=SPARQLConstraint, analytic=true, queryId=0e8ef44f-6a56-4d26-bf2b-4d901fb0b55e] com.bigdata.bop.rdf.join.ChunkedMaterializationOp[15](ConditionalRoutingOp[14])[ com.bigdata.bop.rdf.join.ChunkedMaterializationOp.vars=[-exists-1], com.bigdata.bop.IPredicate.relationName=[kb.lex], com.bigdata.bop.IPredicate.timestamp=1383677751361, com.bigdata.bop.PipelineOp.sharedState=true, com.bigdata.bop.BOp.bopId=15] com.bigdata.bop.bset.ConditionalRoutingOp[14](HTreeSolutionSetHashJoinOp[12])[ com.bigdata.bop.BOp.bopId=14, com.bigdata.bop.bset.ConditionalRoutingOp.condition=SPARQLConstraint, com.bigdata.bop.PipelineOp.altSinkRef=13, analytic=true, queryId=0e8ef44f-6a56-4d26-bf2b-4d901fb0b55e] com.bigdata.bop.join.HTreeSolutionSetHashJoinOp[12](PipelineJoin[11])[ com.bigdata.bop.BOp.bopId=12, com.bigdata.bop.BOp.evaluationContext=CONTROLLER, com.bigdata.bop.PipelineOp.maxParallel=1, com.bigdata.bop.PipelineOp.sharedState=true, class com.bigdata.bop.join.SolutionSetHashJoinOp.release=true, com.bigdata.bop.PipelineOp.lastPass=true, namedSetRef=NamedSolutionSetRef\{localName=--set-7,queryId=0e8ef44f-6a56-4d26-bf2b-4d901fb0b55e,joinVars=[z|x,]}

      ]
      com.bigdata.bop.join.PipelineJoin[11](HTreeHashIndexOp[8])[ com.bigdata.bop.BOp.bopId=11, com.bigdata.bop.join.JoinAnnotations.constraints=null, com.bigdata.bop.BOp.evaluationContext=ANY, com.bigdata.bop.join.AccessPathJoinAnnotations.predicate=SPOPredicate[9]]
      com.bigdata.bop.join.HTreeHashIndexOp[8](PipelineJoin[6])[ com.bigdata.bop.BOp.bopId=8, com.bigdata.bop.BOp.evaluationContext=CONTROLLER, com.bigdata.bop.PipelineOp.maxParallel=1, com.bigdata.bop.PipelineOp.lastPass=true, com.bigdata.bop.PipelineOp.sharedState=true, com.bigdata.bop.IPredicate.relationName=[kb.lex], com.bigdata.bop.join.JoinAnnotations.joinType=Exists, com.bigdata.bop.join.HashJoinAnnotations.joinVars=[z|x,], com.bigdata.bop.join.JoinAnnotations.constraints=null, com.bigdata.bop.join.JoinAnnotations.select=null, com.bigdata.bop.join.HashJoinAnnotations.askVar=exists-1, namedSetRef=NamedSolutionSetRef{localName=-set-7,queryId=0e8ef44f-6a56-4d26-bf2b-4d901fb0b55e,joinVars=[z|x,]}]
      com.bigdata.bop.join.PipelineJoin[6](PipelineJoin[3])[ com.bigdata.bop.BOp.bopId=6, com.bigdata.bop.join.JoinAnnotations.constraints=null, com.bigdata.bop.BOp.evaluationContext=ANY, com.bigdata.bop.join.AccessPathJoinAnnotations.predicate=SPOPredicate[4]]
      com.bigdata.bop.join.PipelineJoin[3]()[ com.bigdata.bop.BOp.bopId=3, com.bigdata.bop.join.JoinAnnotations.constraints=null, com.bigdata.bop.BOp.evaluationContext=ANY, com.bigdata.bop.join.AccessPathJoinAnnotations.predicate=SPOPredicate[1]]
      }}}

      At this point, we can examine the EXPLAIN for this query and identify the first operator that produces a stochastic number of solutions. That should be the problem operator. I have collected 3 "EXPLAIN"s of this query. Those EXPLAINs are attached to this ticket.

      A: solutions=423582, chunks=4237, subqueries=0, elapsed=8464ms.
      B: solutions=423614, chunks=4237, subqueries=0, elapsed=7917ms.
      C: solutions=423506, chunks=4236, subqueries=0, elapsed=8845ms.
      

      The first operator to show any variation in unitsOut is bopId=12. It is possible that the problem is restricted to the code path involving joinType=Exists.

      HTreeHashJoinUtility{open=false,joinType=Exists,chunkSize=1000,askVar=-exists-1,joinVars=[x, z],size=456525,considered(left=44476,right=59786,joins=507373),joinSetSize=33640},namedSet=--set-7
      

      TODO This data set could be used to develop a test suite for stable results using permutations on the query above. In addition, we might want to work in a SPARQL sub-SELECT, OPTIONALs, complex OPTIONALs (more than one triple pattern of a FILTER that depends on a variable that is only optionally bound by that OPTIONAL), MINUS, EXISTS, NOT EXISTS, and SORT so we have full coverage over the language features in terms of a stable solution. This test suite could then be run for the standard and analytic query modes. We do not need to use "explain" in the test suite
      - it is sufficient to count the solutions.

      TODO Check to see if a (complex) OPTIONAL for the same graph pattern would fail in the ANALYTIC query mode. EXISTS and OPTIONAL share a lot of code. But the OPTIONAL needs to be more than one triple pattern in order to get handled by pushing down a join group. Simple optionals are handled directly by the JOIN operators.

      This ticket is causing me to look at the following tickets as well:


      - See http://sourceforge.net/apps/trac/bigdata/ticket/483 (Eliminate unnecessary dechunking and rechunking)
      - See http://sourceforge.net/apps/trac/bigdata/ticket/791 (Clean up query hints)

      These issues were dragged in by some API changes to IHashJoinUtility#hashJoin2() that were made to improve the Explain view of the query and gain more insight into how the hash join result was incorrect in the analytic query mode. I wound up cleaning up some unnecessary dechunking and rechunking in HTreeHashJoinUtility#hashJoin2(). In order to check the performance impact of that change, I needed to control the vectoring into the HTree operators from a query hint, which is when I found out that there was a problem with the propagation of the ChunkSizeHint.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        We have performed a code review of the change set for this fix. The next steps based on this code review are:

        • Reconcile TestDuplicates (see the committed version for my changes).
        • Use attached version of CustomByteArrayFrontCodedList with comments from code review.
        • Modify CustomByteArrayFrontCodedList to accept a boolean hasDuplicates in its constructors.
        • Modify the FrontCodedRabaCoder to pass the isDuplicateKeys() value into the CustomByteArrayFrontCodedList constructor. There are two different constructor invocations.
          final CustomByteArrayFrontCodedList decoder = new CustomByteArrayFrontCodedList(
          raba.iterator(), ratio);
          
          and
          
                      this.decoder = new CustomByteArrayFrontCodedList(size, ratio, data
          .array(), data.off() + O_DATA, data.len());
          
        • Remove the search/2 method. Just consult hasDuplicates:boolean.
        • Add IRaba.isDuplicateKeys() after isKeys(). This provides self reporting. All existing implementations (other than the Htree ones) report false. No new data, so no serialization change.
        • Add FrontCodedRabaCoderWithDups in the htree.raba package space as a subclass of FrontCoderRabaCoder and make sure that isDuplicateKeys() returns true. Define an inner class named DefaultFrontCodedRabaCoderWithDups. This will be used by IndexMetadata.
        • Modify htree.NodeSerializer to verify that the leafCoder is an IRabaCoder.isDuplicateKeys()==true. This provides a sanity check for the htree initialization. Make this a non-conditional test (not "assert").
        • Modify IndexMetadata to explicitly test for the Htree index type. This will change the default behavior for the Htree.
          Old:
                       final IRabaCoder leafKeysCoder = newInstance(getProperty(
                              indexManager, properties, namespace,
                              Options.LEAF_KEYS_CODER, DefaultFrontCodedRabaCoder.class
                                      .getName()), IRabaCoder.class);
          
          New:
                      final IRabaCoder leafKeysCoder = newInstance(getProperty(
                              indexManager, properties, namespace,
                              Options.LEAF_KEYS_CODER, (indexType==Htree?DefaultFrontCodedRabaCoderWithDups.class:DefaultFrontCodedRabaCoder.class)
                                      .getName()), IRabaCoder.class);
          
        • Modify HTreeHashJoinUtility.getIndexMetadata()
          Old:
                  @SuppressWarnings("rawtypes")
          final ITupleSerializer<?, ?> tupleSer = new DefaultTupleSerializer(
          new ASCIIKeyBuilderFactory(Bytes.SIZEOF_INT),
          new FrontCodedRabaCoder(ratio),// keys : TODO Optimize for int32!
          new SimpleRabaCoder() // vals
          );
          
          New:
                  @SuppressWarnings("rawtypes")
          final ITupleSerializer<?, ?> tupleSer = new DefaultTupleSerializer(
          new ASCIIKeyBuilderFactory(Bytes.SIZEOF_INT),
          new FrontCodedRabaCoderWithDups(ratio),// keys : TODO Optimize for int32!
          new SimpleRabaCoder() // vals
          );
          
        • Once this is ready, re-run bigdata-perf/CI/govtrack. There is an ant script to setup and run this. Look for any variability in the #of solutions for the govtrack queries. Each query will run 3 times. There is one query that reports ERR
          - this is because the HTTP GET can not transmit the request (it is larger than the URL payload limit and would have to be a POST).
        Show
        bryanthompson bryanthompson added a comment - We have performed a code review of the change set for this fix. The next steps based on this code review are: Reconcile TestDuplicates (see the committed version for my changes). Use attached version of CustomByteArrayFrontCodedList with comments from code review. Modify CustomByteArrayFrontCodedList to accept a boolean hasDuplicates in its constructors. Modify the FrontCodedRabaCoder to pass the isDuplicateKeys() value into the CustomByteArrayFrontCodedList constructor. There are two different constructor invocations. final CustomByteArrayFrontCodedList decoder = new CustomByteArrayFrontCodedList( raba.iterator(), ratio); and this.decoder = new CustomByteArrayFrontCodedList(size, ratio, data .array(), data.off() + O_DATA, data.len()); Remove the search/2 method. Just consult hasDuplicates:boolean. Add IRaba.isDuplicateKeys() after isKeys(). This provides self reporting. All existing implementations (other than the Htree ones) report false. No new data, so no serialization change. Add FrontCodedRabaCoderWithDups in the htree.raba package space as a subclass of FrontCoderRabaCoder and make sure that isDuplicateKeys() returns true. Define an inner class named DefaultFrontCodedRabaCoderWithDups. This will be used by IndexMetadata. Modify htree.NodeSerializer to verify that the leafCoder is an IRabaCoder.isDuplicateKeys()==true. This provides a sanity check for the htree initialization. Make this a non-conditional test (not "assert"). Modify IndexMetadata to explicitly test for the Htree index type. This will change the default behavior for the Htree. Old: final IRabaCoder leafKeysCoder = newInstance(getProperty( indexManager, properties, namespace, Options.LEAF_KEYS_CODER, DefaultFrontCodedRabaCoder.class .getName()), IRabaCoder.class); New: final IRabaCoder leafKeysCoder = newInstance(getProperty( indexManager, properties, namespace, Options.LEAF_KEYS_CODER, (indexType==Htree?DefaultFrontCodedRabaCoderWithDups.class:DefaultFrontCodedRabaCoder.class) .getName()), IRabaCoder.class); Modify HTreeHashJoinUtility.getIndexMetadata() Old: @SuppressWarnings("rawtypes") final ITupleSerializer<?, ?> tupleSer = new DefaultTupleSerializer( new ASCIIKeyBuilderFactory(Bytes.SIZEOF_INT), new FrontCodedRabaCoder(ratio),// keys : TODO Optimize for int32! new SimpleRabaCoder() // vals ); New: @SuppressWarnings("rawtypes") final ITupleSerializer<?, ?> tupleSer = new DefaultTupleSerializer( new ASCIIKeyBuilderFactory(Bytes.SIZEOF_INT), new FrontCodedRabaCoderWithDups(ratio),// keys : TODO Optimize for int32! new SimpleRabaCoder() // vals ); Once this is ready, re-run bigdata-perf/CI/govtrack. There is an ant script to setup and run this. Look for any variability in the #of solutions for the govtrack queries. Each query will run 3 times. There is one query that reports ERR - this is because the HTTP GET can not transmit the request (it is larger than the URL payload limit and would have to be a POST).
        Hide
        bryanthompson bryanthompson added a comment -

        This incorporates a fix for the stochastic behavior of the analytic query mode whose root cause was the failure of the leaf keys coder for the HTree to support duplicate keys. Committed to CI for feedback. I will also re-run the govtrack queries to establish a new baseline and assess the performance impact relative to the last released code.

        Committed revision r7830.

        Show
        bryanthompson bryanthompson added a comment - This incorporates a fix for the stochastic behavior of the analytic query mode whose root cause was the failure of the leaf keys coder for the HTree to support duplicate keys. Committed to CI for feedback. I will also re-run the govtrack queries to establish a new baseline and assess the performance impact relative to the last released code. Committed revision r7830.
        Hide
        bryanthompson bryanthompson added a comment -

        When trying to run the govtrack test suite, I observed that we have a serialization problem with the FixedLengthValueRabaCoder.

           [java] Caused by: java.io.InvalidClassException: com.bigdata.btree.raba.codec.FixedLengthValueRabaCoder; local class incompatible: stream classdesc serialVersionUID = 5549200745262968226, local class serialVersionUID = -4802583518071989794
        

        Fix to the implicit serialVersionUID for the FixedLengthValueRabaCoder.

        Committed revision r7831.

        Show
        bryanthompson bryanthompson added a comment - When trying to run the govtrack test suite, I observed that we have a serialization problem with the FixedLengthValueRabaCoder. [java] Caused by: java.io.InvalidClassException: com.bigdata.btree.raba.codec.FixedLengthValueRabaCoder; local class incompatible: stream classdesc serialVersionUID = 5549200745262968226, local class serialVersionUID = -4802583518071989794 Fix to the implicit serialVersionUID for the FixedLengthValueRabaCoder. Committed revision r7831.
        Hide
        bryanthompson bryanthompson added a comment -

        A review of the IRabaCoder implementations revealed that the new FrontCodedRabaCoderDupKeys class lacked an explicit serialVersionUID field. That is fixed in in this commit. This will prevent problems similar to the above if there are future changes to the IRabaCoder interface.

        Committed revision r7832.

        Show
        bryanthompson bryanthompson added a comment - A review of the IRabaCoder implementations revealed that the new FrontCodedRabaCoderDupKeys class lacked an explicit serialVersionUID field. That is fixed in in this commit. This will prevent problems similar to the above if there are future changes to the IRabaCoder interface. Committed revision r7832.
        Hide
        bryanthompson bryanthompson added a comment -

        Analytic query performance results:

        64.8958	Minutes r7675
        73.8409	Minutes r7841
        

        Note: Query is completely stable. There is some performance regression, but it may be within the variance (for example, it is the same as in 1.2.3). Also note that we are producing more solutions than we might have historically since some solutions associated with duplicate keys were being dropped.

        Show
        bryanthompson bryanthompson added a comment - Analytic query performance results: 64.8958 Minutes r7675 73.8409 Minutes r7841 Note: Query is completely stable. There is some performance regression, but it may be within the variance (for example, it is the same as in 1.2.3). Also note that we are producing more solutions than we might have historically since some solutions associated with duplicate keys were being dropped.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: