Details

      Description

      The following query is slow. It should be turned into a DistinctTermAdvancer annotation on the POS access path. We have existing code that does this for inference (materializing sub property of hierarchy) but not yet for SPARQL query. This could be implemented as an AST Optimizer.

      SELECT DISTINCT ?property WHERE { ?x ?property ?y . } LIMIT 1000
      

      See BLZG-1089 (COUNT * should be optimized as fast range count)

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        I think that I may restrict my ambitions for the distinct-term-scan to the following cases (and their ramifications):

        triples:

        SELECT DISTINCT ?x {?s ?p ?o} (where ?x is either ?s or ?p or ?o)
        

        If any of ?s ?p or ?o are bound for a triples-mode query, then we DO NOT use this rewrite since (a) the pipeline join will be as efficient; and (b) the distinct-term-scan does not bind the other variables so it can not be used to filter them when they are constants.

        quads:

        SELECT DISTINCT ?x {GRAPH ?g {?s ?p ?o}} (where ?x is either ?s or ?p or ?o or ?g)
        

        The quads indices are:

        SPOC
        POCS
        OCSP
        CSPO
        PCSO
        SOPC
        

        Therefore, if the graph is bound then we can not use this without a filter except for the special case where DISTINCT ?s is projected since the CSPO index would work.

        This also implies that the rewrite can not be applied for quads if either FROM or FROM NAMED is specified since the restriction on the set of named graphs is equivalent to ?g being knownBound.

        Note: We MUST NOT use the distinct term scan if there are runtime bindings (either known or maybe bound) for the other variables appearing in the BPG or named graph pattern.

        Note: The only way to use this operator when there are known or maybe bound variables is to explicitly add a JOIN (PipelineJoin) after the operator to verify that at least one solution exists for each binding produced by the distinct term operator. Since the pipeline join will introduce duplicates, the DISTINCT can not longer be dropped from the select. At best we can turn it into

         distinct-scan(?var) => pipleline-join(pred.asBound()) => PROJECT DISTINCT ?var 
        

        This is basically using the distinct-term scan to run the pipeline join with fewer variables.

        Show
        bryanthompson bryanthompson added a comment - I think that I may restrict my ambitions for the distinct-term-scan to the following cases (and their ramifications): triples: SELECT DISTINCT ?x {?s ?p ?o} (where ?x is either ?s or ?p or ?o) If any of ?s ?p or ?o are bound for a triples-mode query, then we DO NOT use this rewrite since (a) the pipeline join will be as efficient; and (b) the distinct-term-scan does not bind the other variables so it can not be used to filter them when they are constants. quads: SELECT DISTINCT ?x {GRAPH ?g {?s ?p ?o}} (where ?x is either ?s or ?p or ?o or ?g) The quads indices are: SPOC POCS OCSP CSPO PCSO SOPC Therefore, if the graph is bound then we can not use this without a filter except for the special case where DISTINCT ?s is projected since the CSPO index would work. This also implies that the rewrite can not be applied for quads if either FROM or FROM NAMED is specified since the restriction on the set of named graphs is equivalent to ?g being knownBound. Note: We MUST NOT use the distinct term scan if there are runtime bindings (either known or maybe bound) for the other variables appearing in the BPG or named graph pattern. Note: The only way to use this operator when there are known or maybe bound variables is to explicitly add a JOIN (PipelineJoin) after the operator to verify that at least one solution exists for each binding produced by the distinct term operator. Since the pipeline join will introduce duplicates, the DISTINCT can not longer be dropped from the select. At best we can turn it into distinct-scan(?var) => pipleline-join(pred.asBound()) => PROJECT DISTINCT ?var This is basically using the distinct-term scan to run the pipeline join with fewer variables.
        Hide
        bryanthompson bryanthompson added a comment -

        Ok. This is running now. I have reasonably complete coverage at the SPARQL QUERY execution layer.

        7b02f404f15d67aa535b389ce6d4da8742d93eb7

        Show
        bryanthompson bryanthompson added a comment - Ok. This is running now. I have reasonably complete coverage at the SPARQL QUERY execution layer. 7b02f404f15d67aa535b389ce6d4da8742d93eb7
        Hide
        bryanthompson bryanthompson added a comment -

        There are 3 new test failures, two of which are duplicates:

        SPARQL test suite:

        com.bigdata.rdf.sail.tck.BigdataComplexSparqlQueryTest.testSES1970CountDistinctWildcard (from com.bigdata.rdf.sail.TestAll)
        
        junit.framework.AssertionFailedError: expected:<3> but was:<4>
        	at org.openrdf.query.parser.sparql.ComplexSPARQLQueryTest.testSES1970CountDistinctWildcard(ComplexSPARQLQueryTest.java:681)
        

        blueprints test suite (2 failures
        - embedded mode and client mode)

        com.bigdata.blueprints.TestBigdataGraphEmbedded.testGraphQueryTestSuite (from com.bigdata.blueprints.TestAll)
        
         Caused by: junit.framework.AssertionFailedError: expected:<5> but was:<3>
        	at com.tinkerpop.blueprints.GraphQueryTestSuite.testGraphQueryForHasOR(GraphQueryTestSuite.java:167)
        
        
        Show
        bryanthompson bryanthompson added a comment - There are 3 new test failures, two of which are duplicates: SPARQL test suite: com.bigdata.rdf.sail.tck.BigdataComplexSparqlQueryTest.testSES1970CountDistinctWildcard (from com.bigdata.rdf.sail.TestAll) junit.framework.AssertionFailedError: expected:<3> but was:<4> at org.openrdf.query.parser.sparql.ComplexSPARQLQueryTest.testSES1970CountDistinctWildcard(ComplexSPARQLQueryTest.java:681) blueprints test suite (2 failures - embedded mode and client mode) com.bigdata.blueprints.TestBigdataGraphEmbedded.testGraphQueryTestSuite (from com.bigdata.blueprints.TestAll) Caused by: junit.framework.AssertionFailedError: expected:<5> but was:<3> at com.tinkerpop.blueprints.GraphQueryTestSuite.testGraphQueryForHasOR(GraphQueryTestSuite.java:167)
        Hide
        bryanthompson bryanthompson added a comment -

        Actually, the BigdataComplexSparqlQueryTest is hitting a problem with the fast-range-count optimizer, not the distinct-term-scan optimizer. Here is the test. It passes if you disable the fast-range-count optimizer.

        	@Test
        	public void testSES1970CountDistinctWildcard()
        		throws Exception
        	{
        		loadTestData("/testdata-query/dataset-ses1970.trig");
        
        		String query = "SELECT (COUNT(DISTINCT *) AS ?c) {?s ?p ?o }";
        
        		TupleQuery tq = conn.prepareTupleQuery(QueryLanguage.SPARQL, query);
        
        		try {
        			TupleQueryResult result = tq.evaluate();
        			assertNotNull(result);
        
        			assertTrue(result.hasNext());
        			BindingSet s = result.next();
        			Literal count = (Literal)s.getValue("c");
        			assertNotNull(count);
        
        			assertEquals(3, count.intValue());
        
        			result.close();
        		}
        		catch (QueryEvaluationException e) {
        			e.printStackTrace();
        			fail(e.getMessage());
        		}
        	}
        
        Show
        bryanthompson bryanthompson added a comment - Actually, the BigdataComplexSparqlQueryTest is hitting a problem with the fast-range-count optimizer, not the distinct-term-scan optimizer. Here is the test. It passes if you disable the fast-range-count optimizer. @Test public void testSES1970CountDistinctWildcard() throws Exception { loadTestData("/testdata-query/dataset-ses1970.trig"); String query = "SELECT (COUNT(DISTINCT *) AS ?c) {?s ?p ?o }"; TupleQuery tq = conn.prepareTupleQuery(QueryLanguage.SPARQL, query); try { TupleQueryResult result = tq.evaluate(); assertNotNull(result); assertTrue(result.hasNext()); BindingSet s = result.next(); Literal count = (Literal)s.getValue("c"); assertNotNull(count); assertEquals(3, count.intValue()); result.close(); } catch (QueryEvaluationException e) { e.printStackTrace(); fail(e.getMessage()); } }
        Hide
        bryanthompson bryanthompson added a comment -

        The issue above has been fixed. A unit test will be written for correct rejection of this case at the AST rewrite layer.

        Show
        bryanthompson bryanthompson added a comment - The issue above has been fixed. A unit test will be written for correct rejection of this case at the AST rewrite layer.
        Hide
        bryanthompson bryanthompson added a comment -

        The blueprints test failure is different. See com.bigdata.blueprints.TestAll. I have verified that the blueprints error is NOT related to the fast-range-count optimizer (by commenting it out in the DefaultOptimizersList). Instead it is related to the distinct-term-scan optimizer.

        @artem: Can you chase down what is actually running and failing for this test?

        Show
        bryanthompson bryanthompson added a comment - The blueprints test failure is different. See com.bigdata.blueprints.TestAll. I have verified that the blueprints error is NOT related to the fast-range-count optimizer (by commenting it out in the DefaultOptimizersList). Instead it is related to the distinct-term-scan optimizer. @artem: Can you chase down what is actually running and failing for this test?
        Hide
        bryanthompson bryanthompson added a comment -

        Artem has implemented a fix for this: 4a6c710583c33e2c8f73ae08774b6330b685e577

        Show
        bryanthompson bryanthompson added a comment - Artem has implemented a fix for this: 4a6c710583c33e2c8f73ae08774b6330b685e577
        Hide
        bryanthompson bryanthompson added a comment -

        I have added system property based query hints that can be used to disable these optimizers in case we run into more edge cases. These query hints can be removed once we have more experience with these optimizers.

            /**
        	 * The name of an property that may be used to enable or disable the
        	 * {@link ASTFastRangeCountOptimizer}.
        	 * 
        	 * @see <a href="http://trac.bigdata.com/ticket/1037" > Rewrite SELECT
        	 *      COUNT(...) (DISTINCT|REDUCED) {single-triple-pattern} as ESTCARD
        	 *      </a>
        	 */
            String FAST_RANGE_COUNT_OPTIMIZER = "fastRangeCountOptimizer";
        
        	boolean DEFAULT_FAST_RANGE_COUNT_OPTIMIZER = Boolean.valueOf(System
        			.getProperty(FAST_RANGE_COUNT_OPTIMIZER, "true"));
            
            /**
        	 * The name of an property that may be used to enable or disable the
        	 * {@link ASTDistinctTermScanOptimizer}.
        	 * 
        	 * @see <a href="http://trac.bigdata.com/ticket/1035" > DISTINCT PREDICATEs
        	 *      query is slow </a>
        	 */
            String DISTINCT_TERM_SCAN_OPTIMIZER = "distinctTermScanOptimizer";
        
        	boolean DEFAULT_DISTINCT_TERM_SCAN_OPTIMIZER = Boolean.valueOf(System
        			.getProperty(DISTINCT_TERM_SCAN_OPTIMIZER, "true"));
        

        Also a bug fix for the distinct-term-scan for constants in triples modes.

        Show
        bryanthompson bryanthompson added a comment - I have added system property based query hints that can be used to disable these optimizers in case we run into more edge cases. These query hints can be removed once we have more experience with these optimizers. /** * The name of an property that may be used to enable or disable the * {@link ASTFastRangeCountOptimizer}. * * @see <a href="http://trac.bigdata.com/ticket/1037" > Rewrite SELECT * COUNT(...) (DISTINCT|REDUCED) {single-triple-pattern} as ESTCARD * </a> */ String FAST_RANGE_COUNT_OPTIMIZER = "fastRangeCountOptimizer"; boolean DEFAULT_FAST_RANGE_COUNT_OPTIMIZER = Boolean.valueOf(System .getProperty(FAST_RANGE_COUNT_OPTIMIZER, "true")); /** * The name of an property that may be used to enable or disable the * {@link ASTDistinctTermScanOptimizer}. * * @see <a href="http://trac.bigdata.com/ticket/1035" > DISTINCT PREDICATEs * query is slow </a> */ String DISTINCT_TERM_SCAN_OPTIMIZER = "distinctTermScanOptimizer"; boolean DEFAULT_DISTINCT_TERM_SCAN_OPTIMIZER = Boolean.valueOf(System .getProperty(DISTINCT_TERM_SCAN_OPTIMIZER, "true")); Also a bug fix for the distinct-term-scan for constants in triples modes.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: