Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1346

DistinctTermScanOp is not retrieving all data

    Details

      Description

      When loading the attached TTL file (containing 6 people) into a stock blazegraph 1.5.1 instance (standalone), a SELECT DISTINCT query fails to retrieve all distinct people. The correct results are returned by removing the DISTINCT keyword, or adding a second clause to the query.

      – only returns 3 of the 6 people
      SELECT DISTINCT ?People WHERE

      { ?People a <http://semoss.org/ontologies/Person> }

      – returns the correct result
      SELECT ?People WHERE

      { ?People a <http://semoss.org/ontologies/Person> }

      – also returns the correct result, but includes the DISTINCT keyword
      SELECT DISTINCT ?People WHERE

      { ?People a <http://semoss.org/ontologies/Person> . ?People ?p ?o . }

      I've attached my solutions log as well.

      1. icebreaker.ttl
        3 kB
        Ryan Bobko
      2. solutions.csv
        12 kB
        Ryan Bobko

        Activity

        ry99 Ryan Bobko created issue -
        bryanthompson bryanthompson made changes -
        Field Original Value New Value
        Assignee bryanthompson [ bryanthompson ] michaelschmidt [ michaelschmidt ]
        Hide
        bryanthompson bryanthompson added a comment -

        Ryan wrote: [snip] it seems pretty clear that the DistinctTermScanOp is missing half of the data that the PipelineJoin is getting. Adding the extra clause seems to change the DistinctTermScanOp to a PipelineJoin, which produces the correct results.

        Based on this, I would attribute the problem to the distinct term scan optimizations.

        Show
        bryanthompson bryanthompson added a comment - Ryan wrote: [snip] it seems pretty clear that the DistinctTermScanOp is missing half of the data that the PipelineJoin is getting. Adding the extra clause seems to change the DistinctTermScanOp to a PipelineJoin, which produces the correct results. Based on this, I would attribute the problem to the distinct term scan optimizations.
        Hide
        michaelschmidt michaelschmidt added a comment -

        The problem still exists in the current RC. Interestingly, the issue only shows up in triples mode (i.e., in quads mode all three queries return six results as expected). To me, the rewriting looks good, this is a clear case where the DISTINCT term scan optimization should work. The rewritten AST is as follows:

        QueryType: SELECT
        includeInferred=true
        SELECT ( VarNode(People) AS VarNode(People) )
          JoinGroupNode {
            StatementPatternNode(VarNode(People), ConstantNode(Vocab(14)[http://www.w3.org/1999/02/22-rdf-syntax-ns#type]), ConstantNode(TermId(2U)[http://semoss.org/ontologies/Person])) [scope=DEFAULT_CONTEXTS] [distinctTermScan=VarNode(People)]
              queryHints={com.bigdata.bop.IPredicate.keyOrder=POS}
              AST2BOpBase.estimatedCardinality=0
              AST2BOpBase.originalIndex=POS
          } AST2BOpBase.estimatedCardinality=6
         

        However, for some reason the DistinctTermScanOp only returns three rather than six results, so this seems to be a problem in the operator implementation.

        I also included the test queries into CI, currently they are outcommented (see TestTickets class). Should be enabled once we tackle this issue.

        Show
        michaelschmidt michaelschmidt added a comment - The problem still exists in the current RC. Interestingly, the issue only shows up in triples mode (i.e., in quads mode all three queries return six results as expected). To me, the rewriting looks good, this is a clear case where the DISTINCT term scan optimization should work. The rewritten AST is as follows: QueryType: SELECT includeInferred= true SELECT ( VarNode(People) AS VarNode(People) ) JoinGroupNode { StatementPatternNode(VarNode(People), ConstantNode(Vocab(14)[http: //www.w3.org/1999/02/22-rdf-syntax-ns#type]), ConstantNode(TermId(2U)[http://semoss.org/ontologies/Person])) [scope=DEFAULT_CONTEXTS] [distinctTermScan=VarNode(People)] queryHints={com.bigdata.bop.IPredicate.keyOrder=POS} AST2BOpBase.estimatedCardinality=0 AST2BOpBase.originalIndex=POS } AST2BOpBase.estimatedCardinality=6 However, for some reason the DistinctTermScanOp only returns three rather than six results, so this seems to be a problem in the operator implementation. I also included the test queries into CI, currently they are outcommented (see TestTickets class). Should be enabled once we tackle this issue.
        beebs Brad Bebee made changes -
        Workflow Trac Import v4 [ 15840 ] Trac Import v5 [ 16026 ]
        Hide
        michaelschmidt michaelschmidt added a comment -

        The problem shows up if, in triples or quads mode, the maximum number of constants are bound, namely

        • in triples mode we have two constants bound + the distinct variable
        • in quads mode we have three constants bound + the distinct variable

        Not sure what exactly the problem is in that case. The DistinctMultiTermAdvancer computes the (arguably right key), but src.seek(toKey) advances the cursor to the over-next position such that every second result is skipped. Need more time to investigate in detail what's going on here.

        For now, the fix is to outcomment the distinct term scan optimization in cases where the maximum number of constants is bound. Implemented a fix in ASTDistinctTermScanOptimizer, see the comment in method getCandidateKeyOrders. Once we fixed the root cause, this code snippet can be removed again. Also added test cases for both triples and quads mode.

        Fixed in branch blzg1346. Running CI, will merge it down if this succeeds. In case CI runs through, I'll close this one and open a follow-up ticket.

        Show
        michaelschmidt michaelschmidt added a comment - The problem shows up if, in triples or quads mode, the maximum number of constants are bound, namely in triples mode we have two constants bound + the distinct variable in quads mode we have three constants bound + the distinct variable Not sure what exactly the problem is in that case. The DistinctMultiTermAdvancer computes the (arguably right key), but src.seek(toKey) advances the cursor to the over-next position such that every second result is skipped. Need more time to investigate in detail what's going on here. For now, the fix is to outcomment the distinct term scan optimization in cases where the maximum number of constants is bound. Implemented a fix in ASTDistinctTermScanOptimizer, see the comment in method getCandidateKeyOrders. Once we fixed the root cause, this code snippet can be removed again. Also added test cases for both triples and quads mode. Fixed in branch blzg1346. Running CI, will merge it down if this succeeds. In case CI runs through, I'll close this one and open a follow-up ticket.
        michaelschmidt michaelschmidt made changes -
        Status Open [ 1 ] Accepted [ 10101 ]
        michaelschmidt michaelschmidt made changes -
        Status Accepted [ 10101 ] In Progress [ 3 ]
        Hide
        michaelschmidt michaelschmidt added a comment -

        Merged into master.

        Show
        michaelschmidt michaelschmidt added a comment - Merged into master.
        michaelschmidt michaelschmidt made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        michaelschmidt michaelschmidt made changes -
        Status Resolved [ 5 ] In Review [ 10100 ]
        michaelschmidt michaelschmidt made changes -
        Resolution Done [ 10000 ]
        Status In Review [ 10100 ] Done [ 10000 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v5 [ 16026 ] Trac Import v6 [ 18388 ]
        michaelschmidt michaelschmidt made changes -
        Fix Version/s BLAZEGRAPH_RELEASE_1_5_2 [ 10164 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v6 [ 18388 ] Trac Import v7 [ 19794 ]
        beebs Brad Bebee made changes -
        Workflow Trac Import v7 [ 19794 ] Trac Import v8 [ 21426 ]

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            ry99 Ryan Bobko
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: