Details

      Description

      The following query is doing a full scan on

      select * {<<?s ?p ?o>> ?p1 ?o1 } LIMIT 10
      

      If you look at the EXPLAIN of this query, you can see that it visited ALL statements on SPOPredicate[1](?src, ?p2, ?tgt, ?-anon-2 then did the (pipeline) join with SPOPredicate[4](?-sid-1, ?p, ?o, ?-anon-5).

      The right way to run this query is to do a scan of the SID region of the SPO index and then unpack the <<?s ?p ?o>> from the subject position of the visited triples. There is no other join.

      The RDR syntax <<x y z>> p o implies that <<x y z>> must exist (in a query) and asserts that statement (when parsing data). An interpretation that does not respect this might be the root of the problem.

        Activity

        Hide
        bryan b. thompson bryan b. thompson added a comment -

        Attachment RDR-query-explain.html has been added with description: Explain of the query for this ticket.

        Show
        bryan b. thompson bryan b. thompson added a comment - Attachment RDR-query-explain.html has been added with description: Explain of the query for this ticket.
        Hide
        bryan b. thompson bryan b. thompson added a comment -

        The problem with this specific query is that we have 5M statements in the database. 1/2 of them are link attributes. The other half is the ground triple for the link. The query is being executed with what amounts to two fully unbound triple patterns, so it begins with a full scan of 2.5M statements before it finds the first statement that can join with the 2nd triple pattern.

        There are two ways to fix this:

        1. The two statement patterns could be combined as a single statement pattern which uses a prefix constraint to only visit the SIDs portion of the SPO index. The Subject of the visited statements will be a SID. We can then decompose that SID, obtaining bindings for its embedded s, p, and o. (Actually, this is not correct since it fails to verify the existence of the ground triple and the RDR syntax requires that we do this.)

        2. The second statement pattern can have a key-range constraint that is lifted onto the access path and the optimizer can notice this key-range constraint. The constraint is that the S variable MUST be a SID, so it will have the appropriate prefix bytes. If the optimizer notices this constraint and the range counts are collected with the constraint applied, then the JOINs will be executed in the other order and the query will be fast (because each statement visited on the 1st join will have a SID for its Subject position and the nexted index join for the 2nd triple pattern will always succeed).

        In fact, both of this really rely on recognizing the key-range constraint that can be lifted onto the access path. (1) just goes a step further and eliminates one join.

        (1) is not valid. The RDR syntax for data asserts the existence of the ground triple. The RDR syntax for query requires the presence of the ground triple. Therefore we really do need to run both joins. Hence, I think that (2) is correct while (1) is not.

        We have added some unit tests that help to analyze the problem, but the basic performance issue remains.

        Show
        bryan b. thompson bryan b. thompson added a comment - The problem with this specific query is that we have 5M statements in the database. 1/2 of them are link attributes. The other half is the ground triple for the link. The query is being executed with what amounts to two fully unbound triple patterns, so it begins with a full scan of 2.5M statements before it finds the first statement that can join with the 2nd triple pattern. There are two ways to fix this: 1. The two statement patterns could be combined as a single statement pattern which uses a prefix constraint to only visit the SIDs portion of the SPO index. The Subject of the visited statements will be a SID. We can then decompose that SID, obtaining bindings for its embedded s, p, and o. (Actually, this is not correct since it fails to verify the existence of the ground triple and the RDR syntax requires that we do this.) 2. The second statement pattern can have a key-range constraint that is lifted onto the access path and the optimizer can notice this key-range constraint. The constraint is that the S variable MUST be a SID, so it will have the appropriate prefix bytes. If the optimizer notices this constraint and the range counts are collected with the constraint applied, then the JOINs will be executed in the other order and the query will be fast (because each statement visited on the 1st join will have a SID for its Subject position and the nexted index join for the 2nd triple pattern will always succeed). In fact, both of this really rely on recognizing the key-range constraint that can be lifted onto the access path. (1) just goes a step further and eliminates one join. (1) is not valid. The RDR syntax for data asserts the existence of the ground triple. The RDR syntax for query requires the presence of the ground triple. Therefore we really do need to run both joins. Hence, I think that (2) is correct while (1) is not. We have added some unit tests that help to analyze the problem, but the basic performance issue remains.
        Hide
        bryan b. thompson bryan b. thompson added a comment -
        Show
        bryan b. thompson bryan b. thompson added a comment - See http://trac.bigdata.com/ticket/526 (RDR)
        Hide
        bryanthompson bryanthompson added a comment -

        Mike and I have talked this through. It appears that we need to modify
        the SPOPredicate.asBound() method to unify the SID variable (when present)
        and the S,P,O and C (if in quads mode) positions on the SPO. Since this
        unification can fail, we have decided to modify the contract for asBound()
        to allow it to return null if the unification does not succeed. Mike is
        going to handle this.

        I am going modify the PipelineJoin code to handle the case where
        SPOPredicate.asBound() returns null. In this case, the incoming binding
        set simply fails the join. The logic needs to be updated to let that
        happen naturally.

        Show
        bryanthompson bryanthompson added a comment - Mike and I have talked this through. It appears that we need to modify the SPOPredicate.asBound() method to unify the SID variable (when present) and the S,P,O and C (if in quads mode) positions on the SPO. Since this unification can fail, we have decided to modify the contract for asBound() to allow it to return null if the unification does not succeed. Mike is going to handle this. I am going modify the PipelineJoin code to handle the case where SPOPredicate.asBound() returns null. In this case, the incoming binding set simply fails the join. The logic needs to be updated to let that happen naturally.
        Hide
        bryanthompson bryanthompson added a comment -

        IPredicate.asBound()
        - javadoc. now allows a null return.

        BOpContext.bind()
        - the bind() version associated with the IElement based PipelineJoin code path has been deprecated.

        PipelineJoin:


        - the handleJoin() method has been deprecated. It is associated with the IElement process rather than "solutions" based AP reads.


        - the code in the BindingSetConsumerTask has been modified to accomodate the possible null return from SPOPreciate.asBound() when the SPOPredicate is associated with an RDR access path (the SID variable is defined). Incoming binding sets that cause SPOPredicate.asBound() to fail will cause the join to fail for that source solution. For these cases, the join for that source solution is simply not run (we do not create the associated AccessPathTest).

        Committed revision r7868.

        Show
        bryanthompson bryanthompson added a comment - IPredicate.asBound() - javadoc. now allows a null return. BOpContext.bind() - the bind() version associated with the IElement based PipelineJoin code path has been deprecated. PipelineJoin: - the handleJoin() method has been deprecated. It is associated with the IElement process rather than "solutions" based AP reads. - the code in the BindingSetConsumerTask has been modified to accomodate the possible null return from SPOPreciate.asBound() when the SPOPredicate is associated with an RDR access path (the SID variable is defined). Incoming binding sets that cause SPOPredicate.asBound() to fail will cause the join to fail for that source solution. For these cases, the join for that source solution is simply not run (we do not create the associated AccessPathTest). Committed revision r7868.
        Hide
        mikepersonick mikepersonick added a comment -

        Fixed SPOPredicate asBound to deal with the case where the sid var is bound in the incoming solution. It will do some consistency checking (filters out solutions where the sid bound to sidVar is either not a sid at all or has values that do not match the constants in the SPOPredicate). It will then bind the values from the incoming sid onto any variables defined by the SPOPredicate before doing the asBound. This will give us the point tests to make sure the statements are ground.

        Also changed the StaticOptimizer to pick a better ordering when two statement patterns have the same cardinality but one has a sid var and the other does not.

        Show
        mikepersonick mikepersonick added a comment - Fixed SPOPredicate asBound to deal with the case where the sid var is bound in the incoming solution. It will do some consistency checking (filters out solutions where the sid bound to sidVar is either not a sid at all or has values that do not match the constants in the SPOPredicate). It will then bind the values from the incoming sid onto any variables defined by the SPOPredicate before doing the asBound. This will give us the point tests to make sure the statements are ground. Also changed the StaticOptimizer to pick a better ordering when two statement patterns have the same cardinality but one has a sid var and the other does not.
        Hide
        bryanthompson bryanthompson added a comment -

        Fix verified in r7870.

        Show
        bryanthompson bryanthompson added a comment - Fix verified in r7870.

          People

          • Assignee:
            mikepersonick mikepersonick
            Reporter:
            bryan b. thompson bryan b. thompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: