Details

    • Type: New Feature
    • Status: In Progress
    • Resolution: Unresolved
    • Affects Version/s: BIGDATA_RELEASE_1_2_0
    • Fix Version/s: None
    • Component/s: Bigdata RDF Database
    • Labels:
      None

      Description

      Bigdata inlines a wide variety of datatype literals into the statement indices. When the variable in the object position of a basic graph pattern is known to obey some data type constraint, it becomes possible to rewrite the query plan such that it uses a key-range scan against the appropriate range of the OSP(C) index. If there is also a range constraint (GT, LT, GE, LE, etc) then that range constraint can be used to place bounds on the key-range scan. This can result in very efficient evaluation, especially for relatively unselective analytic queries.

      When the constraint is such that the variable can only take on values for a specific datatype, such as xsd:int, the key-range scan will be only against the corresponding part of the OSP(C) index. When the constraint allows literals of multiple datatypes to match via type promotion, then the query can be rewritten as the UNION of the key-range scans for each of the possible datatypes.

      Such datatype constraints could be imposed using a variety of SPARQL functions, including: datatype(), isNumeric(), etc. The constraint imposed by these functions (and the presence of a range constraint in the query) would have to be recognized by the query planner, which could then rewrite the query to take advantage of key-range scans.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Whether or not we can lift the filter into a key-range scan depends on the data type conversions for the specific value types used in the range constraint. For xsd:dateTime, there is only one declaration for the comparison operators.

        A > B xsd:dateTime xsd:dateTime op:dateTime-greater-than(A, B) xsd:boolean 
        

        However, we also have to look at subtype substitution [1] and numeric promotion [2]. These might permit filter lifting for xsd:dateTime in the example above, but that does not generalize to range constraints in some other datatypes. The explicit use of a datatype filter on the value to be used in the range test allows us to lift the range constraint, together with the datatype constraint, into a key-range scan on the OSP(C) statement index.

        See https://sourceforge.net/projects/bigdata/forums/forum/676946/topic/4366799 for a sample query using a range constraint on xsd:dateTime.

        [1] http://www.w3.org/TR/xpath20/#dt-subtype-substitution
        [2] http://www.w3.org/TR/xpath20/#promotion

        Show
        bryanthompson bryanthompson added a comment - Whether or not we can lift the filter into a key-range scan depends on the data type conversions for the specific value types used in the range constraint. For xsd:dateTime, there is only one declaration for the comparison operators. A > B xsd:dateTime xsd:dateTime op:dateTime-greater-than(A, B) xsd:boolean However, we also have to look at subtype substitution [1] and numeric promotion [2] . These might permit filter lifting for xsd:dateTime in the example above, but that does not generalize to range constraints in some other datatypes. The explicit use of a datatype filter on the value to be used in the range test allows us to lift the range constraint, together with the datatype constraint, into a key-range scan on the OSP(C) statement index. See https://sourceforge.net/projects/bigdata/forums/forum/676946/topic/4366799 for a sample query using a range constraint on xsd:dateTime. [1] http://www.w3.org/TR/xpath20/#dt-subtype-substitution [2] http://www.w3.org/TR/xpath20/#promotion
        Hide
        bryanthompson bryanthompson added a comment -
        Show
        bryanthompson bryanthompson added a comment - Assigned to MikeP to address https://sourceforge.net/apps/trac/bigdata/ticket/253#comment:5 (BSBM Q5).
        Hide
        mikepersonick mikepersonick added a comment -

        1. add isNumericResult to IValueExpression interface
        2. isNumericResult=true for MathBOp and numeric IConstants
        3. look for the circumstance where a variable is both LT and GT an isNumericResult VE
        4. create a RangeBOp that produces two IVs
        - from and to
        5. modify Predicate to work with a RangeBOp for a term (right now it only works with IVariableOrConstants
        6. swap in the RangeBOp in place of the IVariable term for predicates in the query that use the ranged variable
        7. change AbstractKeyOrder?
        - how it constructs its from and to keys for predicates
        8. RangeBOp is kinda like an IVariable in that it can be unbound (if it relies on a value expression which itself contains unbound variables
        - i.e. MathBOp). Keep an eye out for gotchas here

        Show
        mikepersonick mikepersonick added a comment - 1. add isNumericResult to IValueExpression interface 2. isNumericResult=true for MathBOp and numeric IConstants 3. look for the circumstance where a variable is both LT and GT an isNumericResult VE 4. create a RangeBOp that produces two IVs - from and to 5. modify Predicate to work with a RangeBOp for a term (right now it only works with IVariableOrConstants 6. swap in the RangeBOp in place of the IVariable term for predicates in the query that use the ranged variable 7. change AbstractKeyOrder? - how it constructs its from and to keys for predicates 8. RangeBOp is kinda like an IVariable in that it can be unbound (if it relies on a value expression which itself contains unbound variables - i.e. MathBOp). Keep an eye out for gotchas here
        Hide
        bryanthompson bryanthompson added a comment -

        Also, we should be able to lift a filter which asserts that some variable used by an access path is equal to some other variable onto the access path.

        Show
        bryanthompson bryanthompson added a comment - Also, we should be able to lift a filter which asserts that some variable used by an access path is equal to some other variable onto the access path.
        Hide
        mikepersonick mikepersonick added a comment -

        I have a first pass of this done
        - the simple case where we don't have to do anything special to capture other datatypes. This is only useful when you know you have a single datatype for a particular predicate (or subject-predicate combo). I added a query hint that allows you to mark a triple pattern as safe to use with the RangeBOp. For example, if you know you are always going to see xsd:dateTime values for a particular predicate, you can mark that predicate as safe for ranging:

        where {
        ?s <date> ?o .
        hint:Prior hint:rangeSafe "true" .
        filter (?o > "2012-01-01"^xsd:dateTime && ?o < "2012-03-01"^xsd:dateTime) .
        ...
        }

        Note that if values for ?o exist in the indices that would normally satisfy the filters based on SPARQL compare (for example ?o="2012-02-01"^^xsd:date), this query will produce incorrect answers. Thus it is only safe to use when it is known with certainty there will be a single datatype in the value space.

        The final optimization in the general case and is to use a simple union (TEE) to get all the possible datatypes covered that can be treated as comparable per the SPARQL spec.

        Show
        mikepersonick mikepersonick added a comment - I have a first pass of this done - the simple case where we don't have to do anything special to capture other datatypes. This is only useful when you know you have a single datatype for a particular predicate (or subject-predicate combo). I added a query hint that allows you to mark a triple pattern as safe to use with the RangeBOp. For example, if you know you are always going to see xsd:dateTime values for a particular predicate, you can mark that predicate as safe for ranging: where { ?s <date> ?o . hint:Prior hint:rangeSafe "true" . filter (?o > "2012-01-01"^ xsd:dateTime && ?o < "2012-03-01" ^xsd:dateTime) . ... } Note that if values for ?o exist in the indices that would normally satisfy the filters based on SPARQL compare (for example ?o="2012-02-01"^^xsd:date), this query will produce incorrect answers. Thus it is only safe to use when it is known with certainty there will be a single datatype in the value space. The final optimization in the general case and is to use a simple union (TEE) to get all the possible datatypes covered that can be treated as comparable per the SPARQL spec.

          People

          • Assignee:
            mikepersonick mikepersonick
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated: