Details

      Description

      This is a feature request to identify options for an improved EXPLAIN. The current EXPLAIN provides a lot of low level information about query plan performance. However, it does not do any of the following (options that have been implemented already are marked through strikethrough):

      1. Identify if the query is bad (e.g., if it requires bottom up evaluation, has unconstrained cross products, ill-formed joins arising from a lack of a shared variable).
        1. Identify cases where a variable is visible in the outer scope (bindings are produced), but is not bound within the inner scope (no binding producers) and therefore is not visible when evaluating in a FILTER or similar expression in the inner scope.
        2. Identify cases where a variable is bound using BIND() but not projected into a UNION (or other nested subgroup) and hence the subgroup runs with the variable unconstrained and then applies the BIND.
      2. Identify if a better join order was available (the RTO can do this, but more work is required for quads mode RTO support and for rewriting sub-selects and aggregations such that we can apply the RTO to all queries).
      3. Identify if a query will have a long run time. If so, static analysis could spend more time on the query, we could automatically enable the analytic query mode, use the RTO, etc.
      4. Identify if there are plan alternatives that are more efficient and what query hints could be added or removed to improve the plan execution.
      5. Identify if the query plan failed to push down a filter or optimize a particular expression.
      6. Identify for a workload which queries are responsible for most of the workload of the database.
      7. Identify constructs that are not optimizable (such as FILTER ?x="some string", which is not inlinable easily in the general case), to give the user the opportunity to rewrite the query into a more precise one
      8. Graphical representation of query plan
      9. Mark JOINs and FILTERs that are expensive and do not change the intermediate cardinality. People often write queries that are redundant in this sense.
      10. Mark patterns where statement pattern (or other group nodes) cannot be moved in front of optional -> this might be an ill designed pattern
      11. Mark patterns where MINUS is used in an unsatisfiable/always satisifed way -> this might not be the query the author wanted to write
      12. Mark FILTERs where the variable is out of scope (and thus will never be bound)
      13. Indicate projection variables that are not mentioned inside the query (or are not in the main scope) and thus can never be bound in the result

        Issue Links

          Activity

          Hide
          jeremycarroll jeremycarroll added a comment -

          Here are two specifics:


          - the special query hint for filter exists is difficult to know whether it would be good or not to use


          - and where the implicit distinct for the default graph in quads mode did in fact not filter anything, highlight that to suggest query modification

          Show
          jeremycarroll jeremycarroll added a comment - Here are two specifics: - the special query hint for filter exists is difficult to know whether it would be good or not to use - and where the implicit distinct for the default graph in quads mode did in fact not filter anything, highlight that to suggest query modification
          Hide
          bryanthompson bryanthompson added a comment -

          The first point might be better addressed by the scheduled improvements in query plan generation.

          The 2nd suggestion above is problematic. In general, the semantics of a named graph (GRAPH ?g {... }) access path where ?g is ignored are NOT the same as the semantics of a default graph access path for the same triple pattern. The default graph version always yields distinct triples. The named graph version can contain duplicate triples. This leads to different cardinality for the query.

          I find it easy to imagine that a given application may never have duplicates (in which case I would actually recommend NOT using the quads mode, but instead using multiple namespaces). However if an application sometimes has duplicates and sometimes does not, then this "suggestion" could violate the intended semantic and lead to undesired errors in the application.

          Show
          bryanthompson bryanthompson added a comment - The first point might be better addressed by the scheduled improvements in query plan generation. The 2nd suggestion above is problematic. In general, the semantics of a named graph (GRAPH ?g {... }) access path where ?g is ignored are NOT the same as the semantics of a default graph access path for the same triple pattern. The default graph version always yields distinct triples. The named graph version can contain duplicate triples. This leads to different cardinality for the query. I find it easy to imagine that a given application may never have duplicates (in which case I would actually recommend NOT using the quads mode, but instead using multiple namespaces). However if an application sometimes has duplicates and sometimes does not, then this "suggestion" could violate the intended semantic and lead to undesired errors in the application.
          Hide
          bryanthompson bryanthompson added a comment -

          The new (correct) join ordering semantics results in (correctly) sub-optimal query evaluation if a query uses REQUIRED - OPTIONAL - REQUIRED groups. Such queries should be brought to the user's attention in the EXPLAIN. This should also be done for ill-formed queries (which require us to rewrite them for explicit bottom up evaluation).

          Show
          bryanthompson bryanthompson added a comment - The new (correct) join ordering semantics results in (correctly) sub-optimal query evaluation if a query uses REQUIRED - OPTIONAL - REQUIRED groups. Such queries should be brought to the user's attention in the EXPLAIN. This should also be done for ill-formed queries (which require us to rewrite them for explicit bottom up evaluation).
          Hide
          bryanthompson bryanthompson added a comment - - edited

          Assigned to Michael to:

          • Document additional "gotchas" that arise from correct SPARQL evaluation semantics in 1.5.2.
          • Partition the set of bullets for this task into those that we will implement immediately and those that will be implemented later.
          • Add annotations to the AST to indicate "gotchas".

          Next to be assigned to Igor to:

          • Update the EXPLAIN to render out the additional information per this ticket.
          Show
          bryanthompson bryanthompson added a comment - - edited Assigned to Michael to: Document additional "gotchas" that arise from correct SPARQL evaluation semantics in 1.5.2. Partition the set of bullets for this task into those that we will implement immediately and those that will be implemented later. Add annotations to the AST to indicate "gotchas". Next to be assigned to Igor to: Update the EXPLAIN to render out the additional information per this ticket.
          Hide
          michaelschmidt michaelschmidt added a comment - - edited

          For now, I'd start out with the following ones:

          1.1 Identify cases where a variable is visible in the outer scope (bindings are produced), but is not bound within the inner scope (no binding producers) and therefore is not visible when evaluating in a FILTER or similar expression in the inner scope.
          10. Mark patterns where statement pattern (or other group nodes) cannot be moved in front of optional -> this might be an ill designed pattern
          11. Mark patterns where MINUS is used in an unsatisfiable/always satisifed way -> this might not be the query the author wanted to write
          12. Mark FILTERs where the variable is out of scope (and thus will never be bound)

          Examples for all cases and the relevant code hooks where these potential mis-specifications are observed is available at (first + second table sheet): https://docs.google.com/spreadsheets/d/1EcpsimlsObw5fe3h7oCRF79EtG9TAmdaCfZ7KotvbCQ

          Show
          michaelschmidt michaelschmidt added a comment - - edited For now, I'd start out with the following ones: 1.1 Identify cases where a variable is visible in the outer scope (bindings are produced), but is not bound within the inner scope (no binding producers) and therefore is not visible when evaluating in a FILTER or similar expression in the inner scope. 10. Mark patterns where statement pattern (or other group nodes) cannot be moved in front of optional -> this might be an ill designed pattern 11. Mark patterns where MINUS is used in an unsatisfiable/always satisifed way -> this might not be the query the author wanted to write 12. Mark FILTERs where the variable is out of scope (and thus will never be bound) Examples for all cases and the relevant code hooks where these potential mis-specifications are observed is available at (first + second table sheet): https://docs.google.com/spreadsheets/d/1EcpsimlsObw5fe3h7oCRF79EtG9TAmdaCfZ7KotvbCQ

            People

            • Assignee:
              igorkim igorkim
              Reporter:
              bryanthompson bryanthompson
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: