Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-667

DISTINCT ORDER BY does not preserve order

    Details

      Description

      Given:

      @prefix ex: <http://example.org/>.
      
      ex:Jim ex:hasName "Jim".
      ex:Michael ex:hasName "Michael".
      ex:Dwight ex:hasName "Dwight".
      

      and the query:

      PREFIX ex: <http://example.org/>
      
      SELECT DISTINCT ?sub WHERE {
        ?sub ex:hasName ?name.
      } order by DESC(?name)
      

      The projected solutions are incorrect if the ORDER BY attribute is not also projected.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        The problem appears to be the order in which DISTINCT and ORDER BY are being applied in the query plan. The DISTINCT operator (of necessity) only passes along the variables which are being projected out of the query.

        1. One option when DISTINCT and ORDER BY are both specified is to annotate the ORDER BY operator to also impose DISTINCT. It can then sort on the variables in the source solutions (including the "name" attribute which is not projected in this query). During the output stage, only the DISTINCT projected variables should be reported. This can be achieved by scanning the solutions in the sorted order and only reporting the first solution having a distinct value for the projected variable(s).

        2. Another option is to simply run the DISTINCT operator after the ORDER BY operator. However, we need to be careful that the DISTINCT operator DOES NOT have any concurrency when it is evaluated after an ORDER BY operator as concurrent evaluation of the DISTINCT operator would scramble the solution order.

        I have implemented a fix for the JVM DISTINCT operator and ORDER BY. AST2BOpUtility#addDistinct() now requires an additional parameter which indicates whether the ordering of the solutions must be maintained. This corresponds to the 2nd option listed above. The ordering is preserved in this case by specifying MAX_PARALLEL := 1. This prevents concurrent evaluation of the JVM DISTINCT operator. The operator does not internally reorder solutions. Therefore, the net output of the operation preserves the input ordering, but only passes along the distinct solutions.

        The HTree version of the DISTINCT operator appears to internally reorder solutions. For the moment, I have modified AST2BOpUtility#addDistinct() to force the use of the JVM DISTINCT operator when the solution set order must be preserved.

        The easiest way to handle preserveOrder is with a custom DISTINCT operator. That operator only needs to maintain the last solution and run with MAX_PARALLEL:=1. It can simply compare the bindings for the variables to be made distinct with the last solution. If they are the same bindings, then the current solution is dropped. Since no hash index is required, this version will be faster and more scalable. I will follow up with an implementation and then hook it into the query plan.

        For the moment, a workaround exists and is present in SVN:

        Committed revision r6353.

        Show
        bryanthompson bryanthompson added a comment - The problem appears to be the order in which DISTINCT and ORDER BY are being applied in the query plan. The DISTINCT operator (of necessity) only passes along the variables which are being projected out of the query. 1. One option when DISTINCT and ORDER BY are both specified is to annotate the ORDER BY operator to also impose DISTINCT. It can then sort on the variables in the source solutions (including the "name" attribute which is not projected in this query). During the output stage, only the DISTINCT projected variables should be reported. This can be achieved by scanning the solutions in the sorted order and only reporting the first solution having a distinct value for the projected variable(s). 2. Another option is to simply run the DISTINCT operator after the ORDER BY operator. However, we need to be careful that the DISTINCT operator DOES NOT have any concurrency when it is evaluated after an ORDER BY operator as concurrent evaluation of the DISTINCT operator would scramble the solution order. I have implemented a fix for the JVM DISTINCT operator and ORDER BY. AST2BOpUtility#addDistinct() now requires an additional parameter which indicates whether the ordering of the solutions must be maintained. This corresponds to the 2nd option listed above. The ordering is preserved in this case by specifying MAX_PARALLEL := 1. This prevents concurrent evaluation of the JVM DISTINCT operator. The operator does not internally reorder solutions. Therefore, the net output of the operation preserves the input ordering, but only passes along the distinct solutions. The HTree version of the DISTINCT operator appears to internally reorder solutions. For the moment, I have modified AST2BOpUtility#addDistinct() to force the use of the JVM DISTINCT operator when the solution set order must be preserved. The easiest way to handle preserveOrder is with a custom DISTINCT operator. That operator only needs to maintain the last solution and run with MAX_PARALLEL:=1. It can simply compare the bindings for the variables to be made distinct with the last solution. If they are the same bindings, then the current solution is dropped. Since no hash index is required, this version will be faster and more scalable. I will follow up with an implementation and then hook it into the query plan. For the moment, a workaround exists and is present in SVN: Committed revision r6353.
        Hide
        bryanthompson bryanthompson added a comment -

        Actually, the concept of an optimized operator for DISTINCT over ORDERED solutions will only work when the attribute(s) on which the ordering is imposed is a subset of the attributes to be projected out of the query. Otherwise we have to build a hash index over the DISTINCT solutions anyway since they may be in an order which is different from the order of the projected variables (which requires us to maintain the set of distinct
        solutions rather than just comparing to the last output solution).

        I am therefore going to close out this issue. If we run into a problem with DISTINCT + ORDER BY for analytic query, then we can handle that as another ticket. The optimization for DISTINCT + ORDER BY when the projection includes all expressions in the ORDER BY clause could likewise be handled as another ticket.

        Show
        bryanthompson bryanthompson added a comment - Actually, the concept of an optimized operator for DISTINCT over ORDERED solutions will only work when the attribute(s) on which the ordering is imposed is a subset of the attributes to be projected out of the query. Otherwise we have to build a hash index over the DISTINCT solutions anyway since they may be in an order which is different from the order of the projected variables (which requires us to maintain the set of distinct solutions rather than just comparing to the last output solution). I am therefore going to close out this issue. If we run into a problem with DISTINCT + ORDER BY for analytic query, then we can handle that as another ticket. The optimization for DISTINCT + ORDER BY when the projection includes all expressions in the ORDER BY clause could likewise be handled as another ticket.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: