The problem appears to be the order in which DISTINCT and ORDER BY are being applied in the query plan. The DISTINCT operator (of necessity) only passes along the variables which are being projected out of the query.

1. One option when DISTINCT and ORDER BY are both specified is to annotate the ORDER BY operator to also impose DISTINCT. It can then sort on the variables in the source solutions (including the "name" attribute which is not projected in this query). During the output stage, only the DISTINCT projected variables should be reported. This can be achieved by scanning the solutions in the sorted order and only reporting the first solution having a distinct value for the projected variable(s).

2. Another option is to simply run the DISTINCT operator after the ORDER BY operator. However, we need to be careful that the DISTINCT operator DOES NOT have any concurrency when it is evaluated after an ORDER BY operator as concurrent evaluation of the DISTINCT operator would scramble the solution order.

I have implemented a fix for the JVM DISTINCT operator and ORDER BY. AST2BOpUtility#addDistinct() now requires an additional parameter which indicates whether the ordering of the solutions must be maintained. This corresponds to the 2nd option listed above. The ordering is preserved in this case by specifying MAX_PARALLEL := 1. This prevents concurrent evaluation of the JVM DISTINCT operator. The operator does not internally reorder solutions. Therefore, the net output of the operation preserves the input ordering, but only passes along the distinct solutions.

The HTree version of the DISTINCT operator appears to internally reorder solutions. For the moment, I have modified AST2BOpUtility#addDistinct() to force the use of the JVM DISTINCT operator when the solution set order must be preserved.

The easiest way to handle preserveOrder is with a custom DISTINCT operator. That operator only needs to maintain the last solution and run with MAX_PARALLEL:=1. It can simply compare the bindings for the variables to be made distinct with the last solution. If they are the same bindings, then the current solution is dropped. Since no hash index is required, this version will be faster and more scalable. I will follow up with an implementation and then hook it into the query plan.

For the moment, a workaround exists and is present in SVN:

Committed revision r6353.

The problem appears to be the order in which DISTINCT and ORDER BY are being applied in the query plan. The DISTINCT operator (of necessity) only passes along the variables which are being projected out of the query.

1. One option when DISTINCT and ORDER BY are both specified is to annotate the ORDER BY operator to also impose DISTINCT. It can then sort on the variables in the source solutions (including the "name" attribute which is not projected in this query). During the output stage, only the DISTINCT projected variables should be reported. This can be achieved by scanning the solutions in the sorted order and only reporting the first solution having a distinct value for the projected variable(s).

2. Another option is to simply run the DISTINCT operator after the ORDER BY operator. However, we need to be careful that the DISTINCT operator DOES NOT have any concurrency when it is evaluated after an ORDER BY operator as concurrent evaluation of the DISTINCT operator would scramble the solution order.

I have implemented a fix for the JVM DISTINCT operator and ORDER BY. AST2BOpUtility#addDistinct() now requires an additional parameter which indicates whether the ordering of the solutions must be maintained. This corresponds to the 2nd option listed above. The ordering is preserved in this case by specifying MAX_PARALLEL := 1. This prevents concurrent evaluation of the JVM DISTINCT operator. The operator does not internally reorder solutions. Therefore, the net output of the operation preserves the input ordering, but only passes along the distinct solutions.

The HTree version of the DISTINCT operator appears to internally reorder solutions. For the moment, I have modified AST2BOpUtility#addDistinct() to force the use of the JVM DISTINCT operator when the solution set order must be preserved.

The easiest way to handle preserveOrder is with a custom DISTINCT operator. That operator only needs to maintain the last solution and run with MAX_PARALLEL:=1. It can simply compare the bindings for the variables to be made distinct with the last solution. If they are the same bindings, then the current solution is dropped. Since no hash index is required, this version will be faster and more scalable. I will follow up with an implementation and then hook it into the query plan.

For the moment, a workaround exists and is present in SVN:

Committed revision r6353.