Details

      Description

      Integrate the native operators into the AST, including:


      - OFFSET/LIMIT (SliceOp)
      - DISTINCT (DistinctBindingSetOp)
      - Aggregation (MemoryGroupByOp and PipelinedAggregationOp)
      - ORDER BY (MemorySortOp)
      - HASH JOIN (HTreeHashJoinOp)

      See [1,2,3,4].

      [1] https://sourceforge.net/apps/trac/bigdata/ticket/58 (Scalable DISTINCT)
      [2] https://sourceforge.net/apps/trac/bigdata/ticket/48 (Support aggregation queries)
      [3] https://sourceforge.net/apps/trac/bigdata/ticket/48 (ORDER_BY)
      [4] https://sourceforge.net/apps/trac/bigdata/ticket/339 (Change dependency to sesame 2.4)

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Each node in a cluster needs to be able to materialize a query. This is currently done by requesting the BOp for the query from the query controller using its queryId, which is a UUID. While a subquery is part of the same BOp tree as the parent query, it will have a distinct UUID. That UUID is assigned when the subquery is evaluated. While it is possible to bind the UUID of the top-level query in advance, we can not do that with the subquery since a subquery may be issued many, many times during the evaluation of a given parent query.

        However, we should be able to use a copy-on-write modification to bind the UUID of the parent query on the subquery each time that subquery is invoked. The subquery can then discover the IRunningQuery for the parent from the QueryEngine. (While it is possible that the subquery may be running on a node on which the parent query has never started, that is not true for named subqueries as their solution sets are bound to the query controller.)

        Another way to make this work is to encode the UUID of the parent query within the name assigned to each named solution set. This requires that we pre-assign the queryId to the top-level query, but that is Ok. The named subquery include will then parse the name of the solution set and extract the queryId component (or we could just make the attribute a Serializable class so we do not have to parse anything). It can then request the IRunningQuery on which the solution set was hung from the query engine and grab the named solution set. Really, that seems to be the easiest approach.

        I've written a NamedSolutionSetRef class. It has fields for the query UUID, the name of the named set, and the join variables. The operators which produce and consume the named solution set now use an attribute whose value is this NamedSolutionSetRef. The consumer uses the query UUID to locate the IRunningQuery for the root query (this is where the named result sets were attached). It then extracts the htree reference from an attribute whose name is made unique across the query by combining all the fields int he NamedSolutionSetRef into a String.

        Committed revision r5113.

        Show
        bryanthompson bryanthompson added a comment - Each node in a cluster needs to be able to materialize a query. This is currently done by requesting the BOp for the query from the query controller using its queryId, which is a UUID. While a subquery is part of the same BOp tree as the parent query, it will have a distinct UUID. That UUID is assigned when the subquery is evaluated. While it is possible to bind the UUID of the top-level query in advance, we can not do that with the subquery since a subquery may be issued many, many times during the evaluation of a given parent query. However, we should be able to use a copy-on-write modification to bind the UUID of the parent query on the subquery each time that subquery is invoked. The subquery can then discover the IRunningQuery for the parent from the QueryEngine. (While it is possible that the subquery may be running on a node on which the parent query has never started, that is not true for named subqueries as their solution sets are bound to the query controller.) Another way to make this work is to encode the UUID of the parent query within the name assigned to each named solution set. This requires that we pre-assign the queryId to the top-level query, but that is Ok. The named subquery include will then parse the name of the solution set and extract the queryId component (or we could just make the attribute a Serializable class so we do not have to parse anything). It can then request the IRunningQuery on which the solution set was hung from the query engine and grab the named solution set. Really, that seems to be the easiest approach. I've written a NamedSolutionSetRef class. It has fields for the query UUID, the name of the named set, and the join variables. The operators which produce and consume the named solution set now use an attribute whose value is this NamedSolutionSetRef. The consumer uses the query UUID to locate the IRunningQuery for the root query (this is where the named result sets were attached). It then extracts the htree reference from an attribute whose name is made unique across the query by combining all the fields int he NamedSolutionSetRef into a String. Committed revision r5113.
        Hide
        bryanthompson bryanthompson added a comment -

        Modified AST2BOpUtility to actually invoke the AST optimizers.

        Committed revision r5114.

        Show
        bryanthompson bryanthompson added a comment - Modified AST2BOpUtility to actually invoke the AST optimizers. Committed revision r5114.
        Hide
        bryanthompson bryanthompson added a comment -

        Modified the named subquery operator to checkpoint the HTree containing the solution set, reload a read-only view of the HTree from that checkpoint, and then save a reference to the read-only view on the query attributes. This makes the named solution set save for concurrent readers, which is a real possibility if the subquery is included into more than one place in the query.

        Committed revision r5116.

        Show
        bryanthompson bryanthompson added a comment - Modified the named subquery operator to checkpoint the HTree containing the solution set, reload a read-only view of the HTree from that checkpoint, and then save a reference to the read-only view on the query attributes. This makes the named solution set save for concurrent readers, which is a real possibility if the subquery is included into more than one place in the query. Committed revision r5116.
        Hide
        bryanthompson bryanthompson added a comment -

        Ok. The problem is BucketPageBLZG-900. It needs to be changed from:

        			tuple.copy(nextNonEmptySlot, data}}}
        to this:
        {{{
        			tuple.copy(nextNonEmptySlot, BucketPage.this);
        
        

        Committed revision r5117.

        Show
        bryanthompson bryanthompson added a comment - Ok. The problem is BucketPageBLZG-900. It needs to be changed from: tuple.copy(nextNonEmptySlot, data}}} to this: {{{ tuple.copy(nextNonEmptySlot, BucketPage.this); Committed revision r5117.
        Hide
        bryanthompson bryanthompson added a comment -

        This is done.

        Show
        bryanthompson bryanthompson added a comment - This is done.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: