Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-375

AccessPath should visit binding sets rather than elements when used for high level query.

    Details

    • Type: New Feature
    • Status: Done
    • Resolution: Done
    • Affects Version/s: TERMS_REFACTOR_BRANCH
    • Fix Version/s: None
    • Component/s: Query Engine

      Description

      The access path currently visits elements, such as ISPOs, when scanning an index. This decision goes back to the physical schema design for the RDF database where the elements are very small. However, some index schemas can have large elements in which only a few values may be selected by the access path. In order to be more general, the IAccessPath interface should be extended to support index key-range scans where binding sets are visited. This should be in addition to the range iterator interface since there are use cases where visiting the elements (rows) might be a better fit.

      One inadvertant consequence of the current design is that a scan on a non-perfect index, e.g, https://sourceforge.net/apps/trac/bigdata/ticket/208, can result in a failure to filter out bindings which are not part of the key prefix.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        A related issue is that the IPredicate and IAccessPath should be able to function with named fields on the data structure. The current IPredicate/IAccessPath mechansism operate using the ordinal index of the field to be accessed on the relation elements. This works fine for RDF since we have a fixed physical schema, but it does make it difficult to extract things such as the StatementEnum bits. Changing this to support named fields would allow us to apply the query engine to GOM (schema flexible) and to relation (fixed schema, but one schema per table). For relational, we could always compile the field name to the ordinal index of the field, but that will not work for GOM.

        Show
        bryanthompson bryanthompson added a comment - A related issue is that the IPredicate and IAccessPath should be able to function with named fields on the data structure. The current IPredicate/IAccessPath mechansism operate using the ordinal index of the field to be accessed on the relation elements. This works fine for RDF since we have a fixed physical schema, but it does make it difficult to extract things such as the StatementEnum bits. Changing this to support named fields would allow us to apply the query engine to GOM (schema flexible) and to relation (fixed schema, but one schema per table). For relational, we could always compile the field name to the ordinal index of the field, but that will not work for GOM.
        Hide
        bryanthompson bryanthompson added a comment -

        When making this change, also:


        - Change the SampleIndex operator to visit IBindingSets.


        - Change DataSetJoin to use an inline access path [1].


        - Support filtering of APs when the index is not perfect [2],

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/233 (DataSetJoin)
        [2] https://sourceforge.net/apps/trac/bigdata/ticket/208 (Filter AP when not perfect index).

        Show
        bryanthompson bryanthompson added a comment - When making this change, also: - Change the SampleIndex operator to visit IBindingSets. - Change DataSetJoin to use an inline access path [1] . - Support filtering of APs when the index is not perfect [2] , [1] https://sourceforge.net/apps/trac/bigdata/ticket/233 (DataSetJoin) [2] https://sourceforge.net/apps/trac/bigdata/ticket/208 (Filter AP when not perfect index).
        Hide
        bryanthompson bryanthompson added a comment -

        As a partial step towards resolution of this ticket, I have implemented a method to convert the IElement[] chunks visited by an IAccessPath's IChunkedIterator into an Iterator<IBindingSet>. This is being done in the context of the htree based hash join operators. The existing HTreeHashJoinOp materializes all solutions from the upstream pipeline operators in an HTree, building a hash index on the join variables, and then runs the access path once, probing the hash index for each solution to find those which can join. In order to support named subquery with temporary solution sets we need to setup a hash index which is the materialized solutions of the subquery and the probe the hash index for each solution flowing through the query pipeline. In the first case we are working with an IAccessPath visiting IElements. In the second case we are working with an IAsynchronousIterator, visiting IBindingSets. The code block below is being used to align these two cases by converting IElements into IBindingSets so we can reuse the same hash join logic.

            /**
             * Convert an {@link IAccessPath#iterator()} into a stream of
             * {@link IBindingSet}s.
             * 
             * @param src
             *            The iterator draining the {@link IAccessPath}. This will visit
             *            {@link IElement}s.
             * @param pred
             *            The predicate for that {@link IAccessPath}
             * @param vars
             *            The array of distinct variables (no duplicates) to be
             *            extracted from the visited {@link IElement}s.
             * @param stats
             *            Statistics to be updated as elements and chunks are consumed
             *            (optional).
             * 
             * @return The dechunked iterator visiting the solutions. The order of the
             *         original {@link IElement}s is preserved.
             * 
             * @see https://sourceforge.net/apps/trac/bigdata/ticket/209 (AccessPath
             *      should visit binding sets rather than elements when used for high
             *      level query.)
             * @see https://sourceforge.net/apps/trac/bigdata/ticket/233 (Inline access
             *      path).
             */
            @SuppressWarnings({ "rawtypes", "unchecked" })
            static public ICloseableIterator<IBindingSet> solutions(
                    final IChunkedIterator<?> src, final IPredicate<?> pred,
                    final IVariable<?>[] vars, final BaseJoinStats stats) {
        
                // Array of the as-bound values; reused for each visited tuple.
                final IConstant<?>[] vals = new IConstant<?>[vars.length];
        
                return new CloseableIteratorWrapper(new Striterator(src).addFilter(
                        new Expander() {
        
                            private static final long serialVersionUID = 1L;
        
                            // Count access path chunks and units.
                            @Override
                            protected Iterator expand(Object obj) {
        
                                final Object[] chunk = (Object[]) obj;
        
                                stats.accessPathChunksIn.increment();
        
                                stats.accessPathUnitsIn.add(chunk.length);
        
                                return Arrays.asList(chunk).iterator();
        
                            }
        
                        }).addFilter(new Resolver() {
        
                    private static final long serialVersionUID = 1L;
        
                    // Resolve IElements to IBindingSets.
                    @Override
                    protected Object resolve(final Object obj) {
        
                        final IElement e = (IElement) obj;
        
                        // overwrite with values copied from the access path.
                        BOpContext.copyValues(e, pred, vars, vals);
        
                        return new ListBindingSet(vars, vals);
        
                    }
        
                })) {
                    
                    /**
                     * Close the real source if the caller closes the returned 
                     * iterator.
                     */
                    @Override
                    public void close() {
                        super.close();
                        src.close();
                    }
                };
        
            }
        
        Show
        bryanthompson bryanthompson added a comment - As a partial step towards resolution of this ticket, I have implemented a method to convert the IElement[] chunks visited by an IAccessPath's IChunkedIterator into an Iterator<IBindingSet>. This is being done in the context of the htree based hash join operators. The existing HTreeHashJoinOp materializes all solutions from the upstream pipeline operators in an HTree, building a hash index on the join variables, and then runs the access path once, probing the hash index for each solution to find those which can join. In order to support named subquery with temporary solution sets we need to setup a hash index which is the materialized solutions of the subquery and the probe the hash index for each solution flowing through the query pipeline. In the first case we are working with an IAccessPath visiting IElements. In the second case we are working with an IAsynchronousIterator, visiting IBindingSets. The code block below is being used to align these two cases by converting IElements into IBindingSets so we can reuse the same hash join logic. /** * Convert an {@link IAccessPath#iterator()} into a stream of * {@link IBindingSet}s. * * @param src * The iterator draining the {@link IAccessPath}. This will visit * {@link IElement}s. * @param pred * The predicate for that {@link IAccessPath} * @param vars * The array of distinct variables (no duplicates) to be * extracted from the visited {@link IElement}s. * @param stats * Statistics to be updated as elements and chunks are consumed * (optional). * * @return The dechunked iterator visiting the solutions. The order of the * original {@link IElement}s is preserved. * * @see https://sourceforge.net/apps/trac/bigdata/ticket/209 (AccessPath * should visit binding sets rather than elements when used for high * level query.) * @see https://sourceforge.net/apps/trac/bigdata/ticket/233 (Inline access * path). */ @SuppressWarnings({ "rawtypes", "unchecked" }) static public ICloseableIterator<IBindingSet> solutions( final IChunkedIterator<?> src, final IPredicate<?> pred, final IVariable<?>[] vars, final BaseJoinStats stats) { // Array of the as-bound values; reused for each visited tuple. final IConstant<?>[] vals = new IConstant<?>[vars.length]; return new CloseableIteratorWrapper(new Striterator(src).addFilter( new Expander() { private static final long serialVersionUID = 1L; // Count access path chunks and units. @Override protected Iterator expand(Object obj) { final Object[] chunk = (Object[]) obj; stats.accessPathChunksIn.increment(); stats.accessPathUnitsIn.add(chunk.length); return Arrays.asList(chunk).iterator(); } }).addFilter(new Resolver() { private static final long serialVersionUID = 1L; // Resolve IElements to IBindingSets. @Override protected Object resolve(final Object obj) { final IElement e = (IElement) obj; // overwrite with values copied from the access path. BOpContext.copyValues(e, pred, vars, vals); return new ListBindingSet(vars, vals); } })) { /** * Close the real source if the caller closes the returned * iterator. */ @Override public void close() { super.close(); src.close(); } }; }
        Hide
        bryanthompson bryanthompson added a comment -

        Refactored the IAccessPath interface into a base interface, an
        interface for access paths based no IElements, and an interface for
        IAccessPaths based on IBindingSets. The latter will support an
        "inline" access path for use with the RTO and certain named and
        default graph patterns. Modified AccessPath to implement the new
        solutions() method, wrapping the AccessPath#iterator() as an iterator
        visiting IBindingSets. Modified PipelineJoin to use the new
        IBindingSetAccessPath (when enabled in the code by an inline
        constant). Modified the IBindingSet implementations to permit
        iterator().remove() and fixed up the test suites to not fail the
        implementations (this fixes a long standing todo for IBindingSet).

        I am currently tracking down some bugs which emerge when the pipeline
        join is modified to use the IBindingSetAccessPath. TestPipelineJoin is
        clean but we still get some errors in the AST evaluation test suite,
        and presumably in the larger SPARQL TCK.

        Show
        bryanthompson bryanthompson added a comment - Refactored the IAccessPath interface into a base interface, an interface for access paths based no IElements, and an interface for IAccessPaths based on IBindingSets. The latter will support an "inline" access path for use with the RTO and certain named and default graph patterns. Modified AccessPath to implement the new solutions() method, wrapping the AccessPath#iterator() as an iterator visiting IBindingSets. Modified PipelineJoin to use the new IBindingSetAccessPath (when enabled in the code by an inline constant). Modified the IBindingSet implementations to permit iterator().remove() and fixed up the test suites to not fail the implementations (this fixes a long standing todo for IBindingSet). I am currently tracking down some bugs which emerge when the pipeline join is modified to use the IBindingSetAccessPath. TestPipelineJoin is clean but we still get some errors in the AST evaluation test suite, and presumably in the larger SPARQL TCK.
        Hide
        bryanthompson bryanthompson added a comment -

        Tracked down the problem mentioned above. It turns out the Predicate#asBound() was not associating the IVariable with the Constant on the new predicate. The new path for bind() was therefore failing to propagate bindings for the variable for the asBound predicate. This resulted in NotMaterializedExceptions because the lexicon join was failing to associate the resolved Value with the variable because its "variable" quality was lost by asBound().

        The change was to Predicate#asBound() on line 360.

        Old:

                    tmp._set(i, val.clone());
        

        New:

                    tmp._set(i, new Constant(var, val.get()));
        

        Also fixed a bug in the BOpContext#bind() variant for access paths visiting solutions where it was failing to propagate the IV's cached value. Added a test suite for this bind() method.

        The new code path in PipelineJoin to use AccessPath#solutions() rather than AccessPath#iterator() is enabled in this commit, but can be disabled by changing the following code at line 1640.

        				    if(true && accessPath instanceof IBindingSetAccessPath) {
        				        // Handle join in terms of an IBindingSet iterator.
                                handleJoin2();
        				    } else {
        				        // Handle join in terms of an IElement[] iterator.
        				        handleJoin();
        				    }
        

        Committed revision r5269.

        Show
        bryanthompson bryanthompson added a comment - Tracked down the problem mentioned above. It turns out the Predicate#asBound() was not associating the IVariable with the Constant on the new predicate. The new path for bind() was therefore failing to propagate bindings for the variable for the asBound predicate. This resulted in NotMaterializedExceptions because the lexicon join was failing to associate the resolved Value with the variable because its "variable" quality was lost by asBound(). The change was to Predicate#asBound() on line 360. Old: tmp._set(i, val.clone()); New: tmp._set(i, new Constant(var, val.get())); Also fixed a bug in the BOpContext#bind() variant for access paths visiting solutions where it was failing to propagate the IV's cached value. Added a test suite for this bind() method. The new code path in PipelineJoin to use AccessPath#solutions() rather than AccessPath#iterator() is enabled in this commit, but can be disabled by changing the following code at line 1640. if(true && accessPath instanceof IBindingSetAccessPath) { // Handle join in terms of an IBindingSet iterator. handleJoin2(); } else { // Handle join in terms of an IElement[] iterator. handleJoin(); } Committed revision r5269.
        Hide
        bryanthompson bryanthompson added a comment -

        Here is a discussion of the commit point describe above. As of this commit, joins against the lexicon or statement indices operate on IBindingSets read from the index rather than IElements. However, I am unsure that the original reason for this change remains valid.

        The motivations for visiting IBindingSets on an IAccessPath rather than IElements as as
        follows:


        - Greater API consistency.


        - Fewer JOIN operators.


        - RTO JOINs must support cutoff evaluation.

        The original reason for visiting elements rather than binding sets was that the access path as seen as index access rather than solutions access. However, the concept of solutions access is more general and, for wide tables in which we only select a few columns, could be significantly more efficient.

        But our primary use case is to join against narrow tables: (IV,Value) or (S,P,O,C). For those use cases it could be that the new code path is LESS efficient for several different reasons:


        - Binding sets are not as compact as ISPO objects (possible heap and NIO concern).


        - Predicate#asBound() can no longer overwrite the variable on the predicate with a constant, but must actually create a new Constant/2 instance in order to propagate the constant onto the variable in the binding sets visited on the access path.

        I suspect that the simplicity of the APIs and the ability to have fewer JOIN operators will win out in the long run, but we should measure the performance impact of this change set.

        Committed revision r5269.

        Show
        bryanthompson bryanthompson added a comment - Here is a discussion of the commit point describe above. As of this commit, joins against the lexicon or statement indices operate on IBindingSets read from the index rather than IElements. However, I am unsure that the original reason for this change remains valid. The motivations for visiting IBindingSets on an IAccessPath rather than IElements as as follows: - Greater API consistency. - Fewer JOIN operators. - RTO JOINs must support cutoff evaluation. The original reason for visiting elements rather than binding sets was that the access path as seen as index access rather than solutions access. However, the concept of solutions access is more general and, for wide tables in which we only select a few columns, could be significantly more efficient. But our primary use case is to join against narrow tables: (IV,Value) or (S,P,O,C). For those use cases it could be that the new code path is LESS efficient for several different reasons: - Binding sets are not as compact as ISPO objects (possible heap and NIO concern). - Predicate#asBound() can no longer overwrite the variable on the predicate with a constant, but must actually create a new Constant/2 instance in order to propagate the constant onto the variable in the binding sets visited on the access path. I suspect that the simplicity of the APIs and the ability to have fewer JOIN operators will win out in the long run, but we should measure the performance impact of this change set. Committed revision r5269.
        Hide
        bryanthompson bryanthompson added a comment -

        I am going to close this out now. I think that we have moved this issue far enough that it is no longer blocking anything. I'll leave a linked to the issue in the AccessPath code.

        Show
        bryanthompson bryanthompson added a comment - I am going to close this out now. I think that we have moved this issue far enough that it is no longer blocking anything. I'll leave a linked to the issue in the AccessPath code.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: