Details

    • Type: Bug
    • Status: In Progress
    • Resolution: Unresolved
    • Affects Version/s: BIGDATA_RELEASE_1_2_1
    • Fix Version/s: None
    • Component/s: NanoSparqlServer
    • Labels:
      None

      Description

      Add a maintained cache for the "DESCRIBE" of RDF Resources. The cache should be able to answer linked data GET queries, SPARQL DESCRIBE queries, and also allow us to resolve star-join patterns against a diplodocus [2] style linked data cache. This will also support fast parallel materialization of resources for graph API clients [1].

      Cache maintenance can be handled by registering an IChangeLog listener. We can do that by creating a CustomServiceFactory for the cache. The factory will see every connection start. The factory can also be integrated into query evaluation by rewriting the DESCRIBE to include a SERVICE call. The DESCRIBE is basically a star-join [3]. We will need to take the returned Graph and generate the appropriate bindings. By handling this as star-join we can leverage the DESCRIBE cache for star-patterns in queries. This is very similar to diplodocus style materialized joins, except that the latter also uses the ordered list of the templates in the dictionary to decide when clusters can join (they share the same RDF Value entry).

      [1] https://sourceforge.net/apps/trac/bigdata/ticket/560 (Graph API)
      [2] https://diuf.unifr.ch/main/xi/diplodocus (diplodocus)
      [3] https://sourceforge.net/apps/trac/bigdata/ticket/552 (Retry star-joins)

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        I've added a basic describe cache. A linked data GET against a URL that is an extension of the SPARQL end point will now hit the cache. If there is a cache miss, it will turn the request into a SPARQL DESCRIBE. The result of the DESCRIBE will then be injected into the cache. The cache must be explicitly enabled through QueryHints.DESCRIBE_CACHE (edit the default value). This is all alpha and is expected to evolve significantly.

        I had to modify the ASTDescribeOptimizer and ASTConstructOptimizer in order to pass through the original ProjectionNode for a DESCRIBE query when the DESCRIBE cache is being maintained. We need to know the original projection in order to identify the resources that are being described by the solutions to the as-run query. Those correlations are picked up by a DescribeBindingsCollector that observes the generated solutions and then translated into updates for the DESCRIBE cache by a DescribeCacheUpdater.

        I broke out the DESCRIBE and CONSTRUCT unit tests from the TestBasicQuery class into their own test suites. I have modified the (new) TestDescribe suite to also verify that the cache is being maintained.

        Note that this assumes that cache inserts occur lazily when resources are described and are invalidated when a resource is updated. If we work with an eagerly materialized DESCRIBE cache then we would not have the invalidation logic, we would update the cache synchronously as part of insert/remove, and we would not have the logic that monitors the results of a DESCRIBE query to update the cache.

        Committed revision r6440.

        Show
        bryanthompson bryanthompson added a comment - I've added a basic describe cache. A linked data GET against a URL that is an extension of the SPARQL end point will now hit the cache. If there is a cache miss, it will turn the request into a SPARQL DESCRIBE. The result of the DESCRIBE will then be injected into the cache. The cache must be explicitly enabled through QueryHints.DESCRIBE_CACHE (edit the default value). This is all alpha and is expected to evolve significantly. I had to modify the ASTDescribeOptimizer and ASTConstructOptimizer in order to pass through the original ProjectionNode for a DESCRIBE query when the DESCRIBE cache is being maintained. We need to know the original projection in order to identify the resources that are being described by the solutions to the as-run query. Those correlations are picked up by a DescribeBindingsCollector that observes the generated solutions and then translated into updates for the DESCRIBE cache by a DescribeCacheUpdater. I broke out the DESCRIBE and CONSTRUCT unit tests from the TestBasicQuery class into their own test suites. I have modified the (new) TestDescribe suite to also verify that the cache is being maintained. Note that this assumes that cache inserts occur lazily when resources are described and are invalidated when a resource is updated. If we work with an eagerly materialized DESCRIBE cache then we would not have the invalidation logic, we would update the cache synchronously as part of insert/remove, and we would not have the logic that monitors the results of a DESCRIBE query to update the cache. Committed revision r6440.
        Hide
        bryanthompson bryanthompson added a comment -

        Working on MVCC view semantics for the SOLUTION SETS and DESCRIBE caches.


        - Refactored BOpContext to take the PipelineOp and the lastInvocation

        flag as constructor arguments. This required touching a lot of the

        bop test suites. The PipelineOp gives us access to the chunk

        capacity in the BOpContext and should be useful in the long term.

        At present it is being used to locate the alternate solution set

        source.


        - Encapsulated field references in NamedSolutionSetRef. Extracted an

        INamedSolutionSetRef interface. Created a utility class to

        encapsulate the creation of named solution set references. Moved

        the named solution set reference classes and unit tests into

        bigdata/com.bigdata.bop. Added the concept of the Fully Qualified

        Name (FQN) for a named solution set.


        - Renamed ISparqlCache => ISolutionSetCache and SparqlCache =>

        SolutionSetCache.


        - Temporarily disabled the solution set cache in QueryHints. The

        solution set cache MUST use Name2Addr now or it will lose access to

        the solution sets when the [cacheMap] goes out of scope.


        - Modified StaticAnalysis#getSolutionSetStats(name) to throw an

        exception if the named solution set could not be resolved (all

        callers were doing this, so I lifted the exception into the method

        that was being called).


        - Refactored the SparqlCacheFactory and related interfaces to create

        an ICacheConnection abstraction. This is part of working on MVCC

        view for the IDescribeCache and the ISparqlCache. They both need to

        be aware of the namespace and timestamp of the KB view.


        - Delegated all resolution of pre-existing named solution sets to

        static methods on NamedSolutionSetRefUtility. This fixes a problem

        where static analysis was not looking it all the right places.

        @see https://sourceforge.net/apps/trac/bigdata/ticket/531 (SPARQL Update for Solution Sets)
        @see https://sourceforge.net/apps/trac/bigdata/ticket/584 (DESCRIBE CACHE)

        Committed revision r6448.

        Show
        bryanthompson bryanthompson added a comment - Working on MVCC view semantics for the SOLUTION SETS and DESCRIBE caches. - Refactored BOpContext to take the PipelineOp and the lastInvocation flag as constructor arguments. This required touching a lot of the bop test suites. The PipelineOp gives us access to the chunk capacity in the BOpContext and should be useful in the long term. At present it is being used to locate the alternate solution set source. - Encapsulated field references in NamedSolutionSetRef. Extracted an INamedSolutionSetRef interface. Created a utility class to encapsulate the creation of named solution set references. Moved the named solution set reference classes and unit tests into bigdata/com.bigdata.bop. Added the concept of the Fully Qualified Name (FQN) for a named solution set. - Renamed ISparqlCache => ISolutionSetCache and SparqlCache => SolutionSetCache. - Temporarily disabled the solution set cache in QueryHints. The solution set cache MUST use Name2Addr now or it will lose access to the solution sets when the [cacheMap] goes out of scope. - Modified StaticAnalysis#getSolutionSetStats(name) to throw an exception if the named solution set could not be resolved (all callers were doing this, so I lifted the exception into the method that was being called). - Refactored the SparqlCacheFactory and related interfaces to create an ICacheConnection abstraction. This is part of working on MVCC view for the IDescribeCache and the ISparqlCache. They both need to be aware of the namespace and timestamp of the KB view. - Delegated all resolution of pre-existing named solution sets to static methods on NamedSolutionSetRefUtility. This fixes a problem where static analysis was not looking it all the right places. @see https://sourceforge.net/apps/trac/bigdata/ticket/531 (SPARQL Update for Solution Sets) @see https://sourceforge.net/apps/trac/bigdata/ticket/584 (DESCRIBE CACHE) Committed revision r6448.
        Hide
        bryanthompson bryanthompson added a comment -


        - Added indexNameScan(prefix:String) to IIndexStore. This allows us

        to dynamically discover all named indices spanned by some
        prefix. The IIndexStore interface is extended by the
        IBigdataFederation, so this capability also extends to scale out.
        The requirement for enumerating the named indices spanned by a
        namespace exists for the named SOLUTION SET cache. If those
        solution sets are durable, then it also exists for
        AbstractTripleStore#destroy() since there may be named indices
        that are not explicitly part of either the SPORelation or the
        LexiconRelation.

        Modified ListIndexPartitions to use indexNameScan()
        (DumpFederation).

        Modified IndexManager.listIndexPartitions() to use
        indexNameScan().

        Modified DumpJournal to use indexNameScan().

        Modified CompactTask to use indexNameScan().


        - Modified the SolutionSetCache to use resolution through Name2Addr.


        - Modified AbstractRelation#destroy() to destroy all named indices

        spanned by the namespace of the relation.


        - Modified LexiconRelation to invoke super.destroy() after it

        destroys its own indices. This change was necessary since those
        indices are now destroyed automatically by the base class when the
        relation is destroyed.


        - Exposed the readsOnCommitTime on ITx. This was already available

        on the Tx class.


        - Fixed bug where getIndexWithCheckpointAddr(long) could allow the

        unisolated view of an index to be exposed. This issue was
        specific to the RWStore and was an error in immediateFree(). See
        https://sourceforge.net/apps/trac/bigdata/ticket/586 (RWStore
        immedateFree() not removing Checkpoint addresses from the
        historical index cache.)


        - Modified some methods in AbstractJournal that returned a

        ReadOnlyIndex to internally simply return a read-only BTree
        object. Specifially, this pattern was changed in
        getName2Addr(commitTime) and getName2Addr(). The resulting BTree
        is guaranteed to be read-only by getIndexWithCheckpointAddr()
        which was already being used by both methods.


        - Deprecated AbstractJournal#registerIndex(final String name, final

        IndexMetadata metadata) in favor of register(name,metadata).


        - Moved the initialization of the DescribeServletFactory into the

        ServiceRegistry. It is still conditional on the query hint.
        However, the service is actually registered so it now notices
        updates and uses them to invalidate the cache.


        - Added unit test to verify that the DESCRIBE cache is invalidated

        based on IChangeLog notices.

        See https://sourceforge.net/apps/trac/bigdata/ticket/584 (DESCRIBE CACHE)
        See https://sourceforge.net/apps/trac/bigdata/ticket/531 (SOLUTION SET CACHE)

        Committed revision r6452.

        Show
        bryanthompson bryanthompson added a comment - - Added indexNameScan(prefix:String) to IIndexStore. This allows us to dynamically discover all named indices spanned by some prefix. The IIndexStore interface is extended by the IBigdataFederation, so this capability also extends to scale out. The requirement for enumerating the named indices spanned by a namespace exists for the named SOLUTION SET cache. If those solution sets are durable, then it also exists for AbstractTripleStore#destroy() since there may be named indices that are not explicitly part of either the SPORelation or the LexiconRelation. Modified ListIndexPartitions to use indexNameScan() (DumpFederation). Modified IndexManager.listIndexPartitions() to use indexNameScan(). Modified DumpJournal to use indexNameScan(). Modified CompactTask to use indexNameScan(). - Modified the SolutionSetCache to use resolution through Name2Addr. - Modified AbstractRelation#destroy() to destroy all named indices spanned by the namespace of the relation. - Modified LexiconRelation to invoke super.destroy() after it destroys its own indices. This change was necessary since those indices are now destroyed automatically by the base class when the relation is destroyed. - Exposed the readsOnCommitTime on ITx. This was already available on the Tx class. - Fixed bug where getIndexWithCheckpointAddr(long) could allow the unisolated view of an index to be exposed. This issue was specific to the RWStore and was an error in immediateFree(). See https://sourceforge.net/apps/trac/bigdata/ticket/586 (RWStore immedateFree() not removing Checkpoint addresses from the historical index cache.) - Modified some methods in AbstractJournal that returned a ReadOnlyIndex to internally simply return a read-only BTree object. Specifially, this pattern was changed in getName2Addr(commitTime) and getName2Addr(). The resulting BTree is guaranteed to be read-only by getIndexWithCheckpointAddr() which was already being used by both methods. - Deprecated AbstractJournal#registerIndex(final String name, final IndexMetadata metadata) in favor of register(name,metadata). - Moved the initialization of the DescribeServletFactory into the ServiceRegistry. It is still conditional on the query hint. However, the service is actually registered so it now notices updates and uses them to invalidate the cache. - Added unit test to verify that the DESCRIBE cache is invalidated based on IChangeLog notices. See https://sourceforge.net/apps/trac/bigdata/ticket/584 (DESCRIBE CACHE) See https://sourceforge.net/apps/trac/bigdata/ticket/531 (SOLUTION SET CACHE) Committed revision r6452.
        Hide
        bryanthompson bryanthompson added a comment -

        The DESCRIBE cache will never become materialized if you are only running queries (versus mutations) against a given KB instance. This is because the HTree is only created when we have a mutable view.

        I have disabled all cache features in QueryHints pending further work on this issue.

        Show
        bryanthompson bryanthompson added a comment - The DESCRIBE cache will never become materialized if you are only running queries (versus mutations) against a given KB instance. This is because the HTree is only created when we have a mutable view. I have disabled all cache features in QueryHints pending further work on this issue.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: