Details

    • Type: New Feature
    • Status: Closed - Won't Fix
    • Resolution: Duplicate
    • Affects Version/s: BIGDATA_RELEASE_1_1_0
    • Fix Version/s: None
    • Component/s: NanoSparqlServer
    • Labels:
      None

      Description

      Add a SPARQL cache. This can be used to cache RDF/GOM objects (DESCRIBE <uri>), SPARQL solution sets, and general graph query results (CONSTRUCT). An ASK cache is also easily constructed and would have use within transparently distributed query against known data sources.

      For DESCRIBE <uri> we can apply updates to the cache as they are processed. We could cache the results for each URI separately, and the response when multiple URIs are given could be a merge of the responses for the separate URIs.

      For general SPARQL solution sets, we can invalidate the cache as statement patterns used within the cache are updated. It is also possible to use subsumption to identify when solution sets for queries, subqueries, or join groups can be reused to satisfy some or all of a new query. One simple case is when the query differs only in the solution modifiers (ORDER BY, OFFSET/LIMIT). For OFFSET/LIMIT, it makes sense to cache the full solution set and then report only the relevant SLICE to the application. The remaining SLICES can then be return trivially from the cache. For join groups, the cache should be indexed and matched based on an unordered set of statement patterns and subgroups.

      Any join group solution in the cache which is less restrictive can be used to generate the solutions for a join group enjoying either the same or more restrictive constraints. When the query has more constraints, the additional constraints are simply used to filter the join group solutions. However, there are some "gotchas". One problerm with SPARQL query is that exogenous solutions (as presented by the openrdf API and/or the BINDINGS clause) pose constraints on the query which are not captured solely by the join groups in the query. Thus, when caching and matching solution sets we must also consider the BINDINGS which were in effect when the solution set was cached.

      Subsumption matching of join groups would appear to be fairly straight forward if one matches first on the basic (unordered) statement patterns and then verifies that the new join group is is no less constrained than original join group. However, (another gotcha), bigdata does not use bottom up evaluation. Instead, the engine vectors solutions through the query plan in a left-to-right order and hides variables which are not projected into sub-queries. If this vectored evaluation would produce different results than bottom up evaluation, then we rewrite the query by lifting out the join group into a sub-query. Hence, caching of sub-groups or sub-selects is not possible unless the left side of the cached query plan is identical (and the same BINDINGS are given externally).

      SERVICE calls are an excellent opportunity for caching. We already use a custom cache for the bigdata internal full text search service and this cache provides a tremendous performance boost for common query patterns when populating a GUI page. Non-native SERVICEs and true remote SERVICE calls can be cached as well and this caching should provide a tremendous performance boost. However, caching for SERVICEs should be provisioned in the IServiceOptions since updates to a non-native service might not be visible to the bigdata IChangeLog. Also, since we use vectored evaluation, the cached SERVICE result is only valid when the inputs for the cached solution are the same (or a superset which we can then filter locally based on an analysis of the SERVICE graph pattern). However, any SERVICEs which declared that it should run first may be understood as being evaluated solely against the exogenous variables.

      Yet another twist is that variables may be replaced in queries with constants. Such queries are more bound than a corresponding query in the cache with a variable in the position of that constant. Again, this is very similar to the question of handling the BINDINGS clause (except that the BINDINGS clause can communicate multiple source solutions into the query).

      The MVCC architecture means that cache invalidation can only apply to changes since the commit point for which the invalidation notice was generated. Historical cache views which have been invalidated should not be flushed until the retention period for the cache view has expired since they may still be accessed by active transactions. If the history retention period is known to be significantly longer than the historical cache access patterns, then invalidated cache results could be flushed once there is no active transaction which can read from that commit point. Either way, this requires an integration with the transaction service.

      The invalidation notices should emerge from the existing IChangeLog interface. For a tightly integrated cache, they can be propagated by method calls against an object reference attached to the AST2BOpContext. For remote caches or cache fabrics, then updates can be propagated using either UDP or TCP notification depending on the cache coherency requirements.

      The cache itself can be trivially derived from existing classes, including the MemoryManager (up to 4TB of main memory cache), the ConcurrentWeakValueCache, and the various cache implementations which bound the total bytes in the cache by batching updates through the cache and evicting the LRU (or LIRS) entries until there is enough room in the cache to store the new cache entry. The solutions in the cache should be stored in the same IV encoding that we use to store named solution sets, and cached solution sets could be viewed as named solution sets which are resolved based on a subsumption match and then referenced by their cache entry identifier. Assuming that we work with weak reference semantics for cache eviction (and recycling of the records backing the solution set in the memory manager) then a cache entry can be pinned by holding a reference to it in the query plan. We might use a proxy pattern to pin a cached solution set within a cache fabric since it will be remote (not integrated with the query controller).

      The HTree is used for named solution sets and works quite well for these purposes. However, it is single threaded for writing, which could impose some overhead when running the original query into the cache. Releasing a named solution set is managed by clearing the child IMemoryManager within which it was allocated. A similar design could be used to release solution sets and graphs in the SPARQL cache. The solution cache will contain materialized RDF Values. If we would only use the cache for top-level queries, then a general purpose RDF BindingSet serialization would be appropriate. However, for caching join groups we want to preserve the IVs as well. Therefore a bigdata specific representation must be used, such as is used by the IV encoders. Another possibility is the DirectBufferPoolAllocator. This architecture was mapped out for use by the query engine and writes solutions sets onto one or more direct buffers. This can provide both higher throughput than the HTree and the NIO buffers can be directly shipped on the wire. The IV encoder for the wire could be used in this case. It incrementally records the IVCache association information into the solutions and provides for incremental decode, requiring only the heap in the decoder for the IVCache associations.

      Rather than storing GRAPHs in the cache, we could also simply store solution sets and run the ASTConstructIterator to produce the statements from the CONSTRUCT template (DESCRIBE is translated into a CONSTRUCT). This would also make it easier to handle CONNEG for the response since the solutions are not yet in an external representation.

      An ASK cache is of necessity different since we are only storing a boolean result.

      A SPARQL cache should significantly improve performance for benchmarks and applications by permitting pre-computed results to be reused without any computational cost.

      See Improving the Performance of Semantic Web Applications with SPARQL Query Caching for a discussion of some related issues and some benchmark results.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Another interesting twist would be to support LOAD INTO SOLUTIONS. This could be used to join against data mapped into RDF solutions from flat files. That would begin to move bigdata into an architecture which can join against unindexed data as well as indexed data.

        I've put up a wiki page [1] to document the support for bigdata specific extensions to SPARQL update.

        [1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=SPARQL_Update

        Show
        bryanthompson bryanthompson added a comment - Another interesting twist would be to support LOAD INTO SOLUTIONS. This could be used to join against data mapped into RDF solutions from flat files. That would begin to move bigdata into an architecture which can join against unindexed data as well as indexed data. I've put up a wiki page [1] to document the support for bigdata specific extensions to SPARQL update. [1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=SPARQL_Update
        Hide
        bryanthompson bryanthompson added a comment -

        I have expanded the documentation for the SPARQL UPDATE extension and modified the SPARQL grammar. It now parses the extended SPARQL UPDATE grammar. I have worked through the AST changes to support this and the correct AST is now being built by the parser. I have not yet integrated support for the SPARQL update extensions.

        Committed revision r6204

        Show
        bryanthompson bryanthompson added a comment - I have expanded the documentation for the SPARQL UPDATE extension and modified the SPARQL grammar. It now parses the extended SPARQL UPDATE grammar. I have worked through the AST changes to support this and the correct AST is now being built by the parser. I have not yet integrated support for the SPARQL update extensions. Committed revision r6204
        Hide
        bryanthompson bryanthompson added a comment -

        I am going to fork this issue and continue to deal with the bigdata extensions to SPARQL UPDATE for named solution sets under the new issue [1].

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/531 (SPARQL UPDATE for SOLUTION SETS)

        Show
        bryanthompson bryanthompson added a comment - I am going to fork this issue and continue to deal with the bigdata extensions to SPARQL UPDATE for named solution sets under the new issue [1] . [1] https://sourceforge.net/apps/trac/bigdata/ticket/531 (SPARQL UPDATE for SOLUTION SETS)
        Hide
        bryanthompson bryanthompson added a comment -

        I have finished relayering the Checkpoint, ICheckpointProtocol, ISolutionSet, and SolutionSetStream classes. There is now a durable SolutionSetStream which can be backed by an IRWStrategy. Once we resolve [1], we will be able to generalize this to any IRawStore. The SparqlCache is still wired to a ConcurrentHashMap. I have not yet derived versions of the ISolutionSetCache for the HTree and the BTree. There is some remaining work to be done there concerning the integration with the Checkpoint record. Specifically, I need to establish a protocol to store the address of the ISolutionSetStats in the checkpoint record. Right now the ISolutionSetStats is hacked into the bloom filter addr since that is not used by the SolutionSetStream. There is plenty of room in the Checkpoint record, we just need to establish a protocol to make this work.

        There is now a top-level durable Stream. SolutionSetStream extends this. The Stream is based on the PSOutputStream mechanism but it implements the ICheckpointProtocol and hence can be stored in Name2Addr just like the BTree or HTree. This also makes it compatible with a MemStore mode Journal for the SPARQL cache.

        I still need to review the integration of the SparqlCache with the SolutionSetStream and provide appropriate layering of access (query scope => cache scope => database scope) through the resource locator protocol. There are notes on this above and also in the committed files.

        Committed revision r6372.

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/555 (Support PSOutputStream/InputStream at IRawStore)

        Show
        bryanthompson bryanthompson added a comment - I have finished relayering the Checkpoint, ICheckpointProtocol, ISolutionSet, and SolutionSetStream classes. There is now a durable SolutionSetStream which can be backed by an IRWStrategy. Once we resolve [1] , we will be able to generalize this to any IRawStore. The SparqlCache is still wired to a ConcurrentHashMap. I have not yet derived versions of the ISolutionSetCache for the HTree and the BTree. There is some remaining work to be done there concerning the integration with the Checkpoint record. Specifically, I need to establish a protocol to store the address of the ISolutionSetStats in the checkpoint record. Right now the ISolutionSetStats is hacked into the bloom filter addr since that is not used by the SolutionSetStream. There is plenty of room in the Checkpoint record, we just need to establish a protocol to make this work. There is now a top-level durable Stream. SolutionSetStream extends this. The Stream is based on the PSOutputStream mechanism but it implements the ICheckpointProtocol and hence can be stored in Name2Addr just like the BTree or HTree. This also makes it compatible with a MemStore mode Journal for the SPARQL cache. I still need to review the integration of the SparqlCache with the SolutionSetStream and provide appropriate layering of access (query scope => cache scope => database scope) through the resource locator protocol. There are notes on this above and also in the committed files. Committed revision r6372. [1] https://sourceforge.net/apps/trac/bigdata/ticket/555 (Support PSOutputStream/InputStream at IRawStore)
        Hide
        bryanthompson bryanthompson added a comment -

        The DESCRIBE cache is now being handled by [1]. The SPARQL solution set cache is handled by [2].

        This ticket is closed. We can open a new ticket when we decide to take up view maintenance for named solution sets.

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/584 (DESCRIBE Cache).
        [2] https://sourceforge.net/apps/trac/bigdata/ticket/531 (SPARQL UPDATE for SOLUTION SETS)

        Show
        bryanthompson bryanthompson added a comment - The DESCRIBE cache is now being handled by [1] . The SPARQL solution set cache is handled by [2] . This ticket is closed. We can open a new ticket when we decide to take up view maintenance for named solution sets. [1] https://sourceforge.net/apps/trac/bigdata/ticket/584 (DESCRIBE Cache). [2] https://sourceforge.net/apps/trac/bigdata/ticket/531 (SPARQL UPDATE for SOLUTION SETS)

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: