Type: New Feature
Status: Closed - Won't Fix
Affects Version/s: BIGDATA_RELEASE_1_1_0
Fix Version/s: None
Add a SPARQL cache. This can be used to cache RDF/GOM objects (DESCRIBE <uri>), SPARQL solution sets, and general graph query results (CONSTRUCT). An ASK cache is also easily constructed and would have use within transparently distributed query against known data sources.
For DESCRIBE <uri> we can apply updates to the cache as they are processed. We could cache the results for each URI separately, and the response when multiple URIs are given could be a merge of the responses for the separate URIs.
For general SPARQL solution sets, we can invalidate the cache as statement patterns used within the cache are updated. It is also possible to use subsumption to identify when solution sets for queries, subqueries, or join groups can be reused to satisfy some or all of a new query. One simple case is when the query differs only in the solution modifiers (ORDER BY, OFFSET/LIMIT). For OFFSET/LIMIT, it makes sense to cache the full solution set and then report only the relevant SLICE to the application. The remaining SLICES can then be return trivially from the cache. For join groups, the cache should be indexed and matched based on an unordered set of statement patterns and subgroups.
Any join group solution in the cache which is less restrictive can be used to generate the solutions for a join group enjoying either the same or more restrictive constraints. When the query has more constraints, the additional constraints are simply used to filter the join group solutions. However, there are some "gotchas". One problerm with SPARQL query is that exogenous solutions (as presented by the openrdf API and/or the BINDINGS clause) pose constraints on the query which are not captured solely by the join groups in the query. Thus, when caching and matching solution sets we must also consider the BINDINGS which were in effect when the solution set was cached.
Subsumption matching of join groups would appear to be fairly straight forward if one matches first on the basic (unordered) statement patterns and then verifies that the new join group is is no less constrained than original join group. However, (another gotcha), bigdata does not use bottom up evaluation. Instead, the engine vectors solutions through the query plan in a left-to-right order and hides variables which are not projected into sub-queries. If this vectored evaluation would produce different results than bottom up evaluation, then we rewrite the query by lifting out the join group into a sub-query. Hence, caching of sub-groups or sub-selects is not possible unless the left side of the cached query plan is identical (and the same BINDINGS are given externally).
SERVICE calls are an excellent opportunity for caching. We already use a custom cache for the bigdata internal full text search service and this cache provides a tremendous performance boost for common query patterns when populating a GUI page. Non-native SERVICEs and true remote SERVICE calls can be cached as well and this caching should provide a tremendous performance boost. However, caching for SERVICEs should be provisioned in the IServiceOptions since updates to a non-native service might not be visible to the bigdata IChangeLog. Also, since we use vectored evaluation, the cached SERVICE result is only valid when the inputs for the cached solution are the same (or a superset which we can then filter locally based on an analysis of the SERVICE graph pattern). However, any SERVICEs which declared that it should run first may be understood as being evaluated solely against the exogenous variables.
Yet another twist is that variables may be replaced in queries with constants. Such queries are more bound than a corresponding query in the cache with a variable in the position of that constant. Again, this is very similar to the question of handling the BINDINGS clause (except that the BINDINGS clause can communicate multiple source solutions into the query).
The MVCC architecture means that cache invalidation can only apply to changes since the commit point for which the invalidation notice was generated. Historical cache views which have been invalidated should not be flushed until the retention period for the cache view has expired since they may still be accessed by active transactions. If the history retention period is known to be significantly longer than the historical cache access patterns, then invalidated cache results could be flushed once there is no active transaction which can read from that commit point. Either way, this requires an integration with the transaction service.
The invalidation notices should emerge from the existing IChangeLog interface. For a tightly integrated cache, they can be propagated by method calls against an object reference attached to the AST2BOpContext. For remote caches or cache fabrics, then updates can be propagated using either UDP or TCP notification depending on the cache coherency requirements.
The cache itself can be trivially derived from existing classes, including the MemoryManager (up to 4TB of main memory cache), the ConcurrentWeakValueCache, and the various cache implementations which bound the total bytes in the cache by batching updates through the cache and evicting the LRU (or LIRS) entries until there is enough room in the cache to store the new cache entry. The solutions in the cache should be stored in the same IV encoding that we use to store named solution sets, and cached solution sets could be viewed as named solution sets which are resolved based on a subsumption match and then referenced by their cache entry identifier. Assuming that we work with weak reference semantics for cache eviction (and recycling of the records backing the solution set in the memory manager) then a cache entry can be pinned by holding a reference to it in the query plan. We might use a proxy pattern to pin a cached solution set within a cache fabric since it will be remote (not integrated with the query controller).
The HTree is used for named solution sets and works quite well for these purposes. However, it is single threaded for writing, which could impose some overhead when running the original query into the cache. Releasing a named solution set is managed by clearing the child IMemoryManager within which it was allocated. A similar design could be used to release solution sets and graphs in the SPARQL cache. The solution cache will contain materialized RDF Values. If we would only use the cache for top-level queries, then a general purpose RDF BindingSet serialization would be appropriate. However, for caching join groups we want to preserve the IVs as well. Therefore a bigdata specific representation must be used, such as is used by the IV encoders. Another possibility is the DirectBufferPoolAllocator. This architecture was mapped out for use by the query engine and writes solutions sets onto one or more direct buffers. This can provide both higher throughput than the HTree and the NIO buffers can be directly shipped on the wire. The IV encoder for the wire could be used in this case. It incrementally records the IVCache association information into the solutions and provides for incremental decode, requiring only the heap in the decoder for the IVCache associations.
Rather than storing GRAPHs in the cache, we could also simply store solution sets and run the ASTConstructIterator to produce the statements from the CONSTRUCT template (DESCRIBE is translated into a CONSTRUCT). This would also make it easier to handle CONNEG for the response since the solutions are not yet in an external representation.
An ASK cache is of necessity different since we are only storing a boolean result.
A SPARQL cache should significantly improve performance for benchmarks and applications by permitting pre-computed results to be reused without any computational cost.
See Improving the Performance of Semantic Web Applications with SPARQL Query Caching for a discussion of some related issues and some benchmark results.