Type: New Feature
Status: Closed - Won't Fix
Affects Version/s: BIGDATA_RELEASE_1_1_0
Fix Version/s: None
Component/s: Bigdata Federation
Range counts are attached to each predicate during the query optimization phase. On a cluster, the range counts for unselective as-given (versus as-bound) predicates can span multiple shards. The range count operation is already parallelized on the cluster. The purpose of this ticket is to further optimize such range counts.
1. The client view of the scale-out index uses a cached view of the metadata index (MDI). For any range count which spans a shard, the discovered range count for that shard could be recorded in the index locators in the client's MDI cache. That cache is currently organized as B+Tree using exactly the same data structures which are used by the MDS. In order to support cached range counts, we would have to add a hash map from partitionId to the rangeCount for that partition.
2. FusedView turns rangeCount() into an actual probe against each B+Tree component in the view. It does this because it a shard split operation reused the source index segment in both created views. However, I am not sure that this is still true. If not, then the code might be simplified.
Regardless, we should optimize any request where the range count's fromKey/toKey span the partition locator's fromKey/toKey. The optimization is to report the sum of the Checkpoint.entryCount values. Right now, the code will always issue key probes for the fromKey/toKey. For each index segment in the view, this means that we have to load the nodes region and resolve the first/last leaves in the index segment. If we need only rely on the Checkpoint records, then we do not need to open the IndexSegment, but just the IndexSegmentStore. That operation only reads the IndexSegmentCheckpoint record and the IndexMetadata record on the IndexSegment. It is much lighter weight that an open of the IndexSegment (aka B+Tree view onto the IndexSegmentStore).
3. The rangeCount of a shard should really be stored in the Checkpoint record for the B+Tree so we can report it directly from the B+Tree on the ManagedJournal without any recourse to the index segments in the view.
4. Read historical range counts might be turned into a UDP broadcast/multicast datagram pattern. The ClientIndexView would simply broadcast/multicast a datagram containing the index name (or UUID), the timestamp of the view, and the fromKey/toKey. Each DS node would listen to the appropriate multicast group. For each datagram received, it could consult a hash map providing a cache for shards resolvable on that DS node. If the shard was present, it could report the rangeCount of the shard.
There are drawbacks to this approach and I am not sure that it could be made to work. It's advantage is that UDP multicast is inherently parallel. It seems that the main drawback is that there no way to know when all DS nodes have responded. Such a technique might be more useful when a large number of index partitions are spanned by the query.
5. Another way to optimize the case when a large number of index partitions are spanned by the range count is to vector the range count requests. This would simply group the requests based on the DS node having the shard and the issue one request per DS node. This approach provides the minimum number of messages and benefits from reliable delivery.
6. With the introduction of the BLOBS index, all of the RDF indices have fairly uniform key sizes. We estimate the range count of a shard for a given index and then just multiply the #of shards spanned by the range count operation times the average range count for a shard on that index. With this approach we have imperfect information, but we only need a few data points to answer any range count which spans more than 2~5 shards.