This is my comment on the infinispan-dev thread concerning the BP-Wrapper algorithm to reduce lock contention for caches.
Reading , it appears that my suggestion of segments is called "distributed locks" in this paper and the authors suggests:
> For example, the algorithms that need to detect sequence of accesses cannot retain their performance advantages when pages in the same sequence have been distributed into multiple partitions and cannot be identified as a sequence.
Overall, I like the approach outlined in the paper. It takes the concept of batching further than what we did, but in much the same spirit. In our implementation, we only batched evictions. They are also batching the updates to the replacement policy ordering which is still a bottleneck for us.
Concerning the FIFO queue, the paper (page 372) considers both per-thread queues or global queues. They dismiss global queues for batching updates primarily because certain cache replacement policies need to detect sequences of operations. Can such sequences of operations be meaningfully detected within infinispan? Is there any thread-affinity for a transaction? If not, then segmented queues might be an alternative for infinispan.
Also, for java, if the hash map uses weak values then cache entries then any entries on the FIFO queues will remain pinned in the hash map even if they would have otherwise been evicted by the cache policy which is a nice touch. That way you can continue to access an object which remains accessible to the JVM held even if the cache replacement policy would have discarded the cache entry.
The experiment in the paper compared several cache replacement policies, but the data were all hot in cache and the cache was large enough to avoid all cache misses, so I am not clear why they bothered to compare different cache replacement policies -- none of them did any evictions for their tests. While "prefetching" did reduce the lock contention slightly, it had no meaningful effect on either throughput or response time. The authors posit that it might have more effect on machines with many more cores, depending on the CPU architecture. It seems easily enough to ignore the prefetch dimension for now and just concentrate on batching.
So, I think that the batching approach per this paper makes sense and could be combined with LRU or LIRS easily since the cache replacement policy is encapsulated (which might not be true with prefetch) and seems overall like a good idea and an improvement on what we were already doing. For our application, I think that we would use a hash map with weak values (per above) to gain the benefit of the actual retention time of the object by the JVM. I would also like to explore the use of segmented queues as well as per-thread queues to control the #of queues in the system separately from the #of threads. By batching updates to the access order, you are also effectively batching evictions since they are driven by updates. When combined with a weak value caches, the thread running the updates should also clear entries from the cache on the weak reference queue, thereby batching the clearing of stale cache entries (those with cleared references) at the same time it is batching updates to the access order and inserts to the cache.