Details

      Description

      The LRUNexus implementation should use a non-blocking design. This is especially true for joins and for scale-out, both of which have potentially higher concurrency than scale-up database writes.

      The underlying problem is that the key is a composite of the store UUID and long address of the record in that store. This leads naturally to a nested hash map design, which can also support isolation of write sets until a checkpoint or commit. However, in order for indices to compete for RAM, the LRU (or ideally LIRS) ordering must be shared across all stores, hence "GlobalLRU".

      Since evictions are driven by inserts, and evictions can come from the cache for any store, any insert on the cache can cause any record held by any cache to be evicted. What is needed is a combination of a ConcurrentHashMap for the per-store cache with a non-blocking double-linked list for the global LRU.

      Unfortunately, the non-blocking Java collection which would be the candidate for the double-linked list (ConcurrentLinkedQueue) does not provide us with the hooks which we would require in order to implement the semantics of: (1) "touch" on the LRU ordering; or (2) "remove" of an element from the LRU ordering other than the element in the LRU position.

      The "touch" semantics are mandatory. We must have access to the prior/next references so that we can reorder the elements in the list. This drives us to our own implementation.

      The "remove" semantics are not required at this time, but they are a placeholder for a read/write store where the address of a record to be recycled must be cleared from the cache before it can be used to store different data.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        We have experimented with a non-blocking LRU cache from infinispan, but the cache currently has (severe) problems with contention when used as an LRU.

        See http://jira.jboss.org/jira/browse/ISPN-277

        Show
        bryanthompson bryanthompson added a comment - We have experimented with a non-blocking LRU cache from infinispan, but the cache currently has (severe) problems with contention when used as an LRU. See http://jira.jboss.org/jira/browse/ISPN-277
        Hide
        bryanthompson bryanthompson added a comment -

        This is my comment on the infinispan-dev thread concerning the BP-Wrapper algorithm to reduce lock contention for caches.

        Reading [1], it appears that my suggestion of segments is called "distributed locks" in this paper and the authors suggests:

        > For example, the algorithms that need to detect sequence of accesses cannot retain their performance advantages when pages in the same sequence have been distributed into multiple partitions and cannot be identified as a sequence.

        Overall, I like the approach outlined in the paper. It takes the concept of batching further than what we did, but in much the same spirit. In our implementation, we only batched evictions. They are also batching the updates to the replacement policy ordering which is still a bottleneck for us.

        Concerning the FIFO queue, the paper (page 372) considers both per-thread queues or global queues. They dismiss global queues for batching updates primarily because certain cache replacement policies need to detect sequences of operations. Can such sequences of operations be meaningfully detected within infinispan? Is there any thread-affinity for a transaction? If not, then segmented queues might be an alternative for infinispan.

        Also, for java, if the hash map uses weak values then cache entries then any entries on the FIFO queues will remain pinned in the hash map even if they would have otherwise been evicted by the cache policy which is a nice touch. That way you can continue to access an object which remains accessible to the JVM held even if the cache replacement policy would have discarded the cache entry.

        The experiment in the paper compared several cache replacement policies, but the data were all hot in cache and the cache was large enough to avoid all cache misses, so I am not clear why they bothered to compare different cache replacement policies -- none of them did any evictions for their tests. While "prefetching" did reduce the lock contention slightly, it had no meaningful effect on either throughput or response time. The authors posit that it might have more effect on machines with many more cores, depending on the CPU architecture. It seems easily enough to ignore the prefetch dimension for now and just concentrate on batching.

        So, I think that the batching approach per this paper makes sense and could be combined with LRU or LIRS easily since the cache replacement policy is encapsulated (which might not be true with prefetch) and seems overall like a good idea and an improvement on what we were already doing. For our application, I think that we would use a hash map with weak values (per above) to gain the benefit of the actual retention time of the object by the JVM. I would also like to explore the use of segmented queues as well as per-thread queues to control the #of queues in the system separately from the #of threads. By batching updates to the access order, you are also effectively batching evictions since they are driven by updates. When combined with a weak value caches, the thread running the updates should also clear entries from the cache on the weak reference queue, thereby batching the clearing of stale cache entries (those with cleared references) at the same time it is batching updates to the access order and inserts to the cache.

        [1] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf

        Show
        bryanthompson bryanthompson added a comment - This is my comment on the infinispan-dev thread concerning the BP-Wrapper algorithm to reduce lock contention for caches. Reading [1] , it appears that my suggestion of segments is called "distributed locks" in this paper and the authors suggests: > For example, the algorithms that need to detect sequence of accesses cannot retain their performance advantages when pages in the same sequence have been distributed into multiple partitions and cannot be identified as a sequence. Overall, I like the approach outlined in the paper. It takes the concept of batching further than what we did, but in much the same spirit. In our implementation, we only batched evictions. They are also batching the updates to the replacement policy ordering which is still a bottleneck for us. Concerning the FIFO queue, the paper (page 372) considers both per-thread queues or global queues. They dismiss global queues for batching updates primarily because certain cache replacement policies need to detect sequences of operations. Can such sequences of operations be meaningfully detected within infinispan? Is there any thread-affinity for a transaction? If not, then segmented queues might be an alternative for infinispan. Also, for java, if the hash map uses weak values then cache entries then any entries on the FIFO queues will remain pinned in the hash map even if they would have otherwise been evicted by the cache policy which is a nice touch. That way you can continue to access an object which remains accessible to the JVM held even if the cache replacement policy would have discarded the cache entry. The experiment in the paper compared several cache replacement policies, but the data were all hot in cache and the cache was large enough to avoid all cache misses, so I am not clear why they bothered to compare different cache replacement policies -- none of them did any evictions for their tests. While "prefetching" did reduce the lock contention slightly, it had no meaningful effect on either throughput or response time. The authors posit that it might have more effect on machines with many more cores, depending on the CPU architecture. It seems easily enough to ignore the prefetch dimension for now and just concentrate on batching. So, I think that the batching approach per this paper makes sense and could be combined with LRU or LIRS easily since the cache replacement policy is encapsulated (which might not be true with prefetch) and seems overall like a good idea and an improvement on what we were already doing. For our application, I think that we would use a hash map with weak values (per above) to gain the benefit of the actual retention time of the object by the JVM. I would also like to explore the use of segmented queues as well as per-thread queues to control the #of queues in the system separately from the #of threads. By batching updates to the access order, you are also effectively batching evictions since they are driven by updates. When combined with a weak value caches, the thread running the updates should also clear entries from the cache on the weak reference queue, thereby batching the clearing of stale cache entries (those with cleared references) at the same time it is batching updates to the access order and inserts to the cache. [1] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf
        Hide
        bryanthompson bryanthompson added a comment -

        Also, see the LIRS [1] cache replacement policy which appears to be a good choice fo r reducing the cache eviction effects of key range scans.

        [1] http://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf

        Show
        bryanthompson bryanthompson added a comment - Also, see the LIRS [1] cache replacement policy which appears to be a good choice fo r reducing the cache eviction effects of key range scans. [1] http://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf
        Hide
        bryanthompson bryanthompson added a comment -

        We have made a lot of progress on this front working with the infinispan people. There is a non-blocking cache implementation using the infinispan buffered concurrent hash map with a LRU and LIRS access policy. Unfortunately this implementation does not let us manage the memory burden of the cache directly, but design notes for a non-blocking concurrent hash map based design are inline below. This design is somewhat simpler and less fraught with peril (it does not subclass CurrentHashMap) and would let us manage the memory burden explicitly.

        /**

        • A mostly non-blocking cache based on a {@link ConcurrentHashMap} and batched
        • updates to its access policy. This approach encapsulates:
        • <ul>
        • <li>
        • (a) an unmodified ConcurrentHashMap (CHM); combined with</li>
        • <li>
        • (b) non-thread-safe thread-local buffers (TLB) for touches, managed by an
        • inner CHM<ThreadId,TLB> instance. The reason for this inner map is to allow
        • the TLB instances to be discarded by clear(); The TLBs get batched onto;</li>
        • <li>
        • (c) a shared non-thread safe access policy (LIRS, LRU) built on double-linked
        • nodes (DLN) stored in the inner CHM. Updates are deferred until holding the
        • lock (d). The DLN reference to the cached value is final. The (prior, next,
        • delete) fields are only read or written while holding the lock (d). Other
        • fields could be defined by subclassing a newDLN() method to support LIRS,
        • etc. The access policy will need [tail|head,] or similar fields, which would
        • also be guarded by the lock (d);</li>
        • <li>
        • (d) a single lock guarding mutation on the access policy. Since there is only
        • one lock, there can be no lock ordering problems. Both batching touches onto
        • (c) and eviction (per the access policy) require access to the lock, but that
        • is the only lock. If the access policy batches evictions, then lock requests
        • will be rare and the whole cache will be non-blocking, wait free, and not
        • spinning on CAS locks 99% of the time; and</li>
        • <li>
        • (d) explicit management of the threads used to access the cache. e.g., by
        • queuing accepted requests and servicing them out of a thread pool, which has
        • the benefit of managing the workload imposed by the clients.</li>
        • </ul>
        • <p>
        • This should have the best possible performance and the simplest
        • implementation. (b) The TLB could be a DLN[] or other simple data structures.
        • The access policy (c) is composed from linking DLN instances together while
        • holding the lock.
        • <ul>
        • <li>
        • A get() on the outer class looks up the DLN on the inner CHM and places it
        • into the TLB (if found).</li>
        • <li>
        • A put() or putIfAbsent() on the outer class creates a new DLN and either
        • unconditionally or conditionally puts it into the inner CHM. The new DLN is
        • added to the TLB IFF it was added to the inner CHM. The access order is NOT
        • updated at this time.</li>
        • <li>
        • A remove() on the outer class acquires the lock (d), looks up the DLN in the
        • cache, and synchronously unlinks the DLN if found and sets its [deleted]
        • flag. I would recommend that the clients do not call remove() directly, or
        • that an outer remove() method exists which only removes the DLN from the
        • inner CHM and queues up remove requests to be processed the next time any
        • thread batches its touches through the lock. The inner remove() method would
        • synchronously update the DLNs.</li>
        • <li>
        • A clear() clear the ConcurrentHashMap<Key,DLN<Val>> map. It would also clear
        • the inner ConcurrentHashMap<ThreadId,TLB> map, which would cause the existing
        • TLB instances to be discarded. It would have to obtain the lock in order to
        • clear the [head,tail] or related fields for the access policy.</li>
        • </ul>
        • When batching touches through the lock, only the access order is updated by
        • the appropriate updates of the DLN nodes. If the [deleted] flag is set, then
        • the DLN has been removed from the cache and its access order is NOT updated.
        • If the cache is over its defined maximums, then evictions are batched while
        • holding the lock. Evictions are only processed when batching touches through
        • the lock.
          */
        Show
        bryanthompson bryanthompson added a comment - We have made a lot of progress on this front working with the infinispan people. There is a non-blocking cache implementation using the infinispan buffered concurrent hash map with a LRU and LIRS access policy. Unfortunately this implementation does not let us manage the memory burden of the cache directly, but design notes for a non-blocking concurrent hash map based design are inline below. This design is somewhat simpler and less fraught with peril (it does not subclass CurrentHashMap) and would let us manage the memory burden explicitly. /** A mostly non-blocking cache based on a {@link ConcurrentHashMap} and batched updates to its access policy. This approach encapsulates: <ul> <li> (a) an unmodified ConcurrentHashMap (CHM); combined with</li> <li> (b) non-thread-safe thread-local buffers (TLB) for touches, managed by an inner CHM<ThreadId,TLB> instance. The reason for this inner map is to allow the TLB instances to be discarded by clear(); The TLBs get batched onto;</li> <li> (c) a shared non-thread safe access policy (LIRS, LRU) built on double-linked nodes (DLN) stored in the inner CHM. Updates are deferred until holding the lock (d). The DLN reference to the cached value is final. The (prior, next, delete) fields are only read or written while holding the lock (d). Other fields could be defined by subclassing a newDLN() method to support LIRS, etc. The access policy will need [tail|head,] or similar fields, which would also be guarded by the lock (d);</li> <li> (d) a single lock guarding mutation on the access policy. Since there is only one lock, there can be no lock ordering problems. Both batching touches onto (c) and eviction (per the access policy) require access to the lock, but that is the only lock. If the access policy batches evictions, then lock requests will be rare and the whole cache will be non-blocking, wait free, and not spinning on CAS locks 99% of the time; and</li> <li> (d) explicit management of the threads used to access the cache. e.g., by queuing accepted requests and servicing them out of a thread pool, which has the benefit of managing the workload imposed by the clients.</li> </ul> <p> This should have the best possible performance and the simplest implementation. (b) The TLB could be a DLN[] or other simple data structures. The access policy (c) is composed from linking DLN instances together while holding the lock. <ul> <li> A get() on the outer class looks up the DLN on the inner CHM and places it into the TLB (if found).</li> <li> A put() or putIfAbsent() on the outer class creates a new DLN and either unconditionally or conditionally puts it into the inner CHM. The new DLN is added to the TLB IFF it was added to the inner CHM. The access order is NOT updated at this time.</li> <li> A remove() on the outer class acquires the lock (d), looks up the DLN in the cache, and synchronously unlinks the DLN if found and sets its [deleted] flag. I would recommend that the clients do not call remove() directly, or that an outer remove() method exists which only removes the DLN from the inner CHM and queues up remove requests to be processed the next time any thread batches its touches through the lock. The inner remove() method would synchronously update the DLNs.</li> <li> A clear() clear the ConcurrentHashMap<Key,DLN<Val>> map. It would also clear the inner ConcurrentHashMap<ThreadId,TLB> map, which would cause the existing TLB instances to be discarded. It would have to obtain the lock in order to clear the [head,tail] or related fields for the access policy.</li> </ul> When batching touches through the lock, only the access order is updated by the appropriate updates of the DLN nodes. If the [deleted] flag is set, then the DLN has been removed from the cache and its access order is NOT updated. If the cache is over its defined maximums, then evictions are batched while holding the lock. Evictions are only processed when batching touches through the lock. */
        Hide
        bryanthompson bryanthompson added a comment -

        Resolution of this issue should also address isolation for the global LRU. In particular, the write set on the journal should not be transferred into the primary cache for the journal instance until the commit point. This will make it possible to discard the write set on the journal during an abort() without forcing us to discard the entire cache for the journal.

        Show
        bryanthompson bryanthompson added a comment - Resolution of this issue should also address isolation for the global LRU. In particular, the write set on the journal should not be transferred into the primary cache for the journal instance until the commit point. This will make it possible to discard the write set on the journal during an abort() without forcing us to discard the entire cache for the journal.
        Hide
        bryanthompson bryanthompson added a comment -

        I have implemented BCHMGlobalLRU2, which has very good performance on both synthetic microbenchmarks and in SPARQL performance tests. As a configuration option, the cache can either use striped locks or per-thread buffers.

        The LIRS access policy is been sketched out, but not is not finished.

        There is currently a memory leak which needs to be tracked down for the LRU access policy.

        Show
        bryanthompson bryanthompson added a comment - I have implemented BCHMGlobalLRU2, which has very good performance on both synthetic microbenchmarks and in SPARQL performance tests. As a configuration option, the cache can either use striped locks or per-thread buffers. The LIRS access policy is been sketched out, but not is not finished. There is currently a memory leak which needs to be tracked down for the LRU access policy.
        Hide
        bryanthompson bryanthompson added a comment -

        The "memory" leak is in fact an interaction with the garbage collector. The non-blocking concurrent LRU runs just fine with the parallel old generation garbage collector.

        However, it appears that the LRUNexus does not actually improve performance now that we are using an extension lock in the DiskRW and DiskWORM journal persistence store modes to allow concurrent read/write IOs. Performance is as good when we allow the IO requests to pass through to the OS file system cache and disk cache layers. (That is, NO application level disk cache is as good as a high concurrency non-blocking cache).

        The non-blocking LIRS cache is still not finished, but work on this is being deferred for now. The main point of LIRS is to correct for some edge conditions in the LRU cache design, but due to the interactions with the garbage collectors it seems better to leave the LRUNexus disabled for now.

        Show
        bryanthompson bryanthompson added a comment - The "memory" leak is in fact an interaction with the garbage collector. The non-blocking concurrent LRU runs just fine with the parallel old generation garbage collector. However, it appears that the LRUNexus does not actually improve performance now that we are using an extension lock in the DiskRW and DiskWORM journal persistence store modes to allow concurrent read/write IOs. Performance is as good when we allow the IO requests to pass through to the OS file system cache and disk cache layers. (That is, NO application level disk cache is as good as a high concurrency non-blocking cache). The non-blocking LIRS cache is still not finished, but work on this is being deferred for now. The main point of LIRS is to correct for some edge conditions in the LRU cache design, but due to the interactions with the garbage collectors it seems better to leave the LRUNexus disabled for now.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: