The purpose of this ticket is to improve our understanding of the small slot optimization, any interaction with group commit, and make the small slot optimization policy less susceptible to mis-configuration.
Martyn is going to:
- 2x2 experimental design crossing the small slots and group commit option.
- Extract the dumpJournal allocators and page histogram for each.
- Validate the reported allocations in use against the allocated bits in the allocators.
- Verify that we have a good explanation of the allocation statistics that we are observing.
- Consider a separate allocation policy for blob headers (basically going around the small slot optimization for blob headers). However, the maximum waste policy might be sufficient without requiring this tighter coupling.
- Consider a maximum waste policy. The small slot policy does not return an allocator to the free list unless it has a sufficient sparsity. If no allocator for a given slot size is on the free list, then a new one is allocated. The modified policy would then check the amount of "waste" (unused storage) for allocators of that slot size. If this waste exceeds a threshold (percentage of all space allocated for that size allocator), then it would scan the allocators for that slot size that are not on the free list and put the one with the most free space onto the free list. Further, if there is enough waste across the small slot allocators, then we could just fill in the next free slot rather than looking for a set of slots with good locality (e.g., a page, 1/2 page, etc.). This balances waste (store size on the disk) against locality.
See https://docs.google.com/a/systap.com/spreadsheets/d/1AANi3aCQIOcx2nMoerecnKOgl7gtZDNLJ7fBEQfCZ2o/edit?usp=sharing for the data from our discussion of this ticket.
The original ticket description is below.
I currently updated to the current Revision (f4c63e5) of Blazegraph from Git and tried to load a dataset into the updated Webapp. With Bigdata 1.4.0 this resulted in a journal of ~18GB. Now the process was cancelled because the disk was full - the journal was beyond 50GB for the same file with the same settings. The only exception was that I activated GroupCommit.
The dataset can be downloaded here:
Please find the settings used to load the file below.
Do I have a misconfiguration, or is there a bug eating all disk memory?
curl -H "Accept: text/plain" http://localhost:8080/bigdata/namespace/gnd/properties
#Wed Apr 22 11:35:31 CEST 2015