Make the DataService highly available using HDFS. There are really only a few things that we need to handle:
1. Where to store the root blocks. Currently the root blocks are written on the head of the file, but that does not work for HDFS. Choices include:
- Zookeeper (they are small and 1:1 with the journal files for each DS node).
- HDFS in a file with the same basename as the journal but using a .rbs extension.
- An object/blob store such as S3.
@martyncutcher suggests writing the root blocks on the end of the HDFS file. This would work as long as the file append and file length information are atomically updated, which seems likely.
2. The file extension policy needs to be reviewed. We currently pre-allocate 200MB and then incrementally extend according to a hard-coded policy based on the initial (200MB) and maximum extent (also 200MB
- this is the target size for the live journal before it is closed out and a new one opened). However, even assuming that HDFS allows us to pre-extend the file, this rapid growth could leave holes on which we can not write. We need to take a look at the HDFS file append mechanisms and find a way to hide the fact of file extension from the WORMStrategy and the AbstractJournal.
3. The design of Java IO is such that an interrupt during an IO transparently closes the backing FileChannel. This causes readers to observe an AsynchronousFileCloseException. The code base uses a re-opener pattern to transparently re-open such FileChannels in reader threads. This has very little overhead on local file systems. If HDFS also adheres to this transparent close of the file access layer when there is an interrupt during an IO then we need to make sure that the overhead associated with this re-opener pattern is also quite small. If it is not, then this will have an extreme adverse impact on query. Note that the interrupts can occur in both reader and writer threads, but we only use the re-opener pattern in reader threads.
4. We need to verify that HDFS permits a reader to access written data without a sync/close. If not, then we need to fully buffer the writes since the last commit point in memory. One possibility would be a large write cache such that the writes were fully buffered up to the commit point (WORMStrategy does support the write cache). We could let the write cache incrementally flush, but we would have to retain the buffers on a list for read back until the commit. Or we could disallow incremental flush, require that the entire write set up to the group commit (plus any concurrent unisolated writes that would get melded into the next commit point) are buffered in order to ensure read back without IO (and without the possibility of interrupted IO). The commit protocol (with the exclusive lock) would then evict the write cache to the HDFS file (and optionally sync the file) and then append the root block to the end of the file and definitely sync the file. At this point the data would be restart safe.
5. Allocations can not be discarded, even on abort, once there has been a write on the HDFS file. Basically we need to only advance the allocator pointer. If we assume that the root block is at the end of the file, but there have been flushed writes without a new root block because of an abort() then we can?t find the old root block. Therefore the behavior really needs to be a single atomic write. And we need to do it whether we commit or abort. Abort would just write the old root block again.
Whatever approach we use needs to correctly align the write addresses in the buffer and on the disk.