Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-378

Records must store the as-written length for HA failover reads to be successful.

    Details

    • Type: Bug
    • Status: Open
    • Resolution: Unresolved
    • Affects Version/s: JOURNAL_HA_BRANCH
    • Fix Version/s: None
    • Component/s: RWStore
    • Labels:
      None

      Description

      HA failover reads depend on an unambiguous signal indicating that the data read from the media has a checksum error. In this case, the read operation in attempted against a peer in the same quorum. However, we need both the length of the data (as written into the record) and the checksum of those length bytes in the record in order to differentiate between a media problem (bit rot, bad write, etc) and a problem where the storage address has since been recycled and a record with a different length is now stored at that address.

      The problem might be restricted to the RWStore since that is the only architecture where the contents on the disk for a given storage address may be overwritten by a subsequent write (if the record at that address had been deleted and then recycled, at which point it will have different data). In these situations, the checksum for the new data at the given storage address is often (typically) correct and a failover read will not serve any purpose. It is therefore important to distinguish between the recycling of the storage address and an internal inconsistency withihn the data stored at that address. In order to distinguish between these two cases we need to store the length of the data as part of the raw record format.

      This is related to designs for record level compression and encryption which require a means to indicate the run length of an uncompressed header (for nextAddr and priorAddr fields for leaves), the compression type, and the run length of the compressed data. See https://sourceforge.net/apps/trac/bigdata/ticket/43 (I have some notes on schemes to represent this information which need to be attached to that ticket).

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        When addressing this, also update MemStore#size() and RWStrategy#size(), both of which currently report the slot bytes rather than the user bytes because the memory manager and RWStore do not accurately track user bytes.

        Show
        bryanthompson bryanthompson added a comment - When addressing this, also update MemStore#size() and RWStrategy#size(), both of which currently report the slot bytes rather than the user bytes because the memory manager and RWStore do not accurately track user bytes.
        Hide
        bryanthompson bryanthompson added a comment -

        This ticket has not yet been addressed. Concerning this, Martyn wrote:

        ...the original store prefixed  with length data to support streams but we removed
        this and rely on the external long address encoding the data length and use this to
        determine checksum boundaries.
        
        We've gone around this issue a few times with respect to encryption/compression
        as well as HA, but have not quite got to prioritizing the mods required.
        
        Aside from HA I think there remains some conflict with buffer ownership (with scalable
        protocols in some conflict with optimum buffer usage).
        
        My medium/long term preference would be to return to a general stream protocol.  If 
        the RWStore used NIO buffers for its streamed data it could still pretty efficiently transfer
        to client buffers.
        

        These changes would break binary compatibility.

        Currently the stores support:
        
        - IPSOutputStream getOutputStream();
        - InputStream getInputStream(long addr);
        
        The stream format was modified to be compatible with the blob representation.
        
        IPSOutputStream.save() returns the long address that can be used to retrieve the InputStream.
        
        The store currently does not prefixes the data size so the long address is always required
        to read the stream.  This is true even for the blobs.
        
        If we did prefix the data size then a 32-bit address would be sufficient to retrieve a stream
        and since the PSInputStream would know the size of the buffer we could provide more support
        for direct buffer transfers on input.
        
        We could use a pooled PSOutput/PSInputStream to reduce object churn - which would be
        required if we were to use NIO backing buffers - although finalizers could provide error
        recovery (ensure release of the buffers to the buffer pool).
        
        Show
        bryanthompson bryanthompson added a comment - This ticket has not yet been addressed. Concerning this, Martyn wrote: ...the original store prefixed with length data to support streams but we removed this and rely on the external long address encoding the data length and use this to determine checksum boundaries. We've gone around this issue a few times with respect to encryption/compression as well as HA, but have not quite got to prioritizing the mods required. Aside from HA I think there remains some conflict with buffer ownership (with scalable protocols in some conflict with optimum buffer usage). My medium/long term preference would be to return to a general stream protocol. If the RWStore used NIO buffers for its streamed data it could still pretty efficiently transfer to client buffers. These changes would break binary compatibility. Currently the stores support: - IPSOutputStream getOutputStream(); - InputStream getInputStream(long addr); The stream format was modified to be compatible with the blob representation. IPSOutputStream.save() returns the long address that can be used to retrieve the InputStream. The store currently does not prefixes the data size so the long address is always required to read the stream. This is true even for the blobs. If we did prefix the data size then a 32-bit address would be sufficient to retrieve a stream and since the PSInputStream would know the size of the buffer we could provide more support for direct buffer transfers on input. We could use a pooled PSOutput/PSInputStream to reduce object churn - which would be required if we were to use NIO backing buffers - although finalizers could provide error recovery (ensure release of the buffers to the buffer pool).

          People

          • Assignee:
            martyncutcher martyncutcher
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: