Details

      Description

      Implement a SERVICE that indexing the change log for a KB and exposes it to high level query using SPARQL.

      The service should index triples or quads depending on the database mode. The index key should be:

      [revisionTimestamp, p, o, s, [c,] add|remove]
      

      Where revisionTimestamp is probably (lastCommitTime+1).

      The concept of a revisionTimestamp is already used in the indices to support the MVCC architecture. The proposed revision timestamp is a timestamp known to be strictly greater than the last commit time. (It might make more logical sense to index the commit time associated with the commit point, but that would limit scalability since the commit time is not available until after the commit (implying that we would have to buffer everything) and it would also imply 2 commit points for each commit, which would limit throughput. See AbstractBTree#getRevisionTimestamp().

      It should be possible to prune index at each commit, leaving no more than a configured amount of data in the index.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        I have modified the presume index order to be POS(C). This will cluster statements for the same predicate within an update in the index. That will make it possible to use an advancer pattern to more efficiently visit only those statements having a specific predicate within a time range.

        Show
        bryanthompson bryanthompson added a comment - I have modified the presume index order to be POS(C). This will cluster statements for the same predicate within an update in the index. That will make it possible to use an advancer pattern to more efficiently visit only those statements having a specific predicate within a time range.
        Hide
        bryanthompson bryanthompson added a comment -

        I have implemented a history service. It can be enabled through AbstractTripleStore.Options.HISTORY_SERVICE. There is also an option to prune the head of the history index maintained by that service. It indexes a revision time, the (s,p,o[,c]), and the associated ChangeAction (INSERTED, REMOVED, or UPDATED), and the various metadata bits that are associated with an ISPO in the indices (statement type, and some flags).

        I have not done an integration with SPARQL yet through the SERVICE keyword. I am not convinced yet whether this facility would be used from SPARQL or from code. We should talk about this. You can use code similar to that found in the HistoryServiceFactory class to access the data in the history index from code. Integrating this into SPARQL is more like making it a feature rather than a demonstration concept / prototype.

        The revision time for the entries in the history index is currently lastCommitTime+1 (for an unisolated connection). The issue with revision time is that we have to record things in the history index incrementally, so it can not be a commit time because we do not have that yet. lastCommitTime+1 will always be strictly greater than the previous commit point. When you scan the history index, you can then use fromKey=firstCommitTime to visit (or null for the head of the index). toKey=firstCommitTime to be excluded (or null for the tail of the index). That all has semantics that are pretty much what people would expect for the scan. However, the reported revision times are not going to correspond to the commit points. When reporting this data, we could resolve (and cache) the first commit time greater than that commit point, but only if we still have access to that commit point (the resolution would be against the commit record index, and commit points are pruned from that index when they are recycled).

        The revision time for a full read/write tx is less well defined. We can use the same value (lastCommitTime+1). However, it is a little more ambiguous in this case because you can have concurrent transactions so there could be multiple transactions that wind up with the same revisionTime in the history index.

        We could also use the actual timestamp when we touched the index for a given ISPO, but there are issues with that as well. For example, on a long running data load we would have a bunch of different revision times for the same commit point and the actual order of those revision times within the index is not material since all changes are associated with the same commit point.


        - Refactored IChangeRecord.ChangeAction into its own file.


        - Added transactionBegin() and transactionPrepare() methods to IChangeLog.


        - Integrated a history index into SPORelation.


        - Added a HistoryServiceFactory that will index the data reported by

        the IChangeLog.


        - Added option to enable the history service to AbstractTripleStore.


        - Added option to specify the min release time for the history index

        (defaults to infinite).


        - Added SPOKeyOrder.appendKey(...) to append the key component for an

        ISPO without invoking reset() on the IKeyBuilder or extracting and

        returning the byte[] key. This makes it possible to reuse the

        SPOKeyOrder for the history index.


        - Added hashCode() to ChangeRecord (based on the ISPO hashCode).


        - Wrote unit tests of the history index.


        - TODO This feature is just a prototype right now. We have to work

        through the use cases (including the SPARQL SERVICE use cases) and

        read/write tx support before we can support this. Once supported,

        document this on wiki (HISTORY_SERVICE,

        HISTORY_SERVICE_MIN_RELEASE_TIME, how to access from code, how to

        access from query).


        - TODO Unit tests of the history index at the SPARQL layer. This

        requires a SERVICE translation. The <<>> syntax might be a nice way

        to express access to the statements in the history index.

        See http://sourceforge.net/apps/trac/bigdata/ticket/607 (History Service)

        Committed revision r6640.

        Show
        bryanthompson bryanthompson added a comment - I have implemented a history service. It can be enabled through AbstractTripleStore.Options.HISTORY_SERVICE. There is also an option to prune the head of the history index maintained by that service. It indexes a revision time, the (s,p,o [,c] ), and the associated ChangeAction (INSERTED, REMOVED, or UPDATED), and the various metadata bits that are associated with an ISPO in the indices (statement type, and some flags). I have not done an integration with SPARQL yet through the SERVICE keyword. I am not convinced yet whether this facility would be used from SPARQL or from code. We should talk about this. You can use code similar to that found in the HistoryServiceFactory class to access the data in the history index from code. Integrating this into SPARQL is more like making it a feature rather than a demonstration concept / prototype. The revision time for the entries in the history index is currently lastCommitTime+1 (for an unisolated connection). The issue with revision time is that we have to record things in the history index incrementally, so it can not be a commit time because we do not have that yet. lastCommitTime+1 will always be strictly greater than the previous commit point. When you scan the history index, you can then use fromKey=firstCommitTime to visit (or null for the head of the index). toKey=firstCommitTime to be excluded (or null for the tail of the index). That all has semantics that are pretty much what people would expect for the scan. However, the reported revision times are not going to correspond to the commit points. When reporting this data, we could resolve (and cache) the first commit time greater than that commit point, but only if we still have access to that commit point (the resolution would be against the commit record index, and commit points are pruned from that index when they are recycled). The revision time for a full read/write tx is less well defined. We can use the same value (lastCommitTime+1). However, it is a little more ambiguous in this case because you can have concurrent transactions so there could be multiple transactions that wind up with the same revisionTime in the history index. We could also use the actual timestamp when we touched the index for a given ISPO, but there are issues with that as well. For example, on a long running data load we would have a bunch of different revision times for the same commit point and the actual order of those revision times within the index is not material since all changes are associated with the same commit point. - Refactored IChangeRecord.ChangeAction into its own file. - Added transactionBegin() and transactionPrepare() methods to IChangeLog. - Integrated a history index into SPORelation. - Added a HistoryServiceFactory that will index the data reported by the IChangeLog. - Added option to enable the history service to AbstractTripleStore. - Added option to specify the min release time for the history index (defaults to infinite). - Added SPOKeyOrder.appendKey(...) to append the key component for an ISPO without invoking reset() on the IKeyBuilder or extracting and returning the byte[] key. This makes it possible to reuse the SPOKeyOrder for the history index. - Added hashCode() to ChangeRecord (based on the ISPO hashCode). - Wrote unit tests of the history index. - TODO This feature is just a prototype right now. We have to work through the use cases (including the SPARQL SERVICE use cases) and read/write tx support before we can support this. Once supported, document this on wiki (HISTORY_SERVICE, HISTORY_SERVICE_MIN_RELEASE_TIME, how to access from code, how to access from query). - TODO Unit tests of the history index at the SPARQL layer. This requires a SERVICE translation. The <<>> syntax might be a nice way to express access to the statements in the history index. See http://sourceforge.net/apps/trac/bigdata/ticket/607 (History Service) Committed revision r6640.
        Hide
        bryanthompson bryanthompson added a comment -

        The history index was using the incorrect SPOKeyOrder for triples/sids/quads modes. This showed up in CI as an error in the SIDS mode.

        Committed revision r6651.

        Show
        bryanthompson bryanthompson added a comment - The history index was using the incorrect SPOKeyOrder for triples/sids/quads modes. This showed up in CI as an error in the SIDS mode. Committed revision r6651.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: