Details

    • Type: New Feature
    • Status: Reopened
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Bigdata RDF Database
    • Labels:
      None

      Description

      While many uses of bigdata involve large static datasets, dynamic datasets (where tuples may be removed/unasserted) may leave unreferenced objects in the lexicon. This will eventually be a space issue.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Bigdata is designed for fast access to historical writes and allows selective retention of "history" within a federation. This means that we can not "GC" lexicon entries if they are in use in any historical commit point which is being retained. There are several approaches to this problem:

        One approach is to maintaining reference counters on the lexicon entries. However, this approach since would drive IO tremendously since both inserts and updates of tuples are logged on the Journal and migrated onto index segments (this is necessary in order to have the fast access to historical commit points and is part of the general write once contract for bigdata).

        Another approach would do a parallel scan over one of the statement indices using a read-only operation and collect all term identifiers which are not in use within that index. However, this would have to be done for all commit points since the last time a GC pass was performed. The only way to approach this is with by obtaining a read-only transaction corresponding to the last GC point and then doing a full scan on all journals and shards having data for that commit point or any subsequent commit point.

        Given a parallel scan, we need to identify those term identifiers which are no longer in use. This could be done using a left outer join against a read-historical view of the ID2TERM index. We would then need to have an eventually consistent delete which guaranteed
        that the terms were assigned either their old termIds or a new termId consistently if there was a concurrent insert of a statement using a given term.

        All of these are quite complex operations. The simplest approach is just to bulk export / import into a new namespace. This could be done as a streaming operation. The old namespace could then be dropped and its history purged using administrative actions.

        Show
        bryanthompson bryanthompson added a comment - Bigdata is designed for fast access to historical writes and allows selective retention of "history" within a federation. This means that we can not "GC" lexicon entries if they are in use in any historical commit point which is being retained. There are several approaches to this problem: One approach is to maintaining reference counters on the lexicon entries. However, this approach since would drive IO tremendously since both inserts and updates of tuples are logged on the Journal and migrated onto index segments (this is necessary in order to have the fast access to historical commit points and is part of the general write once contract for bigdata). Another approach would do a parallel scan over one of the statement indices using a read-only operation and collect all term identifiers which are not in use within that index. However, this would have to be done for all commit points since the last time a GC pass was performed. The only way to approach this is with by obtaining a read-only transaction corresponding to the last GC point and then doing a full scan on all journals and shards having data for that commit point or any subsequent commit point. Given a parallel scan, we need to identify those term identifiers which are no longer in use. This could be done using a left outer join against a read-historical view of the ID2TERM index. We would then need to have an eventually consistent delete which guaranteed that the terms were assigned either their old termIds or a new termId consistently if there was a concurrent insert of a statement using a given term. All of these are quite complex operations. The simplest approach is just to bulk export / import into a new namespace. This could be done as a streaming operation. The old namespace could then be dropped and its history purged using administrative actions.
        Hide
        bryanthompson bryanthompson added a comment -

        This is a known limitation of the broad WORM contract of the bigdata architecture and the physical RDF schema. Per above, export/import is a workaround.

        Show
        bryanthompson bryanthompson added a comment - This is a known limitation of the broad WORM contract of the bigdata architecture and the physical RDF schema. Per above, export/import is a workaround.
        Hide
        fkoliver fkoliver added a comment -

        Import and export of data with timestamps may be necessary with a backup strategy. The scheduled maintenance period for import/export will not meet high availability constraints.

        Another approach may be to consider the lexicon as having both a new and old version. Each read goes to the new version first, and if not found then goes to the old version. Terms found in the old version are moved to the new version. Each write goes only to the new version. In normal operation, the new version contains all the lexicon data while the old version is empty or absent, so no performance loss.

        During GC, an empty new version is created, and the previous new version becomes the old version. As terms are used they get moved from the old to the new version. A background task can find and exercise all terms in the tuple indices, moving all referenced terms to the new version. After the background task completes, any terms remaining in the old version of the lexicon are unreferenced (and a verification step is possible) and can be destroyed.

        Show
        fkoliver fkoliver added a comment - Import and export of data with timestamps may be necessary with a backup strategy. The scheduled maintenance period for import/export will not meet high availability constraints. Another approach may be to consider the lexicon as having both a new and old version. Each read goes to the new version first, and if not found then goes to the old version. Terms found in the old version are moved to the new version. Each write goes only to the new version. In normal operation, the new version contains all the lexicon data while the old version is empty or absent, so no performance loss. During GC, an empty new version is created, and the previous new version becomes the old version. As terms are used they get moved from the old to the new version. A background task can find and exercise all terms in the tuple indices, moving all referenced terms to the new version. After the background task completes, any terms remaining in the old version of the lexicon are unreferenced (and a verification step is possible) and can be destroyed.
        Hide
        bryanthompson bryanthompson added a comment -

        Fred,

        I could see how that might work so let's talk about it further.

        Having two versions of the lexicon introduces complexity because we have no means right now to guarantee that the shards of each version are located on the same node. However, we could do pretty much the same thing if GC defined a "marker" timestamp on the lexicon which separates the "old" and "new" lexicon revisions. If the marker exists, then on insert we rewrite tuples whose revision timestamps are older than the marker such that they get new revision timestamps and appear on the mutable B+Tree on the journal and migrate into new index segments. Thus they are carried forward into the "new" lexicon. Your background task would then read one of the statement indices, resolving term identifiers against the ID2TERM index to RDF Values and then RDF Values to the TERM2ID index. Tuples older than the GC revision marker would be re-inserted and would receive new revision timestamps. At some point you would simply throw out (delete from the disk) the old TERM2ID and ID2TERM index shards since all data for the current state of the statement indices would be on the current state of the lexicon indices.

        I think that the challenge here would be to ensure that you could continue to read on historical commit points for the RDF database. If lexicon reads were always against a timestamp no earlier than the last GC point for the lexicon then that might be consistent. However, if there is an attempt to resolve a lexicon shard which had been "GC'd" then that would clearly fail.

        Presumably we would like to also clean up the shard locators for the GC'd lexicon index shards. I think that the best way to approach that problem would be to decentralize the metadata service, basing that service on a P2P protocol provided by the data services based on the shards for which they have responsibility (in a shared nothing architecture) or the shards for which they have affinity (in a shared disk architecture).

        I've reopened this issue for further discussion.

        Show
        bryanthompson bryanthompson added a comment - Fred, I could see how that might work so let's talk about it further. Having two versions of the lexicon introduces complexity because we have no means right now to guarantee that the shards of each version are located on the same node. However, we could do pretty much the same thing if GC defined a "marker" timestamp on the lexicon which separates the "old" and "new" lexicon revisions. If the marker exists, then on insert we rewrite tuples whose revision timestamps are older than the marker such that they get new revision timestamps and appear on the mutable B+Tree on the journal and migrate into new index segments. Thus they are carried forward into the "new" lexicon. Your background task would then read one of the statement indices, resolving term identifiers against the ID2TERM index to RDF Values and then RDF Values to the TERM2ID index. Tuples older than the GC revision marker would be re-inserted and would receive new revision timestamps. At some point you would simply throw out (delete from the disk) the old TERM2ID and ID2TERM index shards since all data for the current state of the statement indices would be on the current state of the lexicon indices. I think that the challenge here would be to ensure that you could continue to read on historical commit points for the RDF database. If lexicon reads were always against a timestamp no earlier than the last GC point for the lexicon then that might be consistent. However, if there is an attempt to resolve a lexicon shard which had been "GC'd" then that would clearly fail. Presumably we would like to also clean up the shard locators for the GC'd lexicon index shards. I think that the best way to approach that problem would be to decentralize the metadata service, basing that service on a P2P protocol provided by the data services based on the shards for which they have responsibility (in a shared nothing architecture) or the shards for which they have affinity (in a shared disk architecture). I've reopened this issue for further discussion.

          People

          • Assignee:
            Unassigned
            Reporter:
            fkoliver fkoliver
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: