Bigdata is designed for fast access to historical writes and allows selective retention of "history" within a federation. This means that we can not "GC" lexicon entries if they are in use in any historical commit point which is being retained. There are several approaches to this problem:
One approach is to maintaining reference counters on the lexicon entries. However, this approach since would drive IO tremendously since both inserts and updates of tuples are logged on the Journal and migrated onto index segments (this is necessary in order to have the fast access to historical commit points and is part of the general write once contract for bigdata).
Another approach would do a parallel scan over one of the statement indices using a read-only operation and collect all term identifiers which are not in use within that index. However, this would have to be done for all commit points since the last time a GC pass was performed. The only way to approach this is with by obtaining a read-only transaction corresponding to the last GC point and then doing a full scan on all journals and shards having data for that commit point or any subsequent commit point.
Given a parallel scan, we need to identify those term identifiers which are no longer in use. This could be done using a left outer join against a read-historical view of the ID2TERM index. We would then need to have an eventually consistent delete which guaranteed
that the terms were assigned either their old termIds or a new termId consistently if there was a concurrent insert of a statement using a given term.
All of these are quite complex operations. The simplest approach is just to bulk export / import into a new namespace. This could be done as a streaming operation. The old namespace could then be dropped and its history purged using administrative actions.