It seems that there are two ways to go. They differ in their correctness and scalability.
The term frequency data is stored in the full text index:
sortKey(token):docId => termFreq, termWeight
where docId is the literal's identifier;
where termFreq is the #of occurrences of that token in that literal; and
where termWeight is the normalized termFreq value.
The "correct" approach respects the normalized term frequency data for each indexed literal (which is the relative frequency of each token in that literal). This means that we have to scan the key range for "c*" and then sort the entries. This is a problem because there are so many entries for "c*". We do too much IO to read those entries and then we spam the heap to sort them.
The alternative is much faster. We simply take the first N tuples from the full text index start with "c". If you need to materialize more results, then we need to skip past the first minRank tuples and then accept the next N tuples (up to maxRank).
The only drawback with the alternative is that the results are not in decreasing termWeight order. Instead they are in a somewhat arbitrary order (docId order). However, for what you are doing I think that this is perfectly Ok. When you ask for "C*" there is really no point order the hits based on termWeight. If the document has a token starting with "C" then you should get it and the order hardly matters in my opinion.
There is one other twist on all of this. The computed relevance scores are specific to a literal. The relative term frequency and the resulting cosine (aka relevance) scores are specific to an RDF Literal, not to a complex "source document" containing that literal. It's possible that we might model this within the full text index as well by changing the index tuples to:
sortKey(token):sourceId:predId:literalId => termFreq, termWeight
where sourceId corresponds more or less to a Lucene document with a bunch of fields;
where predId is the predicate linking the literal to that snippet within which it was indexed; and
where literalId is the RDF Literal (what I call the docId above, which is what it is called in the code).
And alternative encoding would be:
sortKey(token):sourceId:literalId => termFreq, termWeight, predId
The difference between the two encodings would appear to be whether or not the intention is to search within named "fields" (specific relationships as modeled by the predicate) or to have the relevance be normalized across the different ways in which the token appears within the various "fields" (predId).