Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-464

Optimize prefix query when slice by rank

    Details

      Description

      A prefix query such as "c*" can be optimized when it is sliced by rank. This query turns into an index scan on the full text index. We can optimize this using an iterator which visits a slice of the index.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        See also https://sourceforge.net/apps/trac/bigdata/ticket/318 (prefix search broken in r4553)

        Show
        bryanthompson bryanthompson added a comment - See also https://sourceforge.net/apps/trac/bigdata/ticket/318 (prefix search broken in r4553)
        Hide
        bryanthompson bryanthompson added a comment -

        "c*" needs to be optimized differently. This is basically a key-range scan on the full text index. The index looks, more or less, like this:

         
        [ca...][docId]:termFreq,termWeight
        [cb...][docId]:termFreq,termWeight
        ...
        [cz...][docId]:termFreq,termWeight
        

        When you do a min/max rank slice of "c*" we need to be slicing the iterator, not the hit list. To make that work, you would have to find the tuple offset in the index of the first result for "c*" (the first key GTE "c").

         
        fromKey = keyBuilder.appendText("c*", true/*unicode*/, false/*successor*);
         
        firstIndex = ndx.indexOf(fromKey); // index of first key GET 'c'.
         
        startIndex += minRank; // add in the minRank to skip over the one's we've already visited.
         
        fromKey = ndx.keyAt(startIndex); // updated fromKey to advanced beyond the first minRank tuples.
        

        That optimization could happen inside of the full text search() method when we recognize the wildcard query.

        Note: This assumes that the query is "c*" and not "AND(c*, d*)". For "c*", it is true that we can simply advance the starting point of the iterator. However, for "AND(c*,d*)" we would still have to compute a hit list which means scanning ALL of "c*" and ALL of "d*".

        Show
        bryanthompson bryanthompson added a comment - "c*" needs to be optimized differently. This is basically a key-range scan on the full text index. The index looks, more or less, like this: [ca...][docId]:termFreq,termWeight [cb...][docId]:termFreq,termWeight ... [cz...][docId]:termFreq,termWeight When you do a min/max rank slice of "c*" we need to be slicing the iterator, not the hit list. To make that work, you would have to find the tuple offset in the index of the first result for "c*" (the first key GTE "c"). fromKey = keyBuilder.appendText("c*", true/*unicode*/, false/*successor*); firstIndex = ndx.indexOf(fromKey); // index of first key GET 'c'. startIndex += minRank; // add in the minRank to skip over the one's we've already visited. fromKey = ndx.keyAt(startIndex); // updated fromKey to advanced beyond the first minRank tuples. That optimization could happen inside of the full text search() method when we recognize the wildcard query. Note: This assumes that the query is "c*" and not "AND(c*, d*)". For "c*", it is true that we can simply advance the starting point of the iterator. However, for "AND(c*,d*)" we would still have to compute a hit list which means scanning ALL of "c*" and ALL of "d*".
        Hide
        bryanthompson bryanthompson added a comment -

        This optimization might also be applicable to single token queries. However, we do need to visit everything spanned by the full text query in order to factor in the within document term normalization.

        Show
        bryanthompson bryanthompson added a comment - This optimization might also be applicable to single token queries. However, we do need to visit everything spanned by the full text query in order to factor in the within document term normalization.
        Hide
        bryanthompson bryanthompson added a comment -

        It seems that there are two ways to go. They differ in their correctness and scalability.

        The term frequency data is stored in the full text index:

        sortKey(token):docId => termFreq, termWeight
        

        where docId is the literal's identifier;
        where termFreq is the #of occurrences of that token in that literal; and
        where termWeight is the normalized termFreq value.

        The "correct" approach respects the normalized term frequency data for each indexed literal (which is the relative frequency of each token in that literal). This means that we have to scan the key range for "c*" and then sort the entries. This is a problem because there are so many entries for "c*". We do too much IO to read those entries and then we spam the heap to sort them.

        The alternative is much faster. We simply take the first N tuples from the full text index start with "c". If you need to materialize more results, then we need to skip past the first minRank tuples and then accept the next N tuples (up to maxRank).

        The only drawback with the alternative is that the results are not in decreasing termWeight order. Instead they are in a somewhat arbitrary order (docId order). However, for what you are doing I think that this is perfectly Ok. When you ask for "C*" there is really no point order the hits based on termWeight. If the document has a token starting with "C" then you should get it and the order hardly matters in my opinion.

        There is one other twist on all of this. The computed relevance scores are specific to a literal. The relative term frequency and the resulting cosine (aka relevance) scores are specific to an RDF Literal, not to a complex "source document" containing that literal. It's possible that we might model this within the full text index as well by changing the index tuples to:

        sortKey(token):sourceId:predId:literalId => termFreq, termWeight
        

        where sourceId corresponds more or less to a Lucene document with a bunch of fields;
        where predId is the predicate linking the literal to that snippet within which it was indexed; and
        where literalId is the RDF Literal (what I call the docId above, which is what it is called in the code).

        And alternative encoding would be:

        sortKey(token):sourceId:literalId => termFreq, termWeight, predId[]
        

        The difference between the two encodings would appear to be whether or not the intention is to search within named "fields" (specific relationships as modeled by the predicate) or to have the relevance be normalized across the different ways in which the token appears within the various "fields" (predId[]).

        Show
        bryanthompson bryanthompson added a comment - It seems that there are two ways to go. They differ in their correctness and scalability. The term frequency data is stored in the full text index: sortKey(token):docId => termFreq, termWeight where docId is the literal's identifier; where termFreq is the #of occurrences of that token in that literal; and where termWeight is the normalized termFreq value. The "correct" approach respects the normalized term frequency data for each indexed literal (which is the relative frequency of each token in that literal). This means that we have to scan the key range for "c*" and then sort the entries. This is a problem because there are so many entries for "c*". We do too much IO to read those entries and then we spam the heap to sort them. The alternative is much faster. We simply take the first N tuples from the full text index start with "c". If you need to materialize more results, then we need to skip past the first minRank tuples and then accept the next N tuples (up to maxRank). The only drawback with the alternative is that the results are not in decreasing termWeight order. Instead they are in a somewhat arbitrary order (docId order). However, for what you are doing I think that this is perfectly Ok. When you ask for "C*" there is really no point order the hits based on termWeight. If the document has a token starting with "C" then you should get it and the order hardly matters in my opinion. There is one other twist on all of this. The computed relevance scores are specific to a literal. The relative term frequency and the resulting cosine (aka relevance) scores are specific to an RDF Literal, not to a complex "source document" containing that literal. It's possible that we might model this within the full text index as well by changing the index tuples to: sortKey(token):sourceId:predId:literalId => termFreq, termWeight where sourceId corresponds more or less to a Lucene document with a bunch of fields; where predId is the predicate linking the literal to that snippet within which it was indexed; and where literalId is the RDF Literal (what I call the docId above, which is what it is called in the code). And alternative encoding would be: sortKey(token):sourceId:literalId => termFreq, termWeight, predId[] The difference between the two encodings would appear to be whether or not the intention is to search within named "fields" (specific relationships as modeled by the predicate) or to have the relevance be normalized across the different ways in which the token appears within the various "fields" (predId[]).
        Hide
        bryanthompson bryanthompson added a comment -

        We will be exploring a subject centric search index instead of pursuing this issue.

        Show
        bryanthompson bryanthompson added a comment - We will be exploring a subject centric search index instead of pursuing this issue.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: