Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-889

Wildcard search in bigdata for type suggessions

    Details

    • Type: New Feature
    • Status: Done
    • Resolution: Done
    • Affects Version/s: BIGDATA_RELEASE_1_3_0
    • Fix Version/s: None
    • Component/s: Bigdata SAIL

      Description

      The most typical use case for fulltext search is type suggestions for terms as the user usually does not remember the terms by heard and also wants the UI to be responsive.
      At the moment default configuration of Bigdata search works only with whole words. There are also some docs about configuration but they do not explain how to configure wildcard search (if it is possible at all). It would be great to see either this feature implemented or (if it already exists) to see some additional sentences in the docs about configuring it.

      P.S. I know about possibilities of integrating with USeekM but setting up Elastic Search (and dealing with poorly documented USeekM) is an overhead for users who just want to have type-suggestions working

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        This was just addressed in [1]. Have you consulting the wiki page for the full text search feature? Are there things that are missing from the documentation on that page?

        Thanks,
        Bryan

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/803 (prefixMatch does not work in full text search)

        Show
        bryanthompson bryanthompson added a comment - This was just addressed in [1] . Have you consulting the wiki page for the full text search feature? Are there things that are missing from the documentation on that page? Thanks, Bryan [1] https://sourceforge.net/apps/trac/bigdata/ticket/803 (prefixMatch does not work in full text search)
        Hide
        antonkulaga antonkulaga added a comment -

        What I want to know is the following. For instance I want to do case insensitive search for a part of a word (quite common for type suggestions). For instance I want to know everything that contains "stability" as a part of a word ( so, both "Instability" and "Stability")

        When I do:

        PREFIX bds: <http://www.bigdata.com/rdf/search#>

        SELECT ?object
        WHERE

        { ?object bds:search "*stability" .

        }
        LIMIT 50

        I do not get "Instability" in results.
        Another use case: there are typical endings of some scientific terms, esp. in biology and medicine, I want to do wildcard search for word's with suffixes and endings that want.
        It would be nice to have it explained in fulltext search docs.

        Show
        antonkulaga antonkulaga added a comment - What I want to know is the following. For instance I want to do case insensitive search for a part of a word (quite common for type suggestions). For instance I want to know everything that contains "stability" as a part of a word ( so, both "Instability" and "Stability") When I do: PREFIX bds: < http://www.bigdata.com/rdf/search# > SELECT ?object WHERE { ?object bds:search "*stability" . } LIMIT 50 I do not get "Instability" in results. Another use case: there are typical endings of some scientific terms, esp. in biology and medicine, I want to do wildcard search for word's with suffixes and endings that want. It would be nice to have it explained in fulltext search docs.
        Hide
        bryanthompson bryanthompson added a comment -

        You can configure the full text index to handle case-insensitive matching. You can also install custom tokenizers. By default, it uses the language code associated with the literal.

        The index only supports prefix matches. To handle the more general case of a wildcard at the head of the search you have to either also carry a reverse string index (an index over the tokens in reverse order) or a different kind of index entirely (e.g., based on an finite automata).

        People do use the free text search index for low-latency type-ahead applications.

        See FullTextIndex.Options [1] for more information on the configuration options for this index.

        [1] http://www.bigdata.com/docs/api/com/bigdata/search/FullTextIndex.Options.html

        Show
        bryanthompson bryanthompson added a comment - You can configure the full text index to handle case-insensitive matching. You can also install custom tokenizers. By default, it uses the language code associated with the literal. The index only supports prefix matches. To handle the more general case of a wildcard at the head of the search you have to either also carry a reverse string index (an index over the tokens in reverse order) or a different kind of index entirely (e.g., based on an finite automata). People do use the free text search index for low-latency type-ahead applications. See FullTextIndex.Options [1] for more information on the configuration options for this index. [1] http://www.bigdata.com/docs/api/com/bigdata/search/FullTextIndex.Options.html
        Hide
        bryanthompson bryanthompson added a comment -

        Another possible approach is to integrate a stemmer into the full text index. This would just be a modified tokenizer. That would automatically change the search into a search of word stems. This might be suitable for your application. You could also break each token into any recognizable stems when it is indexed. For example, biomedical could be broken into bio- and medic-.

        Note that you can do exact Unicode prefix searches of non-inlined, non-blob literals or URIs using the TERM2ID index.

        Note that you can do exact prefix searches of inline RDF Values using the OSP(C) index.

        I have captured some of this documentation at [1].

        I am going to close this ticket out as "won't fix." You can use the mechanisms described here to implement a stem search or prefix search. That should provide low-latency prefix or stem prefix search suitable for your application.

        If you want to add some documentation onto this ticket about the FullTextIndex.Options or a recipe to configure the FullTextIndex for your application, I am happy to post that to the wiki page for the free text search.

        [1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=FullTextSearch

        Show
        bryanthompson bryanthompson added a comment - Another possible approach is to integrate a stemmer into the full text index. This would just be a modified tokenizer. That would automatically change the search into a search of word stems. This might be suitable for your application. You could also break each token into any recognizable stems when it is indexed. For example, biomedical could be broken into bio- and medic-. Note that you can do exact Unicode prefix searches of non-inlined, non-blob literals or URIs using the TERM2ID index. Note that you can do exact prefix searches of inline RDF Values using the OSP(C) index. I have captured some of this documentation at [1] . I am going to close this ticket out as "won't fix." You can use the mechanisms described here to implement a stem search or prefix search. That should provide low-latency prefix or stem prefix search suitable for your application. If you want to add some documentation onto this ticket about the FullTextIndex.Options or a recipe to configure the FullTextIndex for your application, I am happy to post that to the wiki page for the free text search. [1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=FullTextSearch
        Hide
        antonkulaga antonkulaga added a comment -

        >Options or a recipe to configure the FullTextIndex? for your application, I am happy to post that to the wiki page for the free text search.

        Thank you for clarifications. I will probably work on text indexes in the beginning of next week.

        Show
        antonkulaga antonkulaga added a comment - >Options or a recipe to configure the FullTextIndex? for your application, I am happy to post that to the wiki page for the free text search. Thank you for clarifications. I will probably work on text indexes in the beginning of next week.

          People

          • Assignee:
            mikepersonick mikepersonick
            Reporter:
            antonkulaga antonkulaga
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: