Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-986

Provide a configurable IAnalyzerFactory

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Done
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Other

      Description

      Currently it is possible to config full text search by providing a custom implementation of IAnalyzerFactory.

      The purpose of this work item is to provide one general purpose new implementation as an alternative to DefaultAnalyzerFactory that is configurable but with roughly the same functionality as DefaultAnalyzerFactory.

      Here is a sample bigdata.properties file section setting up the new class:

      # use the new class
      com.bigdata.search.FullTextIndex.analyzerFactoryClass=com.bigdata.search.ConfigurableAnalyzerFactory
      
      # set up the US english analyzer, note en-us is a language range as per RFC 4647 which matches the en-us language tag as well as en-us-x-eubonics
      com.bigdata.search.ConfigurableAnalyzerFactory.analyzers.en-us.analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
      com.bigdata.search.ConfigurableAnalyzerFactory.analyzers.en-us.stopwords=default
      
      # set up the default analyzer: note * is a language range by RFC 4647
      com.bigdata.search.ConfigurableAnalyzerFactory.analyzers.*.analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
      
      # set up a PatternAnalyzer for private use language tag x-gene
      com.bigdata.search.ConfigurableAnalyzerFactory.analyzers.x-gene.analyzer=org.apache.lucene.analysis.miscellaneous.PatternAnalyzer
      com.bigdata.search.ConfigurableAnalyzerFactory.analyzers.x-gene.pattern="\\W+"
      com.bigdata.search.ConfigurableAnalyzerFactory.analyzers.x-gene.stopwords=none
      

      Goal is to provide support for each of the language specific Analyzers from lucene, the PatternAnalyzer, which needs an additional property, WhitespaceAnalyzer, SimpleAnalyzer, KeywordAnalyzer, StopAnalyzer

      Some analyzers support stop words, some do not. The stopwords property can have value none or default (if stopwords are supported).
      For the PatternAnalyzer that supports stop words but has no default list, the stop words property can be none or the name of a class that does have a default stop words list.

      There may be an additional analyzer which is the com.bigdata.search.EmptyAnalyzer which always returns an EmptyTokenStream; this allows for turning bds search off for certain language tags and/or off for the default and on only for specified tags.

      When looking up a language tag the rules of rfc4647 should be used.

        Attachments

          Activity

            People

            Assignee:
            jeremycarroll jeremycarroll
            Reporter:
            jeremycarroll jeremycarroll
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: