Text Indexing Configuration Options

From Blazegraph
Revision as of 10:28, 19 January 2016 by Maria Krokhaleva (Talk | contribs)

Jump to: navigation, search

ConfigurableAnalyzerFactory class can be used with the blazegraph properties file to specify which Analyzers are used for which languages. Languages are specified by the language tag on RDF literals, which conform with RFC 5646. Within blazegraph plain literals are assigned to the default locale's language.

The blazegraph properties are used to map language ranges, as specified by RFC 4647 to classes which extend Analyzer. Supported classes included all the natural language specific classes from Lucene, and also:

More generally any subclass of Analyzer that has at least one constructor matching:

is usable. If the class has a static method named getDefaultStopSet() then this is assumed to do what it says on the can; some of the Lucene analyzers store their default stop words elsewhere, and such stopwords are usable by this class. If no stop word set can be found, and there is a constructor without stopwords and a constructor with stopwords, then the former is assumed to use a default stop word set. Configuration is by means of the blazegraph properties file. All relevant properties start com.bigdata.search.ConfigurableAnalyzerFactory which we abbreviate to c.b.s.C in this documentation. Properties from Options apply to the factory. Other properties, from AnalyzerOptions start with c.b.s.C.analyzer.language-range where language-range conforms with the extended language range construct from RFC 4647 section 2.2. There is an issue that bigdata does not allow '*' in property names, and we use the character '_' to substitute for '*' in extended language ranges in property names. These are used to specify an analyzer for the given language range. If no analyzer is specified for the language range * then the StandardAnalyzer is used. Given any specific language, then the analyzer matching the longest configured language range, measured in number of subtags is returned by getAnalyzer(String, boolean) In the event of a tie, the alphabetically first language range is used. The algorithm to find a match is "Extended Filtering" as defined in section 3.3.2 of RFC 4647. Some useful analyzers are as follows:

KeywordAnalyzer This treats every lexical value as a single search token
WhitespaceAnalyzer This uses whitespace to tokenize
PatternAnalyzer This uses a regular expression to tokenize
TermCompletionAnalyzer This uses up to three regular expressions to specify multiple tokens for each word, to address term completion use cases.
EmptyAnalyzer This suppresses the functionality, by treating every expression as a stop word.

There are in addition the language specific analyzers that are included by using the option Options#NATURAL_LANGUAGE_SUPPORT. By setting this option to true, then all the known Lucene Analyzers for natural languages are used for a range of language tags. These settings may then be overridden by the settings of the user. Specifically the following properties are loaded, prior to loading the user's specification (with c.b.s.C expanding to com.bigdata.search.ConfigurableAnalyzerFactory)

c.b.s.C.analyzer._.like=eng
c.b.s.C.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer
c.b.s.C.analyzer.pt.like=por
c.b.s.C.analyzer.zho.analyzerClass=org.apache.lucene.analysis.cn.ChineseAnalyzer
c.b.s.C.analyzer.chi.like=zho
c.b.s.C.analyzer.zh.like=zho
c.b.s.C.analyzer.jpn.analyzerClass=org.apache.lucene.analysis.cjk.CJKAnalyzer
c.b.s.C.analyzer.ja.like=jpn
c.b.s.C.analyzer.kor.like=jpn
c.b.s.C.analyzer.ko.like=kor
c.b.s.C.analyzer.ces.analyzerClass=org.apache.lucene.analysis.cz.CzechAnalyzer
c.b.s.C.analyzer.cze.like=ces
c.b.s.C.analyzer.cs.like=ces
c.b.s.C.analyzer.dut.analyzerClass=org.apache.lucene.analysis.nl.DutchAnalyzer
c.b.s.C.analyzer.nld.like=dut
c.b.s.C.analyzer.nl.like=dut
c.b.s.C.analyzer.deu.analyzerClass=org.apache.lucene.analysis.de.GermanAnalyzer
c.b.s.C.analyzer.ger.like=deu
c.b.s.C.analyzer.de.like=deu
c.b.s.C.analyzer.gre.analyzerClass=org.apache.lucene.analysis.el.GreekAnalyzer
c.b.s.C.analyzer.ell.like=gre
c.b.s.C.analyzer.el.like=gre
c.b.s.C.analyzer.rus.analyzerClass=org.apache.lucene.analysis.ru.RussianAnalyzer
c.b.s.C.analyzer.ru.like=rus
c.b.s.C.analyzer.tha.analyzerClass=org.apache.lucene.analysis.th.ThaiAnalyzer
c.b.s.C.analyzer.th.like=tha
c.b.s.C.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer
c.b.s.C.analyzer.en.like=eng

List of Options

analyzerClass If specified this is the fully qualified name of a subclass of Analyzer that has appropriate constructors. This is set implicitly if some of the options below are selected (for example PATTERN). For each configured language range, if it is not set, either explicitly or implicitly, then LIKE must be specified.
like The value of this property is a language range, for which an analyzer is defined. Treat this language range in the same way as the specified language range. Loops are not permitted. If this option is specified for a language range, then no other option is permitted.
stopwords The value of this property is one of:
  • none - This analyzer is used without stop words.
  • default - Use the default setting for stopwords for this analyzer. It is an error to set this value on some analyzers such as SimpleAnalyzer that do not support stop words. A fully qualified class name ... of a subclass of Analyzer which has a static method getDefaultStopSet(), in which case, the returned set of stop words is used. If the analyzerClass does not support stop words then any value other than STOPWORDS_VALUE_NONE is an error. If the analyzerClass does support stop words then the default value is STOPWORDS_VALUE_DEFAULT
pattern The value of the pattern parameter to PatternAnalyzer(Version, Pattern, boolean, Set) (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified.
wordBoundary The value of the wordBoundary parameter to TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean) (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified.
subWordBoundary The value of the subWordBoundary parameter to TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean) (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. The default sub-word boundary is a pattern that never matches, i.e. there are no sub-word boundaries. Pattern.compile("(?!)")
softHyphens The value of the softHyphens parameter to TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean) (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified.
alwaysRemoveSoftHyphens The value of the alwaysRemoveSoftHypens parameter to TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean) (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. Default value is false

Disable Stopwords

Use the following parameters in the blazegraph properties file to completely disable stopwords:

com.bigdata.search.FullTextIndex.analyzerFactoryClass=com.bigdata.search.ConfigurableAnalyzerFactory
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.stopwords=none
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer._.like=eng