Solr Analyzer

Analyzer consists of a Tokenizer and one or more TokenFilters. A Tokenizer is responsible for producing Tokens, which in most cases correspond to words to be indexed. A TokenFilter takes in Tokens from the Tokenizer and can modify or remove a Token before indexing. For instance, Solr's WhitespaceTokenizer breaks words on whitespace, and its StopFilter removes common words from search results. Other types of analysis include stemming, synonym expansion, and case folding. Chances are, if you need analysis done in a particular way for your application, Solr has one or more tokenizers and filters to meet your needs.

You can also apply analysis to a query during a search operation. As a general rule, you should run the same analysis on a query as on the document to be indexed. Users new to these concepts commonly make the mistake of stemming document tokens but not query tokens, which often results in zero search matches.

Analyzers are components that pre-process input text at index time and/or at search time. It's important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.

On wildcard and fuzzy searches, no text analysis is performed on the search word.

The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the Solr schema.

Tokens and Token Filters:

An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a Tokenizer that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream.

Specifying an Analyzer in the schema:

A Solr schema.xml file allows two methods for specifying the way a text field is analyzed. (Normally only field types of solr.TextField will have Analyzers explicitly specified in the schema):

1. Specifying the class name of an Analyzer — anything extending org.apache.lucene.analysis.Analyzer.

<fieldtype name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>

2. Specifying a TokenizerFactory followed by a list of optional TokenFilterFactories that are applied in the listed order. Factories that can create the tokenizers or token filters are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation via reflection.

<fieldtype name="text" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>

Any Analyzer, TokenizerFactory, or TokenFilterFactory may be specified using its full class name with package — just make sure they are in Solr's classpath when you start your appserver. Classes in the org.apache.solr.analysis.* package can be referenced using the short alias solr.*.

If you want to use custom Tokenizers or TokenFilters, you'll need to write a very simple factory that subclasses BaseTokenizerFactory or BaseTokenFilterFactory, something like this…

public class MyCustomFilterFactory extends BaseTokenFilterFactory {
  public TokenStream create(TokenStream input) {
    return new MyCustomFilter(input);
  }
}

Char Filters:

Char Filter is a component that pre-processes input characters. It can be chained like as Token Filters and placed in front of a Tokenizer. Char Filters can add, change, or remove characters.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License