Solr CommonGrams Filter

The CommonGrams filter is designed to only work on phrase queries. It is designed to solve the problem of slow phrase queries with phrases containing common words, when you don't want to use stop words. It would not make sense for Boolean queries. Boolean queries just get passed through unchanged.

For background on the CommonGramsFilter please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

There are two filters: CommonGramsFilter and CommonGramsQueryFilter.

You use CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing.

CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries (i.e. non-phrase queries) will work. For example "the rain" would produce 3 tokens:

the (position 1)
rain (position 2)
the-rain (position 1)

When you have a phrase query, you want Solr to search for the token "the-rain" so you don't want the unigrams.

When you have a Boolean query, the CommonGramsQueryFilter only gets one token as input and simply outputs it.

For background on the problem with "l'art" please see: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We used a custom filter to change all punctuation to spaces. You could probably use one of the other filters to do this. (See the comments from David Smiley at the end of the blog post regarding possible approaches.)At the time, I just couldn't get WordDelimiterFilter to behave as documented with various combinations of parameters and was not aware of the other filters David mentions.

The problem with "l'art" is actually due to a bug or feature in the QueryParser. Currently the QueryParser interacts with the token chain and decides whether the tokens coming back from a tokenfilter should be treated as a phrase query based on whether or not more than one non-synonym token comes back from the tokestream for a single 'queryparser token'.

It also splits on whitespace which causes all CJK queries to be treated as phrase queries regardless of the CJK tokenizer you use. This is a contentious issue. See https://issues.apache.org/jira/browse/LUCENE-2458. There is a semi-workaround using PositionFilter, but it has many undesirable side effects. I believe Robert Muir, who is an expert on the various problems involved and opened Lucene-2458 is working on a better fix.

http://www.hathitrust.org/blogs/large-scale-search

If you do not want all queries to be phrasequeries, you should use:

<fieldType name="text" class="solr.TextField" autoGeneratePhraseQueries="false">

then the lack of whitespace between words will not cause phrase queries. If you use this option, phrase queries will only be caused if the user explicitly puts terms in double quotes.

This is in reference to Tom's comment on his "l'art" problem (http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance ).

There are two problems:

  1. That the queryparser "pre-tokenizes" on whitespace at all.
  2. That the queryparser forms a phrase query, if the analyzer returns more than one position back from a "queryparser token" (whitespace).

Turning off autoGeneratePhraseQueries only solves problem #2, because its not appropriate for many languages. Before this option (e.g. Solr 1.4.x), you had to use the PositionFilter to workaround this problem. But PositionFilter simply "flattens/stacks" the positions (makes it seem as if they are all synonyms). With PositionFilter you couldn't have phrase queries at all… and you don't get a BooleanQuery coordination factor.

With autoGeneratePhraseQueries=false, you won't get a phrase query unless it was in double quotes… its just that simple.

Fixing problem #1 alltogether, is the way to go. Because then the tokenization would be left to the analyzer completely, and you would have a lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License