Full Text Search

Synopsis:

ALTER TABLE news ADD FULLTEXT(headline, story);
ALTER TABLE products ADD FULLTEXT ft_index_name (pname,psku, pdescription);
CREATE FULLTEXT INDEX ft_index_name ON products (pname, psku, pdescription); 

SELECT headline, story FROM news WHERE MATCH (headline,story) AGAINST ('Hurricane');
SELECT headline, story FROM news WHERE MATCH (headline,story) AGAINST ('+Hurricane -Katrina' IN BOOLEAN MODE);
SELECT headline, story FROM news WHERE MATCH (headline,story) AGAINST ('Katrina' WITH QUERY EXPANSION);

SELECT * FROM fulltext_sample WHERE MATCH(colname) AGAINST('good,alert'); // query for two words
SELECT copy,MATCH(copy) AGAINST('good,alert') AS relevance FROM fulltext_sample WHERE MATCH(copy) AGAINST('good,alert');
SELECT * FROM stuff WHERE MATCH(secret_string) AGAINST(+"car rental" IN BOOLEAN MODE) ORDER BY FREQ ASC;
SELECT id,title,MATCH (title,body) AGAINST ('cvs') AS score FROM articles WHERE MATCH (title,body) AGAINST ('cvs') ORDER by score DESC

ALTER TABLE foo ENGINE=MyISAM; // Re-index
REPAIR TABLE foo QUICK; // Re-index

Full-text searching is performed using MATCH() … AGAINST syntax. MATCH() takes a comma-separated list that names the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a literal string, not a variable or a column name.

The column name passed to Match() must be the same as the one used in FULLTEXT() definition. If multiple columns are specified, all of them must be listed, and in the correct order.

The result of this query is automatically sorted by relevancy.

The MATCH function is used to specify the column names that identify your FULLTEXT collection. The column list inside the MATCH function must exactly match that of the FULLTEXT index definition, unless your search in boolean mode (see below).

The AGAINST function is where your full text search query goes. Besides the default natural language search mode, you can perform boolean mode searches, and use query expansion.

Why do we sometimes put the MATCH(…) AGAINST(…) before the WHERE clause?

To obtain the relevance. When the IN BOOLEAN MODE is used, the relevance may not be calculated correctly. This is because the algorithm that is used to calculate relevance ranking is simple when IN BOOLEAN MODE is used, thus lending itself poorly for providing relevant documents. If this is what you experience, you may want to use this query:

SELECT MATCH(content) AGAINST('keyword1 keyword2') AS relevance FROM table1 WHERE MATCH(content) AGAINST('+keyword1 +keyword2' IN BOOLEAN MODE) HAVING relevance > 0.2 ORDER BY relevance DESC

In the above query, we are using MATCH(…) AGAINST(…) before the WHERE clause to obtain the relevance. And after the WHERE clause, we use MATCH(…) AGAINST(… IN BOOLEAN MODE). The MATCH(…) AGAINST(…) clauses that come before and after 'WHERE' do not have to match. The HAVING clause is used to further restrict the result of the search.

It appears that there was no performance loss due to the fact that we have two MATCH(…) AGAINST(…) portions, but you may want to measure this on your own systems and datasets.

Details:

Why should we use fulltext search?

Wildcard and regular expression matching requires that MySQL try and match each and every row in a table, and table indexes are rarely used. As such, these searches can be very time consuming as the number of rows to be searched grows. When full-text searching is used, MySQL does not need to look at each row individually, analyzing and processing each word individually. Rather, an index of words (in specified columns) is created by MySQL, and searches can be made against those words. MySQL can thus quickly and efficiently determine which words match (which rows contains them), which don't, how often they match, and so on.

What storage engine support fulltext search?

Currently only MyISAM supports fulltext search. Fulltext search using InnoDB is coming in version 5.7

By default, fulltext search is not case sensitive, how can we get around it?

Full-text search are not case sensitive, unless BINARY mode is used.

How to include related search terms?

Use the WITH QUERY EXPANSION clause:

SELECT note_text FROM productnotes WHERE Match(note_text) Against('anvils' WITH QUERY EXPANSION);

The Blind Query Expansion (or automatic relevance feedback) feature can be used to expand the results of the search. This often includes much more noise, and makes for a very fuzzy search. In most cases you would use this operation if the users query returned just a few results, you try it again WITH QUERY EXPANSION and it will add words that are commonly found with the words in the query.

SELECT headline, story FROM news WHERE MATCH (headline,story) AGAINST ('Katrina' WITH QUERY EXPANSION);

The above query might return all news stories about hurricanes, not just ones containing Katrina.

What are the three types of fulltext search supported by MyISAM?

There are three types of full-text searches:

  • Boolean search interprets the search string using the rules of a special query language. The string contains the words to search for. It can also contain operators that specify requirements such that a word must be present or absent in matching rows, or that it should be weighted higher or lower than usual. Common words such as “some” or “then” are stopwords and do not match if present in the search string. The IN BOOLEAN MODE modifier specifies a boolean search. For more information, see Section 11.9.2, “Boolean Full-Text Searches”.
  • Natural language search interprets the search string as a phrase in natural human language (a phrase in free text). There are no special operators. The stopword list applies. In addition, words that are present in 50% or more of the rows are considered common and do not match. Full-text searches are natural language searches if no modifier is given.
  • Query expansion search is a modification of a natural language search. The search string is used to perform a natural language search. Then words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The WITH QUERY EXPANSION modifier specifies a query expansion search. For more information, see Section 11.9.3, “Full-Text Searches with Query Expansion”.

What is a boolean search?

A boolean search interprets the search string using the rules of a special query language. The string contains the words to search for. It can also contain operators that specify requirements such that a word must be present or absent in matching rows, or that it should be weighted higher or lower than usual. Common words such as “some” or “then” are stopwords and do not match if present in the search string. The IN BOOLEAN MODE modifier specifies a boolean search.

For example, a search string that contains plus and minus signs is a boolean search.

What is a natural language search?

Naturual language search interprets the search string as a phrase in natural human language (a phrase in free text). There are no special operators. The stopword list applies. In addition, words that are present in 50% or more of the rows are considered common and do not match. Full-text searches are natural language searches if no modifier is given.

What is a query expansion search?

Query expansion search is a modification of a natural language search. The search string is used to perform a natural language search. Then words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The WITH QUERY EXPANSION modifier specifies a query expansion search.

How to filter out results that contain certain word?

SELECT headline, story FROM news WHERE MATCH (headline,story) AGAINST ('+Hurricane -Katrina' IN BOOLEAN MODE);

The above statement would match news stories about hurricanes but not those that mention hurricane katrina.

Can a fulltext index contains multiple columns?

Yes.

How does MySQL fulltext calculate relevance?

The relevance calculation is fairly complex and is based upon the number of words in the index, the number of unique words in that row, the total number of words in both the index and the result, as well as the weight of the word (for the average English sentence, the word 'cool' will be weighted less, and therefore have a lower relevance, than the word 'dandy', though trends can change!). The figures may differ in your version of MySQL, as MySQL does occasionally tweak the calculation algorithms.

How to alter the minimum word length?

By default, the indexes you create will only include words of between four and twenty characters in length. This is something that you can adjust if you need to. You should try to think of the search terms that your visitors are likely to use and adjust this setting accordingly. The minimum word length of four-letters would not allow visitors to search for terms such as "SSH" or "CVS" which may be a problem for your site. To alter the minimum word length, you should modify the [mysqld] section and the [myisamchk] section to include:

# Full Text Minimum Word Length
ft_min_word_len=3
ft_stopword_file="C:\\MySQL\\stop.txt"

How to disable stopwords?

You can disable stopwords by setting the variable ft_stopword_file to an empty string.

How to make sure that the results match some term but does not match another term?

Use BOOLEAN search mode:

-term // Make sure that the term mentioned does not appear in the result
+term // Make sure that the term mentioned does appear in the result

SELECT id, title FROM articles WHERE MATCH(title, body) AGAINST ("+cvs -ssh" IN BOOLEAN MODE);

How to get around the 50% threshold?

Use IN BOOLEAN MODE:

SELECT id, title FROM articles WHERE MATCH(title, body) AGAINST ("ssh" IN BOOLEAN MODE);

If a word is present in more than 50% of the rows it will have a weight of zero. This has advantages on large datasets, but can make testing difficult on small ones. To get around this problem, use IN BOOLEAN MODE.

What are the characteristics of boolean search?

  • They does not use the 50% threshold
  • They do not automatically sort rows in order of decreasing relevance. To sort by relevance, you must use the double MATCH(…) AGAINST(…) construct (one before WHERE and one after WHERE) as mentioned elsewhere on this page.

What is the meaning of the plus and minus signs?

  • A leading plus sign indicates that this word must be present in each row of the result
  • A leading minus sign indicates that this word must not be present in any row of the result. NOTE: The minus operator acts only to exclude rows that are otherwise matched by other search terms. Thus, a boolean mode search that contains only terms preceded by minus signs returns an empty result. It does not return "all rows except those containing the excluded terms."

When there is no operator specified for a IN BOOLEAN MODE query, what does it mean?

When no operator is specified, the word is optional, but the rows that contains it are rated higher.

What is the meaning of the > operator?

The > operator increase the contribution to the relevance (give this word more weight)

What is the meaning of the < operator?

The < operator decrease contribution to the relevance (give this word less weight)

What is the meaning of parentheses operator?

Groups words into sub-expressions. Parenthesized groups can be nested.

What is the meaning of the tilde operator?

A leading tilde (~) acts as a negation operator, causing the word's contribution to the row's relevance to be negative. This is useful for marking "noise" words. A row containing such a word is rated lower than others, but is not excluded.

What is the meaning of the asterisk (*) operator?

The asterisk serves as a wild-card operator. Unlike other operator, this operator should be appended to the word to have its effect. Words match if they begin with the word preceding the * operator. If a word is specified with the * operator, it is not stripped from a boolean query even if it is too short or a stop-word. This occurs because the word is not seen as too short or a stop-word, but as a prefix that must be present in the document in the form of a word that begin with the prefix.

What is the meaning of "double quotes" operator?

A phrase that is enclosed within double-quote characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in FULLTEXT index for the words. Nonword characters need to be matched exactly. Phrase searching requires only that matches contains exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase". "some words" matches "some words of wisdom" but not "some noise words". NOTE: the double quote characters that enclose the phrase are operator characters that delimit the phrase. They are not the quotation marks that enclose the search string itself. In other words, the double-quotes characters must already be part of the search string.

How to tune a fulltext index?

  • Reduce the size of the index by increasing the ft_min_word_len setting.
  • Do not index column of type TEXT. Pre-process the text, and put the result into a separate column / table, and create the fulltext index on the new column / table.
  • Create the stopword files for your language
  • Use "REPAIR TABLE xxx QUICK" to rebuild the index.
  • Make note of the size of the index file (xxx.MYI) before and after you make any changes
  • Use ft_dump utility inspect the fulltext index
  • Limit the number of results by applying a LIMIT clause
  • Follow the normal procedure for tuning databases / queries
  • Use faster CPUs
  • Add more memory
  • Use faster hard drive / RAID controller
  • key_buffer_size
  • If fulltext have performance impact on other part of your application, you may want to put fulltext search on a separate database server, or use another application such as Apache Solr that is dedicated for fulltext search. Taking search out of database also give you capability to customize stemmer and query parser.

Miscellaneous:

For large data sets, it is much faster to load your data into a table that has no FULLTEXT index and then create the index after that, than to load data into a table that has an existing FULLTEXT index.

A couple points about Full-Text searching in MySQL:

  • Searches are not case sensitive
  • Short words are ignored, the default minimum length is 4 characters. You can change the min and max word length with the variables ft_min_word_len and ft_max_word_len
  • Words called stopwords are ignored, you can specify your own stopwords, but default words include the, have, some - see default stopwords list.
  • You can disable stopwords by setting the variable ft_stopword_file to an empty string.
  • Full Text searching is only supported by the MyISAM storage engine.
  • If a word is present in more than 50% of the rows it will have a weight of zero. This has advantages on large datasets, but can make testing difficult on small ones.
  • MySQL does not index any words less than or equal to 3 characters in length, nor does it index any words that appear in more than 50% of the rows. This means that if your table contains 2 or less rows, a search on a FULLTEXT index will never return anything. In future, MySQL will make this behavior more flexible, but for now it should suit most natural language uses. If most fields in your database contain the word 'music', you probably don't want these records returned, You can use the IN BOOLEAN MODE modifier to get around the 50% threshold
  • Results are returned in order of relevance, from highest to lowest.

A list of the main features of a standard FULLTEXT search follows:

  • Excludes partial words
  • Excludes words less than 4 characters in length (3 or less)
  • Excludes words that appear in more than half the rows (meaning at least 3 rows are required)
  • Hyphenated words are treated as two words
  • Rows are returned in order of relevance, descending
  • Words in the stopword list (common words) are also excluded from the search results. The stopword list is based upon common English words, so if your data is used for a different purpose, you'll probably want to change the list. Unfortunately, doing so at present is not easy. You'll need to edit the file myisam/ft_static.c. recompile MySQL, and rebuild the indexes!

MySQL does not index words that contains a hyphen. The hyphen split the word into two separate words. If you search for each individual words, MySQL returns the results. But if you search for the whole word, MySQL returns no result. The same is true for any of the special text search modifiers such as the plus sign, the tilde.

Fulltext search is slow when there are a lot of rows in the table. The nature of fulltext indexing is such that the more words that appear in the columns you index, the size of the index will be bigger, and the time it takes to search that index will be longer. You can try to reduce the size of the index by increasing the ft_min_word_len setting or only index a portion of a column. For example, only index the first 200 characters or 100 words. Do not index the whole column of type TEXT. We may need to extract, pre-process the text, put the pre-processed text into a separate column or table, and create a fulltext index on this column / table.

To index Chinese text, we may need to set ft_min_word_len=1.

To index Japanese text, make sure that words from your Japanese text are separated by the ASCII space character, not Japanese UTF-8 (or other) spacing characters.

When using phpMyAdmin to manage data or write SQL query, you must switch away from your Japanese IME to insert a space character.

One way to improve the performance of fulltext search when it need to be done on multiple tables is to create a temporary table:

CREATE TEMPORARY TABLE xxx SELECT id, name, MATCH(name) AGAINST ('search_string') AS relevancy FROM TABLE 1;
INSERT INTO xxx SELECT id, name, MATCH(name) AGAINST('search_string') AS relevancy FROM table2;

SELECT id, name FROM xxx ORDER BY relevancy DESC

It is LEGAL to use two different arguments with the "double match construction":

SELECT *, MATCH(artist,album,title) AGAINST('blues in orbit') FROM musicfiles WHERE MATCH(artist,album,title) AGAINST('ellington');

This will find all records with 'ellington' as a substring of artist, album, or title, but will RATE them based on 'blues in orbit'. You can even ORDER BY or GROUP BY MATCH(kunstner, albumtitel, titel) AGAINST ('prelude to a kiss') or against anything else.

One solution to find a word with dashes or hyphens is to use IN BOOLEAN MODE and enclose the word in double quotes.

MATCH(…) AGAINST(…) is case-insensitive. It is also accent-insensitive. In other words, if you have words that include accents, you have no choice but to use IN BOOLEAN MODE with the double quote operator. Each word that includes accent must be inside a pair of double-quote characters by itself.

We can use multiple MATCH(…) AGAINST(…) constructs (each with its own set of columns) before the WHERE clause to give different weights for different columns. See http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html

If we have certain words that are too short, and we want to be able to index those words, without lowering the ft_min_word_len setting, we can preprocess the text:

  1. Create a new column (just for indexing and searching purpose)
  2. Read value from the original column. Search through the value and replace relevant short word by prepending some constant string to those words (force those short words to meet the minimum length requirement), and store the result in the new column
  3. When the user searches, preprocess the query in the same way.

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html
http://www.databasejournal.com/sqletc/article.php/1578331/Using-Fulltext-Indexes-in-MySQL---Part-1.htm
http://jeremy.zawodny.com/blog/archives/000576.html
http://www.debian-administration.org/article/215/Searching_MySQL_databases_with_fulltext_indexes

http://www.drdobbs.com/database/full-text-search-with-innodb/231902587
http://blogs.innodb.com/wp/2011/12/innodb-full-text-search-in-mysql-5-6-4/
http://www.skysql.com/blogs/kolbe/initial-impressions-innodb-fulltext
http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/
http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/

http://blog.dev001.net/2009/01/implementing-an-autosuggest-feature-using-mysql-fulltext-indices/
http://jacksonkr.com/content/searching-mysql-database-exact-word

http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/full-text-revealed.html
http://azgtech.wordpress.com/2011/05/27/mysql-fulltext-secret-bug-i-mean-feature/
http://perplexed.co.uk/234_full_text_fine_tunning.htm
http://stackoverflow.com/questions/796688/how-to-reset-stop-words-in-mysql

http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
http://dev.mysql.com/doc/refman/5.0/en/information-functions.html
http://www.michikono.com/2007/08/07/the-secret-of-sql_calc_found_rows/
http://stackoverflow.com/questions/186588/which-is-fastest-select-sql-calc-found-rows-from-table-or-select-count
http://wordpress.org/support/topic/slow-queries-sql_calc_found_rows-bringing-down-site
http://cherry.world.edoors.com/C8viW3s_pBA0

Use another application that is dedicated for fulltext searching

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License