Solr Combined Versus Single Index

One combined index is often easier to use when compared to using a different index for each entity type. It's just one configuration to manage. A key deciding factor is if you need to query across them—an easy task with one index. Solr can search across multiple indices (a so-called "distributed search") too, however, it is a relatively new capability that has limitations.

"Distributed search" is mainly used for scaling a large index by breaking it up. See Chapter 9 for more information. However, large or not, it can also be used to search across heterogeneous indices.

The approaches can be mixed and matched for different entities if that suits you.

Before complicating your schema index strategy with more than one index, start with one index. If some of the issues (mentioned next) apply to you, then expand to multiple indices. In the example configuration using MusicBrainz, each entity type gets its own index, though one schema is used to ease configuration.

Problems with using a single combined index:

  • There may be namespace collision problems unless you prefix the field names by type such as: artist_startDate and track_PUID. In the example that we just saw, most entity types have a name. Therefore, it's straightforward for all of them to have this common field. If the type of the fields were different, then you would be forced to name them differently.
  • If you share the same field for different things (like the name field in the example that we have just seen), then there are some problems that can occur when using that field in a query and while filtering documents by document type. These caveats do not apply when searching across all documents.
  • You will get scores that are of lesser quality. The explanation for this is a little complicated, and you may need to read up on Lucene scoring to understand it better. One component of a score is the IDF (Inverse Document Frequency) of a term in the search. In other words, documents matching rare words get scored higher. IDF is based on all of the field values in the index (no matter what type of document it is). For example, if you put different types of things into the same field, then what could be a rare word for a track name might not be for an artist name. Therefore, searching for only tracks or only artists would not make use of good IDF factors when computing the score.
  • Prefix, wildcard, and fuzzy queries will take longer and will be more likely to reach internal scalability thresholds. These query types require scanning for all of the indexed terms used in a field to see if they match the queried term. If you share a field with different types of documents, then the total number of terms to search over is going to be larger, which takes longer to scan over. It will also match more terms than it would otherwise, while possibly generating a query that exceeds the maxBooleanClauses threshold (configurable in solrconfig.xml).
  • With or without sharing field names, the IDF component of the score calculation is diluted, because the ratio is based on the total number of documents indexed, not just those of the type being queried.
  • If you do not share field names and instead prefix the field names with a short type identifier like track_name instead of just name, then you may find it inconvenient or awkward, if the users are exposed to this.
  • For a large number of documents, a strategy using multiple indices will prove to be more scalable. Only testing will indicate what "large" is for your data and your queries, but less than a million documents will not likely benefit from multiple indices. Ten million documents have been suggested as a reasonable maximum number for a single index. There are seven million tracks in MusicBrainz, so we'll definitely have to put tracks in its own index.
  • Committing changes to a Solr index invalidates the caches used to speed up querying. If this happens often, and the changes are usually to one type of entity in the index, then you will get better query performance by using separate indices.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License