schema.xml

Do we have to specify a unique ID field?

Although it is not required, we should define a unique ID field. A unique ID allows specific documents to be updated or deleted, and it enables various other miscellaneous Solr features. If your source data does not have an ID field that you can propagate, Solr can generate one by simply having a field with a field type and with a class of solr.UUIDField.


The schema file (schema.xml) is organized in three sections:

  • Types
  • Fields
  • Other declarations

In the <types> section are common, reusable definitions of how fields should be processed by Solr. The field types declared at the top of the <types> section, like sint and boolean, are used to store primitive types in Solr. For the most part, Lucene only deals with strings, so integers, floats, dates, and doubles require special handling to be searchable.

Using these field types inform Solr on how to treat the content to be indexed with the appropriate special handling, requiring no intervention on your part.

In many instances in the Solr schema, the class attribute is abbreviated to something like solr.TextField. This is simply shorthand for org.apache.solr.schema.TextField. Any valid class in the classpath that extends the org.apache.solr.schema.FieldType class may be used.

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
                generateNumberParts="1" catenateWords="1"
                catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
                protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
                ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
                generateNumberParts="1" catenateWords="0"
                catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldtype>

In the above listing, we declared two different Analyzers (one for indexing and one for query analysis). While our definitions for our analyzers are not exactly the same for both indexing and searching, they really only differ by the addition of synonyms during query analysis. Stemming, stopword removal, and similar operations are all still applied to the tokens before indexing or searching, resulting in the same types of tokens to be used.

We declared the tokenizer first and then any filters used.

The above configuration is set as follows:

  • Tokenize on whitespace, then removed any common words (StopFilterFactory)
  • Handle special cases with dashes, case transitions, etc. (WordDelimiterFilterFactory; lowercase all terms (LowerCaseFilterFactory)
  • Stem using the Porter Stemming algorithm (EnglishPorterFilterFactory)
  • Remove any duplicates (RemoveDuplicatesTokenFilterFactory)

The schema file (schema.xml) contains field type definitions (defined within the <types> tag) and lists the fields that make up your schema (within the <fields> tag), which references a type. The schema contains other information too such as the primary key (the field that uniquely identifies each document—a constraint that Solr enforces) and the default search field.

It is common for the schema to call out for certain fields to be copied to other fields—particularly fields not in input documents. So, even though the input documents don't have a field named text, there are <copyField> tags in the
schema, which call for the fields named cat, name, manu, features, and includes to be copied to text. This is a popular technique to speed up queries, so that queries can search over a small number of fields rather than a long list of them. Such fields used this way are rarely stored, as they are just needed for querying and so are indexed.

Solr schema strategy is driven by how it is queried and not by a standard third normal form decomposition of the data. This isn't to say that all databases are pure third normal form and that they aren't influenced by queries. But in an index, the queries you need to support completely drive the schema design. This is necessary, as you can't perform relational queries on an index. Consequently all the data needed to match a document must be in the document matched. To satisfy that requirement, data that would otherwise exist in one place (like an artist's name in MusicBrainz, for example) is inlined into related entities that need it to support a search. This may feel dirty but I'll just say "get over it". Besides your data's gold source most likely is not in Solr.

Multi-valued fields maintain ordering so that the two fields would have corresponding values at a given index.

What you should not do is try to shove different types of data into the same field by putting both the artist IDs and names into one field. It could introduce text analysis problems, as a field would have to satisfy both types, and it would require the client to parse out the pieces. The exception to this is when you are not indexing the data and if you are merely storing it for display then you can store whatever you want in a field.

It is impossible to query Solr for releases that have an event in the UK that were over a year ago. The issue is that the criteria for this hypothetical search involves multi-valued fields, where the index of one matching criteria needs to correspond to the same value in another multi-valued field in the same index. You can't do that. But let's say that this crazy search example was important to your application, and you had to support it somehow. In that case, there is exactly one release for each event, and a query matching an event shouldn't match any other events for that release. So you could make event documents in the index, and then searching the events would yield the releases that interest you. This scenario had a somewhat easy way out. However, there is no general step-by-step guide. There are scenarios that will have no solution, and you may have to compromise. Frankly, Solr (like most technologies) has its limitations. Solr is not a general replacement for relational databases.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License