Solr Practical usage and requirements

sorting on a field requires that an array of the corrisponding type be
constructed for that field - the size of the array is the size of maxDoc
(ie: the number of documents in your index, including deleted documents).

If you are using TrieInts, and have an index with no deletions, sorting
~14.7Mil docs on 1000 diff int fields will take up about ~55GB.

Thats a minimum just for the sorting of those int fields (SortablIntField
which keeps a string version of the field value will be signifcantly
bigger) and doesn't take into consideration any other data structures used
for searching.

I'm not a GC expert, but based on my limited understanding your graph
actually seems fine to me .. particularly the part where it says
you've configured a Max heap of ~122GB or ram, and it's
never spend anytime doing ConcurrentMarkSweep. My uneducated
understanding of those two numbers is that you've told the JVM it can use
an ungodly amount of RAM, so it is. It's done some basic cleanup of
young gen (ParNew) but because the heap size has never gone above 50GB,
it hasn't found any reason to actualy start a CMS GC to look for dea
objects in Old Gen that it can clean up.

You can also sort on a field by using a function query instead of the
"sort=field+desc" parameter. This will not eat up memory. Instead, it
will be slower. In short, it is a classic speed v.s. space trade-off.

You'll have to benchmark and decide which you want, and maybe some
fields need the fast sort and some can get away with the slow one.


DIH - deleting documents, high performance (delta) imports, and passing parameters

Experience with large merge factors:

At some point we will need to re-build an index that totals about 3 terabytes in size (split over 12 shards). At our current indexing speed we estimate that this will take about 4 weeks. We would like to reduce that time. It appears that our main bottleneck is disk I/O during index merging.

Each index is somewhere between 250 and 350GB. We are currently using a mergeFactor of 10 and a ramBufferSizeMB of 32MB. What this means is that for every approximately 320 MB, 3.2GB, and 32GB we get merges. We are doing this offline and will run an optimize at the end. What we would like to do is reduce the number of intermediate merges. We thought about just using a nomerge merge policy and then optimizing at the end, but suspect we would run out of filehandles and that merging 10,000 segments during an optimize might not be efficient.

We would like to find some optimum mergeFactor somewhere between 0 (noMerge merge policy) and 1,000. (We are also planning to raise the ramBufferSizeMB significantly).

What experience do others have using a large mergeFactor?

4 weeks is a depressingly long time to re-index!

Do you use multiple threads for indexing? Large RAM buffer size is
also good, but I think perf peaks out mabye around 512 MB (at least
based on past tests)?

Believe it or not, merging is typically compute bound. It's costly to
decode & re-encode all the vInts.

Larger merge factor is good because it means the postings are copied
fewer times, but, it's bad beacuse you could risk running out of
descriptors, and, if the OS doesn't have enough RAM, you'll start to
thin out the readahead that the OS can do (which makes the merge less
efficient since the disk heads are seeking more).

Cutting over to SSDs would also be a good idea, but, kinda pricey
still ;)

Do you do any deleting?

Do you use stored fields and/or term vectors? If so, try to make
your docs "uniform" if possible, ie add the same fields in the same
order. This enables lucene to use bulk byte copy merging under the

I wouldn't set such a huge merge factor that you effectively disable
all merging until the end… because, you want to take advantage of
the concurrency while you're indexing docs to get any/all merging done
that you can. To wait and do all merging in the end means you
serialize (unnecessarily) indexing & merging…

You could do periodic small optimizes. The optimize command now
includes 'maxSegments' which limits the target number of segments.

It is possible to write a Lucene program that collects a bunch of
segments and annoints it as an index. This gives you a way to collect
segments after you write them with the nomergepolicy. As long as you
are strict about not writing duplicate records, you can shovel
segments here and there and collect them into the real index as you
please. Ugly? Yes.

We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We have 30 "producers" each sending documents to 1 of 12 Solr shards on a round robin basis. So each shard will get multiple requests.

Sounds like we need to do some monitoring during merging to see what the cpu use is and also the io wait during large merges.

Is there a way to estimate the amount of RAM for the readahead? Once we start the re-indexing we will be running 12 shards on a 16 processor box with 144 GB of memory.

Deletes would happen as a byproduct of updating a record. This shouldn't happen too frequently during re-indexing, but we update records when a document gets re-scanned and re-OCR'd. This would probably amount to a few thousand.

We use 4 or 5 stored fields. They are very small compared to our huge OCR field. Since we construct our Solr documents programattically, I'm fairly certain that they are always in the same order. I'll have to look at the code when I get back to make sure.

We aren't using term vectors now, but we plan to add them as well as a number of fields based on MARC (cataloging) metadata in the future.

Wild card search:

Wildcard searches are not analyzed. To case insensitive search you can lowercase query terms at client side. (with using lowercasefilter at index time) e.g. Mail* => mail*

Wildcard search works with q parameter if you are asking that. &q=mail*

It is normal for leading wildcard search to be slow. Using ReversedWildcardFilterFactory at index time can speedup it.

But it is unusual to use both leading and trailing * operator. Why are you doing this?


ReverseWildcardFilter will help leading wildcard, but will not help trying to use a query with BOTH leading and trailing wildcard. it'll still be slow. Solr/lucene isn't good at that; I didn't even know Solr would do it at all in fact.

If you really needed to do that, the way to play to solr/lucene's way of doing things, would be to have a field where you actually index each _character_ as a seperate token. Then leading and trailing wildcard search is basically reduced to a "phrase search", but where the words are actually characters. But then you're going to get an index where pretty much every token belongs to every document, which Solr isn't that great at either, but then you can apply "commongram" stuff on top to help that out a lot too. Not quite sure what the end result will be, I've never tried it. I'd only use that weird special "char as token" field for queries that actually required leading and trailing wildcards.

Figuring out how to set up your analyzers, and what (if anything) you're going to have to do client-app-side to transform the user's query into something that'll end up searching like a "phrase search where each 'word' is a character…. is left as an exersize for the reader. :)

Prioritizing adjectives in solr search

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License