I’ve recently begun diving into Search Engines in general and Solr in particular. This is my understanding of it so far.
It isn’t really feasible to execute blazing fast search queries on very big SQL databases for 2 different reasons. The first reason comes SQL databases favoring lack of radiancy over performance. Basically, you’d need to use
JOIN in your
SELECT. The second reason is about the nature of data in documents: it’s essentially unstructured plain text so that
SELECT would need
LIKE. Both joins and likes are performance killers, so this way is a no-go in real-life search engines.
Therefore, most of them propose a way to look at data that is very different from SQL, inverted index(es). This kind of data structure is a glorified dictionary where:
- key are individual terms
- values are list of documents that match term
Nothing fancy, but this view of data makes for very fast research in very high-volume databases. Note that the term 'document' is used very loosely in that it’s should be a field-structured view of the initial document (see below).
Though Solr belongs to the NoSQL database family, it is no schemaless. Schema configuration takes place in a dedicated
schema.xml file: individual fields must be defined, and with each its type. Different document types may be different in structure and have few (no?) fields in common. In this case, each document type may be set its own index with its own schema.
Predefined types like strings, integers and dates are available out-of-the-box. Types can be declared searchables (called "indexed") and/or stored (returned in queries). For examples, books could (would?) include not only their content, but also author(s), publisher(s), date of publishing, etc.
There are two available interface to index documents in Solr: a REST API and a full Java interface named SolrJ.
To build the inverted index, documents have to be parsed for individual terms. In order for search to be user-friendly, you have to be able to query regardless of case, of hyphens and of irrelevant words - called stop words (that would include 'a' and 'the' in english). It would also be great to provide a way to equal terms that share a common meaningful root - this is called stemming, such as 'fish', 'fishing' and 'fisherman' as well as offer a dictionary for synonyms.
Solr applies a tokenizer processing chain to each received document: individual steps in the chain have a single responsibility based on either removing, adding or replacing a term token. They are referred to as filters. For example, one filter is used to remove stop words, one to lowercase term (replace) and one to add synonym terms.
Queries also have to be made of terms. Those terms can be composed with binary operators and individual terms can be boosted.
Queries are parsed into tokens through a process similar as documents. Of course, some filters make sense while others do not. In the former category, we find the lowercase filter, in the latter, the synonym one.
Parsed queries are compaired to indexed terms using set theory to determine matching results.
Search results are are paged and ordered so that documents being more relevant to users are presented first. In order to provide the best user-experience, a middle-ground has to be found between:
- correctness - only relevant results are returned
- thoroughness - all relevant results must be returned
Results can be grouped using one or more fields. Grouping depends on the filed type and can be customized: for example, book results can be grouped per author or per author’s first letter, depending on the number of books in the whole index.