Solr (pro­nounced: solar) is an open source sub-project based on the free software Lucene from Apache. Solr is based on Lucene Core and is written in Java. As a search platform, Apache Solr is one of the most popular tools for in­te­grat­ing vertical search engines. Among Solr’s ad­van­tages are also its wide range of functions (which also includes faceting search results, for example) and ac­cel­er­at­ed indexing. It also runs on server con­tain­ers like Apache Tomcat. Firstly, we reveal how Apache Solr works and then explain in a Solr tutorial what to consider when using the software for the first time.

The origins of Apache Solr

The search platform Solr was built based on Lucene. Apache Lucene Core was developed by software designer Doug Cutting in 1997. At first, he offered it via the file hosting service Source­Forge. In 1999, the Apache Software Foun­da­tion launched the Jakarta project to support and drive the de­vel­op­ment of free Java software. In 2001, Lucene also became part of this project – it was also written in Java. Since 2005, it has been one of Apache’s main projects and runs under a free Apache license. Lucene gave rise to several sub-projects such as Lucy (Lucene, written in C) and Lucene.NET (Lucene in C#). The popular search platform, Elas­tic­search, is also based on Lucene, just like Solr.

Solr was also created in 2004 and is based on Lucene: at that time, however, the servlet was still called Solar and was dis­trib­uted by CNET Networks. “Solar” stood for “Search on Lucene and Resin.

In 2006, CNET handed the project over to the Apache Foun­da­tion where it initially went through another de­vel­op­ment period. When Solr was released to the public as a separate project in 2007, it quickly attracted community attention. In 2010, the Apache community in­te­grat­ed the servlet into the Lucene project. This joint de­vel­op­ment guar­an­tees good com­pat­i­bil­i­ty. The package is completed by SolrCloud and the Solr parser, Tika.

De­f­i­n­i­tion

Apache Solr is a platform-in­de­pen­dent search platform for Java-based projects. The open source project is based on the Java library, Lucene. It au­to­mat­i­cal­ly in­te­grates documents in real-time and forms dynamic clusters. Solr is com­pat­i­ble with PHP, Python, XML, and JSON. The servlet has a web user interface and commands are exchanged in HTTP. Solr provides users with a dif­fer­en­ti­at­ed full text search for rich text documents. It is par­tic­u­lar­ly suitable for vertical search engines on static websites. The extension via SolrCloud allows ad­di­tion­al cores and an extended fragment clas­si­fi­ca­tion.

In­tro­duc­tion to Solr: ex­pla­na­tion of the basic terms

Apache Solr is in­te­grat­ed as a servlet in Lucene. Since it com­ple­ments the Lucene software library, we will briefly explain how it works. In addition, many websites use Solr as the basis for their vertical search engine (Netflix and eBay are well-known examples). We will explain what this is in the following section.

What is Apache Lucene?

The free software Lucene is an open source Java library that you can use on any platform. Lucene is known as a scalable and powerful NoSQL library. The archive software is par­tic­u­lar­ly suitable for internet search engines – both for searches on the entire internet and for domain-wide searches and local queries.

Fact

Lucene Core is a software library for the pro­gram­ming language, Java. Libraries only serve as an ordered col­lec­tion of sub programs. De­vel­op­ers use these col­lec­tions to link programs to help modules via an interface. While a program is running, it can access the required component in the library.

Since the library divides documents into text fields and clas­si­fies them logically, Lucene’s full text search works very precisely. Lucene will also find relevant hits for similar texts/documents. This means that the library is also suitable for rating websites such as Yelp. As long as it rec­og­nizes text, it doesn’t matter which format (plain text, PDF, HTML or others) is used. Instead of indexing files, Lucene works with text and metadata. Nev­er­the­less, files must be read from the library.

This is why the Lucene team developed the now in­de­pen­dent Apache project, Tika. Apache Tika is a practical tool for text analysis, trans­la­tion, and indexing. The tool reads text and metadata from over a thousand file types. It then extracts the text and makes it available for further pro­cess­ing. Tika consists of a parser and a detector. The parser analyzes texts and struc­tures the content in an ordered hierarchy. The detector typifies content. For example, it rec­og­nizes the file types as well as the type of content from the metadata.

Lucene’s most important functions and features:

  • Fast indexing, both stepby-step and in batches (up to 150 GB per hour according to their spec­i­fi­ca­tions)
  • Eco­nom­i­cal RAM use
  • Written in Java through­out, therefore cross­plat­form (variants in al­ter­na­tive pro­gram­ming languages are Apache Lucy and Lucene.NET)
  • Interface for plugins
  • Search for text fields (cat­e­gories such as content, title, author, keyword) – even able to search for several at the same time
  • Sorting by text fields
  • Listing search results by sim­i­lar­i­ty/relevance

Lucene divides documents into text fields such as title, author, and text body. The software uses the query parser to search within these fields. This is con­sid­ered a par­tic­u­lar­ly efficient tool for search queries with manual text entry. The simple syntax consists of a search term and a modifier. Search terms can be in­di­vid­ual words or groups of words. You adjust these with a modifier or you link several terms with Boolean variables to a complex query. See the Apache query parser syntax for the exact commands.

Lucene also supports unclear searches based on the Lev­en­shtein distance. The latter records the number of character changes (i.e. replace, insert, or delete) to move from one mean­ing­ful character string to another. Here’s an example: "beer" (replace e with a) to "bear" has a distance of 1, because only one step was needed.

You can define a value that de­ter­mines how large the de­vi­a­tions from the original search term may be so that the relevant term will still be included in the search hits. This value is between 0 and 1 – the closer it is to 1, the more similar the search hit must be to the output word. If you do not enter a value, the value is au­to­mat­i­cal­ly 0.5 and the cor­re­spond­ing command looks like this.

beer~

If you want to set a specific value (0.9 in the example), enter the following command:

beer~0.9

The ap­prox­i­mate search is similarly built and you can search for words and also determine how far away search terms may be from each other in the text but still remain relevant. For example, if you search for the phrase "Alice in Won­der­land" you can specify that the words "Alice" and "Won­der­land" must lie within a 3-word radius:

"alice wonderland"~3

What is a vertical search engine?

Lucene enables you to search the internet as well as within domains. Search engines that cover a wide range of pages are called hor­i­zon­tal search engines. These include the well-known providers Google, Bing, Yahoo, Duck­Duck­Go, and Startpage. A vertical search engine, on the other hand, is limited to a domain, a specific topic, or a target group. A domain-specific search engine helps visitors accessing your website to find specific texts or offers. Examples of the­mat­i­cal­ly oriented search engines are rec­om­men­da­tion portals such as Tri­pAd­vi­sor or Yelp, but also job search engines. Target group-specific search engines, for example, are aimed at children and young people, or sci­en­tists who are looking for sources.

Focused crawlers (instead of web crawlers) provide more accurate results for vertical search engines. A library like Lucene, which divides its index into classes using taxonomy prin­ci­ples and connects them logically using ontology, is the only way to make this exact full-text search possible. Vertical search engines also use the­mat­i­cal­ly matching filters that limit the number of results.

Fact

Ontology and taxonomy are two prin­ci­ples within computer science that are important for proper archiving. Taxonomy deals with dividing terms into classes. These are arranged into a hierarchy, similar to a tree diagram. Ontology goes one step further and places terms in logical relation to each other. Groups of phrases gather in clusters that signal a close re­la­tion­ship. Fur­ther­more, related groups of terms are connected with each other, creating a network of re­la­tion­ships.

Lucene's index is a practical, scalable archive for quick searches. However, there are some essential steps that you must repeat fre­quent­ly since they do not take place au­to­mat­i­cal­ly. Finally, you need a widely branched index for the vertical search. That's where Apache Solr comes in. The search platform expands the library functions and Solr can be set up quickly and easily with the right commands – even by Java beginners. The servlet offers you many practical tools with which you can set up a vertical search engine for your internet presence and adapt it to the needs of your visitors in a short time.

What is Solr? How the search platform works

Since you now have basic in­for­ma­tion about the Lucene foun­da­tion and how Solr can be used, we explain below how the search platform works, how it extends Lucene’s functions, and how you work with it.

Solr: the basic elements

Solr is written in Java, meaning you can use the servlet platform in­de­pen­dent­ly. Commands are usually written in HTTP (Hypertext Transfer Protocol) and for files that need to be saved you use XML (Ex­ten­si­ble Markup Language). Apache Solr also offers Python and Ruby de­vel­op­ers their familiar pro­gram­ming language via an API (Ap­pli­ca­tion Pro­gram­ming Interface). Those who normally work with the JavaScript Object Notation (short: JSON), will find that Elas­tic­Search offers the optimal en­vi­ron­ment. Solr can also work with this format via an API.

Although the search platform is based on Lucene and fits seam­less­ly into its ar­chi­tec­ture, Solr can also work on its own. It is com­pat­i­ble with server con­tain­ers like Apache Tomcat.

Indexing for accurate search results – in fractions of a second

Struc­tural­ly, the servlet is based on an inverted index. Solr uses Lucene's library for this. Inverted files are a subtype of the database index and are designed to speed up the retrieval of in­for­ma­tion. The index saves content within the library. These can be words or numbers. If a user searches for specific content on a website, the person usually enters one or two topic-relevant search terms. Instead of crawling the entire site for these words, Solr uses the library.

This indexes all important keywords almost in real time and connects them to documents on the website where the words you are looking for appear. The search simply runs through the index to find a term. The results list displays all documents that contain this word at least once according to the index. This type of search is equiv­a­lent to the analogous search within a textbook: If you search for a keyword in the index at the end of the book, you will find the in­for­ma­tion on which pages the term can be found within the entire text. There you simply scroll down. The vertical web search shows a list of results with links to the re­spec­tive documents.

For this process to work smoothly, it is the­o­ret­i­cal­ly necessary to enter all keywords and metadata (for example, author or year of pub­li­ca­tion) in the library every time a new document is added to the website’s portfolio. That's why working in the backend with Lucene can be a bit tedious. But Solr can automate these steps.

Relevance and filter

Apache Solr uses Lucene's ontology and taxonomy to output highly accurate search results. The Boolean variables and trun­ca­tion also help here. Solr adds a higher-level cache to Lucene's cache, which means that the servlet remembers fre­quent­ly asked search queries, even if they consist of complex variables. This optimizes the search speed.

If you want to keep users on your website, you should offer them a good user ex­pe­ri­ence. In par­tic­u­lar, this includes making the right offers. For example, if your visitors are looking for in­for­ma­tion about raising chickens, texts about their breeding and feeding habits should appear as the first search results at the top of the list. Recipes including chicken or even films about chickens should not be included in the search results or at least appear a lot further down the page.

Re­gard­less of whether users are searching for a certain term or whether they should be shown in­ter­est­ing topic sug­ges­tions with internal links at the end of an exciting article: in both cases, the relevance of the results is essential. Solr uses the tf-idf measure to ensure that only search results relevant to the searcher are actually displayed.

Fact

The term “term frequency inverse document frequency” or “tf-idf” stands for numerical sta­tis­tics. The search word density in a document (i.e. the number of times a single term occurs in the text) is compared with the number of documents in the entire search pool that contain the term. So, you can see if a search term really occurs more often in the context of a document than in the entirety of the texts.

Solr: the most important functions

Apache Solr collects and indexes data in near real time, supported by Lucene Core. Data stands for documents. Both in the search and in the index, the document is the decisive unit of mea­sure­ment. The index consists of several documents, which in turn consist of several text fields. The database contains a document in a table row. A field is in the table column.

Coupled via an API with Apache Zookeeper, Solr has a single point of contact that provides syn­chro­niza­tion, name registers, and con­fig­u­ra­tion dis­tri­b­u­tion. This includes, for example, a ring algorithm that assigns a co­or­di­na­tor (also: leader) to processes within a dis­trib­uted system. The tried and tested trouble-shooter Zookeeper also triggers processes again when tokens are lost and finds nodes (computers in the system) using node discovery. All these functions ensure that your project always remains freely scalable.

This also means that the search engine works even under the toughest of con­di­tions. As already mentioned, traffic intensive websites that store and manage huge amounts of data on a daily basis also use Apache Solr. If a single Solr server is not enough, simply connect multiple servers via the Solr cloud. Then you can fragment your data sets hor­i­zon­tal­ly – also called sharding. To do this, you divide your library into logically linked fragments. This allows you to expand your library beyond the storage space otherwise available. Apache also rec­om­mends that you upload multiple copies of your library to different servers. This increases your repli­ca­tion factor. If many requests come in at the same time, they are dis­trib­uted to the different servers.

Solr is expanding the full-text search already offered by Lucene with ad­di­tion­al functions. These search features include, but are not limited to:

  • Adapting terms also for groups of words: The system detects spelling errors in the search input and provides results for a corrected al­ter­na­tive.
     
  • Joins: A mixture of the Cartesian product (several terms are con­sid­ered in any order during the search) and selection (only terms that fulfil a certain pre­req­ui­site are displayed), makes a complex Boolean variable syntax.
     
  • Grouping of the­mat­i­cal­ly related terms.
     
  • Facet clas­si­fi­ca­tion: The system clas­si­fies each in­for­ma­tion item according to several di­men­sions. For example, it links a text to keywords such as author name, language, and text length. In addition, there are topics that the text deals with, as well as a chrono­log­i­cal clas­si­fi­ca­tion. The facet search allows the user to use several filters to obtain an in­di­vid­ual list of results.
     
  • Wild card search: A character rep­re­sents an undefined element or several similar elements in a character string? Then use"?" for one character and "*" for several. For example, you can enter a word fragment plus the place­hold­er (for example: teach*). The list of results then includes all terms with this root word (for example: teacher, teach, teaching). This way, users receive hits on this subject area. The necessary relevance results from the topic re­stric­tion of your library or further search re­stric­tions. For example, if users search for "b?nd" they receive results such as band, bond, bind. Words such as "binding" or "bonding" are not included in the search, as the"?" replaces only one letter.
     
  • Rec­og­nizes text in many formats, from Microsoft Word to text editors, to PDF and indexed rich content.
     
  • Rec­og­nizes different languages.

In addition, the servlet can integrate several cores which consist of Lucene indices. The core collects all in­for­ma­tion in a library and you can also find con­fig­u­ra­tion files and schemas there. This sets the behavior of Apache Solr. If you want to use ex­ten­sions, simply integrate your own scripts or plugins from community posts into the con­fig­u­ra­tion file.

Ad­van­tages Dis­ad­van­tages
Adds practical features to Lucene Less suitable for dynamic data and objects
Automatic real-time indexing Adding cores and splitting fragments can only be done manually
Full-text search Global cache can cost time and storage space compared to segmented cache
Faceting and grouping of keywords  
Full control over fragments  
Fa­cil­i­tates hor­i­zon­tal scaling of search servers  
Easy to integrate into your own website  

Tutorial: down­load­ing and setting up Apache Solr

The system re­quire­ments for Solr are not par­tic­u­lar­ly high: all you need is a Java SE Runtime En­vi­ron­ment from version 1.8.0 onwards. The de­vel­op­ers tested the servlet on Linux/Unix, macOS, and Windows in different versions. Simply download the ap­pro­pri­ate in­stal­la­tion package and extract the.zip file (Windows package) or the.tgz file (Unix, Linux, and OSX package) to a directory of your choice.

Solr tutorial: step 1 – download and start

  1. Visit the Solr project page of the Apache Lucene main project. The menu bar appears at the top of the window. Under "Features" Apache informs you briefly about the Solr functions. Under "Resources" you will find tutorials and doc­u­men­ta­tion. Under "Community” Solr fans will help you with any questions you may have. In this area you can also add your own builds.
     
  2. Click on the download button for the in­stal­la­tion. This will take you to the download page with a list of mirror downloads. The current Solr version (7.3, as of May 2018) from a certified provider should be at the top. Al­ter­na­tive­ly, choose from HTTP links and a FTP download. Click on a link to go to the mirror site of the re­spec­tive provider.
  1. The image above shows the various download packages available – in this case for Solr version 7.3.
  • solr-7.3.0-src.tgz is the package for de­vel­op­ers. It contains the source code so you can work on it outside the GitHub community.
  • solr-7.3.0.tgz is the version for Mac and Linux and Unix users.
  • solr-7.3.0.zip contains the Windows-com­pat­i­ble Solr package.
  • in the changes/ folder, you will find the doc­u­men­ta­tion for the cor­re­spond­ing version.

After you have selected the optimal version for your re­quire­ments by clicking on it, a download window will appear. Save the file then once the download is complete, click on the download button in your browser or open your download folder.

  1. Unpack the zip file or the.tgz file. If you want to fa­mil­iar­ize yourself with Solr first, select any directory. There you save the unzipped files. If you already know how you want to use Solr, select the correct server for this purpose. Or build a clustered Solr cloud en­vi­ron­ment if you want to scale up (more about the cloud in the next chapter).
Note

The­o­ret­i­cal­ly, a single Lucene library can index ap­prox­i­mate­ly 2.14 billion documents. In practice, however, this number isn’t usually reached before the number of documents affects per­for­mance. With a cor­re­spond­ing­ly high number of documents it is therefore advisable to plan with a Solr cloud from the beginning.

  1. In our Linux example, we are working with Solr 7.3.0. The code in this tutorial was tested in Ubuntu. You can also use the examples for macOS. In principle, the commands also work on Windows, but with back­slash­es instead of normal slashes.

Enter "cd /[source path]" in the command line to open the Solr directory and start the program. In our example it looks like this:

cd /home/test/Solr/solr-7.3.0
bin/solr start

The Solr server is now running on port 8983 and your firewall may ask you to allow this. Confirm it.

If you want to stop Solr, enter the following command:

bin/solr stop -all

If Solr works, fa­mil­iar­ize yourself with the software. With the Solr demo you can al­ter­na­tive­ly start the program in one of four modes:

  • Solr cloud (command: cloud)
  • Data import handler (command: dih)
  • Without schema (command: schema­less)
  • Detailed example with KitchenSink (command: tech­prod­ucts)

The examples have a schema adapted to each case. You edit this using the schema interface. To do this, enter this command (the place­hold­er[example] stands for one of the above mentioned keywords):

bin/solr -e [example]
  1. So that Solr runs in the cor­re­spond­ing mode. If you want to be sure, check the status report:
bin/solr status
Found 1 Solr nodes:
Solr process xxxxx running on port 8983
  1. The examples contain pre­con­fig­ured basic settings. If you start without an example, you must define the schema and core yourself. The core stores your data. Without it, you cannot index or search files. To create a core, enter the following command:
bin/solr create –c <name_of_core>
  1. Apache Solr has a web-based user interface. If you have started the program suc­cess­ful­ly, you can find the Solr admin web app in your browser under the address "http://localhost:8983/solr/".
  1. Lastly, stop Solr with this command:
bin/solr stop -all

Solr tutorial: part 2, first steps

Solr provides you with a simple command tool. With the so-called post tool you can upload content to your server. These can be documents for the index as well as schema con­fig­u­ra­tions. The tool accesses your col­lec­tion to do this. Therefore, you must always specify the core or col­lec­tion before you start to work with it.

In the following code example, we first specify the general form. <col­lec­tion> with the name of your core/col­lec­tion. "-c" is the command "create." You use it to create a core or col­lec­tion. Behind this, you define ad­di­tion­al options or execute commands. For example, select a port with "-p" and with "*.xml" or "*.csv" you upload all files with the re­spec­tive format to your col­lec­tion (lines two and three). The command "-d" deletes documents in your col­lec­tion (line four).

bin/post –c <collection> [options] <Files|collections|URLs>
bin/post –c <collection> -p 8983 *.xml
bin/post –c <collection> *.csv
bin/post –c <collection> -d '<delete><id>42</id><delete>'

Now you know some of the basic commands for Solr. The “KitchenSink” demo version shows you exactly how to set up Apache Solr.

  1. Start Solr with the demo version. For the KitchenSink demo use the command tech­prod­ucts. You enter the following in the terminal:
bin/solr –e techproducts

Solr starts on port 8983 by default. The terminal reveals that it is creating a new core for your col­lec­tion and indexes some sample files for your catalog. In the KitchenSink demo you should see the following in­for­ma­tion:

Creating Solr home directory /tmp/solrt/solr-7.3.1/example/techproducts/solr
Starting up Solr on port 8983 using command:
bin/solr start -p 8983 -s "example/techproducts/solr"
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=12281). Happy searching!
Setup new core instance directory:
/tmp/solrt/solr-7.3.1/example/techproducts/solr/techproducts
Creating new core 'techproducts' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=techproducts&instanceDir=techproducts
{"responseHeader":
{"status":0,
"QTime":2060},
"core":"techproducts"}
Indexing tech product example docs from /tmp/solrt/solr-7.4.0/example/exampledocs
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/update…
using content-type application/xml...
POSTing file money.xml to [base]
POSTing file manufacturers.xml to [base]
POSTing file hd.xml to [base]
POSTing file sd500.xml to [base]
POSTing file solr.xml to [base]
POSTing file utf8-example.xml to [base]
POSTing file mp500.xml to [base]
POSTing file monitor2.xml to [base]
POSTing file vidcard.xml to [base]
POSTing file ipod_video.xml to [base]
POSTing file monitor.xml to [base]
POSTing file mem.xml to [base]
POSTing file ipod_other.xml to [base]
POSTing file gb18030-example.xml to [base]
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...
Time spent: 0:00:00.486
Solr techproducts example launched successfully. Direct your Web browser to
http://localhost:8983/solr to visit the Solr Admin UI
  1. Solr is now running and has already loaded some XML files into the index. You can work with these later on. In the next steps you should try to feed some files into the index yourself. This is quite easy to do using the Solr admin user interface. Access the Solr server in your browser. In our Tech­prod­ucts demo, Solr already specifies the server and the port. Enter the following address in your browser: "http:://localhost:8983/solr/".

    If you have already defined a server name and a port yourself, use the following form and enter the server name and the port number in the ap­pro­pri­ate place: "http://[server name]:[port number]/solr/".

    Here is where you control the folder example/ex­am­ple­docs. It contains sample files and the post.jar file. Select a file you want to add to the catalog and use post.jar to add it. For our example we use more_books.jsonl.

    To do this, enter the following in the terminal:
cd example/exampledocs
Java -Dc=techproducts –jar post.jar more_books.jsonl

If Solr has suc­cess­ful­ly loaded your file into the index, you will receive this message:

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/update 
POSTing file more_books.jsonl to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...
Time spent: 0:00:00.162
  1. When setting up the Apache Solr search platform, you should include the con­fig­u­ra­tion files and the schema directly. These are given in the demo examples. If you are working on a new server, you must determine the config set and the schema yourself.

    The schema (schema.xml) defines the number, type, and structure of the fields. As already mentioned, a document in Lucene consists of fields. This sub­di­vi­sion fa­cil­i­tates the targeted full text search. Solr works with these fields. A par­tic­u­lar field type only accepts certain content (for example <date> only rec­og­nizes dates in the year-month-day-time form). You use the schema to determine which field types the index rec­og­nizes later and how it assigns them. If you do not do this, the documents fed in will specify the field types. This is practical in the test phase, as you simply start to fill the catalog. However, an approach like this can lead to problems later on.

    These are some basic Solr field types:
  • Dat­eRange­Field (indexes times and points in time up to mil­lisec­onds)
     
  • Ex­ter­nal­File­Field (pulls values from an external folder)
     
  • TextField (general field for text input)
     
  • Bi­na­ry­Field (for binary data)
     
  • Cur­ren­cy­Field indexes two values sep­a­rate­ly, but displays them to the end user as one value Cur­ren­cy­Field stores a numerical value (for example 4.50) and a currency (for example $) in a field (the end user sees both values together ($4.50))
     
  • StrField (UTF-8 and Unicode string in a small field which are not analyzed or replaced by a token)

A detailed list of Solr field types and other commands for schema settings can be found in Solr-Wiki.

To specify field types, call up schema.xml at "http://localhost:8983/solr/tech­prod­ucts/schema". Tech­prod­ucts has already defined field types. A command line in the XML file describes the prop­er­ties of a field in more detail using at­trib­ut­es. According to the doc­u­men­ta­tion, Apache Solr allows the following at­trib­ut­es for a field:

  • field name (Must not be empty. Contains the name of the field.)
     
  • type (Enter a valid field type here. Must not be empty.)
     
  • indexed (stands for "entered in the index". If the value is true, you can search for the field or sort it.)
     
  • stored (Describes whether a field is stored, if the value is "true" the field can be accessed.)
     
  • mul­ti­Val­ued (If a field contains several values for a document, enter the value "true" here.)
     
  • default (Enter a default value that appears if no value is specified for a new document.)
     
  • com­pressed (Rarely, since only ap­plic­a­ble to gzip-com­press­ible fields, set to "false" by default. Must be set to “true” to compress.)
     
  • omitNorms (set to “true” by default. Saves standards for a field and thus also saves memory.)
     
  • ter­mOff­sets (Requires more memory. Stores vectors together with offset in­for­ma­tion (i.e. memory address additions.)
     
  • termPo­si­tions (Requires more memory because it stores the position of terms together with the vector.)
     
  • ter­mVec­tors (Set to "false" by default, saves vector terms if "true.")

You either enter field prop­er­ties directly in the schema xml file, or you use a command in the terminal. This is a simple field type in the schema xml file:

<fields>
<field name="name" type="text_general" indexed="true" multivalued=”false” stored="true" />
</fields>

Al­ter­na­tive­ly, you can use the input terminal again. There you simply enter a curl command, set the field prop­er­ties and send it via the schema interface by spec­i­fy­ing the file address:

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/techproducts/schema
  1. ​​​​​​After you have adapted the scheme, it is the Solr con­fig­u­ra­tion's turn. This is used to define the search settings. These are important com­po­nents:
  • Query cache pa­ra­me­ters
  • Request handlers
  • Location of the data directory
  • Search com­po­nents

The request cache pa­ra­me­ters enable three types of caching: LRUCache, LFUCache, and FastL­RU­Cache. LRUCache uses a linked hash map and FastL­RU­Cache collects data via a con­cur­rent hash map. This hash map processes requests si­mul­ta­ne­ous­ly. This way, the search server produces answers faster if many search queries arrive in parallel. The FastL­RU­Cache reads data faster than the LRUCache and inserts at a slower pace.

Fact

A hash map assigns values to a key. The key is unique – there is only one value for each key. Keys like these can be any object. This is used to calculate the hash value, which is prac­ti­cal­ly the "address" i.e. the exact position in the index. Using it, you can find the key values within a table.

The request handler processes requests. It reads the HTTP protocol, searches the index, and outputs the answers. The “tech­prod­ucts” example con­fig­u­ra­tion includes the standard handler for Solr. The search com­po­nents are listed in the query handler. These elements perform the search. The handler contains the following search com­po­nents by default:

  • query (request)
  • facet (faceting)
  • mlt (More Like This)
  • highlight (best)
  • stats (sta­tis­tics)
  • debug
  • expand (expand search)

For the search component More Like This (mlt), for example, enter this command:

<searchComponent name="mlt" class="org.apache.solr.handler.component.MoreLikeThisComponent" />

More Like This finds documents that are similar in content and structure. It is a class within Lucene. The query finds content for your website visitors by comparing the string and the indexed fields.

To configure the list, first open the request handler:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!--
<int name="rows">10</int>
<str name="fl">*</str>
<str name="version">2.1</str>
-->
</lst>
</requestHandler>

Add your own com­po­nents to the list in the request handler or change existing search com­po­nents. These com­po­nents perform the search when a website visitor enters a search query on your domain. The following command inserts a self-made component before the standard com­po­nents:

<arr name="first-components">
<str>NameOfTheOwnComponent</str>
</arr>

To insert the component after the standard com­po­nents:

<arr name="last-components">
<str>NameOfTheOwnComponent</str>
</arr>

This is how to rename existing com­po­nents:

<arr name="components">
<str>facet</str>
<str>NameOfTheOwnComponent</str>
</arr>

The default data directory can be found in the core instance directory "in­stanceDir" under the name "/data." If you want to use another directory, change the location via sol­r­con­fig.xml. To do this, specify a fixed path or bind the directory name to the core (SolrCore) or in­stanceDir. To bind to the core, write:

<dataDir>/solr/data/$(solr.core.name)</dataDir>

Solr tutorial part 3: build a Solr cloud cluster

Apache Solr provides a cloud demo to explain how to set up a cloud cluster. Of course, you can also go through the example yourself.

  1. First start the command line interface. To start Solr in cloud mode, enter the following in the tool:
bin/solr -e cloud

The demo will start.

  1. Specify how many servers (here: nodes) should be connected via the cloud. The number can be between[1] and [4] (in the example it is[2]). The demo works on one machine, but uses a different port for each server, which you specify in the next step. (The demo shows the port numbers)
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]
Please enter the port for node1 [8983]
Please enter the port for node2 [7574]
solr start –cloud -s example/cloud/node1/solr -p 8983
solr start –cloud -s example/cloud/node2/solr -p 7574

Once you have assigned all ports, the script will start up and shows you (as shown above) the commands for starting the server.

  1. If all servers are running, choose a name for your data col­lec­tion (square brackets indicate place­hold­ers and do not appear in the code).
Please provide a name for your new collection: [insert_name_here]
  1. Use SPLIT­SHARD to create a fragment from this col­lec­tion. You can later split this into par­ti­tions again. This will speed up the search if several queries are received at the same time.
http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=[NewFragment]&collection[NameOfCollection]

After you have created fragments with SPLIT­SHARD, you can dis­trib­ute your data using a router. Solr in­te­grates the com­pos­iteID router (router.key=com­pos­iteId) by default.

Fact

A router de­ter­mines how data is dis­trib­uted to the fragments and how many bits the router key uses. For example, if 2 bits are used, the router indexes data to a quarter of each fragment. This prevents large records on a single fragment from occupying the entire memory. Because that would slow down the search. To use the router, enter a router value (for example a user name such as JohnDoe1), the bit number and the document iden­ti­fi­ca­tion in this form: [Username] /[Bit Number]! [Document iden­ti­fi­ca­tion] (for example: JohnDoe1/2!1234)

Use the interface to divide the fragment into two parts. Both par­ti­tions contain the same copy of the original data. The index divides them along the newly created sub-areas.

/admin/collections?action=SPLITSHARD&collection=[NameOfCollection]&shard=[FragmentNumber]
  1. For the last step you need to define the name of your con­fig­u­ra­tion directory. The templates sample-tech­prod­ucts-configs and _default are available. The latter does not specify a schema, so you can still customize your own schema. With the following command you can switch off the schema­less function of _default for the Solr Cloud interface:
curl http://localhost:8983/api/collections/[_name_of_collection]/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'

This prevents the fields from creating their own schemas that are in­com­pat­i­ble with the rest of the files. Since you need the HTTP method POST for this setting, you cannot simply use the browser address bar. localhost:8983 stands for the first server. If you have selected a different port number, you must insert it there. Replace[_name_of_col­lec­tion] with the name you selected.

You have now set up the Solr cloud. To see if your new col­lec­tion is displayed correctly, check the status again:

bin/solr status

For a more detailed overview of the dis­tri­b­u­tion of your fragments, see the admin interface. The address consists of your server name with port number and the con­nec­tion to the Solr cloud in this form: "http://server­name:port­num­mer/solr/#/~cloud"

Expanding Apache Solr with plugins

Apache Solr already has some ex­ten­sions. These are the so-called handlers. We have already in­tro­duced the request handler. Lucene (and thus Solr) also supports some practical native scripts like the Solr Analyzer class and the Sim­i­lar­i­ty class. You integrate plug-ins into Solr via a JAR file. If you build your own plugins and interact with Lucene in­ter­faces, you should add the lucene-*.jars from your library (solr/lib/) to the class path you use to compile your plugin source code.

This method works if you only use one core. Use the Solr cloud to create a shared library for the JAR files. You should create a directory with the attribute "sharedLib" in the solr.xml file on your servlet. This is an easy way to load plug-ins onto in­di­vid­ual cores:

If you have built your own core, create a directory for the library with the command "mkdir" (under Windows: "md") in this form:

mkdir solr/[example]/solr/CollectionA/lib

Get ac­quaint­ed with Solr and try one of the included demos, go to "example/solr/lib". In both cases, you are now in the library directory of your instance directory. This is where you save your plugin JAR files.

Al­ter­na­tive­ly, use the old method from earlier Solr versions if, for example, you are not suc­cess­ful with the first variant on your servlet container.

  • To do this, unpack the solr.war file.
  • Then add the JAR file with your selfbuilt classes to the WEBINF/lib directory. You can find the directory via the web app on this path: server/sol­r­we­bapp/webapp/WEB-INF/lib.
  • Compress the modified WAR file again.
  • Use your tai­lor­made solr.war.

If you add a "dir" option to the library, it adds all files within the directory to the classpath. Use "regex=" to exclude files that do not meet the "regex" re­quire­ments.

<lib dir="${solr.install.dir:../../../}/contrib/../lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regexe="plugin_name-\d.*\.jar" />

If you build your own plugin script, we recommend the Lisp dialect Clojure for Java Runtime. This pro­gram­ming language supports in­ter­ac­tive program de­vel­op­ment. Other languages integrate their native prop­er­ties. Clojure makes them available through the library. This is a good way to use the Solr servlet.

The pro­gram­ming and scripting language Groovy supports dynamic and static typing on the Java virtual machine. The language is based on Ruby and Python and is compiled in Java byte code. They can be run in a script. Groovy has some features that extend the ca­pa­bil­i­ties of Java. For example, Groovy in­te­grates a simple template that you can use to create code in SQL or HTML. Groovy Syntax out of the box also provides some common ex­pres­sions or data fields for lists. If you process JSON or XML for your Solr search platform, Groovy can help to keep the syntax clean and un­der­stand­able.

Solr vs. Elas­tic­search

When it comes to open source search engines, Solr and Elas­tic­search are always at the forefront of tests and surveys. And both search platforms are based on the Apache Java library Lucene. Lucene is obviously a stable foun­da­tion. The library indexes in­for­ma­tion flexibly and provides fast answers to complex search queries. On this basis, both search engines perform rel­a­tive­ly well. Each of the projects is also supported by an active community.

Elas­tic­search's de­vel­op­ment team works with GitHub, while Solr is based at the Apache Foun­da­tion. In com­par­i­son, the Apache project has a longer history. And the lively community has been doc­u­ment­ing all changes, features, and bugs since 2007. Elas­tic­search’s doc­u­men­ta­tion is not as com­pre­hen­sive, which is one criticism. However, Elas­tic­search is not nec­es­sar­i­ly behind Apache Solr in terms of usability.

Elas­tic­search enables you to build your library in a few steps. For ad­di­tion­al features, you need premium plugins. This allows you to manage security settings, monitor the search platform, or analyze metrics. The search platform comes with a well-matched product family. Under the label Elastic-Stack and X-Pack, you get some basic functions for free. However, the premium packages are only available with a monthly sub­scrip­tion – with one license per node. Solr, on the other hand, is always free – including ex­ten­sions like Tika and Zookeeper.

The two search engines differ most in their focus. Both Solr and Elas­tic­search can be used for small data sets as well as for big data spread across multiple en­vi­ron­ments. But Solr focuses on text search. The concept of Elas­tic­search combines the search with the analysis. The servlet processes metrics and logs right from the start. Elas­tic­search can easily handle the cor­re­spond­ing amounts of data. The server dy­nam­i­cal­ly in­te­grates cores and fragments and has done so since the first version.

Elas­tic­search was once ahead of its com­peti­tor, Solr, but for some years now the Solr cloud has also made faceted clas­si­fi­ca­tion possible. However, Elas­tic­search is still slightly ahead when it comes to dynamic data. In return, the com­peti­tor scores points with static data. It outputs targeted results for the full-text search and cal­cu­lates data precisely.

The different basic concepts are also reflected in the caching. Both providers basically allow request caching. If a query uses complex Boolean variables, both store the called-up index elements in segments. These can merge into larger segments. However, if only one segment changes, Apache Solr must in­val­i­date and reload the entire global cache. Elas­tic­search limits this process to the affected sub-segment. This saves storage space and time.

If you work regularly with XML, HTTP, and Ruby, you will also get used to Solr without any problems. JSON, on the other hand, was added later via an interface. Therefore, the language and the servlet do not yet fit together perfectly. Elas­tic­search, on the other hand, com­mu­ni­cates natively via JSON. Other languages such as Python, Java, .NET, Ruby, and PHP bind the search platform with a REST-like interface.

Summary

Apache’s Solr and Elastic’s Elas­tic­search are two powerful search platforms that we can fully recommend. Those who place more emphasis on data analysis and operate a dynamic website will get on well with Elas­tic­search. You benefit more from Solr if you need a precise full text search for your domain. With complex variables and cus­tomiz­able filters you can tailor your vertical search engine exactly to your needs.

  Solr Elas­tic­search
Type Free open source search platform Free open source search platform with pro­pri­etary versions (free and sub­scrip­tion)
Supported languages Native: Java, XML, HTTP API: JSON, PHP, Ruby, Groovy, Clojure Native: JSON API: Java, .NET, Python, Ruby, PHP
Database Java libraries, NoSQL, with ontology and taxonomy, es­pe­cial­ly Lucene Java libraries, NoSQL, es­pe­cial­ly Lucene as well as Hadoop
Nodes & fragment clas­si­fi­ca­tion
  • Rather static
  • Nodes with Solr cloud and fragments with SPLIT­SHARD must be added manually
  • From Solr7 onwards: automatic scaling via interface
  • Control over fragments via ring algorithm
  • Node discovery with Zookeeper API
  • Dynamic
  • Adds nodes and fragments via internal tool, less control over leaders
  • Node Discovery with in­te­grat­ed Zen tool
Cache Global cache (applies to all sub segments in a segment) Segmented cache
Full-text search
  • Many functions included in the source code – also offers Lucene functions
  • Request parser
  • Sug­ges­tion ap­pli­ca­tions
  • Sim­i­lar­i­ty search
  • Spell check in different languages
  • Com­pat­i­ble with many rich text formats
  • High­light­ing
  • Search is mainly based on Lucene functions
  • Interface for search sug­ges­tions clears up the search mask for end users, but is less cus­tomiz­able
  • Spell check, matching via API
  • Less cus­tomiz­able high­light­ing
Go to Main Menu