Those who need a powerful full-text search usually choose  Apache Solr. While this project is still a good choice, there has been an in­ter­est­ing al­ter­na­tive on offer since 2010: Elas­tic­search. Just like Solr, Elas­tic­search is based on Apache Lucene, but offers other features. Here’s an overview of the features of the search server and an Elas­tic­search tutorial for how to implement the full text search in your own project.

What is Elas­tic­search?

Elas­tic­search belongs to the most important full-text search engines on the internet. Large companies also use the software – Facebook, for example, has been working suc­cess­ful­ly with Elas­tic­search for several years, and GitHub, Netflix, and Sound­Cloud also all rely on the suc­cess­ful search engine.

Given the amount of in­for­ma­tion on some websites, a high level of user-friend­li­ness can only be guar­an­teed if a func­tion­al full-text search is used. If you don't want to use Google or Bing to offer your website visitors a search function, you’ll have to embed your own search function instead. This is possible with Elas­tic­search. The open source project is based on the freely available Apache Lucene.

Elas­tic­search has the ad­van­tages of having had a stable pre­de­ces­sor, and expands the former ones with further features. Just as with Lucene, the search works via an index. Instead of examining all documents during a search query, the program checks a pre­vi­ous­ly created index of the documents, in which all contents are stored in a prepared form. This process takes much less time than searching all documents.

While Lucene gives you a com­plete­ly free hand in terms of where and how to use the full text search, this also means that you have to start from scratch. Elas­tic­search, on the other hand, enables you to get started faster. With Elas­tic­search, it is possible to build up a stable search server in a short time, which can also easily be dis­trib­uted to several machines.

Several nodes (different servers) merge to form a cluster. The so-called “sharding” method is used. Elas­tic­search breaks up the index and dis­trib­utes the in­di­vid­ual parts (shards) to several nodes. This also dis­trib­utes the cal­cu­la­tion load – for large projects, the full-text search is much more stable.

Elat­ic­search is – just like Lucene – based on the object-oriented pro­gram­ming language Java. The search engine outputs search results in JSON format and delivers them via a REST web service. The API makes it very easy to integrate the search function into a website.

In addition, together with Kibana, beats and logstash – col­lec­tive­ly known as Elastic-Stack – Elas­tic­search offers practical ad­di­tion­al services with which you can analyze the full text search. Elastic, the company behind the de­vel­op­ment of Elas­tic­search, also offers paid services – such as cloud hosting.

What does Elas­tic­search offer that Google et al. do not?

Google is already a suc­cess­ful option – and you can easily integrate its popular search function into your own website. So why choose the extra work of building your own search engine with Elas­tic­search? One possible answer is that with Google Custom Search (GCS) you are dependent on Google, and have to allow ad­ver­tis­ing (at least in the free version) in the search results. With Elas­tic­search, however, the code is open source, and if you have set up a full text search function, it is yours – you are not dependent on anyone.

You can also customize Elas­tic­search to suit your needs. For example, if you run your own online platform or e-shop, you can set up the search function so that you can also search the profiles of reg­is­tered users using Elas­tic­search. GCS is a bit stretched when used for this purpose.

Elas­tic­search vs. Apache Solr: What are the key dif­fer­ences?

Both Elas­tic­search and Apache Solr are based on Lucene, but have been developed into in­de­pen­dent programs. However, many users wonder which project they should choose. Although Elas­tic­search is a little younger and, unlike Solr, is not supported by the ex­pe­ri­enced Apache community, it has become highly popular. The main reason for this is that it is simpler to implement. Fur­ther­more, Elas­tic­search is par­tic­u­lar­ly popular because of the way it handles dynamic data. Through a special caching procedure, Elas­tic­search ensures that changes do not have to be entered in the entire global cache. Instead, it is enough to change a small segment. This makes Elas­tic­search more flexible.

Ul­ti­mate­ly, however, the decision often only has to do with the different ap­proach­es regarding open source. Solr is fully committed to the Apache Software Foun­da­tion’s “Community over Code” idea. Every con­tri­bu­tion to the code is taken seriously, and the community decides together which additions and en­hance­ments make it into the final code. This is different with the further de­vel­op­ment of Elas­tic­search. This project is also open source and is offered under a free Apache license – however, only the Elastic team de­ter­mines which changes make it into the code. Some de­vel­op­ers do not like this, and so choose Solr.

Elas­tic­search tutorial

When you start working with Elas­tic­search, you should start off by knowing a few basic terms. The in­for­ma­tion­al structure is worth men­tion­ing:

  • Index: A search query at Elas­tic­search never applies to the content itself, but always to the index. All contents of all documents are stored in this file and are already prepared, so the search takes only a short time. It is an inverted index: For each search term, the place where the term can be found is specified.
     
  • Document: Outbound for the index are the documents in which data occurs. They don’t nec­es­sar­i­ly have to be complete texts (for example blog articles) – mere files with in­for­ma­tion are suf­fi­cient.
     
  • Field: A document consists of several fields. In addition to the actual content field, other metadata also belongs to a document. With Elas­tic­search you can search specif­i­cal­ly for metadata about the author or the time of creation.

When we talk about data pro­cess­ing, we’re primarily talking about to­k­eniz­ing. An algorithm creates in­di­vid­ual terms from a complete text. For the machine, there is no such thing as words. Instead, a text consists of a long string, and a letter has the same value for the computer as a space. For a text to be mean­ing­ful­ly formatted, it must first be broken down into tokens. For example, white­spaces, spaces and hyphens are assumed to be markers for words. In addition, the text must be stan­dard­ized. Words are written in uniform lowercase, and punc­tu­a­tion marks are ignored. Elas­tic­search adopted all of these methods from Apache Lucene.

Note

In the following tutorial we will work with version 6.3.0 of Elas­tic­search. If you are using a different version, some code examples or in­struc­tions might work dif­fer­ent­ly.

In­stal­la­tion

The files needed for Elas­tic­search are freely available on the official Elastic website. The files are offered there as ZIP or tar.gz packages, which is why they can be easily installed under Linux and Mac using a console.

Note

Elastic offers Elas­tic­search in two different packages. The standard version also includes paid features that you can try out in a trial version for some time. Such packages marked with OSS (open source software) only contain free com­po­nents released under the Apache 2.0 license.

For ZIP:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-6.3.0.zip
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-6.3.0.zip.sha512
shasum -a 512 -c elasticsearch-oss-6.3.0.zip.sha512 
unzip elasticsearch-oss-6.3.0.zip
cd elasticsearch-6.3.0

For tar.gz:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-6.3.0.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-6.3.0.tar.gz.sha512
shasum -a 512 -c elasticsearch-oss-6.3.0.tar.gz.sha512 
tar -xzf elasticsearch-oss-6.3.0.tar.gz
cd elasticsearch-6.3.0

First download the package and also the hash checksum (SHA512), which you also check in the third step. Then unpack the package and switch to the cor­re­spond­ing folder.

The ZIP archive can also be down­loaded and used for in­stal­la­tion in Windows, because the package contains a batch file that you can run. Al­ter­na­tive­ly, Elastic now also provides an MSI installer, but this is still in its beta phase. The in­stal­la­tion file of the latter contains a graphical interface that guides you precisely through the in­stal­la­tion.

Note

Because Elas­tic­search is based on Java, this pro­gram­ming language must also be installed on your system. Download the Java De­vel­op­ment Kit (JDK) free of charge from the official website.

Now run Elas­tic­search from the console by nav­i­gat­ing to the ap­pro­pri­ate bin folder and typing “elas­tic­search” – whether you are running Linux, Mac, or Windows. Then open a browser of your choice and call the following localhost port: http://localhost:9200/. If you have installed Elas­tic­search correctly and Java is set correctly, you should now be able to access the full text search.

You can com­mu­ni­cate via the REST API with Elas­tic­search, for which you also need an ap­pro­pri­ate client. Kibana is a good choice (also a free open source offer from Elastic). This program enables you to use Elas­tic­search directly in your browser. To do this, simply go to http://localhost:5601/ to access a graphical user interface. How to install and set up Kibana correctly is explained in our Kibana tutorial|. In Kibana and any other client, you can use the HTTP methods PUT, GET, POST, and DELETE to send commands to your full-text search.

Index

The first step is creating your index and feeding it with data. You can use two different HTTP methods for this: POST and PUT. You use PUT if you want to specify a specific ID for the entry. With POST, Elas­tic­search creates its own ID. In our example, we would like to set up a bib­li­og­ra­phy. Each entry should contain the name of the author, the title of the work, and the year of pub­li­ca­tion.

POST bibliography/novels
{
"author": "Isabel Allende",
"title": "La casa de los espíritus",
"year": "1982"
}

If you want to use the input like this, you must use the Kibana console. However, if you do not want to use this software, you can also use cURL as an al­ter­na­tive. Instead of the previous command, you must then enter the following in the command line:

curl -XPOST http://localhost:9200/bibliography/novels -H "Content-Type: application/json" -d '{"author": "Isabel Allende", "title": "La casa de los espíritus", "year": "1982"}'

In the following we’ll show you the code for Kibana only – but this can be easily trans­ferred into cURL syntax.

Elas­tic­search should, if you have entered every­thing correctly, return the following in­for­ma­tion at the beginning of the message:

{
    "_index": "bibliography",
    "_type": "novels",
    "_id": "AKKKIWQBZat9Vd0ET6N1",
    "_version": 1,
    "result": "created",
}

At this point it’ll become clear that Elas­tic­search now finds an index with the name “bib­li­og­ra­phy” and the type “novels.” Since we used the POST method, Elas­tic­search au­to­mat­i­cal­ly generated a unique ID for our entry. The entry is currently in the first version and was recently created.

Note

One type (_type) was a kind of sub­cat­e­go­ry in Elas­tic­search. It was possible to gather several types under one index. However, this led to different problems and so Elastic currently plans to no longer use such types. In version 6.x _type is still included, but it is no longer possible to combine several types under one index. As of version 7.0, the plan is to com­plete­ly delete types, as the de­vel­op­ers explain in their blog.

You can also use PUT to give your entry a very specific ID. This is defined in the first line of the code. You also need PUT if you want to change an existing entry.

PUT bibliography/novels/1
{
"author": "William Gibson",
"title": "Neuromancer",
"year": "1984"
}

The following output is very similar to the one we get with the POST method, but Elas­tic­search gives us the ID we gave to the entry in the first line. The order of spec­i­fi­ca­tion is always _index/_type/_id.

{
    "_index": "bibliography",
    "_type": "novels",
    "_id": "1",
    "_version": 1,
    "result": "created",
}

With the PUT command and the unique ID we can also modify entries:

PUT bibliography/novels/1
{
"author": "William Gibson",
"title": "Count Zero",
"year": "1986"
}

Since the entry with ID 1 already exists, Elas­tic­search only changes it instead of creating a new one. This is also reflected in the output:

{
    "_index": "bibliography",
    "_type": "novels",
    "_id": "1",
    "_version": 2,
    "result": "updated",
}

The version number has increased to 2, and as a result we get “updated” instead of “created.” Of course you can do the same with an ID created by Elas­tic­search at random – but due to the length and order of the char­ac­ters the work would take a lot more effort. The procedure also means that Elas­tic­search simply over­writes an entry if, for example, you get confused with the ID numbers. To avoid ac­ci­den­tal over­writ­ing, you can use the _create endpoint:

PUT bibliography/novels/1/_create
{
"author": "Mary Shelley",
"title": "Frankenstein; or, The Modern Prometheus",
"year": "1818"
}

Since an entry with ID 1 already exists in the index, an error message will be displayed.

If you make changes to an entry as described, you create a com­plete­ly new entry and must therefore also enter all entries com­plete­ly. You can only integrate changes into the existing entry. To do this, use the _update endpoint:

POST bibliography/novels/1/_update
{
"doc": {
    "author": "Franz Kafka",
"genre": "Horror"
    }
}

Now we have added an ad­di­tion­al field to the entry and changed an existing field without deleting the others – but only in the fore­ground. In the back­ground, however, Elas­tic­search created the complete entry, but inserted the existing content in­de­pen­dent­ly.

In principle, we have so far simply written an entry into a database, and can therefore retrieve it directly. For this we use the GET method.

GET bibliography/novels/1

You can also simply display the entry in the browser:

http://localhost:9200/bibliography/novels/1

In the output Elas­tic­search shows us all details of our entry:

{
    "_index": "bibliography",
    "_type": "novels",
    "_id": "1",
    "_version": 2,
    "found": true,
    "_source": {
        "author": "William Gibson",
        "title": “Count Zero",
        "year": "1984"
    }
}

In addition to this in­for­ma­tion, you will find the fields of the document under _source. Elas­tic­search also tells us that an entry has been found. If you try to load a non-existing entry, no error message would appear. Instead, Elas­tic­search reports “found”: false, and there are no entries under _source.

You also have the option to extract only certain in­for­ma­tion from the database. Let’s assume that you have not only stored bib­li­o­graph­ic data in your index, but also the complete text of each recorded novel. This would also be displayed with a simple GET request. Let's assume you are currently only in­ter­est­ed in the name of the author and the title of the work – then you can only specif­i­cal­ly ask for it:

GET bibliography/novels/1?_source=author,title

If you are not in­ter­est­ed in the metadata of an entry, it is also possible to display only the content:

GET bibliography/novels/1/_source

Let's assume you want to find not just one single entry in your index, but several at the same time. Elas­tic­search has im­ple­ment­ed the _mget-endpoint (for multi-get). If you use it, specify an array of multiple IDs:

GET bibliography/novels/_mget
{
    "ids": ["1", "2", "3"]
}

Even if an entry does not yet exist, the complete query won’t run in­cor­rect­ly. All existing data is displayed. As for the non-existent data, Elas­tic­search gives feedback that it cannot find them.

Similar to loading an entry, deleting it also works. However, you work with DELETE instead of GET:

DELETE /bibliography/novels/1

In the following output, Elas­tic­search tells you that it found the entry under the specified ID:

{
    "_index": "bibliography",
    "_type": "novels",
    "_id": "1",
    "_version": 5,
    "result": "deleted",
}

The program also increases the version number by one. There are two reasons for this:

  1. Elas­tic­search only marks the entry as deleted and does not remove it directly from the hard disk. The entry will only disappear in the further course of indexing.
     
  2. When working with dis­trib­uted indexes on multiple nodes, detailed version control is extremely important. That's why Elas­tic­search marks every change as a new version – including the deletion request.

You can also only make changes to a specific version number you know. If there is already a newer version in the cluster than the one you specified, the change attempt results in an error message.

PUT bibliography/novels/1?version=3
{
"author": "Marcel Proust",
"title": " À la recherche du temps perdu",
"year": "1927"
}

You can not only call several entries at once, but you can also create or delete several entries with _bulk commands. Elas­tic­search uses a slightly modified syntax for this.

POST bibliography/novels/_bulk
{"delete": {"_id": "1"}}
{"create": {"_id": "1"}}
{"author": "Johann Wolfgang von Goethe", "title": "Die Leiden des jungen Werther", "year": "1774"}
{"create": {"_id": "2"}}
{"author": "Umberto Eco", "title": "Il nome della rosa", "year": "1980"}
{"create": {"_id": "3"}}
{"author": "Margaret Atwood", "title": "The Handmaid’s Tale", "year": "1985"}

Each command has its own line. First you specify which action should be performed (create, index, update, delete). You also specify which entry you want to create and where. It is also possible to work in several indexes with such a bulk statement. To do this, you would leave the path to POST empty and assign a separate path to each action. When creating entries, you must also specify a request body in a new line. This contains the content of the entry. The DELETE statement does not require a request body, since the complete entry is deleted.

Up to now, we have always given the same content in the examples – no matter which field it was: Elas­tic­search in­ter­pret­ed all in­for­ma­tion as a coherent character string. But this is not always ap­pro­pri­ate for every field. That is why mapping in Elas­tic­search exists. This de­ter­mines how the al­go­rithms have to interpret an input. You can use the following code to display which mapping is currently used in your index:

GET bibliography/novels/_mapping

All fields are assigned to the types text and keyword. Elas­tic­search knows 6 core datatypes and even more special fields. The 6 main types are partly sub­di­vid­ed into further sub­cat­e­gories:

  • string: This includes both text and keyword. While keywords are con­sid­ered exact matches, Elas­tic­search assumes that a text must be analyzed before it can be used.
     
  • numeric: Elas­tic­search knows different numerical values, which differ above all in their scope. For example, while the type byte can take values between -128 and 127, long has a range of -263 to 263-1.
     
  • date: A date can be specified either to the day or with a time. You can also specify a date in the form of Unix time: seconds or mil­lisec­onds since the 1. 1. 1970.
  • boolean: Fields formatted as boolean can have either a true or false value.
     
  • binary: You can store binary data in these fields. To transmit them, use Base64 encoding.
     
  • range: Used to specify a range. This can be either between two numerical values, two data, or even between two IP addresses.

These are only the main cat­e­gories that you will probably use most often. For more types, see the Elas­tic­search doc­u­men­ta­tion. The main dif­fer­ence between the in­di­vid­ual types is that they are either exact-value or full-text. Elas­tic­search un­der­stands the content of the field as either an exact input or a content that has to be processed first. Analyzing is also important in the course of mapping: analyzing a content is divided into to­k­eniz­ing and nor­mal­iz­ing:

  • To­k­eniz­ing: Single tokens are made from a text. These can include in­di­vid­ual words, but also fixed terms from several words.
     
  • Nor­mal­iz­ing: The tokens are nor­mal­ized by lower-casing them all and reducing them to their root forms.

If you have included a document in the index and performed your mapping correctly, all of the content will be included in the inverted index correctly. To use mapping for yourself, you need to create a com­plete­ly new index. Mapping of existing fields is not possible.

PUT bibliography
{
    "mappings": {
        "novels": {
            "properties": {
                "author": {
                    "type": "text",
                    "analyzer": "simple"
                },
                "title": {
                    "type": "text",
                    "analyzer": "standard"
                },
                "year": {
                    "type": "date",
                    "format": "year"
                }
            }
        }
}
}

So far, we have defined the two fields author and title as text -  as full-text. Therefore, they still need a suitable analyzer. While we provide the field for the title of the novel with the standard analyzer, we need to choose the less complex simple analyzer for the author name. We’ll define the year of pub­li­ca­tion as date – which is an exact value. Since Elas­tic­search accepts a year, month, and day as the standard format, we’ll change this, since in this instance, we want to limit the spec­i­fi­ca­tion to the year.

Search

In the previous section, we used Elas­tic­search and its index mainly for the purpose of a database. However, the real benefit of Elas­tic­search is its full-text search. This means instead of entering the ID of a document and finding the entry, we now set Elas­tic­search so that you can search for specific content. To use the search engine, the program has provided the _search endpoint. You can use this in com­bi­na­tion with the GET method to display all entries, for example:

GET bibliography/novels/_search
Fact

For more complex search queries _search uses a body in brackets. However, some HTTP servers do not provide this for the GET method. Therefore, the de­vel­op­ers have decided that these kinds of requests also work as POSTs.

You can also leave the path blank to search through all existing indexes. In the output you will find the in­ter­est­ing in­for­ma­tion under hits:

"hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
        {
            "_index": "bibliography",
            "_type": "novels",
            "_id": "2",
            "_score": 1,
            "_source": {
                "author": "Umberto Eco",
                "title": "Il nome della rosa",
                "year": "1980"
            }
        },
    ],
    }
}

All other entries in our index are also listed in the output (and only omitted here for the sake of clarity). The feedback from Elas­tic­search provides us not only with the actual content but also two ad­di­tion­al in­for­ma­tion that can help us to un­der­stand the full text search:

  • hits: Every entry that meets the search criteria is con­sid­ered a hit by Elas­tic­search. The program also displays the number of hits. Since there are 3 entries in the index in our example and all can be displayed, “total” applies: 3.
     
  • score: In the form of a score, Elas­tic­search indicates how relevant the entry is in relation to our search query. Since we simply searched for all posts in our example, they all have the same score of 1, and the entries are sorted in de­scend­ing order of relevance in the search results.

In addition, Elas­tic­search provides in­for­ma­tion on how many shards are involved in the search results, how many mil­lisec­onds the search took and whether a timeout has occurred.

If you had now seen the first 10 search results and only wanted to display the following 15, you must use a com­bi­na­tion of both pa­ra­me­ters:

  • size: How many results should Elas­tic­search display?
     
  • from: How many entries should the program skip before dis­play­ing them?
GET bibliography/novels/_search?size=15&from=10

Elas­tic­search dis­tin­guish­es between two different search types. On the one hand it uses a “lite” version and on the other hand a more complex variant that works with the query DSL – a specific search language. In the lite version, you have to enter your search query directly into the search query as a simple string:

GET bibliography/novels/_search?q=atwood

However, you can only search within a specific field:

GET bibliography/novels/_search?q=author:atwood
Fact

In fact, in the first example, you also search in a par­tic­u­lar field without having to specify it: the _all field. If you insert the contents of a document into the index sorted in fields, Elas­tic­search creates an ad­di­tion­al field in the back­ground. In it, all contents from the other fields are ad­di­tion­al­ly stored to make this kind of search within all fields possible.

If you want to combine several search criteria, use the plus sign (+). Using the minus sign, (-) you’ll exclude certain criteria. However, if you use these operators, you must apply a per­cent­age encoding in the search query:

GET bibliography/novels/_search?q=%2Bauthor%3Aatwood+%2Btitle%3Ahandmaid

Elas­tic­search's query string syntax offers even more subtle char­ac­ter­is­tics to customize your search. In the software documents the Elastic de­vel­op­ers sum­ma­rized the es­sen­tials you’ll need, with the op­por­tu­ni­ty to expand your knowledge.

This search variant is well suited for simple queries, but for more complex tasks the “lite” version won’t be able to cope as easily. The risk of entering an error in the long string is too high. For this, Elas­tic­search offers a more con­ve­nient method in the form of Query DSL. The starting point of this kind of search is the query parameter in com­bi­na­tion with a match query:

GET bibliography/novels/_search
{
    "query": {
        "match": {
"author": "allende"
}
    }
}

The result shows us all entries that contain the term “allende” in the author field. It also tells us that analyzing worked in mapping, because Elas­tic­search ignores upper and lower case. To display all entries, you can use match_all in addition to the simple variant already presented:

GET bibliography/novels/_search
{
    "query": {
        "match_all": {}
    }
}

The opposite of this search is match_none. Elas­tic­search also gives you the pos­si­bil­i­ty to search in several fields with one search term:

GET bibliography/novels/_search
{
    "query": {
        "multi_match": {
            "query": "la",
            "fields": ["author", "title"]
        }
    }
}

To enable complex search requests, you can combine several search terms and evaluate them dif­fer­ent­ly. These are the three available to you:

  • must: The search term must be in the result.
  • must_not: The search term may not be in the result.
  • should: If this term appears, the relevance in the search results is increased.

In practice, you will combine this with a Boolean query:

GET bibliography/novels/search_
{
    "query": {
"bool": {
        "must": {
            "match": {
                "title": "la"
            }
},
        "must_not": {
            "match": {
                "title": "rabbit"
            }
        },
        "should": {
            "match": {
                "author": "allende"
            }
        }
    }
}
}

You can also expand your search by adding a filter. This allows you to specify criteria that restrict the search results:

GET bibliography/novels/search_
{
  "query": {
"bool": {
    "must": {
      "match": {
        "title": "la"
      }
},
    "filter": {
      "range": {
        "year": {
          "gte": "1950",
          "lt": "2000"
        }
      }
    }
  }
}
}

In the previous example, we linked the filter to a spectrum – only entries that appeared between 1950 and 2000 are to be displayed.

Summary

With this tool you have every­thing you need to implement the full text search for your project. However, Elas­tic­search also offers other methods to refine your search and make it more complex. You can find more in­for­ma­tion on the official website of Elastic. If you want to extend the full text search, you can also create your own scripts with other languages like Groovy and Clojure.

Pros and cons of Elas­tic­search

Elas­tic­search can be a powerful full text search, with the minor setback that the open source idea is now somewhat re­strict­ed. Otherwise the full text search offers numerous ad­van­tages – also compared to the direct com­peti­tor Apache Solr.

Pros Cons
Open source Elastic as the gateĀ­keepĀ­er
Quick and stable Ā 
Scalable Ā 
Many ready-made program modules (Analyzer, full text search...) Ā 
Easy imĀ­pleĀ­menĀ­taĀ­tion thanks to Java, JSON, and REST-API Ā 
Flexible and dynamic Ā 
Go to Main Menu