Lucene is a program library published by the Apache Software Foundation. It is open source and free for everyone to use and modify. Originally, Lucene was written completely in Java, but now there are also ports to other programming languages. Apache Solr and Elasticsearch are powerful extensions that give the search function even more possibilities.
Lucene is a full-featured text search. This means, quite simply: a program searches a series of text documents for one or more terms that the user has specified. This shows that Lucene is not solely used in the context of the world wide web, even if the searches are mostly found here. Lucene can also be used for archives, libraries, or even on your home desktop PC. It not only searches HTML documents, but also works with e-mail and PDF files.
An index – the heart of Lucene – is decisive for the search, since all terms of all documents are stored here. In principle, an inverted index is simply a table – the corresponding position is stored for each term. In order to build an index, you first need to extract it. All terms must be taken from all the documents and stored in the index. Lucene gives users the ability to configure this extraction individually. Developers decide which fields they want to include in the index during configuration. To understand this, you have to go back one step.
The objects that Lucene works with are documents in every kind of form. However, from Lucene’s point of view, the documents themselves contain fields. These fields contain, for example, the name of the author, the title of the document, or the file name. Each field has a unique name and value. For example, the field with the name title can have the value “Instructions for use for Apache Lucene.” So when creating the index, you can decide which metadata you want to include.
When documents are indexed, tokenization also takes place. For a machine, a document is initially a collection of information. Even if you break away from the level of bits and use content that can be read by humans instead, a document is still a series of characters: letters, punctuation marks, spaces.
Segments are created from this amount of data using tokenization. These segments make it possible to search for terms (mostly single words). The simplest way for tokenization to work is with the white space strategy: a term ends when a space occurs. However, this does not apply if fixed terms consist of several words, such as “Christmas Eve.” Additional dictionaries are used for this, which can also be implemented in the Lucene code.
Lucene also performs a normalization when analyzing the data of which tokenization is a part. This means that the terms are written in a standardized form e.g. all capital letters are written in lower case. Lucene also manages to sort them out. This works via various algorithms e.g. via TF-IDF. As a user, you probably want to get the most relevant or latest results first – the search engine’s algorithms enable this.
For users to find anything at all, they must enter a search term in a text line. The term or terms are called query in the Lucene context. The word “request” indicates that the input must not only consist of one or more words, but can also contain modifiers such as AND, OR, or + and – as well as placeholders. The QueryParser – a class within the program library – translated the input into a specific search request for the search engine. Developers can also make settings for the QueryParser. The parser can be configured in such a way that it is tailored exactly to users’ needs.
What Lucene did that is totally new is incremental indexing. Before Lucene, only batch indexing was possible. While you could only implement complete indexes with this, incremental indexing enables you to update an index. Individual entries can be added or removed.