In our modern in­for­ma­tion society, data, facts, and knowledge have a much higher priority than they were half a century ago. Thanks to the internet, in­for­ma­tion is more and more ac­ces­si­ble. When we access in­for­ma­tion, it has been retrieved from online sources, which is where search engines come in. How do they find the in­for­ma­tion they provide us? The answer is called “In­for­ma­tion Retrieval”. Gathering in­for­ma­tion – more precisely, in­for­ma­tion recovery – is a dis­ci­pline in computer science and in­for­ma­tion science and, above all, of great im­por­tance for search engines. Using complex in­for­ma­tion retrieval systems, they recognize the in­ten­tions that are behind specific search terms and find relevant data on search queries.

The History of Gathering In­for­ma­tion

In­for­ma­tion retrieval is about making existing knowledge ac­ces­si­ble. This has been the case since long before the dawn of the digital age. Vannevar Bush, one of the first people who seriously con­sid­ered how humanity can make their con­cen­trat­ed knowledge more ac­ces­si­ble in the face of an ever-changing world, published a ground­break­ing article at the dawn if the in­for­ma­tion age (1945) titled: “As We May Think”. The article presented a vision for the future of in­for­ma­tion gathering and or­ga­ni­za­tion.

Bush saw the following problem in the sciences: experts are spe­cial­iz­ing more and more, and need more in­for­ma­tion – but because of the dif­fer­en­ti­a­tion caused by this extreme spe­cial­iza­tion, the in­for­ma­tion is in­creas­ing­ly hard to find. Of course, this was at a time when libraries were still organized with analog paper boxes and large catalogs. A keyword search was only possible if a diligent librarian had already bothered to manually index all works. Bush saw a way to make his own in­for­ma­tion more ac­ces­si­ble using the technical de­vel­op­ments available at the time, such as microfilm. His vision was to create a Memex, which was a machine as big as a desk that would serve as a vault for knowledge, and become a serious piece of research equipment. The Memex was never built but the tech­nol­o­gy (users jumping from one article to the next) can be seen as a fore­run­ner of hypertext.

In the 1950’s, the computer scientist Hans Peter Luhn dealt specif­i­cal­ly with the task of obtaining in­for­ma­tion and developed tech­niques that are still relevant today: full-text pro­cess­ing, auto-indexing and selective in­for­ma­tion pro­cess­ing (SDI) have their roots in his research. These methods were very important for the de­vel­op­ment of the Internet as in­for­ma­tion retrieval systems are essential for nav­i­gat­ing the oceans of available in­for­ma­tion on the internet. Without them, you would never be able to find the answers you are looking for.

In­for­ma­tion Retrieval – a De­f­i­n­i­tion

The aim of in­for­ma­tion retrieval (IR) is to make machine-stored data dis­cov­er­able: unlike data mining which extracts struc­tures from online records, IR is concerned with filtering specific in­for­ma­tion from a set of data. The typical ap­pli­ca­tion is an Internet search engine. In­for­ma­tion retrieval systems solve two central issues:

  1. Vagueness: User inquiries are often in­ac­cu­rate. The search terms entered by a user often leave a lot of room for in­ter­pre­ta­tion. For example, those who search for the term “bank” may be looking for general banking in­for­ma­tion or may require di­rec­tions to the nearest financial in­sti­tu­tion. The problem is com­pound­ed when users them­selves are not sure what in­for­ma­tion they are looking for.

  2. Un­cer­tain­ty: Contents of stored in­for­ma­tion are sometimes unknown to the system, which leads to users being presented with the wrong results. This happens, for example, with homonyms –words that have multiple meanings. The user might not be looking for a financial in­sti­tu­tion but in­for­ma­tion on a ge­o­graph­i­cal feature relating to rivers.

In addition, the in­for­ma­tion retrieval system should also evaluate in­for­ma­tion to provide users with a data sequence. The first result should ideally provide the best answer to the users’ question.

Different Models

There are various in­for­ma­tion retrieval models that are not nec­es­sar­i­ly mutually exclusive and can be combined with each other. Some of these models only differ slightly in details. However, they can all still be roughly cat­e­go­rized into three groups:

  • Set Theory Models: Sim­i­lar­i­ty re­la­tion­ships are de­ter­mined by set op­er­a­tions (Boolean model).
  • Algebraic Models: Sim­i­lar­i­ties are de­ter­mined in pairs: documents and search queries can be rep­re­sent­ed as vectors, matrices or tuples (vector space model).
  • Prob­a­bilis­tic Models: These models establish sim­i­lar­i­ty by viewing the data sets as multi-stage random ex­per­i­ments.

Below, we will introduce the three ar­che­typ­al models using these cat­e­gories. The existing models above are all hybrids of the three types. Therefore, the extended Boolean model has the prop­er­ties of set theory, as well as algebraic models.

Boolean Model

The most popular search engines on the web are based on the Boolean principle. These are logical links that help users refine and pinpoint a search. With AND, OR or NOT (AND, OR, NOT) or the cor­re­spond­ing symbols ∧, ∨ or ¬ a request can be specified when, for example, both terms should appear in the result, or content with a certain term should be hidden. These keys also work on Google, according to the same principle. The dis­ad­van­tage of this system is that it doesn’t contain any ranking system for its results.

Vector Space Model

In a math­e­mat­i­cal approach, content can also be rep­re­sent­ed as vectors. In the vector space model, terms are mapped as co­or­di­nate axes. Both documents and search queries receive specific values related to the term and can be rep­re­sent­ed as points or vectors within a vector space. Sub­se­quent­ly, both vectors are compared to each other. The vector (or content) closest to that of the query should appear first in the result rankings. The dis­ad­van­tage here is that without Boolean operators, no terms can be excluded.

Prob­a­bilis­tic Model

The prob­a­bilis­tic model makes use of prob­a­bil­i­ty theory. Each document is assigned a prob­a­bil­i­ty value. The results are then sorted according to the prob­a­bil­i­ty with which they fit each search. How high the chances are that a certain content cor­re­sponds to a user’s wishes is de­ter­mined by so-called “relevance feedback”. For example, users may be asked to rate the results manually. At the next identical search, the model will show a different (perhaps better) result list. A dis­ad­van­tage of this procedure is that it starts with two re­quire­ments, neither of which are guar­an­teed. On one hand, the model pre­sup­pos­es that the users are willing to par­tic­i­pate in the system by giving feedback. On the other hand, the theory also assumes that users view the results in­de­pen­dent­ly of one another, eval­u­at­ing the content of each source as though it were the first they had read in the search. In practice, users always value in­for­ma­tion based on pre­vi­ous­ly viewed content or held knowledge.

Operating Prin­ci­ples or In­for­ma­tion Gathering

With in­for­ma­tion retrieval, different methods and tech­niques are used, in­de­pen­dent­ly of the models. The goal is always to simplify in­for­ma­tion searches for the user and to deliver more relevant results.

Term Frequency, Inverse Document Frequency

The im­por­tance of a term for a search query is cal­cu­lat­ed by combining how often a term occurs and the inverse document frequency. The value is ab­bre­vi­at­ed as tf-idf.

  • Term Frequency: Search word density indicates how often a term appears in a document. However, how often a term appears cannot be a sole indicator of how relevant a text is, as some texts may contain the word multiple times due to length, rather than relevancy of content. Therefore, the frequency should be cal­cu­lat­ed in relation to how big the document is. To do this, how often the search term appears is divided by how often the highest-frequency word appears (eg. “and”):
  • Inverse Document Frequency: In IDF, the complete body of text is con­sid­ered, rather than just a single document. Words that are only found in few documents will have a higher relevance than terms which appear in almost all texts. For example, the term “inverse document frequency” has a high higher value than “and”.

By combining the two tests, in­for­ma­tion retrieval systems can provide better results than if they were used alone: if it were just the term frequency that was important, then the search query “The TV show with the mouse” would pri­or­i­tize content in which the words “the” and “with” appear. That would obviously be unhelpful. In contrast, if the inverse document frequency is used, “TV show” and “mouse” are much more important for the search and are rec­og­nized as the actual search terms.  

Query Mod­i­fi­ca­tion

A major problem in in­for­ma­tion gathering is the behavior of users them­selves: wildly in­ac­cu­rate requests bring up incorrect or in­ad­e­quate in­for­ma­tion. To avoid this, in­for­ma­tion sci­en­tists have in­tro­duced query mod­i­fi­ca­tion, a system that au­to­mat­i­cal­ly changes the entered search query. That means, for example, that synonyms are used which provide better results. The system makes use of thesauri and user-feedback to find those synonyms. To avoid depending on user-co­op­er­a­tion, they can use so-called “pseudo feedback”. With this method, the system reads related terms from the best search results and rates them as relevant to the search. Inquiries can be extended or improved by using the following tech­niques:

  • Stop Word Elim­i­na­tion: Stop words are those ex­pres­sions that only con­tribute in­signif­i­cant­ly to the content of a text. It makes sense not to treat words like “and” or articles such as “the” as rep­re­sen­ta­tive of the document’s content.  
  • Multi-word group iden­ti­fi­ca­tion: Groupings of words must be rec­og­nized as such. This iden­ti­fi­ca­tion ensures that the search engine also considers parts of compound words to be relevant. 
  • Root and Root Shape Reduction: To search more ef­fec­tive­ly, words must be reduced to their root words. Otherwise, in­flec­tion­al forms of a word would not appear correctly in the search results. 
  • Thesaurus: In addition to the terms used in the relevant document, an in­for­ma­tion retrieval system should also consider synonyms of the word as relevant. This is the only way to ensure that users find what they are looking for. 

Recall & Precision

The ef­fec­tive­ness of an in­for­ma­tion retrieval system is commonly cal­cu­lat­ed using the factors recall rate and precision. Both are rep­re­sent­ed as quotients.

  • Recall: How complete are the search results? To do this, the number of “found, relevant” is compared with the number of “not found, relevant documents”. The quotient, in other words, indicates how probable it is that a relevant document will be found: 
  • Precision: What exactly is the search result? To ascertain this, the number of found, relevant to the number of found, ir­rel­e­vant documents is provided. The quotient indicates how probable it is that a found document is relevant:

Both values are basically between 0 and 1, where 1 would be a perfect value. In addition, perfect results are excluded for both quotients in practice. Those who increase the com­plete­ness of the search result do so at the expense of accuracy and vice versa. Ad­di­tion­al­ly, the fallout (i.e., the default rate) can be cal­cu­lat­ed as an ad­di­tion­al value: this quotient reflects the false positive rate; it is de­ter­mined by the ratio of found, ir­rel­e­vant documents to ir­rel­e­vant contents that were not found. Recall and precision can be rep­re­sent­ed as an axis diagram in which each of the two values occupies one axis each.

In­for­ma­tion Retrieval: Example of a Search

Every internet search engine is based on in­for­ma­tion retrieval. Google, Bing, and Yahoo are examples of prominent computer-aided in­for­ma­tion gathering. However, to show how IR works in practice, it makes sense to take a simpler example of our own. Take a search matrix in a (very small) children’s library. There are animals in all the books, but we only want to find books which feature elephants and giraffes, and no croc­o­diles. A search with the Boolean method would look like this: elephant AND giraffe NOT crocodile. The result of the search can only ever be 1 or 0: Does the term occur, or not?

The result of the search would be “Tim & Olli at the Zoo” and “Michael and the Crazy Circus”. However, it does not weigh the results. Which book is more about elephants than giraffes? To ascertain this, the system can determine the Term Frequency and the Inverse Document Frequency. 

“Tim and Olli at the Zoo” is probably the right answer when looking for a text with giraffes and elephants over “Michael and the Crazy Circus”, and should come first in the search results. The method we used here only works if the search terms are fixed (con­trolled indexing). This could happen, for example, in spe­cial­ized databases where the users are trained in how to use a search mask. In our example, a query mod­i­fi­ca­tion would make sense: aside from “elephant”, a search for “pachy­derms” as well as gram­mat­i­cal variants of these words would yield positive results.

Tip

There are a lot more search engines on the World Wide Web than just Google. For example, al­ter­na­tives to Google often pay more attention to user privacy

Go to Main Menu