Index management with robots.txt files

Professional website operators generally aim to make their sites more visible for search engines. One requirement for this is making sure that all URLs can be read by search bots and then properly indexed. While this may sound like a simple task, it must be noted that search engines only rarely fully crawl websites. Even Google’s capacities for gathering and storing website content are limited. Instead, every domain is allotted a certain crawling budget, which determines how many URLS are read out and, if necessary, indexed. Operators of large websites are advised to tackle this topic in a strategic manner by signaling to search bots which areas of a given page should be crawled and which pages should be ignored. Important tools for index management include: robots data in meta tags, canonical tags, redirects as well as the file robots.txt, which is what this tutorial is about.

What is robot.txt?

robots.txt is a text file that’s stored in the root directory of a domain. By blocking some or all search robots from selected parts of a site, these files allow website operators to control search engines’ access to websites. The information found in the robots.txt file refers to the entire directory tree. This latter aspect sets this indexing management tool significantly apart from meta robot data and redirects, which are only applicable for special HTML documents. The word ‘block’ should be given special attention in this context. Search engines interpret robot.txt files merely as a guideline; this means that it can’t force any specific crawling behavior upon search engines. Google and other large search engines claim that they heed these instructions. However, the only way to prevent any unwarranted access from occurring is by implementing strong password protection measures.

Creating a robot.txt

In order to give search bots access to individual crawling guidelines, a pure text file has to be named ‘robots.txt’ and then stored in the domain’s root directory. If, for example, crawling guidelines for the domain, example.com, are to be defined, then the robots.txt needs to be stored in the same directory as www.example.com. When accessed over the internet, this file could be found as follows: www.example.com/robots.txt. If the hosting model for the website doesn’t offer access to the server’s root directory, and instead only to a subfolder (e.g. www.example.com/user/), then implementing indexing management with a robots.txt file isn’t possible. Website operators setting up a robots.txt should use a pure text editor, like vi (Linux) or notpad.exe (Windows); when carrying out a FTP transfer, it’s also important to make sure that the file’s transferred in ASCII mode. Online, the file can be created with a robot.txt generator. Given that syntax errors can have devastating effects on a web project’s indexing, it’s recommended to test the text file prior to uploading it. Google’s search console offers a tool for this.

robots.txt structure

Every robots.txt text contains records that are composed of two parts. The first part is introduced with the keyword, user agent, and addresses a search bot that can be given instructions in the second part. These instructions deal with rules for crawling bans. Initiated by the keyword, disallow, these commands then go on to name a directory or multiple files. The result is the following basic structure:

user-agent: Googlebot
disallow: /temp/ 
disallow: /news.html
disallow: /print 

The robot.txt in the example above only applies to web crawlers with the name, ‘Googlebot‘, and ‘prohibits’ it from reading out the directory /temp/ and the file, news. Additionally, all files and directories with paths beginning with print are also blocked. Notice here how disallow: /temp/ and disallow: /print can only be differentiated from one another (in terms of syntax) by a missing slash (/) at the end; this is responsible for a considerably different meaning in the robots.txt’s syntax.

Inserting comments

robot.txts file can be supplemented with comments, if needed. These are then labeled with a preceding hashtag.

# robots.txt for http://www.example.com

user-agent: Googlebot
disallow: /temp/ # directory contains temporary data 
disallow: /print/ # directory contains print pages
disallow: /news.html # file changes daily 

Addressing multiple user agents

Should multiple user agents be addressed, then the robots.txt can contain any number of blocks written in accordance with its structure:

# robots.txt for http://www.example.com

  user-agent: Googlebot
  disallow: /temp/ 
   
  user-agent: Bingbot 
  disallow: /print/

While Google’s web crawler is prohibited to search through the directory, /temp/, the Bing bot is prevented from crawling /print/.

Addressing all user agents

Should a certain directory or file for all web crawlers need to be blocked, then an asterisk (*) indicating a wildcard for all users is implemented. 

# robots.txt for http://www.beispiel.de

user-agent: *
disallow: /temp/
disallow: /print/
disallow: /pictures/

The robots.txt file blocks the directories /temp/, /print/ and /pictures/ for all web crawlers.

Excluding all directories from indexing

Should a website need to completely block all users agents, then just a slash after the keyword, disallow, is needed.

# robots.txt for http://www.beispiel.de

  user-agent: *
  disallow: /

All web crawlers are instructed to ignore the entire website. Such robot.txt files can be used, for example, for web projects that are still undergoing their test phases.

Allowing indexing for all directories

Web operators can allow search bots to be able to crawl and index entire websites by applying the keyword, disallow, without a slash:

# robots.txt for http://www.example.com

user-agent: Googlebot
disallow: 

If the robot.txt file contains a disallow without a slash, then the entire website is freely available to the web crawlers defined under user agent.

Table 1: robots.txt’s basic functions
Command Example Function
user agent: User agent: Googlebot Address a specific web crawler
  user agent: Address all web crawlers
disallow: disallow: The entire website can be crawled
  disallow: / The entire site is blocked
  disallow: /directory/ A specific directory is blocked
  disallow: /file.html A specific file is blocked
  disallow: /example All directories and files with paths beginning with example are blocked.

Further functions

In addition to the de facto-standard functions listed above, search engines also support some additional parameters that allow content to be presented in the robots.txt.

The following functions can be found on Google’s support section. They are based on an agreement made with Microsoft and Yahoo!.

Defining exceptions

In addition to disallow, Google also supports allow, a further keyword in the robots.txt, which enables exceptions for blocked directories to be defined.

# robots.txt for http://www.example.com

user-agent: Googlebot
disallow: /news/ 
allow: /news/index.html 

The keyword, allow, enables the file, "http://www.beispiel.de/news/index.html", to be read by the Google bot despite the fact that the higher-ranking directory, news, is blocked. 

Blocking files with specific endings

Website operators wishing to prevent Google bots from reading out files with specific endings can use datasets according to the following example:

# robots.txt for http://www.example.com

user agent: Googlebot
disallow: /*.pdf$

The keyword disallow refers to all the files ending with .pdf and protects these Google from bots. The asterisk symbol (*) functions as a wildcard for the domain name. This entry is then completed with a dollar sign, which serves as a line-ending anchor.

Referring web crawlers to site maps

In addition to controlling crawling behavior, robots.txt files also allow search bots to be referred to a website’s sitemap. A robot.txt with a sitemap reference can be called into action as follows:

# robots.txt for http://www.example.com

user agent: *
disallow: /temp/

sitemap: http://www.example.com/sitemap.xml
Table two: expanded robots.txt functions
Command Example Function
allow: allow: /example.html The entered file or directory cannot be crawled
disallow: /*…$ disallow: /*.jpg$ Files with certain endings are blocked
sitemap: sitemap: http://www.example.com/sitemap.xml The XML sitemap is found under the entered address