What is web scraping?

Search engines like Google have been using so-called crawlers for a long time. Crawlers search the internet for user-defined terms. They are special types of bots that visit website after website to establish and categorize associations with search results. The first crawler was released in 1993, when the first search machine was launched: Jumpstation.

Web scraping or web harvesting is a crawling technique. We explain how it works, why it’s used, and how it can be blocked if necessary.

Web scraping: a definition

During the process of web scraping, data is extracted from websites and stored in order to analyze or otherwise exploit it. Many different types of information are collected when scraping – for instance, contact data like email addresses or telephone numbers, and individual search terms or URLs. They are then collected in local databases or tables.

Definition

During web scraping, texts are read from websites in order to obtain and store information. This is comparable to an automatic copy-and-paste process. For image searches, this technique is referred to as image scraping.

How web scraping works

There are different approaches to scraping, but a distinction is generally made between manual and automatic scraping. Manual scraping refers to the manual copying and pasting of information and data. This is rather like cutting and collecting newspaper articles. Manual scraping is only performed when certain pieces of information are to be obtained and stored. It’s a highly effort-intensive process that is rarely used for large quantities of data.

Automatic scraping is when a software or algorithm is used to search through multiple websites and extract information. Depending on the type of website and content, special software is available for this purpose. A number of approaches exist for automatic scraping:

  • Parsers: A parser is used to convert a text to a new structure. In HTML parsing, for example, the software reads a HTML document and stores the information. DOM parsing uses the client display of content in the browser to extract data.
  • Bots: A bot is computer software that is dedicated to performing certain tasks automatically. Bots can be used in web harvesting to automatically search through websites and collect data.
  • Text: Anyone proficient with command line can give Unix-grep instructions in order to comb the web for certain terms in Python or Perl. This is a really simple method for scraping data, but it takes more work than utilizing software.
Note

In this tutorial, we show you what to keep in mind when web scraping with Python. Selenium WebDriver can be easily integrated into this process to collect data.

Why is web scraping used?

Web scraping is used for a range of tasks. For example, it allows contact details or special information to be collected quickly. Scraping is commonplace in a professional context in order to obtain advantages over competitors. Data harvesting enables a company to view all of a competitor’s products and compare them with its own. Web scraping can also be helpful with financial data. The information is read from an external website, placed in a tabular format and then analyzed or further processed.

A good example of web scraping is Google. The search engine uses the technology to display weather information or price comparisons for hotels and flights. Many common price comparison portals also practice scraping to show information from many different websites and providers.

Is web scraping legal?

Scraping is not always legal and scrapers must first consider the copyrights of a website. For some web shops and providers, web scraping can certainly have negative consequences – for example, if the page ranking suffers as a result of aggregators. From time to time, companies may sue comparison portals to compel them to cease web scraping. In these cases, however, the Ninth Circuit Court of Appeals previously ruled that scraping was not illegal and did not violate anti-hacking laws where information was freely accessible. However, companies are at liberty to install technical measures to prevent scraping.

In other words, scraping is legal when the extracted data is freely accessible to third parties on the internet. To stay on the right side of the law, it’s important to consider the following points when web scraping:

  • Consider and observe copyright. If data is copyright-protected, it may not be published elsewhere.
  • Website operators have a right to install technical measures to prevent web scraping. They must not be circumvented.
  • If data use relates to a user registration or usage agreement, this data may not be scraped.
  • Scraping technology is not allowed to hide advertising, general terms of use or disclaimers.

Although scraping is permitted in many cases, it can certainly lead to destructive consequences or even be misused for illegal purposes. The technology is often used for spam, for example. Thanks to this technology, spammers can collect email addresses and send spam emails to these recipients.

How to block web scraping

To prevent web scraping, website operators can take a range of different measures. The file robots.txt is used to block search engine bots, for example. Consequently, they also prevent automatic scraping by software bots. IP addresses belonging to botscan also be blocked. Contact data and personal information can be concealed and sensitive data like telephone numbers can be stored as an image or CSS, reducing the effectiveness of data scraping. Moreover, there are many providers of anti-bot services that can set up a firewall for a fee. Google Search Console can also be used to configure notifications that inform website operators if their data has been scraped.

Click here for important legal disclaimers.


Wait! We’ve got something for you!
Have a look at our great prices for different domain extensions.


Enter the web address of your choice in the search bar to check its availability.
.org
$1/1st year
then $20/year
.com
$1/1st year
then $15/year
.info
$1/1st year
then $20/year
.me
$1/1st year
then $20/year