Continuous integration can improve the work of software development: fewer errors and more efficient, continuous integration. But this requires a simple yet comprehensive tool. That's why we're introducing Jenkins. This software makes it easier for you to work with automatic builds and individual pipelines. Our Jenkins tutorial will help you along the way!
The World Wide Web is made up of billions of interlinked documents, more commonly known as web pages. The source code for websites is written in Hypertext Markup Language (HTML). HTML source code is a mixture of human-readable information and machine-readable code known as tags. A web browser (e.g. Chrome, Firefox, Safari or Edge) processes the source code, interprets the tags, and displays the information contained to the user.
Special software is used to extract only the source code that is useful to people. These programs are referred to as web scrapers, spiders, and bots. They search the source code of a web page using predefined patterns and extract the information contained. The information obtained through web scraping is summarized, combined, analyzed, and stored for further use.
In the following, we explain why Python is particularly well-suited for creating web scrapers and provide you with an introduction to the topic and a tutorial.
- Why should you use Python for web scraping?
- Web scraping overview
- Python web scraping tools
- Tutorial on web scraping with Python and BeautifulSoup
Why should you use Python for web scraping?
The popular programming language Python is a great tool for creating web scraping software. Since websites are constantly being modified, web content changes over time. For example, the website’s design may be modified or new page components may be added. A web scraper is programmed according to the specific structure of a page. If the structure of the page is changed, the scraper must be updated. This is particularly easy to do with Python.
Python is also effective for word processing and web resource retrieval, both of which form the technical foundations for web scraping. Furthermore, Python is an established standard for data analysis and processing. In addition to its general suitability, Python has a thriving programming ecosystem. This ecosystem includes libraries, open-source projects, documentation, and language references as well as forum posts, bug reports, and blog articles.
There are multiple sophisticated tools for performing web scraping with Python. Here we will introduce you to three popular tools: Scrapy, Selenium, and BeautifulSoup. For some hands-on experience, you can use our tutorial on web scraping with Python based on BeautifulSoup. This will allow you to directly familiarize yourself with the scraping process.
Web scraping overview
The basic procedure for the scraping process is easy to explain. First, the scraper developer analyzes theHTML source code of the page in question. Usually, there are unique patterns that are used to extract the desired information. The scraper is programmed using these patterns. The rest of the work is done automatically by the scraper:
- It requests the website via the URL address.
- It automatically extracts the structured data corresponding to the patterns.
- The extracted data is summarized, stored, analyzed, combined, etc.
Applications for web scraping
Web scraping is highly versatile. In addition to search engine indexing, web scraping is used for a range of other purposes including:
- creating contact databases;
- monitoring and comparing the prices of online offers;
- merging data from different online sources;
- tracking online presence and reputation;
- collecting financial, weather and other data;
- monitoring web content for changes;
- collecting data for research purposes; and
- data mining.
Demonstrating web scraping through an example
Let us consider a website for selling used cars. When you navigate to the website in your browser, you will be shown a list of cars. Below we will examine an example of source code for a car listing:
raw_html = """ <h1>Used cars for sale</h1> <ul class="cars-listing"> <li class="car-listing"> <div class="car-title"> Volkswagen Beetle </div> <div class="car-description"> <span class="car-make">Volkswagen</span> <span class="”car-model”">Beetle</span> <span class="car-build">1973</span> </div> <div class="sales-price"> € <span class="”car-price”">14,998.—</span> </div> </li> </ul> """
A web scraper can search through the available online listings of used cars. The scraper will search for a specific model in accordance with what the developer intended. In our example, the model is a Volkswagen Beetle. In the source code, the information for the make and model of the car is tagged with the CSS classes 'car-make' and 'car-model'. By using these class names, the desired information can be easily scraped. Here is an example using BeautifulSoup:
# parse the HTML source code stored in raw_html html = BeautifulSoup(raw_html, 'html.parser') # extract the content of the tag with the class 'car-title' car_title = html.find(class_ = 'car-title').text.strip() # if this car is a Volkswagen Beetle if (car_title == 'Volkswagen Beetle'): # jump up from the car title to the wrapping <li> tag</li> html.find_parent('li') # find the car price car_price = html.find(class_ = 'sales-price').text.strip() # output the car price print(car_price)
Legal risks of web scraping
Furthermore, the automated retrieval, storage and analysis of the data published on the website may constitute a violation of copyright law. If the scraped data contains personally identifiable information, storing and analyzing it without the consent of the person concerned might violate current data protection regulations, e.g. GDPR or CCPA. For example, it is prohibited to scrape Facebook profiles to collect personal information.
Violating privacy and copyright laws may result in severe penalties. You should ensure that you do not break any laws if you intend to use web scraping. Under absolutely no circumstance should you circumvent existing access restrictions.
Technical limitations of web scraping
It is often in the interest of website operators to limit the automated scraping of their online offers. Firstly, if large numbers of scrapers access the website, this can negatively affect its performance. Secondly, there are often internal areas of a website that should not appear in search results.
The robots.txt standard was established to limit scrapers’ access to websites. To do so, the website operator places a text file called robots.txt in the root directory of the website. In this file, there are specific entries that define which scrapers or bots are allowed to access which areas of the website. The entries in the robots.txt file always apply to the entire domain.
The following is an example of a robots.txt file that disallows scraping by any bot across the entire website:
# Any bot User-agent: * # Disallow for the entire root directory Disallow: /
Adhering to the restrictions laid out in the robots.txt file is completely voluntary. The bots are supposed to comply with the specifications, but technically, this cannot be enforced. Therefore, to effectively regulate web scrapers’ access to their websites, website operators also use more aggressive techniques. These techniques include restricting their access by limiting throughput and blocking their IP addresses if they repeatedly access the site ignoring the specifications.
APIs as an alternative to web scraping
While web scraping can be useful, it is not the preferred method for obtaining data from websites. There is often a better way to get this done. Many website operators present their data in a structured, machine-readable format. This data is accessed via special programming interfaces called application programming interfaces (APIs).
There are significant advantages to using an API:
- The API is explicitly made available by the provider for the purpose of accessing the data: There are fewer legal risks, and it is easier for the provider to control access to the data. For example, an API key may be required to access the data. The provider can also limit throughput more precisely.
- The API delivers the data directly in a machine-readable format: This eliminates the need to tediously extract the data from the source code. In addition, the data structure is separate from its graphical presentation. The structure therefore remains the same even if the website design is changed.
If there is an API available that provides access to all the data, this is the preferred way to access it. However, scraping can in principle be used to retrieve all text presented in a human-readable format on web pages.
Python web scraping tools
In the Python ecosystem, there are several well-established tools for executing a web scraping project:
In the following, we will go over the advantages and disadvantages of each of these three tools.
Web scraping with Scrapy
The Python web scraping tool Scrapy uses an HTML parser to extract information from the HTML source code of a page. This results in the following schema illustrating web scraping with Scrapy:
URL → HTTP request → HTML → Scrapy
The core concept for scraper development with Scrapy are scrapers called web spiders. These are small programs based on Scrapy. Each spider is programmed to scrape a specific website and crawls across the web from page to page as a spider is wont to do. Object-oriented programming is used for this purpose. Each spider is its own Python class.
In addition to the core Python package, the Scrapy installation comes with a command-line tool. The spiders are controlled using this Scrapy shell. In addition, existing spiders can be uploaded to the Scrapy cloud. There the spiders can be run on a schedule. As a result, even large websites can be scraped without having to use your own computer and home internet connection. Alternatively, you can set up your own web scraping server using the open-source software Scrapyd.
Scrapy is a sophisticated platform for performing web scraping with Python. The architecture of the tool is designed to meet the needs of professional projects. For example, Scrapy contains an integrated pipeline for processing scraped data. Page retrieval in Scrapy is asynchronous which means that multiple pages can be downloaded at the same time. This makes Scrapy well suited for scraping projects in which a high volume of pages needs to be processed.
Web scraping with Selenium
The free-to-use software Selenium is a framework for automated software testing for web applications. While it was originally developed to test websites and web apps, the Selenium WebDriver with Python can also be used to scrape websites. Despite the fact that Selenium itself is not written in Python, the software’s functions can be accessed using Python.
Unlike Scrapy or BeautifulSoup, Selenium does not use the page’s HTML source code. Instead, the page is loaded in a browser without a user interface. The browser interprets the page’s source code and generates a Document Object Model (DOM). This standardized interface makes it possible to test user interactions. For example, clicks can be simulated and forms can be filled out automatically. The resulting changes to the page are reflected in the DOM. This results in the following schema illustrating web scraping with Selenium:
URL → HTTP request → HTML → Selenium → DOM
URL → HTTP request → HTML → Selenium → DOM → HTML → Scrapy/BeautifulSoup
Web scraping with BeautifulSoup
BeautifulSoup is the oldest of the Python web scraping tools presented here. Like Scrapy, it is also an HTML parser. This results in the following schema illustrating web scraping with BeautifulSoup:
URL → HTTP request → HTML → BeautifulSoup
Unlike Scrapy, developing scrapers with BeautifulSoup does not require object-oriented programming. Instead, scrapers are coded as a simple script. Using BeautifulSoup is thus probably the easiest way to fish specific information out of the “tag soup”.
Comparison of Python web scraping tools
Each of the three tools we have covered has its advantages and disadvantages. In the table below, you will find an overview summarizing them:
|Easy to learn||++||+||+++|
|Accesses dynamic content||++||+++||+|
|Creates complex applications||+++||+||++|
|Able to cope with HTML errors||++||+||+++|
|Optimized for scraping performance||+++||+||+|
So, which tool should you use for your project? To put it briefly, if you want the development process to go quickly or if you want to familiarize yourself with Python and web scraping first, you should use BeautifulSoup. Now, if you want to develop sophisticated web scraping applications in Python and have the necessary know-how to do so, you should opt for Scrapy. However, if your primary goal is to scrape dynamic content with Python, you should go for Selenium.
Tutorial on web scraping with Python and BeautifulSoup
Here we will show you how to extract data from a website with BeautifulSoup. First, you will need to install Python and a few tools. The following is required:
- Python version 3.4 or higher,
- the Python package manager pip, and
- the venv module.
Please follow the installation instructions found on the Python installation page.
Alternatively to using pip, if you have the free-to-use package manager Homebrew installed on your system, you can install Python with the following command:
brew install python
The following code and explanations shown below were written in Python 3 in macOS. In principle, the code should run on other operating systems. However, you may have to make some modifications, especially if you are using Windows.
Setting up a Python web scraping project on your own device
Here, we are going to create the project folder web Scraper for the Python tutorial on the desktop. Open the command-line terminal (e.g. Terminal.app on Mac). Then, copy the following lines of code into the terminal and execute them.
# Switch to the desktop folder cd ~/Desktop/ # Create project directory mkdir ./web Scraper/ && cd ./web Scraper/ # Create virtual environment # Ensures for instance that pip3 is used later python3 -m venv ./env # Activate virtual environment source ./env/bin/activate # Install packages pip install requests pip install beautifulsoup4
Scraping quotes and authors using Python and BeautifulSoup
Let us begin. Open the command-line terminal (e.g. Terminal.app on Mac) and launch the Python interpreter from your Python project folder web Scraper. Copy the following lines of code into the terminal and execute them:
# Switch to the project directory cd ~/Desktop/web Scraper/ # Activate virtual environment source ./env/bin/activate # Launch the Python interpreter # Since we are in the virtual environment, Python 3 will be used python
Now, copy the following codeinto thecommand-line terminal in the Python interpreter. Then, press Enter – several times if necessary – to execute the code. You can also save the code as a file called scrape_quotes.py in your project folder web Scraper. In this case, you can run the Python script using the command python scrape_quotes.py.
Executing the code should result in a file called quotes.csv being created in your Python project folder web Scraper. This will be a table containing the quotes and authors. You can open this file with any spreadsheet program.
# Import modules import requests import csv from bs4 import BeautifulSoup # Website address url = "http://quotes.toscrape.com/" # Execute GET request response = requests.get(url) # Parse the HTML document from the source code using BeautifulSoup html = BeautifulSoup(response.text, 'html.parser') # Extract all quotes and authors from the HTML document quotes_html = html.find_all('span', class_="text") authors_html = html.find_all('small', class_="author") # Create a list of the quotes quotes = list() for quote in quotes_html: quotes.append(quote.text) # Create a list of the authors authors = list() for author in authors_html: authors.append(author.text) # For testing: combine the entries from both lists and output them for t in zip(quotes, authors): print(t) # Save the quotes and authors in a CSV file in the current directory # Open the file using Excel, LibreOffice, etc. with open('./zitate.csv', 'w') as csv_file: csv_writer = csv.writer(csv_file, dialect='excel') csv_writer.writerows(zip(quotes, authors))
Using Python packages for web scraping
Every web scraping project is different. Sometimes, you just want to check the website for any changes. Other times, you are looking to perform complex analyses. With Python, you have a wide selection of packages at your disposal.
- Use the following code in the command-line terminal to install packages with pip3.
pip3 install <package></package>
- Integrate modules in the Python script with import.
from <package> import <module></module></package>
The following packages are often used in web scraping projects:
|venv||Manage a virtual environment for the project|
|lxml||Use alternative parsers for HTML and XML|
|csv||Read and write spreadsheet data in CSV format|
|pandas||Process and analyze data|
|selenium||Use Selenium WebDriver|
Use the Python Package Index (PyPI) for an overview of available Python packages.