Thanks to advances in text recognition using artificial intelligence, also known as natural language processing, there is an increasing need to extract large quantities of text from websites for analytical purposes.
Good examples of this are newspapers, news aggregators or RSS and press overview sites. These provide valuable information for analyzing trends.
For our use demonstration, we chose the news review site newstral.com. We simply want to extract a list of all headlines. We can do this as follows:
- Find the content in the source code of the web page
In developer tools (F12 key or right-clicking and selecting “Inspect”), we first select a headline and look at how it is structured. In this case, the structure isn’t at all complicated – the headline is the anchor text of the link (bottom line).