Objective
Learning how to extract information out of an html page using python and the Beautiful Soup library.
Requirements
- Understanding of the basics of python and object oriented programming
Conventions
- # – requires given linux command to be executed with root privileges either
directly as a root user or by use ofsudo
command - $ – given linux command to be executed as a regular non-privileged user
Introduction
Web scraping is a technique which consist in the extraction of data from a web site through the use of dedicated software. In this tutorial we will see how to perform a basic web scraping using python and the Beautiful Soup library. We will use python3
targeting the homepage of Rotten Tomatoes, the famous aggregator of reviews and news for films and tv shows, as a source of information for our exercise.
Installation of the Beautiful Soup library
To perform our scraping we will make use of the Beautiful Soup python library, therefore the first thing we need to do is to install it. The library is available in the repositories of all the major GNU\Linux distributions, therefore we can install it using our favorite package manager, or by using pip
, the python native way for installing packages.
If the use of the distribution package manager is preferred and we are using Fedora:
$ sudo dnf install python3-beautifulsoup4
On Debian and its derivatives the package is called beautifulsoup4:
$ sudo apt-get install beautifulsoup4
On Archilinux we can install it via pacman:
$ sudo pacman -S python-beatufilusoup4
If we want to use pip
, instead, we can just run:
$ pip3 install --user BeautifulSoup4
By running the command above with the --user
flag, we will install the latest version of the Beautiful Soup library only for our user, therefore no root permissions needed. Of course you can decide to use pip to install the package globally, but personally I tend to prefer per-user installations when not using the distribution package manager.
The BeautifulSoup object
Let’s begin: the first thing we want to do is to create a BeautifulSoup object. The BeautifulSoup constructor accepts either a string
or a file handle as its first argument. The latter is what interests us: we have the url of the page we want to scrape, therefore we will use the urlopen
method of the urllib.request
library (installed by default): this method returns a file-like object:
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('http://www.rottentomatoes.com') as homepage:
soup = BeautifulSoup(homepage)
At this point, our soup it’s ready: the soup
object represents the document in its entirety. We can begin navigating it and extracting the data we want using the built-in methods and properties. For example, say we want to extract all the links contained in the page: we know that links are represented by the a
tag in html and the actual link is contained in the href
attribute of the tag, so we can use the find_all
method of the object we just built to accomplish our task:
for link in soup.find_all('a'):
print(link.get('href'))
By using the find_all
method and specifying a
as the first argument, which is the name of the tag, we searched for all links in the page. For each link we then retrieved and printed the value of the href
attribute. In BeautifulSoup the attributes of an element are stored into a dictionary, therefore retrieving them is very easy. In this case we used the get
method, but we could have accessed the value of the href attribute even with the following syntax: link['href']
. The complete attributes dictionary itself is contained in the attrs
property of the element. The code above will produce the following result:
[...] https://editorial.rottentomatoes.com/ https://editorial.rottentomatoes.com/24-frames/ https://editorial.rottentomatoes.com/binge-guide/ https://editorial.rottentomatoes.com/box-office-guru/ https://editorial.rottentomatoes.com/critics-consensus/ https://editorial.rottentomatoes.com/five-favorite-films/ https://editorial.rottentomatoes.com/now-streaming/ https://editorial.rottentomatoes.com/parental-guidance/ https://editorial.rottentomatoes.com/red-carpet-roundup/ https://editorial.rottentomatoes.com/rt-on-dvd/ https://editorial.rottentomatoes.com/the-simpsons-decade/ https://editorial.rottentomatoes.com/sub-cult/ https://editorial.rottentomatoes.com/tech-talk/ https://editorial.rottentomatoes.com/total-recall/ [...]
The list is much longer: the above is just an extract of the output, but gives you an idea. The find_all
method returns all Tag
objects that matches the specified filter. In our case we just specified the name of the tag which should be matched, and no other criteria, so all links are returned: we will see in a moment how to further restrict our search.
A test case: retrieving all “Top box office” titles
Let’s perform a more restricted scraping. Say we want to retrieve all the titles of the movies which appear in the “Top Box Office” section of Rotten Tomatoes homepage. The first thing we want to do is to analyze the page html for that section: doing so, we can observe that the element we need are all contained inside a table
element with the “Top-Box-Office” id
:
We can also observe that each row of the table holds information about a movie: the title’s scores are contained as text inside a span
element with class “tMeterScore” inside the first cell of the row, while the string representing the title of the movie is contained in the second cell, as the text of the a
tag. Finally, the last cell contains a link with the text that represents the box office results of the film. With those references, we can easily retrieve all the data we want:
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://www.rottentomatoes.com') as homepage:
soup = BeautifulSoup(homepage.read(), 'html.parser')
# first we use the find method to retrieve the table with 'Top-Box-Office' id
top_box_office_table = soup.find('table', {'id': 'Top-Box-Office'})
# than we iterate over each row and extract movies information
for row in top_box_office_table.find_all('tr'):
cells = row.find_all('td')
title = cells[1].find('a').get_text()
money = cells[2].find('a').get_text()
score = row.find('span', {'class': 'tMeterScore'}).get_text()
print('{0} -- {1} (TomatoMeter: {2})'.format(title, money, score))
The code above will produce the following result:
Crazy Rich Asians -- $24.9M (TomatoMeter: 93%) The Meg -- $12.9M (TomatoMeter: 46%) The Happytime Murders -- \$9.6M (TomatoMeter: 22%) Mission: Impossible - Fallout -- $8.2M (TomatoMeter: 97%) Mile 22 -- $6.5M (TomatoMeter: 20%) Christopher Robin -- $6.4M (TomatoMeter: 70%) Alpha -- $6.1M (TomatoMeter: 83%) BlacKkKlansman -- $5.2M (TomatoMeter: 95%) Slender Man -- $2.9M (TomatoMeter: 7%) A.X.L. -- $2.8M (TomatoMeter: 29%)
We introduced few new elements, let’s see them. The first thing we have done, is to retrieve the table
with ‘Top-Box-Office’ id, using the find
method. This method works similarly to find_all
, but while the latter returns a list which contains the matches found, or is empty if there are no correspondence, the former returns always the first result or None
if an element with the specified criteria is not found.
The first element provided to the find
method is the name of the tag to be considered in the search, in this case table
. As a second argument we passed a dictionary in which each key represents an attribute of the tag with its corresponding value. The key-value pairs provided in the dictionary represents the criteria that must be satisfied for our search to produce a match. In this case we searched for the id
attribute with “Top-Box-Office” value. Notice that since each id
must be unique in an html page, we could just have omitted the tag name and use this alternative syntax:
top_box_office_table = soup.find(id='Top-Box-Office')
Once we retrieved our table Tag
object, we used the find_all
method to find all the rows, and iterate over them. To retrieve the other elements, we used the same principles. We also used a new method, get_text
: it returns just the text part contained in a tag, or if none is specified, in the entire page. For example, knowing that the movie score percentage are represented by the text contained in the span
element with the tMeterScore
class, we used the get_text
method on the element to retrieve it.
In this example we just displayed the retrieved data with a very simple formatting, but in a real-world scenario, we might have wanted to perform further manipulations, or store it in a database.
Conclusions
In this tutorial we just scratched the surface of what we can do using python and Beautiful Soup library to perform web scraping. The library contains a lot of methods you can use for a more refined search or to better navigate the page: for this I strongly recommend to consult the very well written official docs.