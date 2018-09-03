Introduction to python web scraping and the Beautiful Soup library

Understanding of the basics of python and object oriented programming

# - requires given linux command to be executed with root privileges either directly as a root user or by use of sudo command

- requires given linux command to be executed with root privileges either directly as a root user or by use of command $ - given linux command to be executed as a regular non-privileged user

python3

Installation of the Beautiful Soup library

pip

$ sudo dnf install python3-beautifulsoup4

$ sudo apt-get install beautifulsoup4

$ sudo pacman -S python-beatufilusoup4

pip

$ pip3 install --user BeautifulSoup4

--user

The BeautifulSoup object

string

urlopen

urllib.request

from bs4 import BeautifulSoup from urllib.request import urlopen with urlopen('http://www.rottentomatoes.com') as homepage: soup = BeautifulSoup(homepage)

soup

a

href

find_all

for link in soup.find_all('a'): print(link.get('href'))

find_all

a

href

get

link['href']

attrs

[...] https://editorial.rottentomatoes.com/ https://editorial.rottentomatoes.com/24-frames/ https://editorial.rottentomatoes.com/binge-guide/ https://editorial.rottentomatoes.com/box-office-guru/ https://editorial.rottentomatoes.com/critics-consensus/ https://editorial.rottentomatoes.com/five-favorite-films/ https://editorial.rottentomatoes.com/now-streaming/ https://editorial.rottentomatoes.com/parental-guidance/ https://editorial.rottentomatoes.com/red-carpet-roundup/ https://editorial.rottentomatoes.com/rt-on-dvd/ https://editorial.rottentomatoes.com/the-simpsons-decade/ https://editorial.rottentomatoes.com/sub-cult/ https://editorial.rottentomatoes.com/tech-talk/ https://editorial.rottentomatoes.com/total-recall/ [...]

find_all

Tag

A test case: retrieving all "Top box office" titles

table

id

Top Box Office

span

a

from bs4 import BeautifulSoup from urllib.request import urlopen with urlopen('https://www.rottentomatoes.com') as homepage: soup = BeautifulSoup(homepage.read(), 'html.parser') # first we use the find method to retrieve the table with 'Top-Box-Office' id top_box_office_table = soup.find('table', {'id': 'Top-Box-Office'}) # than we iterate over each row and extract movies information for row in top_box_office_table.find_all('tr'): cells = row.find_all('td') title = cells[1].find('a').get_text() money = cells[2].find('a').get_text() score = row.find('span', {'class': 'MeterScore'}).get_text() print('{0} -- {1} (TomatoMeter: {2})'.format(title, money, score))

Crazy Rich Asians -- .9M (TomatoMeter: 93%) The Meg -- .9M (TomatoMeter: 46%) The Happytime Murders -- .6M (TomatoMeter: 22%) Mission: Impossible - Fallout -- .2M (TomatoMeter: 97%) Mile 22 -- .5M (TomatoMeter: 20%) Christopher Robin -- .4M (TomatoMeter: 70%) Alpha -- .1M (TomatoMeter: 83%) BlacKkKlansman -- .2M (TomatoMeter: 95%) Slender Man -- .9M (TomatoMeter: 7%) A.X.L. -- .8M (TomatoMeter: 29%)

table

find

find_all

None

find

table

id

id

top_box_office_table = soup.find(id='Top-Box-Office')

Tag

find_all

get_text

span

tMeterScore

get_text

