HTML Scraping 101

More often than not, easy access to data via an API is not possible. Scraping the webpage might then be the only practical way to get at the data. Doing this in Python is fairly straightforward, with the help of some libraries, and a basic understanding of HTML.

This is a super basic tutorial, but I am just writing these points down to remind myself.

We first import the following libraries -

Beautiful Soup (bs4), to help us tease the good stuff out from HTML and XML
request, to help us make HTTP requests to specific webpages
slugify, to generate unique and readable filenames
os, to help us interface with the operating system (be it Windows, Mac or Linux)

import bs4 
import requests 
from slugify import slugify 
import os 

Note: Remember to install awesome-slugify (pip install awesome-slugify) instead of slugify.

Next, specify the webpage which you would like to scrape data off. Say the list of visual art topics on Wikipedia.

websource = ['https://en.wikipedia.org/wiki/Category:Lists_of_visual_art_topics']

Some string manipulation first. I need the wikipedia address later. Simple split the string at ‘wiki/‘ and get the first item that is returned.

domain = websource[0].split("/wiki")[0]

You will get this.

'https://en.wikipedia.org'

Next, we get the content of the page in websource.

html = requests.get(websource[0]).content

Then we parse it with BeautifulSoup. This then allows us to get at all the links using the findAll function.

soup = bs4.BeautifulSoup(html, 'html5lib')
links = set(soup.findAll('a', href=True))

Next, we use what we have to locate a link with ‘mathematical’ inside to find the page with the list of mathematical artists.

for link in links:
    if 'mathematical' in link['href']:
        page = requests.get(domain+link['href']).content
        clean_page = bs4.BeautifulSoup(page, 'html5lib')

The Jupyter notebook with the code is here

HTML Scraping 101

Articles