quaintitative

I write about my quantitative explorations in visualisation, data science, machine and deep learning here, as well as other random musings.

For more about me and my other interests, visit playgrd or socials below


Categories
Subscribe

HTML Scraping 101

More often than not, easy access to data via an API is not possible. Scraping the webpage might then be the only practical way to get at the data. Doing this in Python is fairly straightforward, with the help of some libraries, and a basic understanding of HTML.

This is a super basic tutorial, but I am just writing these points down to remind myself.

We first import the following libraries -

import bs4 
import requests 
from slugify import slugify 
import os 

Note: Remember to install awesome-slugify (pip install awesome-slugify) instead of slugify.

Next, specify the webpage which you would like to scrape data off. Say the list of visual art topics on Wikipedia.

websource = ['https://en.wikipedia.org/wiki/Category:Lists_of_visual_art_topics']

Some string manipulation first. I need the wikipedia address later. Simple split the string at ‘wiki/‘ and get the first item that is returned.

domain = websource[0].split("/wiki")[0]

You will get this.

'https://en.wikipedia.org'

Next, we get the content of the page in websource.

html = requests.get(websource[0]).content

Then we parse it with BeautifulSoup. This then allows us to get at all the links using the findAll function.

soup = bs4.BeautifulSoup(html, 'html5lib')
links = set(soup.findAll('a', href=True))

Next, we use what we have to locate a link with ‘mathematical’ inside to find the page with the list of mathematical artists.

for link in links:
    if 'mathematical' in link['href']:
        page = requests.get(domain+link['href']).content
        clean_page = bs4.BeautifulSoup(page, 'html5lib')

The Jupyter notebook with the code is here


Articles

AI and UIs
Listing NFTs
Extracting and Processing Wikidata datasets
Extracting and Processing Google Trends data
Extracting and Processing Reddit datasets from PushShift
Extracting and Processing GDELT GKG datasets from BigQuery
Some notes relating to Machine Learning
Some notes relating to Python
Using CCapture.js library with p5.js and three.js
Introduction to PoseNet with three.js
Topic Modelling
Three.js Series - Manipulating vertices in three.js
Three.js Series - Music and three.js
Three.js Series - Simple primer on three.js
HTML Scraping 101
(Almost) The Simplest Server Ever
Tweening in p5.js
Logistic Regression Classification in plain ole Javascript
Introduction to Machine Learning Right Inside the Browser
Nature and Math - Particle Swarm Optimisation
Growing a network garden in D3
Data Analytics with Blender
The Nature of Code Ported to Three.js
Primer on Generative Art in Blender
How normal are you? Checking distributional assumptions.
Monte Carlo Simulation of Value at Risk in Python
Measuring Expected Shortfall in Python
Style Transfer X Generative Art
Measuring Market Risk in Python
Simple charts | crossfilter.js and dc.js
d3.js vs. p5.js for visualisation
Portfolio Optimisation with Tensorflow and D3 Dashboard
Setting Up a Data Lab Environment - Part 6
Setting Up a Data Lab Environment - Part 5
Setting Up a Data Lab Environment - Part 4
Setting Up a Data Lab Environment - Part 3
Setting Up a Data Lab Environment - Part 2
Setting Up a Data Lab Environment - Part 1
Generating a Strange Attractor in three.js
(Almost) All the Most Common Machine Learning Algorithms in Javascript
3 Days of Hand Coding Visualisations - Day 3
3 Days of Hand Coding Visualisations - Day 2
3 Days of Hand Coding Visualisations - Day 1
3 Days of Hand Coding Visualisations - Introduction