quaintitative

I write about my quantitative explorations in visualisation, data science, machine and deep learning here, as well as other random musings.

For more about me and my other interests, visit playgrd or socials below


Categories
Subscribe

Topic Modelling

When faced with a huge body of text, sometimes we may just want to see if there are common threads running through the text.

Topic modelling is a good way to do that. Basically, the algorithms search of clusters of words that may(or may not) be logical common threads (topics) in the text.

Latent Dirichlet Allocation (LDA) is one of the most commonly used methods for extracting groups of topics from text. Latent Semantic Indexing is another possibility. There are good posts on LDA and LSI on the interwebs, so I shall not go into the math or theory of these methods. Rather, I shall just demonstrate what the LDA and LSI is good for - discovering groups of topics in body of text (or what most term as the corpus).

First, we need to find something interesting to run LDA/LSI on. Ideally, the larger the more interesting. But as this is just a demo, we will use ‘Ulysses’ from the Project Gutenberg site.

source = 'https://www.gutenberg.org/files/4300/4300-0.txt'

Download the data with the ‘requests’ library as per what we have done for the other scripts

import requests
import re
corpus = requests.get(source)

Using a combination of string functions and the ‘re’ library, we strip out some of the special characters (e.g. newlines)

corpus_text = corpus.text.strip('\ufeff')
regex = re.compile(r'[\n\r\t]')
corpus_text=regex.sub(' ', corpus_text)
corpus_text = corpus_text.replace('\\', '')
corpus_text[:1000] # First 1000 characters

For this exercise, we will use gensim to do LDA and LSI, so let’s import the libraries first

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

The following function is just a simple one to process the text we downloaded, and strip out the stopwords (things like is, are …)

def get_tokens(text):
    return [t for t in simple_preprocess(text) if t not in STOPWORDS]

tokens = get_tokens(corpus_text)

We then convert into a basket of words - basically pre-preparing the text so that we can feed it into the the LDA/LSI model.

dictionary = gensim.corpora.Dictionary([tokens])
corpus = [dictionary.doc2bow(token) for token in [tokens]]

Now to build the LDA model for this text. The main parameters are:

lda_model.show_topics(5)


And the LSI equivalent

lsi_model = gensim.models.LsiModel(corpus, id2word=dictionary)

lsi_model.print_topics() ```

The topics generated off such a small text body is obviously not satisfactory, but the simple script shows how simple it would be to do this for any other collection of text.

We must however understand that these methods can only show us the collection of words that are close to each other and can be grouped as topics. We will still have to eyeball to see if they make sense, and how we would link the words in each of these clusters.

The Jupyter notebook with the code is here


Articles

AI and UIs
Listing NFTs
Extracting and Processing Wikidata datasets
Extracting and Processing Google Trends data
Extracting and Processing Reddit datasets from PushShift
Extracting and Processing GDELT GKG datasets from BigQuery
Some notes relating to Machine Learning
Some notes relating to Python
Using CCapture.js library with p5.js and three.js
Introduction to PoseNet with three.js
Topic Modelling
Three.js Series - Manipulating vertices in three.js
Three.js Series - Music and three.js
Three.js Series - Simple primer on three.js
HTML Scraping 101
(Almost) The Simplest Server Ever
Tweening in p5.js
Logistic Regression Classification in plain ole Javascript
Introduction to Machine Learning Right Inside the Browser
Nature and Math - Particle Swarm Optimisation
Growing a network garden in D3
Data Analytics with Blender
The Nature of Code Ported to Three.js
Primer on Generative Art in Blender
How normal are you? Checking distributional assumptions.
Monte Carlo Simulation of Value at Risk in Python
Measuring Expected Shortfall in Python
Style Transfer X Generative Art
Measuring Market Risk in Python
Simple charts | crossfilter.js and dc.js
d3.js vs. p5.js for visualisation
Portfolio Optimisation with Tensorflow and D3 Dashboard
Setting Up a Data Lab Environment - Part 6
Setting Up a Data Lab Environment - Part 5
Setting Up a Data Lab Environment - Part 4
Setting Up a Data Lab Environment - Part 3
Setting Up a Data Lab Environment - Part 2
Setting Up a Data Lab Environment - Part 1
Generating a Strange Attractor in three.js
(Almost) All the Most Common Machine Learning Algorithms in Javascript
3 Days of Hand Coding Visualisations - Day 3
3 Days of Hand Coding Visualisations - Day 2
3 Days of Hand Coding Visualisations - Day 1
3 Days of Hand Coding Visualisations - Introduction