quaintitative

I write about my quantitative explorations in visualisation, data science, machine and deep learning here, as well as other random musings.

For more about me and my other interests, visit playgrd or socials below


Categories
Subscribe

Extracting and Processing Reddit datasets from PushShift

There are many ways to access the rich data available in Reddit. You could scrape, or you could use the data that has been kindly made available by Pushshift. The paper here will provide more details on the background of this data.

To get the data from Pushshift, you will first need to install two libraries, praw and pmaw.

pip install praw
pip install pmaw

The steps thereafter are simple, but you need to note that each subreddit in reddit has submissions, as well as the associated comments. You cannot fetch both at the same time, but you can get all the submissions first in a subreddit for a date, and then use the comment ids in the submissions to fetch the comments. So 2 steps. We show the first step here first.

So we first initialize the api object, and then set the dates we want to fetch the submissions for.

api = PushshiftAPI()

subreddit = 'wallstreetbets' 
limit = 1000000
comment_threshold  = 10
start_date = dt.datetime(2020,1,1) # year, month, day
end_date = dt.datetime(2020,1,5) # year, month, day

# Use this if you want to count from today and get say the last 30 days of submissions
# end_date = dt.datetime.today() 
# timespan = dt.timedelta(days=30)
# start_date = end_date - timespan
# print(start_date, '|', end_date)

after = int(start_date.timestamp()) # subs after this date, i.e. start
before = int(end_date.timestamp()) # subs before this date, i.e. end

# print(after, before)

Then we can call the search_submissions function in the api to get the submissions.

submissions = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after) # get subs

The second set of steps is a bit more tricky. To get the comment ids for a particular submission id, you will need to use the endpoint instead.

api_endpt = f'https://api.pushshift.io/reddit/submission/comment_ids/{submission_id}'
response = requests.get(api_endpt)
comment_ids = list(response.json()['data'])

Then you can use these ids to fetch the comments with search_comments function in the api.

comments = api.search_comments(ids=comment_ids)

The notebook here has the full code.

And that’s it. Happy data explorations!


Articles

Comparing Prompts for Different Large Language Models (Other than ChatGPT)
AI and UIs
Listing NFTs
Extracting and Processing Wikidata datasets
Extracting and Processing Google Trends data
Extracting and Processing Reddit datasets from PushShift
Extracting and Processing GDELT GKG datasets from BigQuery
Some notes relating to Machine Learning
Some notes relating to Python
Using CCapture.js library with p5.js and three.js
Introduction to PoseNet with three.js
Topic Modelling
Three.js Series - Manipulating vertices in three.js
Three.js Series - Music and three.js
Three.js Series - Simple primer on three.js
HTML Scraping 101
(Almost) The Simplest Server Ever
Tweening in p5.js
Logistic Regression Classification in plain ole Javascript
Introduction to Machine Learning Right Inside the Browser
Nature and Math - Particle Swarm Optimisation
Growing a network garden in D3
Data Analytics with Blender
The Nature of Code Ported to Three.js
Primer on Generative Art in Blender
How normal are you? Checking distributional assumptions.
Monte Carlo Simulation of Value at Risk in Python
Measuring Expected Shortfall in Python
Style Transfer X Generative Art
Measuring Market Risk in Python
Simple charts | crossfilter.js and dc.js
d3.js vs. p5.js for visualisation
Portfolio Optimisation with Tensorflow and D3 Dashboard
Setting Up a Data Lab Environment - Part 6
Setting Up a Data Lab Environment - Part 5
Setting Up a Data Lab Environment - Part 4
Setting Up a Data Lab Environment - Part 3
Setting Up a Data Lab Environment - Part 2
Setting Up a Data Lab Environment - Part 1
Generating a Strange Attractor in three.js
(Almost) All the Most Common Machine Learning Algorithms in Javascript
3 Days of Hand Coding Visualisations - Day 3
3 Days of Hand Coding Visualisations - Day 2
3 Days of Hand Coding Visualisations - Day 1
3 Days of Hand Coding Visualisations - Introduction