quaintitative

I write about my quantitative explorations in visualisation, data science, machine and deep learning here, as well as other random musings.

For more about me and my other interests, visit playgrd or socials below


Categories
Subscribe

Extracting and Processing GDELT GKG datasets from BigQuery

Pre-requisites:

Extract GDELT GKG data from BigQuery

There are many ways to access the rich data available in the GDELT Project. One of the fastest ways is to leverage on the data that they have uploaded and made available on BigQuery. In this article, I will share two notebooks that you can use to download GDELT Global Knowledge Graph (GKG) data from BigQuery.

Other than the usual libraries, you will also need to install the following library.

pip install --upgrade google-cloud-bigquery

The first notebook is here.

We first set the locations from which we access the Google Cloud json key file, and save the data downloaded.

home_dir = Path(os.getcwd())
key_dir = #<location of your Google Cloud json key>
data_dir = home_dir/'data'

We then use the credentials in the key file.

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = str(key_dir/'gdelt-key.json')

We then generate the list of dates to download data from.

start_ = '2017-12-03'
end_ = '2020-12-31'
final_start_time = datetime.strptime(start_, '%Y-%m-%d')
final_end_time = datetime.strptime(end_, '%Y-%m-%d') 
num_days = (final_end_time - final_start_time).days
time_list = []
start_time_ = datetime.strptime(start_, '%Y-%m-%d')
for d in range(0,num_days,1):
    end_time_ = start_time_ + rd.relativedelta(days=+1)
    time_list.append((start_time_.date(), end_time_.date()))
    start_time_ = start_time_ + rd.relativedelta(days=+1)

Next, set up the connection. The authentication is handled automatically as you already have the credentials in the environment.

bqclient = bigquery.Client()

And then we just need to craft the SQL query and send it on its merry way to Google Cloud, and then save the data that is returned.

for d in range(len(time_list)):
    start_date, end_date = time_list[d]
    print(f'Downloading for {start_date}...')
    query_string = f"""
        SELECT DATE, SourceCollectionIdentifier, SourceCommonName, DocumentIdentifier, V2Themes, V2Locations, V2Persons, V2Organizations, V2Tone, SharingImage, RelatedImages, Quotations, AllNames, Amounts
        FROM `gdelt-bq.gdeltv2.gkg_partitioned` 
        WHERE DATE(_PARTITIONTIME) >= "{start_date}" and DATE(_PARTITIONTIME) < "{end_date}"
        LIMIT 5000000
    """
    # print(start_date, end_date)
    # # Note that each day gives around 300-400k rows

    dataframe = (
        bqclient.query(query_string)
        .result()
        .to_dataframe(
            create_bqstorage_client=False,
        )
    )

    dataframe['date'] = pd.to_datetime(dataframe.DATE, format='%Y%m%d%H%M%S')
    print(f'On {start_date}, there are {len(dataframe)} rows from {dataframe["date"].min()} to {dataframe["date"].max()}')

    filename = str(start_date) + '.csv'
    dataframe.to_csv(data_dir/filename, index=False)

A few points to note:

Refer to the sample.csv file in here to see what can be retrieved.

Process GDELT GKG data

Now that we have our dataset, what can we do with it. Remember that these are just rows and rows of attributes of articles, such as locations, persons, organizations that appeared in an article on that day, themes of the articles, links to the articles, tone of the articles etc. So we could use these as tabular data for say sentiment analysis.

One thing we may need to do first is to tag each of these rows though. Say for example, we may need to tag it by the name of a company. But there are many ways in which the name of a company may appear. One easy and quick way is to try to do a fuzzy matching of the company names.

We first install the swifter library (to help with multi-processing), and the rapidfuzz library (for fuzzy matching with Levenstein distance).

pip install swifter
pip install rapidfuzz

The second notebook is here.

I have provided a master list of companies with their ticker symbols and common names, that can be used to create two lists.

master_ticker_pairs = pd.read_csv('master_ticker_name.csv')
master_ticker_pairs['COMNAM'] = master_ticker_pairs['COMNAM'].str.lower()
master_ticker_pairs['combined'] = master_ticker_pairs['COMNAM'].str.replace(' ', '')
master_ticker_pairs = master_ticker_pairs.fillna('')
master_ticker_pairs['combined'] = master_ticker_pairs['combined'].apply(replace_short)

namelist = list(master_ticker_pairs.combined)
tickerlist = list(master_ticker_pairs.TICKER)

Next, we define a function to process the dataset by ‘fuzzy’ matching names in the main V2Organizations field against the namelist and tickerlist. The threshold refers to the confidence level, and can be found by trial and error.

threshold = 75
def map_to_comnam(row):
    mapped = []
    for name in row:
        ticker_result, ticker_score, _ = process.extractOne(name, tickerlist, scorer=fuzz.QRatio)      
        if ticker_score > threshold:
            found_ticker = master_ticker_pairs[master_ticker_pairs.TICKER == ticker_result].TICKER.values[0]
            if (found_ticker not in mapped):
                mapped.append(found_ticker)

        name_result, name_score, _ = process.extractOne(name, namelist, scorer=fuzz.QRatio)
        if name_score > threshold:
            found_ticker = master_ticker_pairs[master_ticker_pairs.combined == name_result].TICKER.values[0]
            if (found_ticker not in mapped):
                mapped.append(found_ticker)
    if len(mapped) == 0:
        return None
    else:
        return mapped

Now, we use a sample of the data downloaded from BigQuery (based on the first part of this article), and then apply this function.

gkg_df = pd.read_csv('sample.csv')
gkg_df_ = gkg_df[~gkg_df.V2Organizations.isnull()].reset_index(drop=True)\

print(f'Length before removing nulls: {len(gkg_df)}, after removing nulls: {len(gkg_df_)}')

gkg_df_['orgs'] = gkg_df_.V2Organizations.apply(extract_orgs)
gkg_df_['tags'] = gkg_df_.loc[:,'orgs'].swifter.progress_bar(False).apply(map_to_comnam)

filtered_gkg = gkg_df_[~gkg_df_.tags.isnull()]
print(f'Number of articles with tags: {len(filtered_gkg)}')

The code can be found in this repository.

And that’s it. Happy data explorations!


Articles

AI and UIs
Listing NFTs
Extracting and Processing Wikidata datasets
Extracting and Processing Google Trends data
Extracting and Processing Reddit datasets from PushShift
Extracting and Processing GDELT GKG datasets from BigQuery
Some notes relating to Machine Learning
Some notes relating to Python
Using CCapture.js library with p5.js and three.js
Introduction to PoseNet with three.js
Topic Modelling
Three.js Series - Manipulating vertices in three.js
Three.js Series - Music and three.js
Three.js Series - Simple primer on three.js
HTML Scraping 101
(Almost) The Simplest Server Ever
Tweening in p5.js
Logistic Regression Classification in plain ole Javascript
Introduction to Machine Learning Right Inside the Browser
Nature and Math - Particle Swarm Optimisation
Growing a network garden in D3
Data Analytics with Blender
The Nature of Code Ported to Three.js
Primer on Generative Art in Blender
How normal are you? Checking distributional assumptions.
Monte Carlo Simulation of Value at Risk in Python
Measuring Expected Shortfall in Python
Style Transfer X Generative Art
Measuring Market Risk in Python
Simple charts | crossfilter.js and dc.js
d3.js vs. p5.js for visualisation
Portfolio Optimisation with Tensorflow and D3 Dashboard
Setting Up a Data Lab Environment - Part 6
Setting Up a Data Lab Environment - Part 5
Setting Up a Data Lab Environment - Part 4
Setting Up a Data Lab Environment - Part 3
Setting Up a Data Lab Environment - Part 2
Setting Up a Data Lab Environment - Part 1
Generating a Strange Attractor in three.js
(Almost) All the Most Common Machine Learning Algorithms in Javascript
3 Days of Hand Coding Visualisations - Day 3
3 Days of Hand Coding Visualisations - Day 2
3 Days of Hand Coding Visualisations - Day 1
3 Days of Hand Coding Visualisations - Introduction