U.S Shooting Incident: Building a News Scraper Part 1

Web Scraping – Towards AI — The Best of Tech, Science, and Engineering

Last night I was startled when I learned that 10 people were killed in what was another shooting incident in the USA. It was 2nd deadly shooting within a week. The world has hardly moved on from the brutal killing of 8 Asian workers in Atlanta. I opened my Twitter and read this,

I was completely perplexed. Who killed whom and why? Internet was filled with assumptions, conspiracy theories, and fake/hate stories. So I decided to not read any news on Twitter and rather scrap news articles using NewsAPI and analyze the articles to extract something useful.

News Scrapper

import requests
import json
import pandas

requests library allows you to send HTTP requests. We will get data objects in JSON format hence we are importing the JSON library to use some JSON functions.

APIKEY = {'Authorization': '*****227112c4fdd8a95499343b54826'}

The prerequisite of scraping data is to get the API key from the website that we are fetching data from. Click here to get your API key, just enter details and you will get yours. This will be used for Authorization when we send an HTTP request.


top_headlines_url = 'https://newsapi.org/v2/top-headlines'

everything_news_url = 'https://newsapi.org/v2/everything'

sources_url = 'https://newsapi.org/v2/sources'
headlines = {'category': 'entertainment', 'country': 'us'}
everything = {'q': 'Boulder', 'language': 'en', 'sortBy': 'popularity'}
sources = {'category': 'general', 'language': 'en', 'country': 'us'}

You can use different parameters to configure your search results. “q” is Keyword or a phrase to search for. I am writing the keyword Boulder, because the shooting took place in Boulder, USA.

Here is the list of country codes as given in the documentation, you can scrap news articles for all 56 countries listed here. Read the documentation here to know about all the parameters that can be passed so you can tweak the input according to your requirement.

List of Countries
#response1 = requests.get(url=top_headlines_url, headers=headers, params=headlines)response = requests.get(url=everything_news_url, headers=APIKEY, params=everything)#response2 = requests.get(url=sources_url, headers=headers, params=sources)

We can use requests.get() to fetch the data we want. we will get 3 JSON objects. I will only use the response that contains News articles.


response_json_string = json.dumps(response.json())


response_python_dict = json.loads(response_json_string)

The json.dumps() method takes the JSON object and returns a JSON formatted string. It is important to understand that JSON object is similar to python dictionary hence can be easily converted. json.loads() is used to convert JSON objects to a python dictionary.

articles_list = response_python_dict['articles']
dataframe = pd.DataFrame.from_dict(articles_list)

Your data has been converted to a pandas data frame. You can keep it in pandas and analyze it or transfer it into a CSV and read the stories. I have converted the data frame into a CSV file.

dataframe  = pd.DataFrame.from_dict(articles_list)
content=dataframe["content"]
dataframe = df.to_csv('Shooting_Stories.csv', index = False)

Here is what the Excel file looks like in excel format. We can use excel functions to read and understand the crux of the story.

news=pd.read_csv("Shooting_Stories.csv")
news[['author', 'title', 'content']]
news_imp=news[['author', 'title', 'content']]
news_imp.tail()

Conclusion

We can build up from here, process the news text, and extract a lot of meaningful data out of the scraped text. You can use it to scrap the news articles from anywhere, using the code as it is with your own API key.

Geopolitics and Data Science enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store