A Simple Web Scraper Tutorial - A Useful Method In NLP

Twitter, along with other social media platforms, represents an immense and largely untapped resource for data. It remains the most popular platform for research, as it still provides its data via a number of APIs, or Application Programming Interfaces and is an invaluable source of user information and public real time data on almost every topic in today's world.

What is Natural Language Processing, or NLP?

Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP has become an essential business tool for uncovering hidden data insights from social media channels. Sentiment analysis can analyze language used in social media posts, responses, reviews, and more to extract attitudes and emotions in response to products, promotions, and events–information companies can use in product designs, advertising campaigns, and more.

A number of tools can be utilized using the data that Twitter provides us, including but not limited to Natural Language Understanding (NLU), which analyzes text in unstructured data formats. This tutorial will show how to scrape tweets related to COVID-19 from Twitter using Twitter APIs.

To begin, I will demonstrate how to create an API Key, API Key Secret, Access Token, and Access Token Secret to be securely stored to use for scraping tweets.

Setting Up the Twitter API Keys

In order to follow along and build your own scraper, you must set up your personal API keys. You can do so using the following steps:

1- Create a text (.txt) file and rename. This file will hold our keys and must be kept secure, much like a password. These keys are unique to you and are how Twitter keeps track of the developers, therefore it should not be shared with anyone. In this example, I've used Notepad named my file "keys.txt"

2- Set up your file in the required format. For API keys, there will need to be at least 1 header followed by the keys stored in variables, or 'names'. Refer to the picture below for the format:

Notice I've stored the keys under the [keys] header. Keep note of your headers in your .txt file so you may correctly access the keys when using the .get() method.

2- Next sign up or sign in to twitter.com and be sure to add your phone number to your account before proceeding!

3- Navigate to https://dev.twitter.com/apps/new and click create a new app. Add a name for your app and click Next.

4- Copy your API key:

5- Paste into your .txt file. Following the format I've displayed above, simple replace the "paste_your_api_key_here" with your key.

6- Repeat steps 4 and 5 for your API Key Secret.

7- Click on App Settings.

8- Under the Access Token and Secret section, click the Generate button to access your keys.

9- Repeat steps 4 and 5 for the Access Token and Access Token Secret.

10- Save your file and add to your repository.

IMPORTANT: Add your .txt file your .gitignore file -- API keys and tokens are unique to each developer and should not be shared since it helps Twitter identify the each individual.

Building the Scraper

To begin I will first import the packages needed to build this simple scraper. For this project, we need 2: configparser for storing our API keys and tweepy, the library used to access the Twitter API.

Note: If simply using !pip install tweepy doesn't work, you should be able to install directly from the library's github using the script below:

Next, we will "build" the scraper in three steps:

1- Create an object to securely read the keys (basically, to access out .txt file we've created earlier)

2- Use the .get() method to access the the keys from the object we've created

3- And lastly, authenticate using OAuthHandler.

Step 1: Creating an object

I will create my configuration object, here named "config", to securely access my API keys and access tokens. (A crucial step!) Here you'd substitute 'keys.txt' with the name of your file

Step 2: Using .get() to access the keys

This method returns the value of the item with the specified key, so I will use it to access each key from my config object and assign each key to a variable. Note the format below and replace the second values in each tuple with the corresponding names of your keys stored in your .txt file:

Step 3: Authentication

For authentication and authorization, I will use OAuthHandler to access Twitter's API. This package is an authorization framework that describes how unrelated services can safely grant access to without actually sharing the initial credentials.

I will first define a variable for our authorization (here it is 'auth'), and then authenticate by applying OAuthHandler to the tweepy library (imported as tw as seen below) using our API and API Secret keys.

Next, I will apply the access token to our authorization using our tokens, and then define how I'd want to handle the download limit by setting our wait_on_rate_limit parameter to True. What this does it forces the program to wait and attempt to reconnect when the download limit is reached.