A Simple Web Scraper Tutorial - A Useful Method In NLP
Twitter, along with other social media platforms, represents an immense and largely untapped resource for data. It remains the most popular platform for research, as it still provides its data via a number of APIs, or Application Programming Interfaces and is an invaluable source of user information and public real time data on almost every topic in today's world.
What is Natural Language Processing, or NLP?
Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP has become an essential business tool for uncovering hidden data insights from social media channels. Sentiment analysis can analyze language used in social media posts, responses, reviews, and more to extract attitudes and emotions in response to products, promotions, and events–information companies can use in product designs, advertising campaigns, and more.
A number of tools can be utilized using the data that Twitter provides us, including but not limited to Natural Language Understanding (NLU), which analyzes text in unstructured data formats. This tutorial will show how to scrape tweets related to COVID-19 from Twitter using Twitter APIs.
To begin, I will demonstrate how to create an API Key, API Key Secret, Access Token, and Access Token Secret to be securely stored to use for scraping tweets.
Setting Up the Twitter API Keys
In order to follow along and build your own scraper, you must set up your personal API keys. You can do so using the following steps:
1- Create a text (.txt) file and rename. This file will hold our keys and must be kept secure, much like a password. These keys are unique to you and are how Twitter keeps track of the developers, therefore it should not be shared with anyone. In this example, I've used Notepad named my file "keys.txt"
2- Set up your file in the required format. For API keys, there will need to be at least 1 header followed by the keys stored in variables, or 'names'. Refer to the picture below for the format:
Notice I've stored the keys under the [keys] header. Keep note of your headers in your .txt file so you may correctly access the keys when using the .get() method.
2- Next sign up or sign in to twitter.com and be sure to add your phone number to your account before proceeding!
3- Navigate to https://dev.twitter.com/apps/new and click create a new app. Add a name for your app and click Next.
4- Copy your API key:
5- Paste into your .txt file. Following the format I've displayed above, simple replace the "paste_your_api_key_here" with your key.
6- Repeat steps 4 and 5 for your API Key Secret.
7- Click on App Settings.
8- Under the Access Token and Secret section, click the Generate button to access your keys.
9- Repeat steps 4 and 5 for the Access Token and Access Token Secret.
10- Save your file and add to your repository.
IMPORTANT: Add your .txt file your .gitignore file -- API keys and tokens are unique to each developer and should not be shared since it helps Twitter identify the each individual.
Building the Scraper
(1) What is NLP?
(2) Using Twitter as a data source: an overview of social media research tools (2019)
(3) Python Dictionary get() Method
(4) What is OAuth? How the open authorization framework works
(5) OAuthHandler
(6) Python Iterators
Comments
Post a Comment