Text_Mining_Twitter

In summer 2017, I made a presentation to the Global Citizenship Program (GCP) Collaboratory. This presentation prompted the GCP leaders to contact me for assistance with the restructuring of KEYS 4014 Social Movements and the Impact of Technologies. They wanted to include more data analysis content into the course and thought that I could help with data from Twitter and various analytical tools from R.

We agreed that I would provide Twitter data and consultation support for KEYS 4014.01 in fall 2017. The data support would consist of two collections of datasets. The first collection would focus on the topic of hate group tweets. This collection would introduce the class to R, Twitter analysis, and technology integration with the course topic. Each search term data collection included: a dataset of up to 3000 tweets based on the chosen search term, datasets for both most favorited and least favorited tweets, datasets for both most retweeted and least retweeted tweets, and charts for favorite count frequencies, retweet frequencies, tweet word frequencies, and a wordcloud based on tweet word frequencies. These datasets were the result of five search terms: 14words, alt right, Identity Europa, MAGA, and White Nationalism.

I helped facilitate the introduction of these datasets with a class presentation on R, RStudio, and text mining of Twitter data. This presentation included a brief introduction to R, a demonstration of the capabilities of R including 2D and 3D plotting, and an exercise of Twitter analysis using R on the search phrase “my professor just.” This presentation prepared the class for their work analyzing the initial collection of datasets.

After the fall semester break, I prepared the second collection of Twitter data for this course. This collection would focus on the topic of tweets associated with the Jason Stockley trial and verdict. This collection would consist of only a dataset of up to 3000 tweets for each search term between the dates Friday, 15 September 2017 until Sunday, 15 October 2017. These datasets were the result of eleven search terms: #blacklivesmatter, #BLM, black lives matter, #policebrutality, police brutality, #stockleyverdict, Jason Stockley, #STLProtest, #STLVerdict, Anthony Lamar Smith, and #Ferguson

Based on feedback from the course instructor this data had a positive impact on the course outcomes for the students.

This short example will help you experiment with mining Twitter data and text processing.

The first step is to install the necessary packages used in this example. These packages are necessary for this example project:

[for Twitter mining] twitteR
[for text mining and analysis] dplyr, tidytext
[for data visualization] ggplot2, wordcloud

Once you have all of the necessary packages, this web resource is a good place to start learning about the twitteR package and Twitter user credentials is here: https://www.r-bloggers.com/setting-up-the-twitter-r-package-for-text-analytics/ You will need your Twitter user credentials to connect to Twitter and download tweet data.

This R script will help run an initial experiment to mine Twitter content and run text analysis on your data using the search term [my professor just]:

text mining Twitter example.R

Now, we can step through a brief description of the example R script above.

The first set of commands load the twitteR package, assign your user credentials to objects [paste your user credentials between the quotation marks], and opens a Twitter access session with setup_twitter_oauth( ). The example code now stores the text string [phrase] to search for in an object. The string "my professor just" is simply an example, you can use any other string or hashtag you want. The function searchTwitter( ) will request your search from Twitter and return the results that are stored in an object. The n = 3000 parameter limits the search to the first 3000 tweets that contain the search terms [the entire phrase or any single word or combination of words]. Another example of the searchTwitter( ) command is shown as a comment. This example shows you how to designate a time window with since/until dates. Notice that the date format used is YYYY-MM-DD. The search and download may take a while. The object containing the downloaded tweets is converted to a data frame so it is easier to work with.

The dim( ) function verifies the number of rows and columns in the data frame.The code then saves the Twitter data as both a .csv and an RData file. You can edit this section based on your data storage preferences.

The next section of code will process the tweets. Before beginning, we will load the necessary packages [dplyr, tidytext, ggplot2, and wordcloud].

Before beginning the analysis you can view the first few tweets with the head( ) function. This example will analyze several of the tweet attributes: created [the creation datetime of this tweet], favoriteCount [the number of times this tweet was favorited], retweetCount [the number of times this tweet was retweeted], and text [the text content of this tweet].

The first section of analysis determines the time span of the data. We sort the tweets and look at the oldest [using head( )] and newest [using tail( )] creation datatime stamps. Once we are done exploring the data time span, we clean up the R environment with the rm( ) function.

The next section of analysis will explore the favorited tweets. These are tweets that someone liked enough to designate them as a "favorite". We start this analysis by tabulating the values in the favoriteCount attribute of the Twitter data. The bar plot of the tabulation results has a distinctive shape. It reveals that many tweets were not favorited or favorited few times. The number of tweets that were highly favorited are small, hence the long small tail to the right of the chart. We will see this shape in several of our other analyses. After plotting the favoriteCount frequency tabulation, the analysis creates two data sets, one with the highest favoriteCount and another with 0 favoriteCount. You can set the count threshold for the high favoriteCount data based on the results of the tabulation and the number of tweets you want in this data set. The 0 favoriteCount dataset is trimmed to 30 rows. If you want more rows, edit the command. The example code saves the two data sets as .csv files for later analysis. Once the data is saved into files, we clean up the R environment with the rm( ) function.

The following section of analysis will explore the retweeted tweets. These are tweets that someone liked enough to share it as a retweet. We start this analysis by tabulating the values in the retweetCount attribute of the Twitter data. The bar plot of the tabulation results has a distinctive shape. It reveals that many tweets were not retweeted or retweeted few times. The number of tweets that were retweeted many times are small, hence the long small tail to the right of the chart. After plotting the retweetCount frequency tabulation, the analysis creates two data sets, one with the highest retweetCount and another with 0 retweetCount. You can set the count threshold for the high retweetCount data based on the results of the tabulation and the number of tweets you want in this data set. The 0 retweetCount dataset is trimmed to 30 rows. If you want more rows, edit the command. The example code saves the two data sets as .csv files for later analysis. Once the data is saved into files, we clean up the R environment with the rm( ) function.

The final section provides an example of text mining the Twitter data set. The first step of processing extracts the text content and converts it to a data frame of individual words. Next the data frame of words is converted to individual tokens [words reduced to common roots, for example: dog and dogs will both become dog]. Stop words [common words that do not convey meaning, like 'the', 'a', 'and', 'but', etc.] are removed leaving us the principal words in the tweet text. The stop word processing can be modified to include other words or terms of little interest such as 'professor' in our example search phrase. You can do this by add your list of words to the object stop_words using the c( ) R function.

The example code now creates a bar chart of the most frequent word tokens in the tweet text. You can adjust how many tokens are shown in this chart by changing the filter( ) value. Notice that this chart has the same shape as our previous frequency charts.

Finally, the code creates a wordcloud of the 150 most frequent word tokens in the tweet text. You can modify the number of words shown in the worcloud to achieve a suitable image.

Project deliverable files

Here are dataset results [as .csv files] from the KEYS 4014 project, grouped by search term:

14words charts [as .png files]: favoriteCount frequency, retweetCount frequency, tweet word frequency, wordcloud