Understanding Effects of Covid-19 Through Sentiment Analysis
Research | Sentiment Analysis | Data Science
Industry
Public Health
Year
2021
Role
Applied Behavioral Scientist
Overview
By gaining a better understanding of the general sentiment of a given population, policy leaders can become better informed regarding how to more effectively govern people during times of crisis. For instance, if feelings of fear are high, politicians can offer words of reassurance to instill feelings of calmness and ease. Or, if feelings of trust are low, politicians can attempt to mend the public trust by strengthening accountability and transparency within the government. At the end of the day, essentially any policy decision can be better informed by knowing how the general populace feels about the issue at hand.
The Problem
COVID-19 has wreaked unprecedented havoc around the world. From a data analytics perspective, never before has a pandemic occurred during a time in history when almost any human can publicly share their thoughts on a global platform. More specifically, Twitter offers real-time insight on the attitudes, beliefs, and general moods of a populace.
The Solution
We compared the sentiments of COVID-19 related tweets at the beginning of the pandemic on March 30, 2020, to the sentiments exactly one year later on March 30, 2021. We select this time frame to capture two of the key events during this pandemic, the onset of stay-at-home orders and vaccine availability. As opposed to other data collection methods, such as interviews and surveys (and the numerous response biases that come along with them), sentiment analysis through Twitter is better able to capture the raw and unfiltered emotions of people who feel the need to express their views.
My Role
I coded exploratory visualizations in R including word frequencies and sentiment analyses in addition to in-depth interpretations and analyses for the extracted data and results. I also assisted with the data retrieval and cleaning.
The rest of my team did the topic modeling, newsmaps, exploratory/predictive analysis of virality, tweet length analyses, topic frequency bar graphs, and hashtag counts (not shown on this page).
Goal
Our research questions for this analysis are:
Are there differences in sentiments between the two sample periods—both in original tweets and in retweets?
Which topics are there the most original tweets about, and which are more often the subject of retweets?
Is there a difference in geographic attention between the two dates? For example, was China being discussed more in 2020 or 2021?
Which of these features are most predictive of how many retweets a tweet gets?
Are there certain sentiments or topic-specific words that are most likely to attract retweets?
Methodology
After hydrating the tweets using Python, we segregated each period’s tweets into original and retweet sets and prepared R-ready CSV files. Then, using R, we determined the frequency of words in the tweets, tweet lengths, common hashtags, and mentions. Many of these steps required our data to be in tidytext format.
Then, we conduct sentiment analysis to further characterize the two periods. We joined the NRC Word-Emotion Association Lexicon to our data. Doing so allowed us to tag the words with eight basic emotions and sentiments. We also joined the AFINN lexicon to rate the emotional valence (positive or negative) of each word.
Next, we use the quanteda package for Latent Dirichlet Allocation (LDA) topic modeling. This algorithm determines the clusters of words that are likely to co-occur, thus defining the topics. Topic modeling helps to illuminate the different conversations surrounding COVID-19 in each period.
We also seek to identify the features that predict a tweet’s virality—defined as the ratio of retweets to followers of the original account. For example, a user with two followers who gets eight retweets on a given tweet would receive a virality score of four. For our two modeling methods, we use the sentiments, mean AFINN score, and topic-specific words as inputs. Our methods include the random forests method and backward stepwise regression optimized using the Akaike Information Criterion (AIC).
We used random forest because it is an ensemble algorithm that runs well on large datasets and has a low risk of overfitting. We use backward stepwise regression so we that can start with a complete model with all of our selected variables and remove those that are predictively insignificant.
Discovery
What differences do we observe in sentiments between the two periods?
We joined the NRC Word-Emotion Association Lexicon to our data, which allowed us to identify words associated with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
I produced visualizations comparing the sentiments being expressed in each sample period. Compared to our 2020 tweets, the 2021 tweets express less trust, less surprise, less joy, less disgust, less anticipation, and less anger, but more sadness, more fear, and more positivity,
★ Insights
Looking at the specific words underlying the 2020 and 2021 sentiments, we can see that the word “pandemic” has been most used but with a different frequency in each sample period. In 2021, other negatively valenced words such as “bad” and "sh*t" words became more common, as did positively valenced words such as “hope” and “love”. This is interesting because it demonstrates that after a year, people seem to be more expressive, likely from the fallout and exhaustion of the previous year.
How positive vs negative are the tweets from each year?
Next, we join the AFINN sentiment lexicon, a list of English terms manually rated for valence with an integer between -5 (negative) and +5 (positive) by Finn Årup Nielsen between 2009 and 2011. We use this lexicon to compute mean positivity scores for all words tweeted in each sample year.
★ Insights
The tweets from 2021 are slightly more positive, but the difference appears negligible.
In 2020, the word “support” (positively valenced) was the most frequently appearing word from the lexicon, whereas in 2021, the word “stop” (negatively valenced) appeared the most frequently. Note that “support” and “stop” are opposites. Perhaps initially, there were certain efforts people wanted to promote to mitigate effects of the pandemic. It could be that people grew exhausted of the pandemic and became more attitudinally opposed to certain phenomena than supportive of others.
Conclusion
By examining tweets from these two dates, we were able to uncover a trove of insights about how COVID-19 communications have changed—both in sentiments and in content. We see that focus on China has abated slightly. Similarly, the topics that people are discussing have changed. Conversations about lockdown are less prevalent, and political discussions have changed as well. The kinds of tweets going viral (getting heavily retweeted) differ substantially between the two years. In fact, the average number of retweets has itself shifted dramatically, with much fewer retweets overall in 2021. While the presence of particular words can’t by itself predict how viral a tweet is, there are clear patterns about which words attracted attention during each period.
These findings may give clues about how policymakers and influencers can craft their messages to reach more eyes during similar healthcare crises. Because determinants of virality are ever-changing, influencers should keep their finger on the pulse using methods similar to the ones we use here.