home | projects

The prevalence of anxiety and depression in social media.

I wanted to know how the use of distressing language is being perceived and maybe internalised by the readers of headlines.


Using a module called snscrape, i collected data about the frequency of posts containing words like 'depressed' and 'anxious' from christmas day each year from 2010 to 2021. This was the function that i used to scrape tweets, it is set up to receive a term to search for and a date to search in, it then searches for the term up to one day past the given date and returns a pandas dataframe with the tweets.

def get_tweets(term, day, month, year):
    # Creating list to append tweet data to
    tweets_list2 = []

    # Using TwitterSearchScraper to scrape data and append tweets to list
    since = year + "-" + month + "-" + day
    until = year + "-" + month + "-" + str(int(day) + 1)
    for i, tweet in enumerate(
            sntwitter.TwitterSearchScraper('{} since:{} until:{}'.format(
    term, since, until)).get_items()):
        if i > 10000:
            break
        tweets_list2.append(
            [tweet.date, tweet.id, tweet.content, tweet.user.username,
    tweet.replyCount, tweet.retweetCount,
             tweet.likeCount])

    # Creating a dataframe from the tweets list above
    df = pd.DataFrame(tweets_list2,
                      columns=['datetime', 'tweet Id', 'text', 'username',
    "replies", "retweets", "likes"])
    return df

I then used my new function to scrape the terms 'depressed' and 'anxious' and found the count for each term and year

datasets_anxiety = {}
datasets_depression = {}

for i in range(2):#get all years for christmas
    datasets_anxiety[str(i+10)] = get_tweets("anxious", "25", "12",
    "20" + str(i+10))
    datasets_depression[str(i+10)] = get_tweets("depressed", "25", "12",
    "20" + str(i+10))


count_per_year_anxiety = []
count_per_year_depression = []
for i in datasets_anxiety:
    count_per_year_anxiety.append(datasets_anxiety[i].count()["text"])
    count_per_year_depression.append(datasets_depression[i].count()["text"])

plotting these against an array of years with the following code gave this graph (note that this is an interactive copy and you can use the control icons in the bottom left of each graph to navigate them):

fig, ax = plt.subplots(2, 1, sharex='col', sharey='row')

ax[0].plot(["20" + i for i in datasets_anxiety.keys()],count_per_year_anxiety,
    label="anxious", color="red")
ax[0].plot(["20" + i for i in datasets_anxiety.keys()],count_per_year_depression,
    label="depressed", color="blue")
plt.legend()
plt.ylabel("No. of tweets")
plt.xlabel("Year")
plt.show()

These lines would be helpful indicators if twitters monthly users (or MAU) was stable for these measurements, but in reality this number has changed dramatically and needs to be taken into account in order to get useful and accurate results. To do this, i found an online record of twitter MAU since 2007 and recorded this data into an array. The data was recorded in millions which is orders of magnitude bigger than our other results so i multiplied each value by 100 to get the MAU in 10,000s.

twitter_MAU = [54,117,185,241,288,305,318,330,321,330,353,396.5]
mau = []
for i in range(len(datasets_anxiety.keys())):#shortens data to num. of years needed
    mau.append(twitter_MAU[i] * 100)

This is definitely an improvement although not as readable as it could be. To make it easier to understand at a glance, i normalised the tweet counts by dividing them by their corresponding year's MAU, this should give the most accurate representation of these terms in this population. Note that from here onwards, the "tweet count" y-axis refers to number of daily tweets per monthly active user. This is to say that a value of 1 would represent one tweet per monthly active user on a particular day.

normalised_count_anxiety = []
normalised_count_depression = []

for i in range(len(count_per_year_anxiety)):
    normalised_count_anxiety.append(count_per_year_anxiety[i] / mau[i])
    normalised_count_depression.append(count_per_year_depression[i] / mau[i])

ax[1].plot(["20" + i for i in datasets_anxiety.keys()],normalised_count_anxiety,
    label="anxious normalised to user count", color="red")
ax[1].plot(["20" + i for i in datasets_anxiety.keys()],normalised_count_depression,
    label="depressed normalised to user count", color="blue")

This graph appears to give an accurate representation of societal mental health over the years, but the years before 2015 could be anomalies due to the lower MAU of those years compared to the relatively stable values for every year since. Because of this I re-ran the code to only show years with similar maximum MAUs. I also ran my original functions again for new terms that could correlate or cause the anxiety and depression trends. These trends are plotted in the below graph. From here onwards all data in graphs will be normalised to MAU but not labeled as such for conciseness.

Conclusions and speculations.

The scientific validity of this method can definitely be called into question despite my efforts to follow some vaguely scientific method, so take these results with that in mind.

From the final graph, I made the observation that mentions of depression, war, anxiety and coronavirus all followed similar trends over the years while mentions of global warming seems to take a completely independent path, one way to explain this is that society may see things like war and coronavirus as absolutes and unsolvable while they may feel as if they have time to help supress coronavirus and that it is less of an imposing risk to their way of life compared to the other two. This could also be down to the season that the results were taken from, since the effects of global warming are less evident in the winter to many. The similar growth of the other terms could indicate that as war and coronavirus are mentioned more often, more people feel anxious or depressed in general. Maybe inflated cases of depression and anxiety caused people to speculate negatively more often about disasters and global events instead of the other way around. Another theory that could explain the increase in the mentions of anxiety and depression is that people are becoming more comfortable about expressing mental health issues publicly and this happened to coincide with a few global events.


Reflection.

While researching this project i foresaw a few flaws in the data collection and conclusions drawn from the data that would be useful to aknowlege here. Firstly the terms that i chose to scrape for could be damaging to the results as i saw that some of the data included tweets using "anxious" as a way to express exitement, rather than genuine anxiety, however there is a much lower chance of the word "depression" being used in a positive way. This does not mean that it could not be used in dry humour or to exaggerate feelings about something that wouldn't make the tweets list in this project if it were validated by a human. Another thing to note is that since each tweet count is found only for christmas day, the data might represent only a period of seasonal depression rather than accurately reporting societies morale, and there is the possibility that things like global warming are less worried about in winter and therefore contribute less, perhaps if the data was taken from summer it would have played a larger part. The only thing stopping further dates being taken for a larger data set was the time that it took to run the scraper, just this data took my computer over a full day and night to collect.

all code was written in python 3.9