Text mining on the American The Office series part II.

Reading time: 5 min

In the previous chapter of our two-part The Office blog series we discovered the main characters’ vocabularies, most frequent phrases and most personal words. Now we turn to analyzing their sentiments and trying to find people who sound alike. 

One of text mining’s most interesting, useful, and commonly applied techniques is sentiment analysis. Sentiment analysis allows us to decide of a text whether or not it contains positive, neutral or negative emotions. There’re numerous ways we can run sentiment analysis: (1) by-word categorization into positive and negative or into emotion classes, such as joy, anger or trust, (2) by-word scoring where we assign each word a number on a given scale (like AFINN using -3 – to – +3) or (3) running sentiment analysis on phrases and sentences instead of words, and scaling the overall sentiment of the text between -1 (all negative) and +1 (all positive). During the analysis of which we’ll be showing the results in the next paragraphs, we used multiple techniques to extract sentiment to help us visualize relationships between people and overall attitudes of characters (are they nice or mean). 

Overall sentiments of characters

The first way we’ll extract each person’s overall sentiment will be the following: we’ll take the most frequent words that actually carry a sentimental value and show the most common positive and negative terms by character. Even by just doing this simple analysis we can already see that some people, like Angela and Oscar, use their most negative words almost as frequently as they do with their most positive ones. That’s the first hint at people’s overall attitudes.

A much more appropriate method is to score each word on a -3 to +3 scale (AFINN), multiply each word’s frequency with its sentiment score and see which words contribute most negativity and positivity to people’s speeches.

This way it is much clearer that Angela, Darryl and Dwight contribute most negativity to their talking, as around 50% of their most commonly used words (ones that have sentimental meaning) carry negativity. This does bring us closer to reality, as we also see that Pam’s, Jim’s or Michael’s word lists suggest large positivity rather than negativity coming from them. Using the above method, we’re able to successfully categorize people into the ‘nicer’ and ‘meaner’ classes.

Sentiments between people

After looking at overall sentiments by characters it’d be very interesting and exciting to see how the used words can model relationships between people. We can take the sentiments of the words that are spoken to one another and see how the results represent the actual attitudes between certain characters. We’ll show the results in a network object to better visualize the relationships. Unfortunately, we cannot paste the interactive network here due to its long HTML code, but we created a GIF from it, so our readers can see what they’re missing out on.  The directions and colors of the arrows indicate the sentiment/relationship.

ezgif-6-47a4653e93f2

Some conclusions that can be drawn by looking at the above GIF are that everyone’s nice to Darryl (as all the arrows pointing to him are green), Jim’s nice to everyone (as all the arrows coming from him are green) however Dwight uses rather negative words when talking to Jim, hence the arrow pointing from Dwight to Jim is red. Most red arrows originate from Angela and Oscar, which means on average they’re the least nice – possibly the meanest – people in the series. 

It’s a good idea to visualize the ‘strongest’ (most positive or negative) relationships. We can do that by leaving only those arrows in the network, that are associated with an extreme high or low sentiment value.

Doing just that can prove that Jim and Pam’s relationship is the most positive one amongst all, while the tone between Angela and her colleagues is mutually strongly negative. Before we saw that most red arrows originate from Angela and Oscar, and now – when only taking the stronger, more meaningful relationships into account – the arrows between Oscar and Angela are red in both directions, so as Jim and Pam have the most positive, Oscar and Angela have the most negative coworker relationship in The Office. 

Evolution of Jim and Dwight’s relationship in terms of sentiments

In case we want to analyze how certain sentiments changed over time, it’d be wise to choose people of whom we actually expect some trend to be shown, of whom we know there has been an evolution over time. For this quick task, we selected Jim and Dwight, as we know they’re the rivals-turned-into-friends of the show.

7

We ran the trend-sentiment analysis with two different methods, and got similar results: Dwight has always been meaner to Jim than the other way around, and towards the very end of the show, in the final season Dwight does end on a very high note compared to his previous 8 seasons’ average, which does prove to some extent that the two ended the series being very good friends.

Quick intro topic analysis – finding similarly speaking people

Another interesting, possibly less popular aspect of text mining is topic analysis or topic modeling, which uses algorithms to try and cluster the text based on its words. It focuses on extracting words that can be found ‘close to each other’ in the text – in other words, that are used jointly (in the same context). A typical use case for this is to cluster a news article’s words into two categories – for example, financial and technological terms. 

In our case what we can use topic modeling for is to try and find people who belong in the same word clusters – that is, they use similar vocabularies, they sound somewhat alike. For this, we simply set the cluster number to 12, as we’re working with 12 people, and await our results, who got put into the same groups. 

By looking at the above chart we can see that Oscar and Angela are in the same group, which means they use similar words – not a surprise, as both are accountants and sit next to each other. Something else is interesting too: Jim and Pam have their own clusters, however, they also have a joint group (cluster #8) which could be the ‘topics’ they talk about in their private lives, such as when talking to their daughter, their families or when simply talking to one another about non-office related stuff.

Within the first part of our The Office blog series, we explored word and phrase usages and found most personal words. In this second part, we focused on analyzing sentiments and clustering similarly sounding characters together. We hope to have shown lots of interesting and cool things Natural Language Processing tools can do for us, either for a fun project like this or for business purposes, such as analyzing customer feedback, scoring product reviews by their sentiments, clustering articles’ topics or simply extracting words that certain companies like to use on social media.

Those interested in technical details can read the original English text on Towards Data Science.

Blogvizual

Author:
Kristóf Rábay – Data Scientist

Leave a Reply

Your email address will not be published. Required fields are marked *