Visualizing and Analyzing the Post Show Recaps Universe

V

Well I’m back again with a follow up project to my original Seinfeld Recap Podcast work that I posted about a few weeks ago (you can here me talk about this project here).

The genesis of my latest project came from a suggestion by Mike Bloom.

He had contacted me and suggested that it might be interesting to look at how different podcasts and their hosts use language. I started thinking about this and thought it might be interesting to bake in some text analysis stuff into a visualization of the podcasts on Post Show Recaps as well as see what answers, if any, we can mine from the audio transcripts of these podcasts.

In what follows, I discuss three different interactive visualizations that I created based on the TV podcasts on Post Show Recaps. I discuss how to interpret/use them and what they tell us about those shows. I also discuss the tag cloud representations based on the most frequently spoke unique terms from all 17 TV podcasts. Then I dig into determining how similar and dissimilar the podcasts are by applying information retrieval techniques to calculate the similarity between each podcast based on the audio transcripts and dive into what that might say about those shows and their hosts. Finally, I perform a sentiment analysis of the audio transcripts to measure how positive, negative and neutral the language of the podcasts are.

Read on to find out about the Wigler Effect, what shows are most similar, which are most unique, what podcast uses the most positive language and what show is the most negative.

Creating the Visualizations

Starting with the current TV shows that are being covered by Post Show Recaps, I created a graph where each podcast host and show is a node. If the podcaster hosts a podcast about that show, they have a relationship represented by an edge in the graph. In the visualization you can select any node and highlight all the relationships and explore who is covering what, who podcasts the most and who hosts shows together.

I created the same kind of visualization for the past TV shows as well. An example of one of these graphs is shown below.

Then, getting back to Mike Bloom’s original idea, I wanted to look at what language is used to describe each show. To do this,  I first wrote a program to download all the podcasts available for every show (it’s like 13 gigabytes of files!). I then wanted to convert these audio files into text. To short cut the process a bit, I took a random selection of 10 podcasts from each show and converted them from audio to text rather than using every available episode.

Using the text files, I wrote a script to calculate the frequencies of each term (ignoring stop words), and write the top 1,000 most frequently spoken terms from each show to a database. In the visualization, the tag clouds are comprised of the most frequently spoke unique terms to that show. That is, each term does not appear in the 1,000 most frequent terms of any other show.

I used Jason Davies’s Word Cloud Javascript library to create a cool animation effect for laying out and rendering the tag clouds and incorporated that into the original graphs I constructed. These clouds are rendered when you click on a show.

I also thought it might be cool to look at how a podcast’s language evolves over time. Since I already had a transcript of about 90 Seinfeld Recap Podcasts, I decided to create an animation to show the evolution of the term usage throughout these recaps.

Starting with Episode 1 of the podcast, I show the 100 or so most frequently spoken terms, then every couple of seconds, I transition the tag cloud to display the change in the most frequently spoken terms. For example, Episode 2 would be the most frequently spoke terms through both Episode 1 and 2 and Episode 80 would be the most frequently spoke terms across 80 episodes.

The image below shows the tag cloud for Episode 1 through 9 from the Seinfeld Podcast.

You can explore all these visualizations here.

Term Frequency Observations

Using all 17 shows covered by Post Show Recaps, I originally started exploring tag clouds comprised of the 100 most frequently spoke terms for each podcast. However, you end up with a ton of overlap between these terms. For example, 23 of the 100 most frequent terms appear in all 17 tag clouds.

The following 23 words appear in all 17 tag clouds:

  • point
  • sort
  • being
  • thought
  • think
  • other
  • first
  • probably
  • kind
  • great
  • time
  • little
  • feel
  • down
  • pretty
  • people
  • episode
  • show
  • good
  • talk
  • getting
  • doing
  • mean
 

These 23 terms are not too surprising. Most are either very common English language terms and the others are terms you would expect to show up consistently in these podcasts like episode or talk.

To make the visualization and analysis more meaningful, I then changed the tag clouds to instead be the most frequently spoke unique terms to a given podcast. So each term displayed in the clouds is amongst the most frequent 1,000 terms for the particular podcast but does not appear in the most frequent 1,000 terms of any other podcast.

Without this change, each show only has on average 10-12 unique terms in their top 100 most frequently spoke terms. This change makes the visualization far more impactful. Each cloud becomes a true representation of the main concepts from that show. For example, consider the tag cloud below. It’s pretty easy to see that this is from Game of Thrones. What other show would you talk about dragons, lords, and wildlings so prominently?

Turning our attention to the Seinfeld Podcast time-lapse visualization, I did not restrict these to purely unique terms. Instead I show the top 100 most frequently spoke terms.

The clouds start to stabilize with time. For example, the only difference between the last two tag clouds in the animation is that money appears in the final cloud while story appears in the one prior.

Fifty-seven terms are in every Seinfeld Podcast tag cloud starting at Episode 1 through to Episode 86.

Individual Tag Clouds

Below is every show’s tag cloud term representation.

 

The individual terms for each show are really interesting. One trend you’ll notice is that the main characters of a show are often amongst the unique terms. Also, central concepts from a show that are unique like transgender or girlfriend for Orange is the New Black or vampires for The Strand help define what those shows are about.

Current TV

Game of Thrones

 

 
The Walking Dead

 

 
 
 
Saturday Night Live

 

 
Better Call Saul

 

 
 
Justified

 

 
 
The Leftovers

 

 
 
House of Cards

 

 
 
Daredevil

 

 
 
 
The Strain

 

 
 
Orange is the New Black

 

 
 
 
Once Upon a Time

 

 
 
Orphan Black

 

 
 
 

Past TV

Boardwalk Empire

 

 
Sons of Anarchy

 

 
 
Seinfeld

 

 
Lost

 

 
 
24

 

 
 

Computing the Similarity Between Podcasts

Finally, I wanted to see if I could answer the following questions:

  • How similar are the audio transcripts from all of these podcasts?
  • What podcasts are most similar to each other?
  • Does the same host influence this similarity or does it have more to do with the show that is being discussed?

To tackle these questions, I needed a way to compare the text from the audio transcripts. Luckily, due to the invention of Internet web search, there is tons of research and techniques for comparing documents.

I applied a technique from information retrieval called term frequency inverse document frequency or TF-IDF, to convert the audio transcripts for each show into a representation that would allow me to compute similarities between the podcasts based on the language used to discuss each show.

Essentially, TF-IDF turns each document into a vector, where each entry in the vector corresponds to a term found in the document. Then to compute the similarity between two documents, we just need to determine how similar the two vectors are. Imagining these two vectors as lines in space, the similarity can be interpreted as the angle between the two vectors. Two identical vectors have a zero degree angle between them, while two very different vectors are perhaps going in completely different directions.

Using this approach, I took each podcast and computed the distance between it and all the other podcasts. The results from this experiment are below:

Results from a Similarity Test between Podcasts

Game of Thrones
Hosts: Rob Cesternino & Josh Wigler
Most Similar Podcasts: The Leftovers, Lost and The Walking Dead

 

The Walking Dead
Hosts: Rob Cesternino & Josh Wigler
Most Similar Podcasts: The Leftovers, Lost and Game of Thrones

Saturday Night Live
Hosts: Rob Cesternino and Rich Tackenberg
Most Similar Podcasts: The Leftovers, Lost and Orphan Black

Better Call Saul
Hosts: Rob Cesternino and Antonio Mazzaro
Most Similar Podcasts: The Leftovers, Orange is the New Black and Lost

Justified
Hosts: Josh Wigler and Antonio Mazzaro
Most Similar Podcasts: The Leftovers, Lost and Orange is the New Black

The Leftovers
Hosts: Josh Wigler and Antonio Mazzaro
Most Similar Podcasts: Lost, Orange is the New Black and The Strain

House of Cards
Hosts: Rob Cesternino and Zach Brooks
Most Similar Podcasts: Lost, The Leftovers and Orphan Black

Daredevil
Hosts: Josh Wigler and Kevin Mahadeo
Most Similar Podcasts: The Leftovers, Lost and Orphan Black

 

The Strain
Hosts: Josh Wigler and Antonio Mazzaro
Most Similar Podcasts: The Leftovers, Lost and Orange is the New Black

Once Upon a Time
Hosts: Mike Bloom and Curt Clarke
Most Similar Podcasts: Orphan Black, Lost and The Leftovers

Orphan Black
Hosts: Mike Bloom and Jessica Liese
Most Similar Podcasts: Lost, Orange is the New Black and The Leftovers

Orange is the New Black
Hosts: Taylor Cotter and Jessica Liese
Most Similar Podcasts: The Leftovers, Lost and Orphan Black

Boardwalk Empire
Hosts: Antonio Mazzaro and Jeremiah Panhorst
Most Similar Podcasts: The Leftovers, Orange is the New Black, and Lost

Sons of Anarchy
Hosts: Rob Cesternino and Josh Wigler
Most Similar Podcasts: Lost, The Leftovers, and Game of Thrones

Seinfeld
Hosts: Rob Cesternino and Akiva Wienerkur
Most Similar Podcasts: Daredevil, The Leftovers, and Orphan Black

Lost
Hosts: Josh Wigler and Mike Bloom
Most Similar Podcasts: The Leftovers, Orange is the New Black and Orphan Black

24
Hosts: Rob Cesternino and Josh Wigler
Most Similar Podcasts: Lost, Game of Thrones and Orange is the New Black

The overall variance between the computed similarity for all shows is pretty small, which makes sense given that a lot of the hosts overlap, they are talking about television shows and they are all podcasts.

What is interesting and also kind of crazy about the most similar shows is that The Leftovers and Lost shows up in 15 of these lists of most similar podcasts! The only podcast that doesn’t have The Leftovers in its’ top three most similar is 24 and for Lost, Seinfeld is the only podcast without it.

I’m not sure how to interpret this. I’ve never seen The Leftovers or listened to the podcast, so I have no insights with respect to that show. As for Lost, perhaps since it’s a very iconic episodic drama that set the stage for many of these shows, it stands to reason that all these shows would warrant similar language usage.

Another factor could be what I am calling “The Wigler Effect“. Josh Wigler hosts a total of 9 shows, more than anyone else. It could be that his talking points overlap and come to dominate the language used across most of these podcasts.

The most consistently dissimilar shows from all other podcasts are:

  • Boardwalk Empire
  • House of Cards
  • Justified
  • Seinfeld
  • Saturday Night Live

I’m not familiar with Justified, but based on my knowledge of these other shows, I think this seems to makes sense. There’s common language that is used to describe some of these shows that you just wouldn’t find anywhere else, like sketch, jokes and monologue from the SNL podcast.

Sentiment Analysis

In this last experiment, I wanted to perform sentiment analysis to see whether there’s a difference in the overall positive or negative language used in each podcast. Sentiment analysis is a computational method for categorizing the mood or opinions expressed in a piece of text. This is a hot area of research and development for things like brand awareness. If a big company wants to know how people are reacting to their products online, they could analyze the overall sentiment from something like people’s tweets. Companies like BrandWatch specialize in this type of analysis.
 
The product of a sentiment analysis on a piece of text is a classification of positive, negative or neutral. There’s a variety of approaches to do this. The simplest form and is often used in conjunction with more advanced techniques is to have a predefined set of positive and negative words, then count how many positive words appear versus how many negative words appear. More advanced techniques apply machine learning approaches like Naive Bayes Classifiers. A large collection of human categorized text will be used to train these classifiers. The closer the test set is to the training set, the more accurate the result.
 
General sentiment analysis is only about 60-70% accurate and for the most part, it’s usually used for smaller snippets of text like tweets or text messages. In an ideal world, I’d separate the audio by host and also train the classifiers on a training set more relevant to the transcripts but that would be a ton of work. Just the same, I was still curious to see what could we learn, if anything, from these podcast transcripts when using a general sentiment analysis approach.
 
Python has an NLP library that supports sentiment analysis, so I didn’t need to write my own algorithms from scratch. Using my randomized collections of 10 podcast episodes per podcast, I calculated the overall sentiment from both the introductory parts, middle and the ends of each episode and podcast to average the overall sentiment.
 

Across the 10 episodes, most of the shows came out with a mixture of some negative, some neutral and a little positive, but the overall average sentiment for almost every podcast, including all three sections, was negative.

Below is a summary of results from the podcasts where the something interesting actually happened.

ShowIntroMiddleEnding
Justifiednegativenegativeneutral
The Leftoversneutralnegativenegative
Once Upon a Timeneutralnegativenegative
24neutralnegativenegative
Lostneutralpositivepositive
Seinfeldnegativenegativenegative

Surprisingly, Lost is the only show with an overall positive sentiment. Josh loves this show so may be his praises carry forward throughout the show :-).

The other really interesting result is Seinfeld. As mentioned, most of the shows had some mixture of sentiment but the overall average was negative. What’s interesting about Seinfeld is that for every episode and segment I looked at, the result was definitively negative. Perhaps Akiva’s complaints about his hate of chocolate (23 times and counting) and other foods carries the day on this podcast.

Final Remarks

Wow, that was a lot of work and I probably know more about the podcasts on Post Show Recaps than any human should! Hopefully you enjoyed this deep dive into those podcasts. If you notice anything interesting or have any questions, I’d love to hear from you on Twitter or in the comments. And if you haven’t looked at the visualization yet, here’s the link one more time.

I’d like to thank Rob Cesternino, Antonio MazzaroJosh Wigler, Kevin MahadeoZach BrooksMike Bloom, Curt ClarkeJessica Liese,  Taylor CotterRich TackenbergJeremiah Panhorst, and Akiva Wienerkur for continuing to make awesome content that’s fun to listen to and analyze :-).

About the author

Sean Falconer
By Sean Falconer

Sean Falconer

Get in touch

I write about programming, developer relations, technology, startup life, occasionally Survivor, and really anything that interests me.