Well I’m back again with a follow up project to my original Seinfeld Recap Podcast work that I posted about a few weeks ago (you can here me talk about this project here).
The genesis of my latest project came from a suggestion by Mike Bloom.
He had contacted me and suggested that it might be interesting to look at how different podcasts and their hosts use language. I started thinking about this and thought it might be interesting to bake in some text analysis stuff into a visualization of the podcasts on Post Show Recaps as well as see what answers, if any, we can mine from the audio transcripts of these podcasts.
In what follows, I discuss three different interactive visualizations that I created based on the TV podcasts on Post Show Recaps. I discuss how to interpret/use them and what they tell us about those shows. I also discuss the tag cloud representations based on the most frequently spoke unique terms from all 17 TV podcasts. Then I dig into determining how similar and dissimilar the podcasts are by applying information retrieval techniques to calculate the similarity between each podcast based on the audio transcripts and dive into what that might say about those shows and their hosts. Finally, I perform a sentiment analysis of the audio transcripts to measure how positive, negative and neutral the language of the podcasts are.
Read on to find out about the Wigler Effect, what shows are most similar, which are most unique, what podcast uses the most positive language and what show is the most negative.
Creating the Visualizations
Starting with the current TV shows that are being covered by Post Show Recaps, I created a graph where each podcast host and show is a node. If the podcaster hosts a podcast about that show, they have a relationship represented by an edge in the graph. In the visualization you can select any node and highlight all the relationships and explore who is covering what, who podcasts the most and who hosts shows together.
I created the same kind of visualization for the past TV shows as well. An example of one of these graphs is shown below.
Then, getting back to Mike Bloom’s original idea, I wanted to look at what language is used to describe each show. To do this, I first wrote a program to download all the podcasts available for every show (it’s like 13 gigabytes of files!). I then wanted to convert these audio files into text. To short cut the process a bit, I took a random selection of 10 podcasts from each show and converted them from audio to text rather than using every available episode.
Using the text files, I wrote a script to calculate the frequencies of each term (ignoring stop words), and write the top 1,000 most frequently spoken terms from each show to a database. In the visualization, the tag clouds are comprised of the most frequently spoke unique terms to that show. That is, each term does not appear in the 1,000 most frequent terms of any other show.
I used Jason Davies’s Word Cloud Javascript library to create a cool animation effect for laying out and rendering the tag clouds and incorporated that into the original graphs I constructed. These clouds are rendered when you click on a show.
I also thought it might be cool to look at how a podcast’s language evolves over time. Since I already had a transcript of about 90 Seinfeld Recap Podcasts, I decided to create an animation to show the evolution of the term usage throughout these recaps.
Starting with Episode 1 of the podcast, I show the 100 or so most frequently spoken terms, then every couple of seconds, I transition the tag cloud to display the change in the most frequently spoken terms. For example, Episode 2 would be the most frequently spoke terms through both Episode 1 and 2 and Episode 80 would be the most frequently spoke terms across 80 episodes.
The image below shows the tag cloud for Episode 1 through 9 from the Seinfeld Podcast.
You can explore all these visualizations here.
Term Frequency Observations
Using all 17 shows covered by Post Show Recaps, I originally started exploring tag clouds comprised of the 100 most frequently spoke terms for each podcast. However, you end up with a ton of overlap between these terms. For example, 23 of the 100 most frequent terms appear in all 17 tag clouds.
The following 23 words appear in all 17 tag clouds:
- point
- sort
- being
- thought
- think
- other
- first
- probably
- kind
- great
- time
- little
- feel
- down
- pretty
- people
- episode
- show
- good
- talk
- getting
- doing
- mean
These 23 terms are not too surprising. Most are either very common English language terms and the others are terms you would expect to show up consistently in these podcasts like episode or talk.
To make the visualization and analysis more meaningful, I then changed the tag clouds to instead be the most frequently spoke unique terms to a given podcast. So each term displayed in the clouds is amongst the most frequent 1,000 terms for the particular podcast but does not appear in the most frequent 1,000 terms of any other podcast.
Without this change, each show only has on average 10-12 unique terms in their top 100 most frequently spoke terms. This change makes the visualization far more impactful. Each cloud becomes a true representation of the main concepts from that show. For example, consider the tag cloud below. It’s pretty easy to see that this is from Game of Thrones. What other show would you talk about dragons, lords, and wildlings so prominently?
Turning our attention to the Seinfeld Podcast time-lapse visualization, I did not restrict these to purely unique terms. Instead I show the top 100 most frequently spoke terms.
The clouds start to stabilize with time. For example, the only difference between the last two tag clouds in the animation is that money appears in the final cloud while story appears in the one prior.
Fifty-seven terms are in every Seinfeld Podcast tag cloud starting at Episode 1 through to Episode 86.
Individual Tag Clouds
The individual terms for each show are really interesting. One trend you’ll notice is that the main characters of a show are often amongst the unique terms. Also, central concepts from a show that are unique like transgender or girlfriend for Orange is the New Black or vampires for The Strand help define what those shows are about.
Current TV
Past TV
Computing the Similarity Between Podcasts
Finally, I wanted to see if I could answer the following questions:
- How similar are the audio transcripts from all of these podcasts?
- What podcasts are most similar to each other?
- Does the same host influence this similarity or does it have more to do with the show that is being discussed?
To tackle these questions, I needed a way to compare the text from the audio transcripts. Luckily, due to the invention of Internet web search, there is tons of research and techniques for comparing documents.
I applied a technique from information retrieval called term frequency inverse document frequency or TF-IDF, to convert the audio transcripts for each show into a representation that would allow me to compute similarities between the podcasts based on the language used to discuss each show.
Essentially, TF-IDF turns each document into a vector, where each entry in the vector corresponds to a term found in the document. Then to compute the similarity between two documents, we just need to determine how similar the two vectors are. Imagining these two vectors as lines in space, the similarity can be interpreted as the angle between the two vectors. Two identical vectors have a zero degree angle between them, while two very different vectors are perhaps going in completely different directions.
Using this approach, I took each podcast and computed the distance between it and all the other podcasts. The results from this experiment are below:
Results from a Similarity Test between Podcasts
Hosts: Rob Cesternino & Josh Wigler
Most Similar Podcasts: The Leftovers, Lost and The Walking Dead
The Walking Dead
Hosts: Rob Cesternino & Josh Wigler
Most Similar Podcasts: The Leftovers, Lost and Game of Thrones
Saturday Night Live
Hosts: Rob Cesternino and Rich Tackenberg
Most Similar Podcasts: The Leftovers, Lost and Orphan Black
Better Call Saul
Hosts: Rob Cesternino and Antonio Mazzaro
Most Similar Podcasts: The Leftovers, Orange is the New Black and Lost
Justified
Hosts: Josh Wigler and Antonio Mazzaro
Most Similar Podcasts: The Leftovers, Lost and Orange is the New Black
The Leftovers
Hosts: Josh Wigler and Antonio Mazzaro
Most Similar Podcasts: Lost, Orange is the New Black and The Strain
House of Cards
Hosts: Rob Cesternino and Zach Brooks
Most Similar Podcasts: Lost, The Leftovers and Orphan Black
Daredevil
Hosts: Josh Wigler and Kevin Mahadeo
Most Similar Podcasts: The Leftovers, Lost and Orphan Black
The Strain
Hosts: Josh Wigler and Antonio Mazzaro
Most Similar Podcasts: The Leftovers, Lost and Orange is the New Black
Once Upon a Time
Hosts: Mike Bloom and Curt Clarke
Most Similar Podcasts: Orphan Black, Lost and The Leftovers
Orphan Black
Hosts: Mike Bloom and Jessica Liese
Most Similar Podcasts: Lost, Orange is the New Black and The Leftovers
Orange is the New Black
Hosts: Taylor Cotter and Jessica Liese
Most Similar Podcasts: The Leftovers, Lost and Orphan Black
Boardwalk Empire
Hosts: Antonio Mazzaro and Jeremiah Panhorst
Most Similar Podcasts: The Leftovers, Orange is the New Black, and Lost
Sons of Anarchy
Hosts: Rob Cesternino and Josh Wigler
Most Similar Podcasts: Lost, The Leftovers, and Game of Thrones
Seinfeld
Hosts: Rob Cesternino and Akiva Wienerkur
Most Similar Podcasts: Daredevil, The Leftovers, and Orphan Black
Lost
Hosts: Josh Wigler and Mike Bloom
Most Similar Podcasts: The Leftovers, Orange is the New Black and Orphan Black
24
Hosts: Rob Cesternino and Josh Wigler
Most Similar Podcasts: Lost, Game of Thrones and Orange is the New Black
The overall variance between the computed similarity for all shows is pretty small, which makes sense given that a lot of the hosts overlap, they are talking about television shows and they are all podcasts.
What is interesting and also kind of crazy about the most similar shows is that The Leftovers and Lost shows up in 15 of these lists of most similar podcasts! The only podcast that doesn’t have The Leftovers in its’ top three most similar is 24 and for Lost, Seinfeld is the only podcast without it.
I’m not sure how to interpret this. I’ve never seen The Leftovers or listened to the podcast, so I have no insights with respect to that show. As for Lost, perhaps since it’s a very iconic episodic drama that set the stage for many of these shows, it stands to reason that all these shows would warrant similar language usage.
Another factor could be what I am calling “The Wigler Effect“. Josh Wigler hosts a total of 9 shows, more than anyone else. It could be that his talking points overlap and come to dominate the language used across most of these podcasts.
The most consistently dissimilar shows from all other podcasts are:
- Boardwalk Empire
- House of Cards
- Justified
- Seinfeld
- Saturday Night Live
I’m not familiar with Justified, but based on my knowledge of these other shows, I think this seems to makes sense. There’s common language that is used to describe some of these shows that you just wouldn’t find anywhere else, like sketch, jokes and monologue from the SNL podcast.
Sentiment Analysis
Across the 10 episodes, most of the shows came out with a mixture of some negative, some neutral and a little positive, but the overall average sentiment for almost every podcast, including all three sections, was negative.
Below is a summary of results from the podcasts where the something interesting actually happened.
Show | Intro | Middle | Ending |
---|---|---|---|
Justified | negative | negative | neutral |
The Leftovers | neutral | negative | negative |
Once Upon a Time | neutral | negative | negative |
24 | neutral | negative | negative |
Lost | neutral | positive | positive |
Seinfeld | negative | negative | negative |
Surprisingly, Lost is the only show with an overall positive sentiment. Josh loves this show so may be his praises carry forward throughout the show :-).
The other really interesting result is Seinfeld. As mentioned, most of the shows had some mixture of sentiment but the overall average was negative. What’s interesting about Seinfeld is that for every episode and segment I looked at, the result was definitively negative. Perhaps Akiva’s complaints about his hate of chocolate (23 times and counting) and other foods carries the day on this podcast.
Final Remarks
Wow, that was a lot of work and I probably know more about the podcasts on Post Show Recaps than any human should! Hopefully you enjoyed this deep dive into those podcasts. If you notice anything interesting or have any questions, I’d love to hear from you on Twitter or in the comments. And if you haven’t looked at the visualization yet, here’s the link one more time.
I’d like to thank Rob Cesternino, Antonio Mazzaro, Josh Wigler, Kevin Mahadeo, Zach Brooks, Mike Bloom, Curt Clarke, Jessica Liese, Taylor Cotter, Rich Tackenberg, Jeremiah Panhorst, and Akiva Wienerkur for continuing to make awesome content that’s fun to listen to and analyze :-).