Digging for Questions: Audio to Text Analysis of The Seinfeld Podcast

D

It’s been almost three years, but I’m back again with another super nerdy blog post! I’ve written a lot in the past about different computer programs or stats stuff that I’ve written, usually related to programming competitions, ontologies, and CrossFit, but today, something a little different.

I have been a big fan of Rob Has a Podcast and PostShowRecaps almost since they were first launched. One of my favorite podcasts to listen to is The Seinfeld Podcast on PostShowRecaps. On this weekly podcast, the hosts Rob Cesternino and Akiva Wienerkur, recap an episode of Seinfeld. They started at the beginning and are about half way through.

One of the running gags of the podcast is when some unanswered Seinfeld question comes up or some kind of plot hole, Rob and Akiva joke about how they will ask Jerry when they have him on the podcast. These include silly inconsistencies like how Jerry’s apartment is on the third floor in the early days of the show but later moves to the fifth floor.

I personally love stuff like this. I often pester my girlfriend about things like why doesn’t Dumbledore just keep the Philosopher’s Stone in his robes rather than making the wizard mouse trap obstacle course that 10 year olds were able to circumvent at the end of the first book? And don’t even get me started on time turners!

The more I love something, the more time I seem to spend over-analyzing and tearing it apart :-).

Anyway, in Episode 91, The Couch, Akiva said he wished he had the list of all the hypothetical questions they have talked about asking Jerry throughout the history of the podcast. He said he was going to tweet at someone if he sees them starting the podcast at episode 1 and have them write down all the questions.\

I decided to solve this problem for Akiva but I didn’t want to re-listen to all 91 episodes.

So how does a computer scientist solve this problem? By writing a program of course. Actually, I wrote a couple of programs.

There’s three sub-problems I needed to solve:

  1. I needed to get copies of all the podcast audio files.
  2. I needed to figure out a way to convert them from sound to text.
  3. I needed to analyze the text and pull out the questions.
To address the first issue, I wrote a script to parse the RSS feed for the podcast and download all the MP3 files to my computer.
For the second problem, I searched and found VoiceBase, a tool and API for converting audio files to text. It’s hardly perfect, but for my purposes, it more than sufficed.

Now the last part.

If I was actually attempting to pull the exact questions from the audio/text files where they pose a hypothetical question to Jerry, this would be a very difficult problem to solve. Or if we were talking about millions of files, we’d have to use some more advanced natural language processing techniques like Latent Semantic Analysis. However, all I really need to do is come up with a way to shortcut the manual task of checking every episode and every piece of audio.

If you listen to or look at the conversations involving these Jerry questions, you’ll notice a few different patterns. The first obvious thing is Rob or Akiva always say certain words, like “question”, “Jerry”, “first question”, “hang up”, etc. These are all hints about what is actually being talked about.

Using hints like this, I wrote another script to pull out chunks of text from the audio conversions that were likely candidates for discussed Jerry Seinfeld interview questions.

 
My algorithm was pretty simple. I took all the text files and looked for anywhere the word “question” appears. Using that location in the text as an anchor, I analyzed the text near the word and looked for other high value terms like “jerry”. If “question” and “jerry” appeared fairly close together, then I added that episode and chunk of text to my list of probable discussions involving interview questions for Jerry.

 

I repeated this process for other combinations of my hints. The word “question” always being essential to the conversation chunk. Of course, you end up with some false positives, but the actual number to manually check is reasonably small.

Armed with my smallish set of text chunks, I hand checked them and pulled out the ones where an actual interview question was discussed. This yielded 20 interview questions for team Rob and Akiva to use when they have their Jerry Seinfeld interview on the podcast (never gonna happen).
The list of questions by episode (oldest to newest) is below. Some may not make total sense if you are not a huge Seinfeld fan. It’s also possible I missed some, but I think it’s pretty comprehensive.

 

The Opera – Posted May 2nd, 2015

  • In the episode The Opera, why are there four thousand plot holes?
  • Was Crazy Joe Devola a friend of Elaine’s therapist or did Elaine end up meeting him through some other guy?
  • Is Uncle Leo still married later on in the series?
  • Why does Jerry have a brother in season two episode three and it’s never mentioned again?
  • How come in the earlier episodes Jerry’s apartment was on the third floor but then it’s on the fifth floor later?

The Airport – Posted May 20th, 2015

  • In the episode the airport, there’s a scene where there’s this criminal who wants the last copy of Time magazine. But George actually wants it because there is a blurb about him. But the guy also says that he is a fan of Time magazine in addition to being the person that’s on the cover of Time magazine. Was this something that ever came out as being too much during the writing of this episode? Shouldn’t he either be on the cover or be a big fan?

The Outing – Posted June 24th, 2015

  • In the episode The Outing, why did the reporter for N.Y.U. want to interview Jerry so bad if she had never met him and doesn’t really know anything about Jerry?

The Mango – Posted August 12th, 2015

  • Is there some sort of temporal anomaly in Jerry’s apartment? Does time in Jerry’s apartment move faster than it actually occurs in the real world?

The Glasses – Posted August 27th, 2015

  • In the episode The Glasses, there’s a scene where the girlfriend Amy is making out with cousin Jeffrey in a deleted scene. Is that canon or is that part of a dream sequence that never made it to air?

The Sniffing Accountant – Posted September 3rd, 2015

  • Did it bother you when Kramer started getting such large ovations from the audience during the show? Did it take away from the scene?

The Lip Reader – Posted September 16th, 2015

  • Is the actual reason Gwen broke up with George because she saw him pigging out at the U.S. open on daytime T.V.?
  • In the episode The Lip Reader, you reference Monica Seles’s return, although she didn’t actually return to tennis until 1995, why not use Stefi Graf?

The Stall – Posted October 28th, 2015

  • After Tony falls, how does George and Kramer get off the mountain?

The Pie – Posted November 18th, 2015

  • In the episode, The Pie, why doesn’t Audrey eat the apple pie?

The Stand-In – Posted November 24th, 2015

  • In the episode The Stand In, when does he take it out and what is he wearing? Is he wearing pants or perhaps a kilt or basketball shorts or sweatpants? Is he driving with it out?
  • Also, it is out, but is it up?

The Hamptons – Posted December 24th, 2015

  • We know you dated Carol Leifer, is it Lifer or Leifer?
  • Were you the guy who took it out with Carol Leifer?

The Opposite – Posted December 30th, 2015

  • In The Opposite, when Kramer spits out the coffee on Regis and Kathy Lee, is the coffee hot or cold? Why does he spit out the coffee? Is it just bad coffee or is it perhaps surprise bourbon?

The Couch – Posted January 27th, 2016

  • In episode The Couch, was Poppy originally suppose to poop on the couch instead of pee?
I also did a couple of other fun things. I created a tag cloud of the 50 most frequently discussed terms. I discounted stop words. Apparently Rob and Akiva ask each other what they “think” a lot :-).

 

I also looked at how frequently a few different topics come up. I generated the topics based on my knowledge of the podcast and Seinfeld. The list with frequencies is below:

  • E.S.P.N – 26 times
  • True Crime – 5 times
  • Phone Booth – 4 times
  • Voicemail – 45 times
  • Phone Message – 56 times
  • Dry Cleaning – 42 times
  • Coffee Shop – 112 times
Finally, I looked at how many times each of the principle four characters are mentioned (I combined mentions of the character’s name and the actor’s real name):
  • Jerry – 11,468 times
  • George – 8,893 times
  • Kramer – 5,359 times
  • Elaine –  4,208 times
Well, that’s all I got for now. There’s a lot of interesting things I could potentially do with this data set. For example, we could also pull out questions for Larry David, which comes up from time to time. It might also be interesting to look at the difference in how the host’s language changes relative to the ranking of the episode that they give. Do they tend to use certain words or phrases when they really like an episode? Also, how has the evolution of the podcast changed their language? The shows have certainly got longer :-).
 
If you have ideas, let me know in the comments.

About the author

Sean Falconer

4 Comments

By Sean Falconer

Sean Falconer

Get in touch

I write about programming, developer relations, technology, startup life, occasionally Survivor, and really anything that interests me.