Crowdsourcing a Behavioral Model for Survivor

C

A little over a year ago, I was contacted by Angie Caunce about participating in a data project related to Survivor. If you are a fan of RHAP, you’ll likely know who Angie is. She is the creator of The System, or the Angie Caunce Character Types, which I have written about before. I also built a website for this system CaunceTypes where you can explore every player of Survivor based on her character type system.

Angie created her system by re-watching seasons of Survivor and identifying different types of players and then grouping players based on the characteristics of that group. For example, the “True Grit” character type is typically a soldier/retired professional athlete/cop/fireman age 35+. You can see a full list here.

Starting in season 30, Worlds Apart, her character types have become a part of the RHAP pre-season podcast ramp up. She joins Rob each season to work through the cast saying that this player is one particular character type, or some other type, based solely on their CBS bios. She then attempts to predict how that particular person will perform in the game based on the historical performance of that character type. A very tough job indeed.

The original development of her system was based on a qualitative study where she re-watched episodes, defined character types and grouped players into those types. It was the work of a single person, i.e. Angie. Although you could argue that Angie is a Survivor expert, she comes to the game with her own biases and limitations that could impact the study. Further, trying to make predictions based only on the bios and the historical performance of the group comes with its’ own limitations and problems. Angie is keenly aware of these issues and wanted to do more.

Angie had contacted me because she was gathering a group of RHAP fans that have an interest and background in statistics, data analysis, math, psychology, and computer science to help her come up with a new system for characterizing the players of Survivor.

I was onboard immediately.

Of course I was interested in combining my favorite TV show with my passion for data-related projects that few people will ever care about. I mean, look no further than my work on calculating the statistical probability of becoming a professional quidditch player after graduating from Hogwarts.

This Survivor project Angie had in mind is kind of a crazy thing to dedicate your time to. We all have jobs, some of us have kids, yet, we wanted to combine our education, career experience with our passion for this silly TV show to somehow create a deep data-driven analysis of different players, their styles, what makes them successful or unsuccessful, and understand a multitude of other facits of the game.

It’s rediculous, ambitious and also kind of awesome.

Fast-foward to today, and we’ve been slowly gathering data, talking about ideas and ways of analyzing this show for over a year and although we still have a long way to go, we have finally reached a point where we have some results to share. Today’s blog post is dedicated to sharing those results.

But first, some background about what this project is, how we gathered the data, and performed the analysis. If you don’t care about all this data nerd stuff, you can skip to the results by clicking here.

WARNING: If you haven’t seen every season of Survivor, there’s spoilers contained in this post about winners and other players, so proceed at your own risk.


The Birth of a Behavioral Representation of Survivor Game Play

From the beginning, Angie really wanted to focus on developing a system based on actual in-game behaviors. If people exhibit similar behaviors, surely they are similar players? If certain behaviors led to success for one player, wouldn’t those same types of behaviors be successful for someone else?

This idea and types of questions were where we began.

Amanda Rabinowitz, huge RHAP and Survivor fan, and also a psychologist and now Research Assistant Professor of Rehabilitation Medicine at Thomas Jefferson University, was a huge help in transforming some loose ideas into a real system for logging player’s in-game behaviors.

Our original list was quite large and we spent considerable time narrowing down the set. We wanted our ultimate set to contain a parsimonious list of distinct behaviors that 1) are relevant to success in the game, 2) can be observed and objectively coded based on watching the episode. We also wanted to have a balance between the number of positive and negative traits.

We refined our list with multiple rounds of pilot testing, wherein a group of us watched an episode and coded the behaviors for each player. We examined how consistent our ratings were for each type of behavior, and when agreement was poor, we either tweaked that behavior or dropped it entirely.

We eventually settled on 22 micro-behaviors that a player may exhibit during the course of a single episode. See below for the full list.

  • Analytical
    Shows skill at understanding the numbers, applying strategy, can see a few moves ahead, excels at the mental challenges (puzzles, memory, etc.)
  • Diplomatic
    Shows ability to compromise, repair relationships, is polite and emotionally intelligent
  • Empathic
    Is caring of other players, worries about people’s feelings, nurtures others, is sensitive
  • Perceptive
    Shows ability to tell when people are lying to them, hiding something, strategizing against them, or working with each other
  • Charming
    Shows likable qualities, other players talk about enjoying their company, makes friends easily
  • Funny
    Makes people laugh and has funny confessionals
  • Flirtatious
    Uses sexual attraction to influence other players (not always successfully)
  • Leadership
    Shows leadership within the tribe or an alliance, takes initiative and has ideas, is highly competitive, shows admirable and heroic qualities
  • Hardworking
    Shows high energy around camp and at challenges, contributes and works hard, never sits down
  • Industrious
    Shows ability to problem solve, to find idols, to eaves drop or build spy shacks
  • Deceptive
    Shows skill at successfully deceiving others, creates stories on the spot, can quickly deflects suspicion, is good a persuading players to follow their plan
  • Moral
    Uses moral judgements to justify behaviour, boasts about loyalty or integrity, makes decisions based on faith or their following of a certain code of conduct (religious, regional, cultural, eg. God, Texas, military)
  • Abrasive
    Rubs people the wrong way, gets under people’s skin, is annoying, shows any behaviour other players interpret as hard to be around
  • Athletic
    Shows notable athleticism in challenges or around camp, such that others comment on it or it is highlighted in the edit
  • Weak
    Does not perform well in physical challenges. Not useful physically around camp.
  • Aggressive
    Shows cut throat behaviour, is antagonizing, plays hard, intimidates others, is bossy and argumentative
  • Temper
    Shows flare ups of anger, shouts at others, irritated, is insulting and abusive to others
  • Egotistical
    Brags, thinks a lot of themselves and their opinions, thinks people should agree with them and do what they say
  • Emotional
    Shows episodes of crying, feeling sorry for themselves, extreme homesickness, feeling picked on feeling left out, insecure and anxious
  • Minion
    Shows little independent thought and does what they are told or simply what everyone else is doing, are simply a number for their alliance
  • Lazy
    Shows unwillingness to work around camp or try hard in challenges
  • Naive
    Shows tendency to be foolish, is manipulated by other players, believes people who are lying to them, makes silly game decisions

After a few updates and clarifications, we felt ready to start the real work. To collect the sets of behaviors, we would need to re-watch every season and collect the data about the behaviors of each player, episode by episode.

Crowdsourcing Survivor Behaviors

We are currently on season 35 of Survivor, the average episode is 42 minutes, the average season is 14 episodes, that’s about 333 hours or 14 full days worth of TV watching to gather a single set of behaviors for every player across every season and every episode. Ideally, to reduce human error, we would want to have multiple sets of data for every season.

There’s no way our small data team could possibly do this. Enter crowdsourcing and the awesome RHAP community.

Angie put the word out about this data project we were working on and that we were looking for volunteers to help re-watch seasons of Survivor and code the episodes based on our set of behaviors. We created spreadsheet templates (see screenshot below) for every season to support this, and the volunteers started answering the call. Within a few weeks, completed spreadsheets started showing up in my inbox and even now, a year later, once every few weeks I get an email with someone’s submission.

We do not yet have codings for every season, and only 10 seasons have more than one submission, but at the time of this writing, we have 42 complete submissions covering 25 seasons.

Now, that may seem impressive or interesting or maybe just weird and insane, but what can we do with this data?

Enter machine learning.

Converting Behavioral Codings into Points in Space

My goal with these coded seasons was to apply machine learning and clustering to the data. To do that, I needed to convert this spreadsheet data into something usable for these types of algorithms.

What I needed was “feature vectors” that could act as the representation of each player. A vector is simply a quantity having direction and a magnitude. A point in 2D space is a vector with two elements, the x position and y position. The point defines the end point of the vector.

Now, a feature vector is just a special type of vector often used in machine learning and AI to represent an object. It is an n-dimensional vector of numerical features. For example, if we wanted to represent someone’s face, we could have features like width and height of someone’s face, color of their eyes, distance between their eyes, color of their skin, etc. These features would be different values for the dimensions covered by the vector. Then to compare whether two faces are similar, we can compare how close these two vectors are in space. Two vectors that are very close in space have similar features, and hence have similar faces.

This is the very basics of how many machine learning, information retrieval and pattern recognition applications work.

In our case, we have these spreadsheets where each tab is an episode and within each tab a player is coded by Yes or No for our set of behaviors. To start to turn this into something usable, we need to first think about these columns of Yes/No behaviors as vectors where a Yes is a 1 and a No is a 0. Each behavior is a feature and the combined set of behaviors is our feature vector, i.e. our point in space.

Below is an example transformation for Denise Stapley’s first episode in Survivor Philippines.

We have one of these feature vectors for each player from each episode they were on. For some players, this may be a full season of 14 or so episodes, so they will have 14 such vectors, while the unfortunate first boots of Survivor will only have a single episode vector.

Now what I needed to do was combine all the episode vectors of a player into a single feature vector that would encapsulate that player’s behaviors across the entire season. To do this, for each feature, I summed the number of times that feature was observed. You can see an example below.

We now have a single vector characterizing a player’s behaviors, but we have to normalize the magnitudes of our summed behaviors, otherwise anyone that stays in the game longer will have unfairly large numbers in their vector in comparison to someone that went out early.

We could normalize by taking the sum and dividing by how many episodes that player appeared in. The problem here is that someone that was only in a few episodes will have a potentially artificially large behavioral representation over someone that was in the game for a long time.

For example, if you’re the first boot and your coding said you were naive, then if we normalize your vector in this way, your naive feature will be a 1.0. However, a player that was in the game for 10 episodes and say was recorded as being naive 8 times, they would only have a 0.8 (8/10). The sample size of the first boot is so small, it ends up getting an exaggerated set of behaviors if we normalize like this.

Well, statistics to the rescue.

We can use statistics to calculate the probability distribution of behaviors based on the number of episodes each player took part in. This allows us to calculate a standard deviation and an expected behavioral probability for the codings. Someone that’s in the game longer is going to have less variance in our expectations about their behaviors, while someone that’s out quickly, their behaviors have a high variance. We can use this information to normalize the results and help smooth out our features making our comparison between players more robust.

Analyzing the Data

The ultimate goal for this project is to have 4 or 5 complete codings for each season, then we can apply techniques like clustering to put players into different buckets based on the behaviors they exhibit in the game. This would allow us to design a new set of character types based on actual observed micro behaviors that player’s exhibit in the game.

Now, we aren’t quite ready for that, our data is still a bit thin, but there’s still some cool things we can do.

Clustering Winners

One of the first experiments I ran was to take the seasons that we have recorded data for and apply K-Means Clustering to group winners based on their similarity. I won’t go into detail of how this algorithm works here or how I ended up with 5 groups, but I’m happy to chat about this if you want to contact me on Twitter.

Below you can see all the winners grouped together based on their similarities. I’ve tried to explain which behaviors were particularly strong or weak for each cluster. This is by no means perfect as we still need more data to do this properly, but we are starting to get some interesting groupings.

Cluster 1

Winners: Earl Cole, J.T. Thomas, Boston Rob, Kim Spradlin, Denise Stapley

These winners are analytical, perceptive, diplomatic, leaders that never lose their temper, are not egotistical, lazy or naive.

Cluster 2

Winners: Brian Heidik, Amber Brkich (Mariano), Bob Crowley, Fabio, Sophie Clarke, John Cochran, Tyson Apostol, Natalie Anderson, Adam Klein

These winners are jacks of all trades. Their strongest behaviors are analysis, perception and charm, but exhibit most behaviors besides ego and being naive.

 

Cluster 3

Winners: Tina Wesson, Ethan Zohn, Vecepia Towery, Yul Kwon, Mike Halloway

These winners have a strong moral compass and are decently analytical, but they are not flirtatious or aggressive. They never lose their temper and have little ego. Quiet but still strategic.

Cluster 4

Winners: Richard Hatch, Tom Westman, Aras Baskauskas, Todd Herzog, Tony Vlachos

These winners are analytical, somewhat aggressive leaders. They are never weak or a minion.

 

Cluster 5

Winners: Sandra Diaz-Twine (Heroes vs Villains edition)

She has one of the thinnest behavioral vectors for a winner. She really could be an outlier or we may need more data.

Predicting Winners

In my next experiment, I wanted to see if we could train a model to predict the winner of a season.

I took the 10 seasons of data where we have multiple submissions and used that as a training set for a Naive Bayes Classifier. I tagged every player (about 170) from these 10 seasons as either a winner or non-winner. Using this data, I trained the classifier to calculate the probability that a given input (i.e. player) was either a winner or non-winner.

Next, I took the other 15 seasons of data that we had (not part of the training set) and used it as a testing set. I wrote a program to calculate the vector representation of a player after each episode and then ran that vector against my classifier to calculate the probability that the player will be the winner.

The hope was that after a certain episode, my classifier would be able to figure out that the eventual winner of the season did indeed have the highest probability of winning the game. For example, if my model was working, and I wanted to predict Season 14 – Survivor Fiji, then after episode 1, I can take everyone’s vector representation and for each player ask my classifier, is this the winner? It’s unlikely to figure this out from just episode 1 data, but what about after episode 2? or episode 3?

Turns out, for most seasons, I was able to predict the winner by episode 8 or about 57% of the way through the season, right around merge time. In some seasons, it took a bit longer like in season 27 it found Tyson at episode 10, while in others, like Survivor Fiji, my classifier figured out that Earl was going to win starting with episode 1! That’s pretty amazing.

Most of the seasons where my classifier was able to identify a winner early in the game was with dominate winners, Earl Cole, J.T., Kim, and Boston Rob.

There were certainly seasons where I failed to predict the winner. It failed in Africa, Gabon, Heroes vs Villains, and Millennials v. Gen X.

It could be that we don’t have enough data yet, could be that the submitted behaviors are inconsistent for that season, or perhaps that winner is indeed a unique snowflake and doesn’t look like one of the winners from my training set. As we gather more data, we should be able to figure this out.

Analyzing the Final 5 Players

In this experiment, I took the feature vectors for the final 5 players from each season and looked at who were the outliers for each behavior. That is, who had the largest representation for a given behavior and who was the lowest.

There’s some pretty interesting results. I’ve included some of my favorites below.

Analytical

Most: Stephen Fishbach
Least: Dan Lembo

Diplomatic

Most: Denise Stapley
Least: Big Tom

Perceptive

Most: Sophie Clarke
Least: Big Tom

Funny

Most: John Cochran
Least: Sash Lenahan

Empathic

Most: Kim Spradlin
Least: Ozzy Lusth

Flirtatious

Most: Amber Brkich (All Stars)
Least: Boston Rob (Redemption Island)

Deceptive

Most: Tony Vlachos
Least: Big Tom

Emotional

Most: Sugar Kiper
Least: Sash Lenahan

Lazy

Most: Clay Jordan
Least: Boston Rob (Redemption Island)
 

Future of the Project

This is really just the beginning of the project and there’s a lot more data we need to gather. There’s also a lot more we can do with the data in terms of analysis. We could look at predicting whether a player will make the merge, apply clustering to all players to come up with a new set of character types, look at all winners and what behaviors they exhibit, look at what behaviors lead to an early exit, and many more ideas.

We can also expand our feature representation beyond just in-game recorded behaviors. We could factor in age, sex, geo-graphic region, jobs, education, challenge wins and many other things.

Incorporating these other features would allow us to do other types of analysis, like correlating behavior and sex. Does an aggressive female have a higher likelihood of being eliminated than an aggressive male contestant?

Hopefully, this is just the beginning.

Final Thoughts

This is one of the most rediculous and ambitious projects I’ve been involved in, so thanks to Angie for bringing me in, it’s been a ton of fun. I hope we can continue to gather volunteers to participate and I’d love to hear other ideas for analysis or things you’d like to see an answer to.

I’d also like to address a couple of things.

I’m sure that some people will ask that since the behaviors we are observing are subject to editing, thus we aren’t seeing everything, doesn’t that impact the results?

Editing for sure impacts what we can record. We can only record what we see, so I’m sure there’s behaviors that are lost on us as an audience. We may not have an analysis behavior for a first boot, but that player may have tried to form an alliance that didn’t make air. Unfortunately, we can only record what is shown, that’s the best we can do. Regardless, with enough seasons and episodes, the consistent behaviors that lead to success should become apparent and the similarity between player’s style should still be calculable.

You may also wonder, how is this different than edgic?

I do not pretend to be an expert on edgic, but from what I do know, edgic is directly trying to code a player based on their edit in order to predict a winner. Each player is scored on tone, visibility and a personality type rating. Editing perhaps impacts what we can record, but we are recording small micro behaviors that a player performs, the tone and personality of the player is not a factor.

I believe we can do much richer analysis with the data we are collecting besides predicting a winner of the season. I think edgic and our behavioral representations are separate and can co-exist.

Acknowledgements

I would really like to thank all the coding volunteers for taking the time to re-watch seasons and record player’s behaviors. This whole project would still just be an enthusiastic Facebook chat without you guys.

Also, I’d like to thank Angie Caunce, Tovy Paull, and Amanda Rabinowitz for their review of this post, helpful ideas and enthusiasm for this project.

About the author

Sean Falconer

6 Comments

By Sean Falconer

Sean Falconer

Get in touch

I write about programming, developer relations, technology, startup life, occasionally Survivor, and really anything that interests me.