Don’t care about my post and just want the good stuff, follow this link: CrossFit Data Explorer
My last blog post was quite easily my most popular post to date (although my Searchy-type problems post did receive a fair amount of attention). One former student of mine explained that my entry had, “the most practical findings for average people. Most of your posts even I have trouble understanding … you need a PhD”.
Ok, so my blog isn’t always for the faint of heart :-). Ah well. However, this time I’m going mainstream again. Well, not quite mainstream, but as mainstream as CrossFit is :-).
Due in part to the level of interest my last post generated as well as my own curiosity, I spent some more time looking at the CrossFit Games data. I’ve had some requests to analyze the women’s data from the Canada Regional in the same way I analyzed the men’s data. I haven’t done that yet, but I think what I have done is pretty cool. And, more importantly, anyone can now interact and play with the data to discover their own interesting trends and results!
I decided to build a visual tool for interacting with the various data available on the CrossFit Games website. I started out building it as a standalone Java application, but that seemed soooo 1998. So, part way through development I abandoned the Java application in favor of a Web 2.0-style application, equipped with all the latest buzzword technologies (i.e. AJAX, jQuery, JSON, etc.).
I’ve made the application available online here, so feel free to play with it.
There’s three different visualizations available that allow you to compare athletes based on different event modalities as well as athlete rankings across various events. It should work with any URL that resolves to an overall results page (example page: Men’s Canadian Regional). I’ve pre-populated a drop down with all the regional results so you can select items from there or paste in an appropriate URL. The Men’s Canadian Regional data is loaded by default. The application may not work with Internet Explorer, so I suggest Firefox, Safari or Chrome.
In the Compare events tab, you can set the graph axes to show results from different events. For example, using the default data, we can set the x-axis to be the first event, the 6.7 KM run, and set the y-axis to be the second event, the snatch complex. Comparing athlete values across these two events allows you to explore an athlete’s strength versus endurance. If you look at the screenshot below, the tooltip shows that Garth Prouse was 1st in the run, but 30th in the snatch.
One interesting thing we can see in this view is how widely distributed the athletes are for various events. Consider the screenshot below where I plotted the overall placement versus the time for the double-under/burpee workout. Everyone is pretty closely clustered, but I’ve marked two distinct outliers. I’m guessing these two individuals must have struggled with double-under technique. Even those these two scores appear to be outliers, if you read my last post, there was little statistically significant difference between these athletes on this particular workout. In contrast, there is a lot of variance in the distribution for this workout when you inspect the women’s data (second screenshot below).
In the Compare athlete rankings tab, you can inspect an athlete’s ranking in each event as well as their overall ranking. For example, in the screenshot below I’ve selected only the top 6 athletes from the default data. We can see that three of these athletes (Erik, Nate, and Dan), for the most part, were pretty consistent across all events. On the other hand, DJ Wickham has an outlier on the run, while Garth and Michael have an outlier on the snatch complex event.
Finally, in the Event rank comparison tab, you can compare an athlete’s ranking in a specific event versus their overall placement. This allows you to visually correlate how closely tied an event’s ranking for an athlete is in comparison to how they did after completing all events. The screenshot below shows all athletes and their rank in the run versus their overall ranking. We see that Cam’s ranking went from 25th in the run to 34th overall, while Jason Fleming went from 10th to 50th and in the other direction, DJ Wickham went form 38th to 6th overall.
For those with technical expertise or those just curious, the way the application works is I send the URL corresponding to the overall results for a sectional or regional competition to the server-side code. Taking whatever URL is provided, I make a server-side request to the URL, get the HTML contents and parse out the values of the overall results table using the PHP Simple HTML DOM Parser. I load this into a simple datastructure (a hashtable of hashtables), which describes the athletes, the events, and all the various results. This information gets encoded as JSON and sent back to the client (front-end).
On the client, I convert the JSON text into a Javascript object/associative array. Then, based on whatever tab is selected, the data is processed into a data series for each athlete. I use Flot to handle the rendering of the graphs. All the interactive behavior and some of the UI is built using jQuery.
The data is split into results for each event and overall placement. For each of these, the results are split into rankings and actual scores. For example, in the default dataset, athlete “Rogers, Dan”, has both an overall rank of first as well as an overall placement score of 37. I do some simple things like recognize times, which for plotting purposes are converted into seconds.
Please post to any questions, suggestions, or general comments. Please let me know if you discover anything interesting :-).
Very Cool! This would give event organizers an objective way to look at which events or types of events were worth having. Some events have almost no correlation to overall rankings. This of course doesn't mean the event was unnecessary, it may be necessary to test whether or not an athlete is strong, even if it doesn't correlate well with the overall rankings. If nothing else, it's very fun to play with the data and very easy. Thanks a bunch!
Hi Cole,
Thanks for the comment.
Yes, that is an interesting thing to consider. It might be interesting to actually look at all the regional games data and see what workouts created the largest fluctuation in results or what workouts had the least or greatest correlation with the final result. If I have some time, I might dig into that.
I think the more narrowly focused the particular workout, the smaller the workout result distribution will correlate with the final result distribution. This is because with a single exercise, you will probably get specialists that will dominate. For example, if we took any real runner and put them into a run event in a CrossFit regional, they would destroy everyone. Same thing with a real olympic lifter being put into a clean and jerk event. However, both of those specialist would most likely really struggle with Nancy.
I think having a mixture of specialist events with combination events is a good way to find the fittest athletes. A good runner has to be good enough at lifting to not be taken out completely by a lifting event, and vice versa for a good lifter.
One really important thing though for event organizers to realize is that these specialist events have to be balanced. If you only have a lifting specialist event then you will unfairly bias your overall standing, same with only having a run.
you two make a lovely couple............................................................
Hey Sean,
All the stuff you are doing is exceptional data organizing and I love the information that your coming up with. I'm not going to pretend that I know what half of it means or how it works but it seems pretty cool nonetheless.
Do you have a thought as to what a good scoring system would be? I don't know if you remember me but I am an owner of CrossFit Taranis and we hosted the "Taranis Winter Challenge" last year that I'm sure you remember. You had some good constructive criticism after our event last year and even offered some assistance (ie up to the minute scoring and rankings)
We aren't totally sure on our format for this year yet but we would like to have a more effective, fair and accurate scoring system that is also user friendly to provide to the minute rankings.
Any thoughts/advice would be greatly appreciated. Do you think you'll be back up for it in November to compete? It would be good to have you back.
Hi Reed,
Yes, I remember you and I remember the Winter Challenge very well, especially the first workout. My screw ups there cost me the competition. I beat myself up about this on a weekly basis :-).
It's interesting that you bring up the issue of scoring because I have been participating in a conversation about this on the Games website: http://games2010.crossfit.com/blog/2010/06/scoring-crossfit-competitions/
I think the key thing about scoring that I've got from both the article and the comments is that no scoring system is perfect, they all have flaws. I think regardless of the scoring system, good programming makes or breaks the competition. With good programming, it shouldn't really matter too much how you score the athletes, you'll still find the fittest and most rounded.
That being said, unless someone proposes a good alternative, I am in favor of the simple placement-based scoring system. It's easy to understand and with enough events, I think the results stabilize.
With regards to programming, both more events and balance between events is necessary. With only 3 or 4 events, it's easy for the results to be biased. For example, Emily Beers beat Alicia at sectionals, and I'm sure Emily is a phenomenal athlete, but she is not yet a complete athlete, as shown by the run and double-under workout at regionals.
In CrossFit, particularly in the games, we talk about finding the fittest male/female alive, but it's more than just fitness, it's also completeness and balance. With this in mind, then pure strength workouts have to be balanced by pure endurance workouts. If you only have one, then you favor the specialist.
Of course, to have more events, the Winter Challenge would have to be run over more than one day. If that's not possible, then careful programming will really be essential.
I think one direction we may see the Games go in the future is to have weight classes. I know this goes against some of the CrossFit philosophy, but I think like old school UFC, CrossFit athletes are evolving. When UFC started, there was not weight classes. Large fighters were usually slower and less skilled than smaller ones, so things could be balanced. However, as the sport evolved, the larger fighters got faster and more skilled. Now, a light heavy weight would most likely destroy any light weight.
I think we're seeing a similar evolution with CrossFit athletes. Big guys are getting faster and they are super strong. Consider Paul "Kong" Smith: http://www.sicfit.com/blog/post/show/id/117-Paul-Kong-Smith-Signs-that-the-Game-has-Changed
This is only the beginning of the evolution. I'll probably write an expanded post about this, summarizing some of the discussion from the Games site and incorporate some more data analysis.
My offer still stands about helping out technically with some up to the minute scoring and ranking system. Maybe send me an email and we can discuss this further (falconer dot sean at gmail dot com).
One potential thing we could experiment with is take the data from last year's challenge and apply different scoring systems to it and see how the results change.
I hope to come back in November, maybe drag some other people from California with me. It will depend on work and flight costs.
Please pass on my congratulations to Alicia and the Taranis affiliate team. I'll be cheering for them at the Games.
Hi Reed,
Just a quick follow-up. Kody King posted some scoring suggestions over on the games site and I think one of them is potentially interesting. It's a hybrid system that combines athlete ranks and standard deviation from the mean. Might be something to consider.
http://games2010.crossfit.com/blog/2010/06/scoring-crossfit-competitions/
Thanks Sean,
I like the input and I especially like the insights into programming. Programming and scheduling is definitely the most challenging aspect of a CrossFit competition. It is extremely hard to find balance without programming biases when the event is held over a short time period.
I am bouncing around a couple of ideas along those lines right now and I hope that I can come up with something fair.
I will send you an email regarding the scoring and rankings. Unfortunately, I spent so much time last year on the scheduling and planning of WODs that my data collection and cataloguing of last years challenge were……….poor. I am hoping to do better this year which is why I am would like to work with you on this.
說「吃虧就是便宜的人」,多半不是吃虧的人。 ............................................................