Monthly Archives: May 2014



A number of my friends are spending a chunk of their summer in Europe. Now I’ve never seen conclusive evidence that Europe exists, so all I know for certain is that a number of my friends need some time away from me. It’s kind of like when your parents would spell a word to each other, or when you spell out W-A-L-K in front of your dog. They’re in “Europe”. So how am I supposed to know they’re not gone forever?

Assuming they intend on coming back the main thing that could prevent them is kidnapping. Kidnapping statistics are hard to come by for adults, so these numbers are about half a decade outdated and don’t include all the countries my friends are visiting, but you’re Reading ShittyData and the sky is blue. Numbers are per 100,000 citizens.


It just so happens that the two travelers I’m most concerned for are both female and females are kidnapped in foreign countries significantly more frequently than males, 65% of kidnappings are of females. Adjusting for gender changes each country’s kidnapping rate.


Divide by 100,000 to find the chance of being kidnapped per person by country. Graphed because decimals are hard and I didn’t line them up well.



But what do these numbers mean? They look small, but are they? We need to contextualize this using the odds of being dealt a Straight Flush is poker, winning $1,000,000 and $10,000 in the Powerball lottery, and being in a car crash before they get back.


Some possible conclusions is that I am better off betting the field that none of these things will happen, but I should probably still drive carefully (1 in 10,000!).

Want my data? God help your soul

“I finished an Ironman”


Once all of the mountains have been climbed, the jungles mapped, and the oceans crossed there’s not a lot for modern man to conquer. To compensate, we create ridiculous challenges for ourselves that we pretend are a test of the human spirit. Take the marathon, tens of thousands of people each year pay enormous fees to run a distance that was once run by a person who died after completing it. Or, if that doesn’t satisfy your deep dissatisfaction with your almost meaningless existence you can pay even more to run through mud and wait in line behind college besties getting their new profile pic for an “obstacle” that probably involves using a basic skill that is considered significant only because so many have let things like “jumping” and “crawling” atrophy into the warm, painless company of a couch and a television. And for that you get a t-shirt that proves you’re a Warrior. I have some personal experience with this.


For some even these aren’t enough and they move on to even more difficult races and challenges such as the Ironman Triathlon. A day long race consisting of a 2.4 mile swim/112m bike/26.2m run, an Ironman is just about the pinnacle of physical and mental endurance or stupidity, whichever you prefer. Fittingly, they take a long ass time to finish, and as I did during my marathon, I wonder if participants think of all the the things they could be doing instead, probably somewhere around mile 60 of the bike. It’s simply opportunity cost, the marginal benefit of doing your next best option.


For this particular exercise imagine Good X is how much of an Ironman you complete and Good Y will be how many Ironman movies you watch. The red lines are theoretical revealed preference and are almost unknowable, but c’mon, they’re good movies. Could you watch every Ironman movie in the time it takes to run an Ironman?

RunTri did an analysis  of 41,000 finishers of 25 Ironman triathlons and found that the average finish time is 12:35:00 or 755 minutes. Below is a table of the recent Ironman trilogy


In the time that it takes to run an Ironman on average you could watch this whole trilogy almost twice, and I don’t think think the Ironman has above a 7.4 on Imbd. That might get a little monotonous, kind of like swimming for 2.4 miles, so let’s see how many Ironman affiliated films you could fit in.


That’s the critically acclaimed Avengers and three excellent animated features, all coming in 5 minutes under the theoretical Ironman finishing time. Just enough time for piss breaks from all that mountain dew you’ve been drinking. Don’t you feel more fulfilled already?


NFL Arrests pt. 2


Arrests in the off-season of the NFL make for entertaining news. Aaron Hernandez comes to mind recently of course (how many more of them could be murderers!?). But as a sample of a larger population do NFL players actually commit more crimes than any random sample, or is our perception clouded by media over-coverage? For that and more I consult UT San Diego’s NFL Arrest Database and make my own dataset from it to do some analysis on NFL player crimes since 2000 (discluding the current off-season).

Here are some totals on crimes committed

DUIs account for nearly a full third of crimes committed by NFL players with assault/battery combining to 15% of all crimes. This is pretty consistent with FBI crime data from a 2012 report which found DUI to be the second leading cause for arrest and for assault/battery to be the most common violent crime. The most apparent difference between FBI crime data and this sample is theft, which is easily explained by wealth effects.

Time to divide this by team, sorted by most to least crimes

This gives an okay idea of what teams may have something within their culture that needs addressing, but it’s without a doubt misleading. I would be comfortable saying the the St. Louis Rams are a “less criminally inclined” organization than the Minnesota Vikings, but when you try to compare Minnesota to Tennessee, or even a team farther down the list, these claims get murky. This is because not all crime is created equal. Take a look at pie graphs for the top four cities.


Denver has a lot of Assault and Domestic Violence, how can you compare that to Minnesota where they drink and drive and sag their pants too low? You can’t, it’s imperfect, so those rankings should be taken with a grain of salt.

There are still a couple things I can do to approximate to what degree of criminality would exist if these were a pure random sample of the population. First I calculated 13 year average of the crime rates of the nearest metro area of each NFL team, ranked teams using this and then compared that ranking to their original ranking for total crimes committed by players.


There’s a lot going on here, so I highlighted the important stuff. The numbers highlighted in red are teams whose number of crimes committed by players ranks way higher than the city they are from. Green teams are teams who are ranked much lower relative to other teams given the crime rate of the city they are located in, with a gold star going to the St. Louis Rams.

Another way to look at this is by imagining if the sample sizes for crime rate in the city being the same size as the teams. That means I need to scale down from crimes per 100,000 to crimes per 1170, or 90 preseason players across 13 years. That calculation is simply (Crimes(City)*13/100,000)*(x/1170)


The highest ratio seen here is 1:0.6, meaning that at most NFL teams commit 60% as many crimes as any random sample from their metro area’s population. So while they may appear on the news more frequently, NFL players are significantly less criminal than the rest of the population. Refining this to take into account income bracket would be interesting though.


Some things to keep in mind: (1) Players don’t necessarily live in the cities they play football in, (2) so these crimes were not necessarily committed in the city a player plays in, and (3) this does not reflect how much time a player has spent in any one area. But that’s fine, because life is imperfect. Someday I’ll rebrand this blog as a monument to Wabi-sabi and sell out.

The bar graph, much more so than anything else in this analysis, predicates itself on the idea that more crime in any metro area will mean NFL players of that metro area will commit more crime. However, I found no evidence for this. There’s very little correlation between how much crime there is in a city and how much crime is committed by that city’s NFL team, here’s what that looks like

This means that the two parts of this analysis that hold the most weight are the chart that shows total crimes committed by teams and the chart showing the difference. It’s clear that in the last decade there have been significant issues  in Minnesota, Cincinnati, Denver, and San Diego NFL programs in either creating enough accountability for players, or drafting and signing irresponsibly. They are, of course, not the only culprits. A+ joke right there.

Oh, you want more? “Vikings” is okay, but “Bengals”? More like the Cincinnati “Batteries” or the Denver “Irreconcilable Differences”. Mile High Stadium should be renamed “Divorce Court”, ha HA got em.

Want my data? God help your soul

Musical Industrial Complex


I have paid to see Bob Dylan twice and I have left Bob Dylan concerts early twice. If Bob Dylan is in a city near me again I will buy another ticket and I will again leave early. He really should stop touring, but he hasn’t, and he won’t, and I’m glad. Two options for why he will not stop touring (1) He has nothing better to do (2) He’s getting paid $225,000 on average for each performance. There are a lot of things I would do on stage in every metropolitan area if you paid me $225,000 each time.

I got that number, and many more, from a leaked list from Degy Entertainment that shows the price for booking an artist. Degy Entertainment acts as a middle man between artist managers and booking agents for venues. They are the definition of a middle man. Anyway, from their information I made a dataset that contains all artists that command at least $100,000 to perform. From there I divide by genre (learning towards how the artist is marketed) to examine relative popularity of popular musical genre as live acts.

Here’s all of my data in an unwieldy bar graph

Grouping my data by genre is interesting because genre’s are super subjective and are blended really closely in popular music. If I were to only take song structure then almost every genre that is popular enough to command $100,000+ per performance would be the exact same genre. Life’s a rant.

Summary by genre (Total and Average are in 1,000s)

The problem is that any one of these columns does not convey the whole story. The large variance of sample size in the count of artists complicates both the total commanded and the average. The count itself does a decent job, but how can you know that a large number of artists is not just a bunch of artists hovering around the 100,000s, while a smaller sample could have a couple artists in the exclusive 1,000,000+ sphere?

Unfortunately I don’t have a test for this. But by using a test for correlation (.58) and doing a student’s t-test (p value < .05) on count and average I know that more artists increases how much each artist is paid, which is possibly the most interesting outcome. Having a greater supply should drive price down, but apparently we have a half-decent argument for Say’s law, which in intro Micro says “supply creates its own demand”. I’m not getting too excited though, look what blog this is on and please lower your expectations accordingly, by cutting my sample off at $100,000 there’s plenty of stuff going on I don’t account for.

Anyway, here it is, as best I can determine, relative popularity of musical genres

For clarity’s sake
1. Pop
2. Singer/Songwriter
3. Rock
4. Singer
5. Alternative Rock
6. Country
7. Hip Hop
8. R&B
9. Indie
10. Electronic
11. Hispanic
12. Soft Rock
13. Entertainment
14. Reggae

This list could of course be expanded if I went below $100,000, but I’m happy with a 150ish row dataset when I’m the one making it from scratch. I really need to see about setting up a Shitty Data intern for work-study.

Want my data? God help your soul

990 Problems


[Have intern research a killer accountant joke]. That joke aside, accountants are great because they file 990 forms for organizations that are publicly available. 990s have a huge amount of information on the financial status of universities, for instance. However, “publicly available” is one of those things in life that can be technically correct but misleading. Recent 990s are a pain in the ass to access let alone read. Some organizations make it easy on you to find them, keeping them on their website, usually on a page dedicated to the Controller. Xavier University is one of those organizations.


Do you see what that says? Xavier has a bank account in the Cayman Islands.

Here are the top salaries at Xavier


And compared

Not all universities are as forthcoming with information to this level of detail. In fact, the only other school in The Big East that provides detailed access beyond what is required by the IRS other than Xavier is Marquette. Using the salary of their (former) head basketball coach and the total compensation to employees of both universities we can compare how the two universities value their coaches, and by extrapolating, their basketball program.


Good thing Mack got a raise this week. Marquette has a lot more employees than Xavier, so either Mack is underpaid or Buzz was overpaid for programs with relatively similar recent success. To all of my future accountant readers, which I am going to generously estimate at 1.5, make these things available because this would have been hella cool if I recent 990s for all Big East schools.


Want my data? God help your soul

Do Dads Love Santana?


What’s not to like about Carlos Santana? I know my dentist appointments would be WAY different without him. The hypothesis was recently brought to me that everyone’s dad likes his song ‘Smooth’ featuring teen heartthrob Rob Thomas, which does not sound false to me. It won a Grammy and is the only song to be in the top 5 of the Hot 100 Songs across two centuries (file under: statistics can be misleading). However, data on song listenership is nearly entirely controlled by the music branch of Nielsen Company, who has proprietary rights to information that goes into making the Billboard Hot 100.

To understand the connection between dads and ‘Smooth’ I’ll have to be a little more vague in my approach. I have facebook demographic data (sort of) for 2014 that tells me that our dad’s account for about 12.7 million users or about 15.6% of all users of the site. I’ll take a random sample from the Santana Facebook fan page and see what percentage of that sample could be our dads. There’s more bias and sampling error here than I have time to list, so just bear with me.

Apparently I only have three facebook friends that like his page, I need to reevaluate some things.


Apparently Facebook changed fan pages a few months ago so I can only see if my friends like a certain page. So instead of taking a random sample of fans, which already would have been #shittydata I will now sample comments from the most recent posts by the page.

Of the 68 comments I “randomly” sampled, 17 of them could have been my dad, except that most of them are Hispanic. They are also mostly IN Spanish so they could be hate messages to Santana for all I know. That means that 25% of my sample fall in the ‘dad’ category, much higher than the percentage of dads on Facebook. This means that dads on Facebook are more likely than other demographics to like/contribute to the Santana fan page and if you’re a Santana fan you HAVE to be a ‘Smooth’ fan, though I’m sure these guys could recommend some tasty deep cuts.

Stay tuned in a few weeks when I start my internship at Nielsen solely to obtain information on Santana’s ‘Smooth’ listener demographics.

Pride in Nationality


I recently came across the graphic above that shows responses in the World Value Survey  with respect to how much each nation takes pride in their nationality. I wanted to see if the respondents to the survey are simply a bunch of fair-weather fans, whose loyalty is determined by their nation’s success. One key determinant of success for nations in GDP, so I took single year 2013 GDP data for each country and checked the correlation between pride and GDP, find that AND MORE after the jump


Turns out there’s barely any correlation at all (.01) between pride and GDP, even worse is that if you calculate the p-value with a 95% confidence level then GDP is not even close to being a statistically significant predictor of how much pride citizens have in their nation.

Thinking on it some more single year GDP does not make a lot of sense here. We, the United States, could have no variation in GDP since my generation was born and I’m sure there are people who wouldn’t feel any kind of way. So next I found the previous year GDP data for each country and calculated the percentage change between the year of the survey and the previous year, attempting to reflect whether the country is going through an economic boom or bust


The correlation here comes out slightly stronger, but nothing to get excited about, down boy (.04). Once again, the p-value for this one is well above .05 so again change in GDP is not statistically significant.

Here’s a graph of the first analysis

And the second

What did we learn here, gang? The world is complicated. People feel all kinds of ways about all kinds of things for all kinds of reasons and sometimes you’ve spent the better part of three hours trying to analyze why they feel that way when you would’ve been better off letting it alone. The world is big, we are small, and I’ll never tell where I hid the plane.

Want my data? God help your soul