Musical Industrial Complex


I have paid to see Bob Dylan twice and I have left Bob Dylan concerts early twice. If Bob Dylan is in a city near me again I will buy another ticket and I will again leave early. He really should stop touring, but he hasn’t, and he won’t, and I’m glad. Two options for why he will not stop touring (1) He has nothing better to do (2) He’s getting paid $225,000 on average for each performance. There are a lot of things I would do on stage in every metropolitan area if you paid me $225,000 each time.

I got that number, and many more, from a leaked list from Degy Entertainment that shows the price for booking an artist. Degy Entertainment acts as a middle man between artist managers and booking agents for venues. They are the definition of a middle man. Anyway, from their information I made a dataset that contains all artists that command at least $100,000 to perform. From there I divide by genre (learning towards how the artist is marketed) to examine relative popularity of popular musical genre as live acts.

Here’s all of my data in an unwieldy bar graph

Grouping my data by genre is interesting because genre’s are super subjective and are blended really closely in popular music. If I were to only take song structure then almost every genre that is popular enough to command $100,000+ per performance would be the exact same genre. Life’s a rant.

Summary by genre (Total and Average are in 1,000s)

The problem is that any one of these columns does not convey the whole story. The large variance of sample size in the count of artists complicates both the total commanded and the average. The count itself does a decent job, but how can you know that a large number of artists is not just a bunch of artists hovering around the 100,000s, while a smaller sample could have a couple artists in the exclusive 1,000,000+ sphere?

Unfortunately I don’t have a test for this. But by using a test for correlation (.58) and doing a student’s t-test (p value < .05) on count and average I know that more artists increases how much each artist is paid, which is possibly the most interesting outcome. Having a greater supply should drive price down, but apparently we have a half-decent argument for Say’s law, which in intro Micro says “supply creates its own demand”. I’m not getting too excited though, look what blog this is on and please lower your expectations accordingly, by cutting my sample off at $100,000 there’s plenty of stuff going on I don’t account for.

Anyway, here it is, as best I can determine, relative popularity of musical genres

For clarity’s sake
1. Pop
2. Singer/Songwriter
3. Rock
4. Singer
5. Alternative Rock
6. Country
7. Hip Hop
8. R&B
9. Indie
10. Electronic
11. Hispanic
12. Soft Rock
13. Entertainment
14. Reggae

This list could of course be expanded if I went below $100,000, but I’m happy with a 150ish row dataset when I’m the one making it from scratch. I really need to see about setting up a Shitty Data intern for work-study.

Want my data? God help your soul

990 Problems


[Have intern research a killer accountant joke]. That joke aside, accountants are great because they file 990 forms for organizations that are publicly available. 990s have a huge amount of information on the financial status of universities, for instance. However, “publicly available” is one of those things in life that can be technically correct but misleading. Recent 990s are a pain in the ass to access let alone read. Some organizations make it easy on you to find them, keeping them on their website, usually on a page dedicated to the Controller. Xavier University is one of those organizations.


Do you see what that says? Xavier has a bank account in the Cayman Islands.

Here are the top salaries at Xavier


And compared

Not all universities are as forthcoming with information to this level of detail. In fact, the only other school in The Big East that provides detailed access beyond what is required by the IRS other than Xavier is Marquette. Using the salary of their (former) head basketball coach and the total compensation to employees of both universities we can compare how the two universities value their coaches, and by extrapolating, their basketball program.


Good thing Mack got a raise this week. Marquette has a lot more employees than Xavier, so either Mack is underpaid or Buzz was overpaid for programs with relatively similar recent success. To all of my future accountant readers, which I am going to generously estimate at 1.5, make these things available because this would have been hella cool if I recent 990s for all Big East schools.


Want my data? God help your soul

Do Dads Love Santana?


What’s not to like about Carlos Santana? I know my dentist appointments would be WAY different without him. The hypothesis was recently brought to me that everyone’s dad likes his song ‘Smooth’ featuring teen heartthrob Rob Thomas, which does not sound false to me. It won a Grammy and is the only song to be in the top 5 of the Hot 100 Songs across two centuries (file under: statistics can be misleading). However, data on song listenership is nearly entirely controlled by the music branch of Nielsen Company, who has proprietary rights to information that goes into making the Billboard Hot 100.

To understand the connection between dads and ‘Smooth’ I’ll have to be a little more vague in my approach. I have facebook demographic data (sort of) for 2014 that tells me that our dad’s account for about 12.7 million users or about 15.6% of all users of the site. I’ll take a random sample from the Santana Facebook fan page and see what percentage of that sample could be our dads. There’s more bias and sampling error here than I have time to list, so just bear with me.

Apparently I only have three facebook friends that like his page, I need to reevaluate some things.


Apparently Facebook changed fan pages a few months ago so I can only see if my friends like a certain page. So instead of taking a random sample of fans, which already would have been #shittydata I will now sample comments from the most recent posts by the page.

Of the 68 comments I “randomly” sampled, 17 of them could have been my dad, except that most of them are Hispanic. They are also mostly IN Spanish so they could be hate messages to Santana for all I know. That means that 25% of my sample fall in the ‘dad’ category, much higher than the percentage of dads on Facebook. This means that dads on Facebook are more likely than other demographics to like/contribute to the Santana fan page and if you’re a Santana fan you HAVE to be a ‘Smooth’ fan, though I’m sure these guys could recommend some tasty deep cuts.

Stay tuned in a few weeks when I start my internship at Nielsen solely to obtain information on Santana’s ‘Smooth’ listener demographics.

Pride in Nationality


I recently came across the graphic above that shows responses in the World Value Survey  with respect to how much each nation takes pride in their nationality. I wanted to see if the respondents to the survey are simply a bunch of fair-weather fans, whose loyalty is determined by their nation’s success. One key determinant of success for nations in GDP, so I took single year 2013 GDP data for each country and checked the correlation between pride and GDP, find that AND MORE after the jump


Turns out there’s barely any correlation at all (.01) between pride and GDP, even worse is that if you calculate the p-value with a 95% confidence level then GDP is not even close to being a statistically significant predictor of how much pride citizens have in their nation.

Thinking on it some more single year GDP does not make a lot of sense here. We, the United States, could have no variation in GDP since my generation was born and I’m sure there are people who wouldn’t feel any kind of way. So next I found the previous year GDP data for each country and calculated the percentage change between the year of the survey and the previous year, attempting to reflect whether the country is going through an economic boom or bust


The correlation here comes out slightly stronger, but nothing to get excited about, down boy (.04). Once again, the p-value for this one is well above .05 so again change in GDP is not statistically significant.

Here’s a graph of the first analysis

And the second

What did we learn here, gang? The world is complicated. People feel all kinds of ways about all kinds of things for all kinds of reasons and sometimes you’ve spent the better part of three hours trying to analyze why they feel that way when you would’ve been better off letting it alone. The world is big, we are small, and I’ll never tell where I hid the plane.

Want my data? God help your soul

Alcohol By Volume Variance


Here at Shitty Data we only tackle issues of the highest gravity… hmmm?




Using reports from the Alcohol and Tobacco Tax and Trade Bureau, which checks label accuracy of alcohol for tax level purposes I compare the accuracy of reported alcohol in liquor, malt beverages, and wine. My data only covers 3 years, which sucks, the Bureau started in 2003, didn’t get its shit together on reporting until 2008 and until 2011 those reports were comically bad. Look at this thing, this report wouldn’t even go its author’s mother’s fridge. You are a government agency, you work for the people i.e. ME, you should be making it easier for me to do shitty data analysis than this. Come on.

Here are the basic numbers
“Over” means there is more alcohol than the label claims

I calculate accuracy, the percentage of samples with more alcohol than their label says, and the percentage of sample with less alcohol than their label says. Averaging these together creates the following table

From here we can draw some conclusions

– Wine is by far the most accurate in its claimed alcohol by volume. Also, wine is as likely to overestimate ABV as it is to underestimate it, not the case for malt beverages and liquor at all
– Malt beverages are more likely to have more alcohol than their labels suggest. Good news for Colt 45 fans, because I can’t imagine anyone’s drinking that for the taste. Bad news for college freshmen skipping orientation to drink Smirnoff Ice’s, your crazy high tolerance that you built up sneaking Burnett’s into your friend’s parent’s basement may get put to the test
– Speaking of Burnett’s, liquor bottles might as well just be making up how much alcohol is in them, really I think there should be a system for liquor that just shows the position you’re likely to fall asleep in based on the drink. That’s not statistically accurate, but this shows that liquor labels most frequently underestimate how many bad decisions you’re going to be making by a good portion


Where is this inaccuracy coming from? The two main possibilities I’m kicking around, but can’t do shitty analysis on so who cares, are that there is a significant tax increase above a certain ABV for liquor that does not exist for wine, but may exist to a lesser degree for malt beverages or the chemical process of making liquor is itself more inaccurate and harder to properly measure accurately than wine, with malt liquor falling between the two.

Want my data? God help your soul

NFL Arrests pt. 1

I started this blog with a promise to myself that I would not do data analysis on sports. I love sports and I love sports analysis, but there is just too goddamn much of it. If there’s any bubble in data analysis it is definitely in sports. But here I am. I found a pretty incredible database, which unfortunately did not offer its source file, but looked easy enough to parse. Using the NFL Arrests Database I attempt to determine which NFL teams are the most likely to have criminal players. I will then divide that question into more specific insights about types of crime and player positions. However, that’s hard as hell, so consider this a preview of a larger post to come.


To start off small I look at the Cleveland Browns, my team. They are terrible and I love them. Born and bred a Browns fan, I’ve never questioned why we were so bad. I think when I was younger I just assumed we did not go to the playoffs by choice. After a questionable draft Browns fans have had two very Browns kicks in the stomach – (1) Josh Gordon, our star receiver is suspended for testing positive for marijuana for the second time and (2) Nate Silver took away our ability to blame a Higher Power for our lack of success. It sucks to be me.

For just this post I will compare the crime statistics of the Browns to the city of Cleveland. Two reasons I do not plan to do this for every team; It takes a really really long time and it is not a statistically sound comparison. A team is more able to get rid of a player who has committed a crime than a city get rid of a resident. Further, not all crimes included in this analysis come with jail time, and jail sentences may be shorter than the scope of this analysis (13 years). Are you picking up that this is a little more complicated than I’ve done previously?

Here are totals for each category of crime

Now compare that to the city of Cleveland overall, calculating rates using respective population totals (90 preseason players) across 13 years, in categories that I can find reliable crime data on since 2000

This is a significant sample size so I am confident in drawing the conclusion that Browns players on average commit fewer crimes than residents of Cleveland. Even taking into account that many Browns players are residents there is enough distance between the two that the Cleveland Browns can be called in the last 13 years a statistically less criminal group than the entire population of Cleveland.


This may seem obvious given the income effect on reducing crime rates, but based on media coverage that acts as an echo chamber of limited sports news, conversations with extended family members, and people I grew up with in Small Town, Ohio, the NFL is filled with criminals. This shows, WITH NUMBERS, that that is wrong.

Want my data? Wait for ‘NFL Arrests pt. 2’ or bring my girlfriend back from Europe.

Pokemon Thoughts pt. 2

Some Pokemon


Look a lot


like humans


some of them are so anthropomorphised that have even taken on thumbs. There are a lot of questionable morals that go into fighting Pokemon and they do not get any easier to swallow when the Pokemon basically looks like its trainer painted up for their high school’s homecoming game.

I assume these are electric Pokemon?

I decided to find an approximation of how many chromosomes off our subservient pocket monsters are away from their trainers by creating my own dataset. I look at the incidence of having opposable thumbs in Pokemon across type, further divided into dominance of type (whether it is their own type, their primary of two types, or secondary). I also look at numbers of each type and sub-type of Pokemon so that comparisons can be done using proportions, as well as creating a dataset that can be used for a very exciting future post. All data is obtained from Bulbapedia and determination of an opposable thumb comes from my own eye test that errs on the conservative side (thank you to KG and Connor for help).

Counts on the first 151 Pokemon, the only Pokemon that matter, is below


I use these numbers to compute basic percentages and then run correlations for all columns

There are some interesting takeaways from this information of the original Pokemon.
– Fighting Pokemon have far and away the most thumbs. 75% of them could easily hold and aim the gun their trainer keeps locked away, over 30% more than the next most human type of Pokemon. This is unsurprising considering the nature of their type
– Slightly more surprising is that 40% of all psychic-type Pokemon have thumbs, a sign of conflation of humanity and intelligence. Forget Team Rocket, Alakazam can open doors with its mind AND hands, get on it Ash!
– There’s a 7 way tie for the least human type. ‘Normal’ types are the most egregious in lacking thumbs because they have the largest sample size of these 7 types. Second is Flying type which is unsurprising given the nature of their type.
– None of the correlations are particularly compelling, meaning I don’t see much reason to do more testing on those variables. There does seem to be a generally strong correlation between how many Pokemon are purely of one type and how many within that type have thumbs, but this is likely skewed by Fighting type Pokemon who are all pure Fighting type among the original 151.

So nothing especially interesting so far. This forces me to acknowledge the other generations of Pokemon like a dying bigoted father hesitantly welcoming his gay son back to the family in order to expand my dataset


Water handily overtakes Poison as the largest type by total number, but if you calculate percentage change Water actually increased the least, with Ghost becoming the type with the largest percentage change (excludes Dark and Fairy).

One of the most interesting things is how inconsistent Pokemon designers have been with their anthropomorphism. The first generation of Pokemon had hugely more thumbed-Pokemon than any other generation and almost as many as all other generations combined. I do not know why the designers became so disenchanted with thumbs, all I know is that the change in thumb frequency is easily statistically significant.


Finally, the correlations as shown for the first 151 Pokemon

This is a larger dataset, so this should be more accurate. However, based on the previous table showing the inconsistency of Pokemon designers I suspect highly that this sample of all Pokemon violates the assumption of homoscedasticity, meaning that if this sample were extrapolated infinitely and I took a random chunk of this same size I would not get the same results so this sample is not representative of a trend. For that reason take the rest of this with a grain of salt.

– The variable ‘Primary’ meaning the number of Pokemon whose primary of its two types is x, seems to be significant. It decreases the incidence of thumbs and increases the incidence of pure-type Pokemon. The number of primary thumbs also seems to decrease the incidence of secondary thumbs
– There is a similar trend among pure-type Pokemon and frequency of thumbs as was seen in the original sample, and the degree to which Fighting type Pokemon is an outlier has been significantly tempered, so I am cautiously optimistic

Shit’s getting a little too wordy and number-y for me so here’s a picture of a black labrador retriever contemplating Pokemon statistics

I want to explore these two trends more so I did some regression/ANOVA analysis on the two variables to see if they were statistically significant

Using a confidence interval of 95% and isolating each variable by the outcome it most closely predicts (correlation closest to 1), I find that the number of primary-type Pokemon is not statistically significant, the incidence or Pure-type Pokemon with thumbs is. This means that the best variable I have to predict how antropologically-inclined a type is is how many Pokemon within that type only have that one type and also have an opposable thumb. Good enough for government work.

great pokemon

Want my data? God help your soul