Tuesday, September 25, 2012

All the Contestants on "The Bachelor" Are Either Completely Disingenuous or Utterly Irrational - And Here's Some Quantitative Proof


Hello, everyone! This week, I take a brief detour from my typical topics-of-interest to respond to a question posed by a fellow Busenbarkemetrics reader on one of the social networks through which this blog is broadcast. The question posed reads as follows: “What is the probability that I will find love on the Bachelor?” Obviously, this question was not derived from my own bank of questions for two reasons; (1) I would be a contestant on the Bachelorette not the Bachelor, and (2) I have literally no interest in finding love on this show.  Either way, this detour is going to be kept to a minimum, as this post will be brief, but will answer the question at hand!

As I digress, not only will I answer this question, but I also will explain why all contestants who have appeared on the Bachelor are either completely disingenuous in their search for love or are completely irrational in their methods chosen to find love. In order to substantiate this claim, I present a decision tree that graphically diagrams the potential outcomes from attempting to find love on the Bachelor(ette) and compare the probability of finding love on the show to another popular method of locating a soul mate.

First, let’s take a look at the decision tree posted below. As we can see, there are two potential outcomes that result in finding love and four that don’t. Even more, it is apparent the cumulative probability of finding love through one of the two outcomes is approximately 0.0055%, which makes the probability of not finding love approximately 99.9945%. This translates to a 1/18158 chance of finding love on the Bachelor from the beginning of the process to the end. That’s fairly low, if you ask me. But, as we all know, these figures mean absolutely nothing without some context. After all, maybe the Bachelor features the greatest probability of finding love of any method out there.

As it turns out, the probability of finding love on the Bachelor is, indeed, extraordinarily low. Comparing the figures from above to another popular method of finding love (Match.com), we can notice that the probability of finding love on the Bachelor is substantially lower than that of finding love on Match.com. In fact, it’s approximately 13.26 times lower. Where the probability of find love on the Bachelor is 1/18158, the probability of finding love on Match.com is 1/1369 (0.073%) – according to figures published by Match.com.

So what does this all mean? To me, it means all contestants who have appeared on the Bachelor are either disingenuous or irrational. If they’re rational people, then they wouldn’t be seeking love using a method with such a low success rate. Instead, they’d spend the time it takes to apply and participate and opt for registering for a Match.com account. Thus, they must have an ulterior motive for appearing on the show or are disingenuous. For those who are truly seeking love on the Bachelor, they are irrational. If they really desire love, they should instead yield to a much more successful avenue to it – such as Match.com.

For all of you who watch the show, you should spend your future viewing sessions trying to determine into which of the categories the contestants fall. Are they completely disingenuous or utterly irrational? You decide!

** In order to complete the above analysis, I defined “finding love” as locating an individual with whom one would spend the remainder of his/her life. That is, all of the figures are based on the number of individuals who were engaged to be married and remain together today. While the Match.com figures are based on those subscribers who are actually married, I made the assumption that Bachelor contestants who are engaged and remain together are effectively the same as those who are married. Further, I assumed the divorce rates among Match.com subscribers and Bachelor contestants are effectively the same, and thus, no incremental change should be applied to the analysis. I also estimated 25% probability of finding a mate after being cast on “Bachelor Pad”. This is probably a vast overestimation, but it’s meant to also include the possibility of meeting another cast mate at ancillary functions**

Tuesday, September 18, 2012

How Many Games Will the Indianapolis Colts Win this Season?: Using Historical Data to Predict Future Outcomes


How many games will the Indianapolis Colts win this year? Well, that’s certainly the question, isn’t it? 7… with 66% confidence it’ll be somewhere between 5 and 9, but let’s not talk about confidence intervals for those in the sports world (don’t get me started on the “analysts” who are, for whatever reason, accepted as the foremost authority on sports – here’s looking at you, Trent Dilfer).  I thought I’d just provide the answer right off-the-bat, before getting into the popular  methodologies used to predict future outcomes and the pitfalls of using past data to predict future behavior.

xkcd.com/904/
Before we proceed any further, I think I should take a brief second to caveat my prediction for the number of games the Colts will win this year. First, that number was, indeed, derived from a statistical algorithm I created (it wasn’t just thrown out to you willy-nilly). Second, don’t make any bets on that information. Below, I’ll discuss the difference between my methods and those used in Vegas. However, my calculations were done so more as an exercise in using historical data to develop predictions about future outcomes – which, needless to say, isn’t the best data on which you should place your bets.

While there are thousands of ways to create prediction models, algorithms, and formulas to predict future behavior, there are generally two universally accepted methods that people in the actual industry of sports use: (1) the Vegas method, and (2) the ESPN analysts’ method.  My model uses neither of these because I’m an incredible genius and an innovator in the world of statistics (calm down, everyone. I’m just kidding. The degree of difficulty of mine is located somewhere between the two).

The Vegas Method:
The first, and most accurate method for predicting very near-term games, is the method used in Vegas to calculate the line for bets placed on the games for the upcoming week. The way they do this is by running regressions to determine the coefficients that most influence scores and winning games. They then integrate these coefficients into models that use “shock” variables to interact with the previous coefficients. For those of you who are interested in statistics, they likely use a GLS model to estimate scores, a logistic model to estimate predictable variables on wins, and Tobit model to integrate the two into a single complete predictive algorithm. The “shock” variables influence the model based on current events – like the fact Darrell Revis is injured or nobody at the Bears currently likes Jay Cutler. Either way, they employ these models on a daily basis, keeping them dynamic and weighting current coefficients higher. All this information is then put into a final model, which is most likely a weighted OLS regression.

The ESPN Analysts’ Method:
The second, and BY FAR least predictive, model is the one used over at big-box TV networks like ESPN. Effectively, they assert subjective analysis onto the value of different variables and use the power of their brains to evaluate the teams and who will win. All the research (not surprisingly) shows this is ineffective.

So how did I determine how many games the Indianapolis Colts will win this season? It actually wasn’t that easy. After collecting data from the Colts and their opponents over each of the last three seasons, I estimated means and metrics of dispersion for the overall offense and defense for each of these teams. I chose three seasons because I think that represents a pretty robust picture of each team. Some teams are effectively the same, while others have transformed quite a bit.

After collecting and analyzing the initial data, I “simulated” each game for the year by creating a Triangular Distribution model that truncates for a minimum of zero points scored or allowed, and then randomizes the points scored and points allowed based on that distribution and the parameters (parameters being mean and level of dispersion). I then subtracted the difference in how points the Colts won by and how many points they allowed to their opponent. If positive, the Colts “won” the simulated game. If negative, the Colts “lost” the simulated game. These were then converted into dummy variables and assessed on a scale of 1 or 0, indicating won or lost, respectively. I then simulated each game 2000 times to control for any potential outliers or mistakes in the parameters.  

After doing all this work, I determined the Colts will win precisely 7.0455 games, with a standard deviation of 1.8358 games, over the 2000 simulations.

At the end of the season, I’ll revisit this analysis to see how accurately my model predicted the actual number of games the Colts won. I’m hoping to engage in some friendly discussion about the pitfalls of using historical data to predict future outcomes. Maybe, however, I’ll be exactly accurate – after all, even if I drew the “winner” out of a hat, giving me a 50% chance to draw the Colts each time, there is still a 17.45% chance I’d have estimated 7 wins for the year, and a 73.43% chance to have estimated between 5 and 9 wins.

I’ll leave you with this, though. In general, using historic data to predict future outcomes and making financial decisions relying on those outcomes can be catastrophic. If I mortgaged my home to bet the Colts will win 7 games, I probably wouldn’t be too happy come December. In fact, the entire banking crisis of 2008 likely wouldn’t have happened had quantitative modelers heeded my warning. Those statistics guys in the big banks downtown inserted an assumption in their models that real estate values generally experience a logarithmic increase from year-to-year. Since their models relied on that assumption, their trading platforms kept purchasing CDOs and artificially escalating the value of the debt instruments that ultimately required the banks to be bailed out. After the fact, some professors went in, noticed this assumption, and changed it. After doing so, the algorithm had the valuation of the CDOs priced correctly within a range of less than a penny.

Be careful with historic data, my friends!

Tuesday, September 11, 2012

Expected Value and Risk-Aversion Follow Up

p.s. the previous post went through on the first try. I'm not surprised, though - read below to find out why!

Expected Value and Risk-Aversion

Hello, everyone, and welcome to the wonderful world of my mind blog!

Here you will find a collection of thoughts, perceptions, and world-mechanics as computed by the most powerful engine of them all - the brain. In this circumstance, it just so happens that the brain-in-question is mine, and thus, we begin Busenbarkemetrics.

"What is Busenbarkemetrics", you ask? Let me fill you in. Busenbarkemetrics encompasses perceptions of the world and potential decisions as the way I see them - a rationale set of utility functions. It is also a curious discipline that wonders how estimates, projections, and uncertainty can be perceived pragmatically, through a rationale set of quantitative metrics. As such, let me proceed with my first post:

As I'm beginning this new first-time blogging experience, I'm obviously using a new software as a medium to do so. In the process, I'm sitting here wondering about something I've often found myself doing when I upload something via technology, and maybe you do, too. Often times, I'm now realizing, I naturally assert a "test" upload before taking the time to compose what I actually want to say. You know what I mean; a brief post that just says "testing" or something before you actually embark on the long journal of writing what you actually want to say. This way, if you realize there is some error in the software, you haven't just wasted  your valuable time typing something you'll now have to re-input.

As I was just about to engage in this practice, I thought maybe I should actually test whether or not this is actually rational. Let's do some quick, fun math to find the probability that the software would actually have to not work in order to justify me issuing a "test" post:

This entire post took me about 23 minutes to write (and yes, I came back at the end and updated the figure to reflect how long it actually took me). If I was to first compose a "test" post, let's just say my total time would have been about 24 minutes - 1 minute to write the test post, check that it worked, start a new post, and then 23 minutes write this one.  Now, let's look at the expected value, which we'll solve backwards and get the probability of failure, to see what makes this rational. The model is:
                                                                 23x + 47(1-x) = 24
                                                                         x = .95833
Where; 23 minutes is how long it takes me to write the post, x is the percentage of time it works correctly, 1-x is the percentage of time it doesn't work correctly, and 47 is how long it takes if I have to write it twice (remember the extra minute for the processing time).

The calculation of expected value for x above shows that, in order for it to be rational to post a "test" blog, the software has to not work over 4.17% of the time. That is an EXTRAORDINARILY high percentage. If I had to estimate, I'd say the software doesn't work maybe 0.75% or something close. There's no way the software doesn't work over 4.17% of the time; so why do we do these test posts?

It's because people are generally more risk-average than they are rational. Think about it. Why do you wear a seat belt on the airplane, when you know the probability of it crashing is infinitesimal (less than .00001%)? All that time strapping on the seat belt, being uncomfortable, and generally having less fun is surely worth more than the expected value of crashing. Even so, the probability of surviving a crash with the seat belt on is relatively none. Still, though, we all wear the seat belt (some of us because we're forced to do so and not because we want to) because we're risk-averse people.

What's the moral of this incredibly long story? Next time you're about do something just because conventional wisdom, authority, or your risk-aversion tells you to do so, think twice about what you're losing and the expected value!

Stay tuned in the weeks to come for some awesome quantitative modeling on how many games the Colts will win this year if historical data means anything, and for some insight as to why college football fans are mathematically and economically far superior to NFL fans....among much more!