Tuesday, September 18, 2012

How Many Games Will the Indianapolis Colts Win this Season?: Using Historical Data to Predict Future Outcomes


How many games will the Indianapolis Colts win this year? Well, that’s certainly the question, isn’t it? 7… with 66% confidence it’ll be somewhere between 5 and 9, but let’s not talk about confidence intervals for those in the sports world (don’t get me started on the “analysts” who are, for whatever reason, accepted as the foremost authority on sports – here’s looking at you, Trent Dilfer).  I thought I’d just provide the answer right off-the-bat, before getting into the popular  methodologies used to predict future outcomes and the pitfalls of using past data to predict future behavior.

xkcd.com/904/
Before we proceed any further, I think I should take a brief second to caveat my prediction for the number of games the Colts will win this year. First, that number was, indeed, derived from a statistical algorithm I created (it wasn’t just thrown out to you willy-nilly). Second, don’t make any bets on that information. Below, I’ll discuss the difference between my methods and those used in Vegas. However, my calculations were done so more as an exercise in using historical data to develop predictions about future outcomes – which, needless to say, isn’t the best data on which you should place your bets.

While there are thousands of ways to create prediction models, algorithms, and formulas to predict future behavior, there are generally two universally accepted methods that people in the actual industry of sports use: (1) the Vegas method, and (2) the ESPN analysts’ method.  My model uses neither of these because I’m an incredible genius and an innovator in the world of statistics (calm down, everyone. I’m just kidding. The degree of difficulty of mine is located somewhere between the two).

The Vegas Method:
The first, and most accurate method for predicting very near-term games, is the method used in Vegas to calculate the line for bets placed on the games for the upcoming week. The way they do this is by running regressions to determine the coefficients that most influence scores and winning games. They then integrate these coefficients into models that use “shock” variables to interact with the previous coefficients. For those of you who are interested in statistics, they likely use a GLS model to estimate scores, a logistic model to estimate predictable variables on wins, and Tobit model to integrate the two into a single complete predictive algorithm. The “shock” variables influence the model based on current events – like the fact Darrell Revis is injured or nobody at the Bears currently likes Jay Cutler. Either way, they employ these models on a daily basis, keeping them dynamic and weighting current coefficients higher. All this information is then put into a final model, which is most likely a weighted OLS regression.

The ESPN Analysts’ Method:
The second, and BY FAR least predictive, model is the one used over at big-box TV networks like ESPN. Effectively, they assert subjective analysis onto the value of different variables and use the power of their brains to evaluate the teams and who will win. All the research (not surprisingly) shows this is ineffective.

So how did I determine how many games the Indianapolis Colts will win this season? It actually wasn’t that easy. After collecting data from the Colts and their opponents over each of the last three seasons, I estimated means and metrics of dispersion for the overall offense and defense for each of these teams. I chose three seasons because I think that represents a pretty robust picture of each team. Some teams are effectively the same, while others have transformed quite a bit.

After collecting and analyzing the initial data, I “simulated” each game for the year by creating a Triangular Distribution model that truncates for a minimum of zero points scored or allowed, and then randomizes the points scored and points allowed based on that distribution and the parameters (parameters being mean and level of dispersion). I then subtracted the difference in how points the Colts won by and how many points they allowed to their opponent. If positive, the Colts “won” the simulated game. If negative, the Colts “lost” the simulated game. These were then converted into dummy variables and assessed on a scale of 1 or 0, indicating won or lost, respectively. I then simulated each game 2000 times to control for any potential outliers or mistakes in the parameters.  

After doing all this work, I determined the Colts will win precisely 7.0455 games, with a standard deviation of 1.8358 games, over the 2000 simulations.

At the end of the season, I’ll revisit this analysis to see how accurately my model predicted the actual number of games the Colts won. I’m hoping to engage in some friendly discussion about the pitfalls of using historical data to predict future outcomes. Maybe, however, I’ll be exactly accurate – after all, even if I drew the “winner” out of a hat, giving me a 50% chance to draw the Colts each time, there is still a 17.45% chance I’d have estimated 7 wins for the year, and a 73.43% chance to have estimated between 5 and 9 wins.

I’ll leave you with this, though. In general, using historic data to predict future outcomes and making financial decisions relying on those outcomes can be catastrophic. If I mortgaged my home to bet the Colts will win 7 games, I probably wouldn’t be too happy come December. In fact, the entire banking crisis of 2008 likely wouldn’t have happened had quantitative modelers heeded my warning. Those statistics guys in the big banks downtown inserted an assumption in their models that real estate values generally experience a logarithmic increase from year-to-year. Since their models relied on that assumption, their trading platforms kept purchasing CDOs and artificially escalating the value of the debt instruments that ultimately required the banks to be bailed out. After the fact, some professors went in, noticed this assumption, and changed it. After doing so, the algorithm had the valuation of the CDOs priced correctly within a range of less than a penny.

Be careful with historic data, my friends!

1 comment:

  1. Nice piece! I liked the part where you mentioned nobody liking Jay Cutler :]

    ReplyDelete