How many games will the
Indianapolis Colts win this year? Well, that’s certainly the question, isn’t
it? 7… with 66% confidence it’ll be somewhere between 5 and 9, but let’s not
talk about confidence intervals for those in the sports world (don’t get me
started on the “analysts” who are, for whatever reason, accepted as the
foremost authority on sports – here’s looking at you, Trent Dilfer). I thought I’d just provide the answer right off-the-bat,
before getting into the popular methodologies
used to predict future outcomes and the pitfalls of using past data to predict
future behavior.
xkcd.com/904/ |
Before we proceed any further,
I think I should take a brief second to caveat my prediction for the number of
games the Colts will win this year. First, that number was, indeed, derived
from a statistical algorithm I created (it wasn’t just thrown out to you willy-nilly).
Second, don’t make any bets on that information. Below, I’ll discuss the
difference between my methods and those used in Vegas. However, my calculations
were done so more as an exercise in using historical data to develop predictions
about future outcomes – which, needless to say, isn’t the best data on which
you should place your bets.
While there are thousands of
ways to create prediction models, algorithms, and formulas to predict future
behavior, there are generally two universally accepted methods that people in
the actual industry of sports use: (1) the Vegas method, and (2) the ESPN
analysts’ method. My model uses neither
of these because I’m an incredible genius and an innovator in the world of
statistics (calm down, everyone. I’m just kidding. The degree of difficulty of
mine is located somewhere between the two).
The Vegas Method:
The first, and most accurate method
for predicting very near-term games, is the method used in Vegas to calculate
the line for bets placed on the games for the upcoming week. The way they do
this is by running regressions to determine the coefficients that most
influence scores and winning games. They then integrate these coefficients into
models that use “shock” variables to interact with the previous coefficients. For
those of you who are interested in statistics, they likely use a GLS model to
estimate scores, a logistic model to estimate predictable variables on wins,
and Tobit model to integrate the two into a single complete predictive
algorithm. The “shock” variables influence the model based on current events –
like the fact Darrell Revis is injured or nobody at the Bears currently likes
Jay Cutler. Either way, they employ these models on a daily basis, keeping them
dynamic and weighting current coefficients higher. All this information is then
put into a final model, which is most likely a weighted OLS regression.
The ESPN Analysts’ Method:
The second, and BY FAR least
predictive, model is the one used over at big-box TV networks like ESPN.
Effectively, they assert subjective analysis onto the value of different
variables and use the power of their brains to evaluate the teams and who will
win. All the research (not surprisingly) shows this is ineffective.
So how did I determine how
many games the Indianapolis Colts will win this season? It actually wasn’t that
easy. After collecting data from the Colts and their opponents over each of the
last three seasons, I estimated means and metrics of dispersion for the overall
offense and defense for each of these teams. I chose three seasons because I
think that represents a pretty robust picture of each team. Some teams are
effectively the same, while others have transformed quite a bit.
After collecting and analyzing
the initial data, I “simulated” each game for the year by creating a Triangular
Distribution model that truncates for a minimum of zero points scored or
allowed, and then randomizes the points scored and points allowed based on that
distribution and the parameters (parameters being mean and level of
dispersion). I then subtracted the difference in how points the Colts won by
and how many points they allowed to their opponent. If positive, the Colts “won”
the simulated game. If negative, the Colts “lost” the simulated game. These
were then converted into dummy variables and assessed on a scale of 1 or 0,
indicating won or lost, respectively. I then simulated each game 2000 times to
control for any potential outliers or mistakes in the parameters.
After doing all this work, I
determined the Colts will win precisely 7.0455 games, with a standard deviation
of 1.8358 games, over the 2000 simulations.
At the end of the season, I’ll
revisit this analysis to see how accurately my model predicted the actual
number of games the Colts won. I’m hoping to engage in some friendly discussion
about the pitfalls of using historical data to predict future outcomes. Maybe,
however, I’ll be exactly accurate – after all, even if I drew the “winner” out
of a hat, giving me a 50% chance to draw the Colts each time, there is still a
17.45% chance I’d have estimated 7 wins for the year, and a 73.43% chance to
have estimated between 5 and 9 wins.
I’ll leave you with this,
though. In general, using historic data to predict future outcomes and making
financial decisions relying on those outcomes can be catastrophic. If I
mortgaged my home to bet the Colts will win 7 games, I probably wouldn’t be too
happy come December. In fact, the entire banking crisis of 2008 likely wouldn’t
have happened had quantitative modelers heeded my warning. Those statistics
guys in the big banks downtown inserted an assumption in their models that real
estate values generally experience a logarithmic increase from year-to-year.
Since their models relied on that assumption, their trading platforms kept
purchasing CDOs and artificially escalating the value of the debt instruments
that ultimately required the banks to be bailed out. After the fact, some
professors went in, noticed this assumption, and changed it. After doing so,
the algorithm had the valuation of the CDOs priced correctly within a range of
less than a penny.
Be careful with historic data,
my friends!
Nice piece! I liked the part where you mentioned nobody liking Jay Cutler :]
ReplyDelete