How to predict this year’s playoffs? An interesting question with many potential answers. We didn’t get an opportunity to explain how we came up with the numbers that we did in the SI article itself, so we’ll have to do that here. Hopefully I can explain our methodology clearly even for those without a statistics background (and not too loosely for those who know their stats!). I should also mention that my colleague, Mikal Skuterud, really deserves a lot credit for the analysis.
The first issue to sort out is what the outcome is that you’re trying to explain. At some level, it’s very simple: who wins. Winning is a binary variable – either you win or you don’t. When trying to explain or predict a binary variable, probit regression analysis is incredibly useful. We’ve used it before, when looking at some interesting issues concerning playoff success last year. Some examples can be found here, here, here, and here. In these pieces, we used a probit regression. In a nutshell, a probit regression uses the data to assign a “score” to a matchup. This score is used in conjunction with the normal distribution to essentially determine how much “luck” (the error term, which includes the effects of missing variables as well as just random chance, and which is assumed to be drawn from the normal distribution) is required for a given team to win that series.
While we have found this to be a useful approach for some questions in the past, it’s actually not making as efficient use of the data as possible. Some series are very closely contested, and “puck luck” plays a large role, while others are quite lopsided, and the amount of “puck luck” would have to be extreme to change the outcome. The problem with a probit regression is that it lumps all series wins into the same category. We actually have some information on how lopsided a series is by looking at the number of games it went. It’s not perfect by any means, but it does contain information. As such, one can think of a team’s results from a series has having 8 possible outcomes: get swept, lose in 5, lose in 6, lose in 7, win in 7, win in 6, win in 5, and sweep. Note that the way these outcomes have been written represent a sort of worst-to-best. In other words, we can order the outcomes. When we can do this, one possible method of analysis is an ordered probit regression. It is similar to a probit, but allows us to exploit additional information about the closeness of a series (and the role of luck) in order to get better predictions.
An ordered probit still constructs a “score” for a series, as the probit regression does, but then uses that score to establish 8 regions of “luck” that would correspond to the 8 possible outcomes. From this, probabilities can be constructed for each of the 8 possible outcomes. The probability that a team wins, then, is simply the sum of the probabilities associated with the four winning outcomes (sweep, win in 5, win in 6, and win in 7).
Once the strategy for determining the probabilities that a given team wins a playoff series against a specific opponent has been established, the next order of business is to figure out what variables should be used to create this “score” used in the ordered probit regression. When doing predictive analysis, you ideally want everything that contains information about how a team will perform in the playoffs. Clearly, how they did in the regular season (i.e. their points) has some value, but what else is there? We went and gathered as much information as we could on a whole mess of variables, with the idea that our estimation strategy would help us figure out what was important. We collected data on regular season points, points in the last half of the season, points in the last 10 games, Corsi, Score Adjusted Corsi, penalty kill, power play, save percentage, shooting percentage, and more. There was one interesting variable that was created by Ian. It has been well-recognized in the analytics community that “puck luck” is a real thing, and that it can make the standings a poor representation of a team’s ability. One way that puck luck manifests itself is in the outcome of one-goal games – particularly overtime games and shootouts. These games are often determined by odd events, and occasionally a team gets the puck to bounce their way a disproportionate number of times. So, what Ian did was construct a variable that compared a team’s winning percentage in one-goal games to their winning percentage in other games. If they were doing much better in one-goal games, then it is possible that their regular season record is predicated not on ability but on puck luck. More on this variable later.
So, having collected all these historical data where we know the actual outcome of each series, now it’s time to plug them into our ordered probit regression and see what predictors are good at predicting the outcomes that actually happened, right? Unfortunately, it’s not quite that simple. Given that some of these variables were only available going back to the 2008 playoffs (in particular, we pulled the Score Adjusted Corsi from puckon.net, which only has that going back to the 2007-08 season), we were left with 105 observations on playoff series. With so many variables, and the fact that these variables are actually quite correlated with each other, using everything doesn’t actually yield anything with any statistical power.
Things are further complicated by the fact that, since 2008, playoff teams don’t really look all that different from each other in terms of these variables. We’ve entered an age of parity, and this is not good for statistical analysis. Regression analysis is based on seeing how differences in certain variables (team characteristics) lead to differences in outcomes (winning versus losing a series). If there aren’t many differences in team characteristics, then it gets hard to explain or predict the difference between who wins and who loses a series.
One solution to this problem is factor analysis. Factor analysis takes the variables that you have, and combines them into a single number, called a factor, in a way designed to make teams look as different as possible according to that factor. You would then use that factor in your regression analysis. You can run regressions using a single factor, or you can create multiple factors. The key is that the number of factors you create and use in the regression is less than the number of variables you began with.
So, our choices were to use factor analysis, with the number of factors to be determined, or to use a smaller set of variables in our regressions, that smaller set also to be determined. We wanted to use the best model possible, but which one would that be? What should the criterion be to discriminate between what is a good model and what is a bad model?
In our case, the “best” model is the one that has the most predictive power. This is not (necessarily) the same as the model that fits the data the best. When you run probit regressions, you can see how many of the series you would have got right if you had used that model to make your picks. Unfortunately, this is rather backwards looking, as the model is created using the data on who won. In other words, the model that fits the data the best is the one that has the most explanatory power, which is quite different from predictive power.
In order to establish predictive power, you need to see how well the model does in predicting the outcomes of series that weren’t used in the generation of the model. The way to do this is to run the regression using all the data you have except for one year. Then, use the resulting model to predict that year that wasn’t used, and compare your predictions to the actual results. This is known as “leave-one-out cross-validation.” So, we tried this with the factor analysis, using several different numbers of factors, as well as several different sets of variables. What we found was that the factor analysis with 5 factors had the most predictive power, predicting on average 10.7 correct series per year (so, out of 15). This was a fair bit better than looking at any single variable by itself, although the one that came closest was Ian’s luck variable. As it turns out, the luck variable was heavily weighted in the construction of the factors, so it turned out to be an important innovation! At some point, we’ll have to look into this more closely.
Now that the model had been established, it was time to generate some results. First off, we used the model to generate probabilities of each of the 8 outcomes for each of the first round series. As mentioned before, the probability that a team wins is the sum of the probabilities associated with that team winning. The first round predicted outcomes are as follows:
Note that this model predicts that the Senators will beat the Canadiens, but if you look at the single most likely outcome, it’s that the Habs win in 7. This is also true for the Flames/Canucks series: the Canucks are predicted to win, but the most likely outcome (of the 8) is that the Flames win in 7. At some level, this is telling us how close these series will be.
From here, we then generated probabilities for all the second round matchups, and generated probabilities for who would win all of those. We then went on to generate probabilities for all the possible third round matchups, and used the model to generate probabilities for the outcomes of those series, and so on right to the Finals. Using this method, we generated probabilities for each team to win the Stanley Cup. Again, it is worth noting, that the most likely Stanley Cup Final (as predicted by this model) is the Blackhawks versus the Lightning. If you were to fill out a bracket by taking the team most likely to win a series into the next round, this is what you would get as well.
Finally, looking at the probabilities generated for the first round, we can see that there are going to be some tightly contested series. The model predicts that each series will go at least 6 games, and the predicted winners generally have less than a 60% chance of winning. We’ll check back after the first round to see how things are going.