To me, the important issue raised by the piece in The Star is that the probability of winning a single game is determined by more than each team’s expected performance. Consider the following stylized version of a competition.
Suppose that IJay and I sit down to have a suicide wing eating competition. In such a situation, I would clearly have the advantage, given that I am not only bigger but actually crave spicy food. However, even iron-clad stomachs can be finicky, and subject to randomness. So, let’s suppose that with probability 1/2, I can eat 16 suicide wings before being unable to continue, with probability 1/4 I eat 12 wings, and with the remaining ¼ in probability, my stomach betrays me after only 8 wings. IJay, on the other hand, can down 15 wings with probability 1/3, 9 wings with probability 1/3, or 6 wings with the remaining 1/3 in probability.
Given these probability distributions, if we engaged in these eating competitions on a regular and prolonged basis, I would average 13 wings a night and IJay would average 10. In addition, we can see that IJay would win 25% of the time. With probability 1/3 he eats 15 wings. In order to win, I would have to have one of those nights where I can only manage 8 or 12 wings. This happens with probability ½. Again with probability 1/3, IJay eats 9 wings. In order to win, I would have to stop at 8, which occurs with probability ¼. Finally, if IJay eats only 6 suicide wings, he cannot beat me. The total probability associated with IJay’s victories, therefore is 1/3*1/2 + 1/3*1/4 = ¼.
Now suppose that IJay, inspired by Homer Simpson, discovers that if he coats his mouth with wax before the competition (note: we really do not recommend or advocate this), then the probability distribution associated with his performance changes. Suppose that after effectively ingesting a candle, IJay can eat 17 suicide wings with probability ½, but with probability ½ the effect of the wax is such that he can’t actually get anything into his mouth at all. Note that, by emulating Homer, IJay has decreased his average consumption from 10 wings to 8.5, but has increased his probability of winning from ¼ to ½.
It is certainly possible to construct examples where an increase in the variance of IJay’s tolerance for spicy foods does not come with an increase in his probability of winning, and I’m sure some empirical observations to this effect can be found around town on a nightly basis. But the point remains that the probability of winning is determined by the entire distribution over possible outcomes and not just the mean.
Which leads us to hockey. The appropriate comparison to the story above would be to say that, if you’re the underdog as IJay was, which probably means you should expect to get outplayed and outshot, then your goalie must be better than the other team’s if you’re going to win. And given the goalies at Sochi, you probably can’t count on a poor performance from the opposition in every game in the medal rounds. So, you need a goalie who has a high probability of stealing the game for you.
Empirically, it’s not so obvious as to what the appropriate measure of such performance is. Nor is it obvious that, given the appropriate measure, that goalies vary enough in this manner to make this a worthwhile discussion to have. But, you don’t know until you go and look, and so that’s what we did.
We went and obtained game-by-game data on the 5 main candidates for the 3 US goalies from NHL.com. While more data is better, goalies’ performances do change over time, and so we looked at just the last 4 years. For Miller, Quick and Howard, this constitutes a workload in the vicinity of 200 games, and for Bishop and Schneider the sample contains just over 100 observations on them. (Note: the column in The Star contained data from a preliminary investigation that only included 2 years’ worth of data.)
As mentioned above, it wasn’t (and still isn’t) obvious to us how to appropriately measure the kind of “big game ability” that can cause a goalie to win a game pretty much single-handedly. So, we looked at the frequencies of a few things that would seem indicative of a goalie having had a good game. First, we looked at shutouts and then games in which our goalies gave up one or fewer goals. We also considered having a save percentage above a certain threshold, and considered two possibilities, .950 and .940. It is well known among the stats community, however, that the best indicator of a goalie’s performance is at even strength. Save percentage on the penalty kill can be very much driven by the players in front of the goalie, and there is huge volatility in PK save percentage, even for a single goalie over a short period of time. Unfortunately, while the NHL.com data does give the number of even strength goals, it does not give the number of even strength shots or saves, and so we were not able to look at 5v5 save percentage. The best we could do in this regard was look at games with 1 or fewer even strength goals (which again didn’t make it into The Star). The following table summarizes our findings.
|Player||<=1 GA%||<=1 EV GA %||>.940 Sv%||>.950 Sv%|
(If anyone out there is aware of a site that has 5v5 save percentage at the game level, please let me know. I know it can be constructed out of the NHL’s detailed game logs, and hopefully we’ll get that done soon, but it would still be a useful resource if it were out there.)
To be honest, I was not expecting to find much in terms of appreciable difference between the goalies in terms of these 5 measures, and the extent to which there was, I certainly did not expect there to be any clear pattern to the results. But, a pattern there is.
Bishop and Schneider perform better in almost every category than the 3 who made the team (the one exception being shutouts, where Quick outperformed Bishop). Interesting, but is this difference driven by something fundamental, or could it just be the product of chance? If you had 5 people flip a coin 10 times, they’re not all going to get the same number of heads. It would be a bit strange to infer, however, that the person who happened to flip the most heads was somehow flipping a different coin from the rest. However, if you had these same 5 people continue to flip their coins, and you noticed that one of coin flippers was getting, say, double the number of heads as the others, then at some point you would have to question whether this was just luck.
The question, then, is how likely is it that Bishop’s observed frequency of shutouts (or other measure we consider) and Howard’s observed frequency of shutouts could have been the outcome of the same underlying process. This is a hypothesis for which statisticians have developed a simple test. So, we looked to see if Bishop’s performance in our measures was statistically different from the other 3, and then checked for Schneider.
What we found is that, with the exception of shutouts, Bishop and Schneider’s performance in our measures was different enough from the other 3 that we could be fairly certain that we’re not just observing luck. The following tables give the degree of confidence (a statistical term) that we have in the difference we observe between the various goalies is not just random luck (using fairly standard categories for confidence).
|Bishop vs||<=1 GA%||<=1 EVGA%||>.940 Sv%||>.950 Sv%|
|Schneider vs||<=1 GA%||<=1 EV GA%||>.940 Sv%||>.950 Sv%|
Again, I feel I should emphasize that it is very much an open question as to how to appropriately measure whatever it is that influences the probability of winning a given game. Chris Boyle’s Shot Quality Project over at Sportsnet seems like just the kind of data one would want to establish an appropriate measure. In my opinion, it’s an incredibly worthwhile project that I follow very closely. There are many other projects going on around the blogosphere that I think are really pushing advanced stats forward, and I’m sure I’ll get around to giving them their due at some point.
To conclude, I think the consideration of specific aspects of the distribution over performances makes for some (delicious pub-style) food for thought, especially when you note that both Bishop’s and Schneider’s average performances (straight save percentage) were better over this time period than all three goalies the US ultimately went with.