There were a couple of requests to see more about the regressions that were run. So here it is.

First, I should mention that the kind of regression that is run is called a probit regression. The variable we’re trying to explain is wins, so it can only take on two values - 1 if a team wins the game, or 0 if it doesn’t. (Note that we are not differentiating between different kinds of wins and losses, such as overtime and shootout. That would be interesting to do at some point, though.) A probit regression is used to estimate the probability that such a dependent variable takes on the value of 1 as a function of some observable characteristics.

The characteristics we used for the league-wide regressions were hit differential (hits for minus hits against), a measure of possession (as mentioned previously, we ran one regression with Fenwick % and another with Corsi %), a dummy variable for home-ice advantage (takes a value 1 when the home team, and 0 when the visitor), faceoff %, penalty % (the number of penalties the other team took divided by the total penalties in the game), and a variable for the year, just to see if this relationship changed over time, perhaps due to rule changes, for example.

As previously mentioned, we had these characteristics for each game from the 2007-08 to the 2012-13 seasons. We then had to randomly pick one of the teams in the game to be the team of interest. This procedure wouldn’t work if we always picked the winning team, for example, so we needed to make sure that we had a good mix of 1’s and 0’s in the sample.

From there, these game characteristics are used to generate a “score” for the game, with the idea that the higher the score, the greater the probability of winning. The score function is a standard linear equation:

*Score = a + b*(hit diff) + c*poss + d*home + e*faceoff + f*penalty + g*year*

This score is translated into a probability of winning via a function that will spit out a number between 0 and 1 for each score, with higher scores receiving higher numbers. That is, we have

*Y = Pr[win=1] = F(score)*

Those familiar with probability theory will recognize that this function *F* is a cumulative density function. It is, in fact, the cdf associated with the normal distribution.

Whew! So, once this regression has been run, you get estimates for the coefficients above. The problem is, what do these coefficients mean? Well, they tell you how each of those variables affect this “score” of the game. But that really doesn’t mean anything in and of itself. In order to be useful, you need to figure out how this score then affects the probability of winning. Unfortunately, this must depend on what the score already is. So, you need to evaluate the effect of a change in one of these variables holding the value of the other variables fixed. What we did was look at the “score” when each of the variables was at their average value, and then consider the effect of a change in each one separately. The effect of a change in the hit differential, therefore, is given by

*δY/δ(hit diff) = b*f(score)*

where *f* is the derivative of the function *F*. This is the *marginal* effect of hits, and it tells you (approximately) the observed increase in the probability of winning if when the hit differential increases by one (either because the team delivered one more hit or took one less hit).

For the individual team regressions, we used the same “score” function, but then just looked at the games played by that particular team.

The results of all these regression can be found here.

At this point, I’d like to discuss a comment made on Twitter by Bower Power. He asked if we had accounted for “home scorer’s bias”. What he’s referring to is the fact that each arena has a person responsible for tracking things like hits and shots. The problem is that it has been well-established that these stats have some subjectivity to them and that these people don’t all record things the same. In some arenas, fewer hits are recorded than in others, even though the same number of hits may have happened, in an objective sense. This introduces measurement error, which means the reported results may not be entirely accurate.

Exactly what effect this has on the regression results depends on the nature of the measurement error.

So I’d like to discuss the different kinds of measurement errors and their implications. First, I should mention that if official scorers are simply inconsistent in the ways that they record hits, then the true effect of hitting will be ** greater** than what we find in our estimates.

But there may be systematic errors in how hits are reported. Suppose that an arena simply underreports hits by both teams by a fixed amount. If the official scorer systematically doesn’t report 5 hits for the home team and 5 hits for the visiting team, then there’s no issue. The hit differential remains the same as the true hit differential. Of course, this is highly unlikely to be the case.

Another possibility is that official scorers systematically give more hits to the home team than the visiting team. In this case the reported hit differential will always be more favourable to the home team than is actually the case, and this kind of measurement error will be soaked up by our home ice advantage variable. In this case, the effect of hitting found by our regression is correct.

If arenas official scorers differ in what they view as a hit, and it gets applied to both teams equally, then the number of hits for each team will be scaled down proportionately. This means that the hit differential will be smaller in arenas where scorers are stingier. We did actually do something to correct for this, albeit unwittingly. If scorers magnify or shade down hits proportionately, then looking at the proportion of total hits that were delivered by the team in question would solve this problem. The first regressions that we ran actually used this measure, and we still had statistically significant results saying that hits are positively correlated with wins.

All these types of measurement error simply don’t matter, for any of the regressions. There is one type of measurement error that would matter, but it figuring out whether it exists and what its impact would be is very difficult.

It could be the case that some arenas systematically misreport hits in a way that distorts both of the two measures we looked at in a way that is correlated with the probability of winning that is separate from any of the other variables we looked at. If that’s the case, then there will be a bias in the estimate. Note, however, that this is not as simple as saying that Chicago, for example, distorts its recording of hits that make the hit differential look less than it really is, so therefore we are overestimating the effect of hits. A really big reason that Chicago is so good is because they have the puck a lot, and possession is one of the variables in our equation. So, that’s not a problem.

Furthermore, looking at simply road games, a common thing to do, doesn’t necessarily solve any problem. In fact, it could make things worse.

At the end of the day, all results need to be taken as evidence of something, and not proof. Other regressions controlling for different things could wipe out any effect we have found. However, given the well-known relationship between hitting and (not) having the puck, this is the only logical next step in the analysis.

Am I reading this wrong, or does your regression indicate that more penalties and fewer faceoff wins increase the probability of winning?

You're right about face-offs. This is something that others have noted as well. It's definitely an open question as to what's going on there.

The penalties measure is actually penalties drawn. So there is a positive effect of the other team taking more penalties. I probably should have emphasized that a bit more, as I think there probably is something more natural about thinking in terms of penalties taken.

Can you rerun it controlling for fenwick / corsi close? Score effects might be confounding the relationship.

That's a good idea. We'd have to get all the other stats when the score is close as well. I can't promise we'll get to it soon, but I'll see what we can do.

They correlate very closely come to think of it, so it probably won't have that much of an impact. Although to truly correct for scoring effects you would probably need a categorical variable with the exact time spent tied, with a 1-goal lead, down 1 goal, etc.

Er not a categorical variable

Hey Phil, former student here. I have an applied econometrics exam in a couple days and it's great to see a probit model applied to something I can make sense of. The analysis of the marginal effect here really cleared a lot of things up for me. Any other blogs you know of with heavy econometrics influence?

BTW, I have never seen something control for possession before-very cool stuff.

Any reason why you switched from hit percentage to hit differential between this article and the last one? I know it should end up yielding the same result, but I'm wondering what the cause was. Easier to interpret perhaps?

Hey Jordano, it's good to hear from you! I hope all is going well - and good luck on your exam.

The reason for the switch between hit percentage to differential is exactly what you said - I think it's easier to interpret the effect of a change in the hit differential than the percentage. We originally did everything with percentages, but then reran them with differential for that reason. I actually didn't realize that some of the results posted here were from the percentage regressions.

As for other blogs that use econometrics, the only other one I am aware of is http://rinkstats.blogspot.ca/

It's definitely worth checking out. There are almost certainly others, so if you come across them, let me know!

Thanks, we'll see how she goes. And I'll check it out. Can't wait to see some more results.