Expected Goals (xG) Limitations – A Deep Dive

Expected Goals (xG) are a great tool for analyzing goaltender and skater performance, but it’s important to discuss limitations and opportunities for improvement.

It’s Tough to Make Predictions, Especially About the Future

When I projected goaltender performance for the 2017-18 season, Cam Talbot was probably going to have an above average season, even possibly a top 10 performance.  Comparing his ability to stop the puck relative to what we might expect his average peer (using expected goals), his track record has been good to very-good prior to this season, within a certain degree of certainty. Circumstances changed of course, as he went from backing up one of the best ever to an overworked starter in 2 years.

For other goalie-season performance summaries, see: http://crowdscoutsports.com/hockey_g_compare.php

Edmonton played him 86 games last season and though I don’t know his exact physiological profile, it stands to reason it may have been too many games. Could this burn out be impacting his performance this season? Impossible to say, but his performance has been down (albeit with a smaller sample size, represented by the wider error bars above) and he has battled some injury issues. Regardless, this prediction has been wrong so far.

All Models Are Wrong, But Some Are Useful

My projection method is built on goals prevented over expected (I say goals prevented, rather than saved because I think it’s important to remove credit for saves made on preventable rebounds), which measures the difference in actual goals allowed by a goalie and the number an average goalie would. In theory, if we can control for number and quality of shots faced we should be able to isolate and identify puck stopping ‘skill,’ the marginal difference between the best and the rest would manifest itself over an adequate number of shots.[1]

This means it falls on the analyst to properly adjust for shot quantity and quality. I mention quantity because in theory goaltenders can discourage shots against and encourage shooters missing the net through excellent positioning. However, I’m not fully confident there’s a convincing way to separate the effect of a goalie forcing misses and playing on a team where the scorer might be more or less likely to record a missed shot. These effects don’t necessarily persist season-to-season, so I’m still using all non-blocked attempts in my analyses, but it’s important to acknowledge what should be a simple counting statistic is much more complex beneath the surface.

A more popular contention is shot quality, where the analyst tries to weight each shot by its probability of being a goal by holding constant the circumstances out of goalies control that can be quantified using publicly available NHL play-by-play data – things like  shot location, team strength, shot type, prior events (turnover, faceoff, etc) and their locations. But these factors, as skeptics will point out, currently doesn’t include shot placement, velocity, net-front traffic, or pre-shot puck movement involving passes (though we can calculate pre-shot movement or angular velocity on rebounds from shot 1 to shot 2 or location and angular velocity of other events immediately preceding a shot; Moneypuck explains and diagrams how this is done well under in their Shot Prediction Expected Goals Model explainer).

Shot quality is also heavily dependent on shot location, which also has recorder bias – some rink scorers systemically record shots closer or further away from the net than we see when that team plays in other rinks. Adjustment methods help standardize these locations (basically aligning home rink distribution of x,y coordinates with what is observed in away rinks games containing the team in question), but there no way to verify how much closer this gets us to the ‘true’ shot locations without personally verifying many, many shots. In any event, it invites differences in measured performance between ‘objective’ metrics developed by different analysts.

All of this is to say the ability to create a ‘true,’ comprehensive, and accurate xG model is limited by the nature of the public data available to analysts. Any evaluation and subsequent projection is best viewed through a more sceptical lens.

Discussing Deltas

With this in mind, back to Cam Talbot. Talbot has underachieved this season for some combination of the following:

  1. His ‘true’ ability or performance has been worse. This is probably the case directionally, but it’s important to explore by how much worse he has been. Is this an age-related decline that can help inform future projections? Is the decline small enough to be considered luck? Do we expect a bounce back? and so on.
  2. The random nature of outcomes has made him look worse this season (or alternatively, really good in prior years). Using the beta distribution, we can calculate the standard deviation we might expect for each sample size. More shots mean more certainty in the outcome. Simulating seasons and treating each expected goal as a weighted coin flip accomplishes something similar. Understanding and quantifying this uncertainty is an important aspect of any analysis. As you can see below, there is a (small) chance Talbot is a completely average goalie who concedes as many goals as an average goalie would (his ‘true’ talent put him along the grey vertical line) but just had some improbable results.

    Flipping weighted coins (pucks) many, many times
  3. Edmonton is giving up tougher chances against this year that goes undetected by an xG model. They might be allowing more cross-ice passes, moving net-front players out of the front of the crease, or employing unfavourable strategies. How this might manifest itself at the team, goalie, or shooter level is discussed below.

One Night in Edmonton, A Deep-Dive

While east coasters, like I currently am, were out for New Year’s Eve 2017, Edmonton was hosting Winnipeg. NYE is overrated, but it had to be better than watching the Jets win 5-0. Talbot was in for 5 goals against, not even getting a mercy pull in the 3rd period to start planning the night.

Giving up 5 goals is almost certainly going to look bad next to the expected goals, and this game is no exception. Winnipeg had 52 unblocked shot attempts that totalled 3.3 xG. However, removing the xG from 2 Talbot rebounds and replacing with expected rebounds drops that to about 3 xG. However, this game demonstrates specific weaknesses of the model but can also show how the model compensates for that lack of information.

The goal highlights are pretty much a list of things expected goal model can miss:

  1. Goal 1: A 3-on-1 off the rush, a pass from the middle of the ice to a man wide-open to the side of the net. Unfortunately, play-by-play picks up no prior events and the xG model scores it as seemingly low 0.1 xG, recognizing only a wrist-shot 9ft away from a sharper angle. Frankly, this was a slam dunk, even if Talbot adjusted his depth to play a pass, the shooter had time and space in the low slot.
  2. Goal 2: A turnover turns into cross-ice pass and immediately deposited for a goal. Fortunately, the turnover is recorded and the angular velocity can be recorded. This is scored as 0.19 xG, which seems low watching this highlight, but has to be representative of all chances under this circumstances and most aren’t executed so cleanly. Still probably low.
  3. Goal 3: A powerplay point shot is deflected. Both of these factors put this shot at 0.17 xG. This is decently accurate, the deflection had to skip off the ice and miss Talbot’s block, if that point shot is taken 100 times, scoring on that deflection about 20 times sounds about right. Also note there’s some traffic, which in a perfect world would be quantified properly and incorporated in the xG model.
  4. Goal 4: A pass from the half wall to a man wide open in front of the net. This is scored as a 0.16 xG, a deflection tight to the net after an offensive zone turnover. However, this is more of a pass than true deflection, neither passer nor shooter was contested and I would think that they would convert more than 16 times if given 100 opportunities.
  5. Goal 5: A rush play results in a cross-ice pass and a shot from the hashmarks. Talbot makes the original save, but the rebound is immediately deposited into the net. The rebound shot, changing angle rapidly, is scored as 0.43 xG. However, that rebound is conditional on Talbot not deflecting the puck into the corner so the play is scored from the original shot: 0.15 xG on the original shot plus 0.02 xG on a potential rebound – a 6% chance of a rebound times the observed goal probability of a rebound, 0.27%.

A spreadsheet containing each shot against Talbot from this game and relevant variables discussed above can here found here or click the image below. The mean non-rebound shot is 0.057 xG, so the actual goals are scored as 2 to 7 times more likely to be goals as average shots – not bad.

Shot by shot xG

Backing the Bus Up

On the surface, the expected goals assigned to each of these goals are low. A lot of this is confirmation bias – we see the goal, so the probability of a goal must have been higher, but in reality, pucks bounce, goalies make great saves, and shots miss the net.

However, we’ve identified additional factors, that if properly quantified, would increase the probability of a goal for those specific shots. There are extremely exciting attempts to gather this data, (specifically tapetotapetracker.com, released at the beginning of 2018, which if properly supported would reach the necessary scale, but still might require some untangling of biases much like shot location[2]) though still in their infancy and certainly not real-time. Advanced video tracking software would also accomplish the same thing, but not necessarily public or comprehensive.

So what does the model do without that helpful information? In the specific goals discussed, the model is conservative, under-estimating the ‘true’ probability of a goal, but across all shots, the number of expected goals roughly equal the number of goals.[3] This means that impactful, but latent events, like cross-ice passes and screens, are effectively dispersed over all shots.

We assume a un-screened shot from the blueline will be saved 999 times out of 1000 (implying 0.001 xG), but the model, not knowing if there’s a screen or not, might assign an xG of 0.03, an average of all scenarios where there may be a screen. This might increase to 0.05 xG if the shooter has a powerplay, or 0.08 xG if there’s 2-person advantage, since the probability of a sufficient screen increases. Note that these model adjustments for powerplay shots are applied evenly to all shots on the powerplay, unable to determine which shots are specifically more dangerous on the powerplay, though a powerplay indicator may be interacted with specific discrete factors (i.e. PP+1 – slap shot, PP+1 – wrist shot, etc).

This is fine if the probability of a screen (or other latent factors) is about even across all teams. However, it doesn’t take much video analysis to know that this is almost certainly not the case. All teams gameplan to generate traffic and cross-ice passing plays, but some have the personnel and talent to execute better than others, while some teams have the ability to counter those strategies better. Some teams will over or underperform their expected goals partially due to these latent variables.

Unfortunately, there isn’t necessarily the data available to quantify how much of that effect is repeatable at a team or player level. Some xG models do factor in individual player shooting talent. My model, borrowing from Dawson Sprigings, shooting talent is represented by regressed shooting percentage indexed to player position. A shot from Erik Karlsson, who is more apt at controlling the puck and setting up and shooting through screens would have a higher xG than a similar shot from another defender. Shots from Shayne Gostisbehere might receive a similar (though smaller) adjustment if consistently aided by the excellent net-front play of Wayne Simmonds.

So while a well-designed xG model can implicitly capture some effect of pre-shot movement and other latent factors, it’s safe to say team-level expected goal biases exist. By how much and how persistently is ultimately up for discussion.

Talbot’s 99 Problems

Focusing on Talbot again, the big thing holding him back is performance while Edmonton is shorthanded. While his save percentage is about what we’d expect given his shot quality at even-strength, it’s abysmally 5% below expected on the penalty-kill.

What kind of shots is Edmonton giving up? On the penalty kill, non-rebound shots are about 1% less likely to be saved than shots conceded by other penalty-kills (rebound shots are about 8% more dangerous, but some of that may be a function of Talbot himself, so let’s focus on non-rebound shots).

Distribution of Expected Goals

So it seems like Talbot just is poor while down a man. Let’s compare his penalty-kill numbers adjacently to his past performance and recently replaced backup, Laurent Brossoit.

For other goalie-season performance breakouts by strength/score/venue/danger, see: http://crowdscoutsports.com/hockey_g_compare.php

I’m assuming most readers see the same thing as me? Both goalies, after 7 collective seasons of being at least average on the penalty kill within a decent margin of error, aren’t even close this season. While there is a chance that this is a coincidence, but there’s a better chance there are systemic failings on the Edmonton penalty kill that are leading to chances more dangerous chances than the expected goal model might reflect.

None of this is shocking at this point. To some it reinforces the desire to focus on 5v5 goaltender save percentage, it’s easier to imagine team-level systemic short-comings persisting on the penalty-kill, while 5v5 they are more evenly distributed. The noise that 5v4 adds to our measures weakens predictivity eventually, however over a short season the inclusion of that extra data offsets any troublesome team-level bias it may carry.

Much like team shot metrics wisely moved from ‘score-close’ to score-adjusted measures, it’s always desirable to use the entirety of data available. About a quarter of goals (23.5%) are scored on the man advantage (only 17% of the shots), and when we talk about measuring goaltender performance and finding advantages on the very slim margins, including that data becomes very important.

However, we’re still fairly certain there are latent team-level biases in our expect goal model and Edmonton’s penalty-kill hints at that. But we don’t know what existent, if at all.

Fixing the Penalty Kill

One approach to quantifying the relative impact of coaching and tactics is a fixed-effects model, basically holding constant the effect of a Todd McLellan penalty-kill or Jon Copper powerplay as they relate to the probability of a goal being scored on any given shot. (Note: I begin to use team-level xG bias and coaching impact a little interchangeably here. I’m interested in team-level bias, but trying to get there by using coaches as a factor.)

This method is rather crude (and obviously not perfect) and wouldn’t necessarily differentiate between marginal goals the result of net-front play or effective pre-shot passing, though an inquiring team might breakout discrete sequences they’re interested in, like how their penalty-kill performs on shots from the point or against 4 Forward-1 Defender powerplays. However, quantifying specific strategic factors like this is not necessarily easy to do, since both teams are simultaneously optimizing and countering each other’s tactics, so for simplicity, it’s probably preferable to consider the holistic penalty-kill or powerplay for now.

Creating fixed-effects for a Todd McLellan penalty-kill will attempt to capture goal probability relative to other teams penalty-kill, presumably capturing some of the effect of pre-shot passes or screens. Because Cam Talbot has played the majority of the games for Edmonton recently, the current model is goaltender agnostic to avoid potential multi-collinearity rather assigning a numerical value to the quality of goaltender playing, hopefully properly apportioning systemic strengths or weaknesses to the penalty-kill variable rather than Talbot or Brossoit specifically.

One limitation of this approach is the effect will be fixed for whatever time period we select, unable to capture tactical changes that may or may not have worked or changes in personnel available to the coach.

It’s also important to reiterate attempts are made to adjust for home rink scorer bias, but any of this uncaptured systemic bias would possibly be wrongfully attributed to coaching (on home-ice at least).

Model Results

Running the model on the 2016-17 and 2017-18 seasons, the coefficients can be transformed to represent the relative probability of a goal then indexed to the league average.  McLellan’s penalty-kill appears to concede penalty-kill shots about 2.4% more dangerous than the basic expected goal model might expect, putting him 32 out of 38 coaches in the 1.5 season period. Using just the most recent season McLellan finds himself last with Dave Hakstol for conceding the toughest shots in the league while down a player. Cam Talbot’s save percentage on the penalty-kill is about 5% below his expected save percentage, but it appears about half of that (2.4%) might be attributable to McLellan’s tactics – which is a nice, clean compromise.

Preventative Measures

More important than an indictment against specific coaches (this is just reflects shot quality, it doesn’t factor in shot quantity or counter-attack measures on the penalty-kill), it might provide a reasonable team-level margin of errors for expected goal models. A distribution of results from the last 4 seasons suggests that their effect on even-strength shot quality is smaller per shot than special teams by a factor of about 2 to 3. Some of this is definitely the result of smaller sample sizes but makes some intuitive sense – special teams are more reliant on tactics than relatively free-flowing even-strength play.

Distribution of impacts of coaches with 3 seasons coached in last 4 years

This provides a decent rule of thumb, an expected goal percentage used might vary about 0.2% at even strength and about 0.6% on the powerplay due to coaching and tactics. This, of course, is over a few seasons, over a shorter time frame that number might fluctuate more.

State Absolute Mean Standard Deviation
Even Strength 0.20% 0.30%
Special Teams 0.60% 0.80%

Final Thoughts

Model testing suggests that even a good projection of goaltender performance would have a healthy margin of error. Misses will happen, but they should be as small as possible and be accompanied by a margin of error.

Of course, misses can be a helpful guide to improvement, looking at things with a different perspective. Results like Edmonton’s penalty-kill can elicit a deep-dive.

None of this is meant to totally reallocate blame from Talbot to McLellan’s penalty-kill, rather explore if how you might quantify that re-allocation, and how that ties into a more general discussion of the limitation of shot quality metrics.

‘True’ shot quality will likely never be completely captured, but there will be incremental improvements, feeding incremental improvements predicting future performance of goaltenders and other useful insights into performance.

But the job of the analyst will be the same: miss small as possible and quantifying uncertainties. With a complex game like hockey and limited available data those uncertainties can be daunting, but hopefully, this has illuminated some of the weaknesses of current xG models (and maybe reveal a few strengths) and how much of a concern those limitations are. Expected goals are a helpful tool and are part of the incremental improvement hockey analytics needs, but acknowledging their limitations will ultimately make them more powerful.

Thanks for reading! Goalie-season xG data is updated daily and can be downloaded or viewed in a goalie compare app. Any custom requests ping me at @crowdscoutsprts or cole92anderson@gmail.com.

Code for this analysis was built off a scraper built by @36Hobbit which can be found at github.com/HarryShomer/Hockey-Scraper.

I also implement shot location adjustment outlined by Schuckers and Curro and adapted by @OilersNerdAlert. Any implementation issues are my fault.

[1] In my opinion, the marginal difference between goaltender skill is very tight at the NHL-level, making identifying and predicting performance very difficult, some might argue pointless. But important trends do manifest themselves over time, even if there are aberrations in smaller, but notable periods (like a season or playoff series).

[2] One of my concerns are trackers being biased toward certain players or team, or worse yet, being more likely to record passing series that lead to goals. Imagine out-of-sample model scoring being ‘tipped off’ to predict a goal if a passing series exists for that shot, that data is not truly ‘unseen.’

[3] Note that sometimes I used expected goals and probabilities a little interchangeability. A 0.5 xG can be interpreted as a 50% chance of a goal for that particular shot, thanks to the properties of logistic regression. No shot is worth greater than 1 xG, much like no shot is certain to go in.

Goalie Points Above Expected (PAX)

Pictured: Dominik Hasek, who made 70 saves in a 1994 playoff game, beating the New Jersey Devils 1-0 in the 4th overtime. Hasek didn’t receive goal support for the equivalent of 2 full regulation games, but he won anyway. What is the probability of Hasek winning this game and what does it tell us about his contribution to winning?

A Chance to Win

I was lucky enough to attend (and later work at) the summer camps of Ian Clark, who went on to coach Luongo in Vancouver and most recently Bobrovsky in Columbus. Part of the instruction included diving into the mental side of the game. A simple motto that stuck with me was: “just give your team a chance to win.” You couldn’t do it all, and certainly couldn’t do it all at once, it was helpful to focus on the task at hand.

You might give up a bad goal, have a bad period, or two or three, but if you can make the next save to keep things close, a win would absolve all transgressions. Conversely, you might play well, receive no goal support, and lose. Being a goalie leaves little in your control. The goal support a goalie receives is (largely[1]) independent of their ability and outside of rebounds, so are most chances they face[2]. Pucks take improbable bounces (for and against) and 60 minutes is a very short referendum on who deserves to win or lose.

Think of being a hitter in baseball and seeing some mix of fastballs down the middle and absolute junk and the chance to demonstrate marginal ability relative to peers on every 20th pitch.

Smart analysis largely throws away what’s out of the goalies control, focusing on their ability to make saves. This casts wins, whatever they are worth, as only a team stat.

Taking a step back, there’s two problems with this:

  • A central purpose of hockey analytics is to figure out and quantify what drives winning, and removing wins from the equation to focus on save efficiency feels like cruising through your math test and handing it in, only to realize you missed the last page. So close, yet so far.
  • Goalies, coaches, fans, primarily care about winning, so it’s illuminating to create a metric that reflects that. Aligning what’s measured and what matters can be helpful and interesting, at the very least deserves some more advanced exploration.

What Matters

Analysis is at its strongest when we can isolate what is in the goaltender’s control, holding external factors constant the best we can. For example, some teams may give up more dangerous chances than others, so it is beneficial to adjust goaltender save metrics by something resembling aggregate shot quality, such as expected goals. Building on this we can evaluate a goaltender’s ability to win games as a function of the quality of chances they face and the goal support they receive.

To do this we can calculate the expected points based on the number of goals a team scores and the number of chances they give up. Because goalies are partially responsible for rebounds, we can strip out rebounds and replace with a less chaotic, more stable expected rebounds. The result is weighing every initial shot as a probability of a goal and a probability of a rebound, converting expected rebounds to expected goals by using the historical shooting % on rebounds, 27%.

\(Expected Goals Against_{n} =\sum\limits_{i=1}^n P(Goal)_{i} + (0.27\times P(Rebound)_{i})\)

A visual representation of the interaction between these factors supports the expectation – scoring more goals and limiting chances (expected goals) against increases expected points gained. Summed to team-level this information could be used to create a Wins Threshold metric, identifying which goalies need to stand on their heads regularly to win games.

Novel concept

Goalie Points Above Expected Metric (PAX Goaltendana)

The expected points gained based on goal support and chances against will be used to compare to the actual points gained in games started by a goaltender. How does this look in practice? Earlier this season, November 4st, Corey Crawford faced non-rebound shots that totaled 2.4 expected goals against, while Chicago only scored 1 goal in regulation. Simulating this scenario 1,000 times suggests with an average goaltending performance Chicago could expect about 0.5 points (the average of all simulations, see below). However, Crawford pitched a shutout and Chicago won in regulation, earning Chicago 2 points. This suggests this Crawford’s performance was worth about 1.5 points to Chicago, or 1.5 Points Above Expected (PAX).

Recipe for success?

Tracking each of Crawford’s starts (ignoring relief efforts) game-by-game show he’s delivered a few wins against the odds (dark green), while really only costing Chicago one game, against New Jersey (dark red).

Crawford’s War March

The biggest steal of the 2017-18 season so far using this framework? Curtis McElhinney on December 10th faced Edmonton shots worth about 5 expected goals (!) and received 1 goal in support. A team might expect 0.05 points under usual circumstances, but McElhinney pitched a shutout and Toronto got the 2 points.

The Art of the Steal

Other notable performances this season is a mixed bag of big names and backups.

Goalie Date Opponent Expected GA Goal Support Expected Points Actual GA Actual Points PAX
CURTIS MCELHINNEY 12/10/17 EDM 5.07 1 0.06 0 2 1.94
CORY SCHNEIDER 11/1/17 VAN 3.78 1 0.17 0 2 1.83
AARON DELL 11/11/17 VAN 3.18 1 0.27 0 2 1.73
TRISTAN JARRY 11/2/17 CGY 2.93 1 0.33 0 2 1.67
ANTON KHUDOBIN 10/26/17 S.J 4.05 2 0.37 1 2 1.64
CAREY PRICE 11/27/17 CBJ 4.12 2 0.37 1 2 1.64
MICHAL NEUVIRTH 11/2/17 STL 2.66 1 0.38 0 2 1.62
SERGEI BOBROVSKY 12/9/17 ARI 2.72 1 0.39 0 2 1.61
PEKKA RINNE 12/16/17 CGY 3.72 2 0.42 0 2 1.58
ROBERTO LUONGO 11/16/17 S.J 2.49 1 0.42 0 2 1.58

Summing to a season-level reveals which goalies have won more than expected. Goalies above the diagonal line (where points gained = points expected) had delivered positive PAX, goalies below the line had negative PAX.

The Name of the Game

Ground Rules

For simplicity, games that go to overtime will be considered to be gaining 1.5 points for each team, reflecting the less certain nature of the short overtime 3-on-3 and shootout. This removes the higher probability of a goal and quality chances against associated with overtime, which is slightly confounding[3], bringing the focus to regulation time goal support.

This brings up an assumption the analysis originally builds on – that goal support is independent of goaltender performance. We know that score effects suggest a team that is trailing will likely generate more shots and as a result are slightly more likely to score. A bad goal against might create a knock-on effect where the goaltender receives additional goal support. While it is possible that the link between goaltender performance and goal support isn’t completely independent (as we might expect in a complex system like hockey), the effect is likely very marginal. But it this scenario a win would be considered more probable, further discrediting any potential win or loss. Generally, the relationship between goaltender performance and goal support is weak to non-existent.

No gifts under the Christmas tree

However, great puckhandling goalies might directly or indirectly help aid their own goal support by helping the transition out of their zone, keeping their defensemen from extra contact, and other actions largely uncaptured by publicly available data. Piecemeal analysis suggests goalies have little ability to help create offense, but absence of evidence does not equal evidence of absence. This will have to be an assumption the analysis will have to live with[4], any boost to goal support would likely be very small.

Taking the Leap – Icarus?

The goal here is to measure what matters, direct contributions to winning. This framework ties together the accepted notion that the best way from a goaltender to help is team win is to make more saves than expected with the contested idea that some are more likely to make those saves in high leverage situations than others, albeit in an indirect way. To most analysts, being clutch or being a choker are just some random processes with a some narrative applied.

However, once again, absence of evidence does not equal evidence of absence[5]. I imagine advanced biometrics might reveal that some players experience a sharper rise in stress hormones which might effect performance (positively or negatively) during a tie game than if down by a handful of goals. I know I felt it at times, but would have difficulty quantifying its marginal effect on performance, if any. A macro study across all goalies would likely be inconclusive as well. Remember NHL goalies are a sample of the best in the world, those wired weakly very likely didn’t make it (like me).

But winning is important, so it is worth making the jump from puck-stopping ability to game-winning ability. The tradeoff (there’s always tradeoffs) is we lose sample size by a factor of about 30, since the unit of measure is now a game, rather than a shot. This invites less stable results if a game or two have lucky or improbable outcomes. On the other hand, it builds in the possibility some guys are able to raise their level of play based on the situation, rewarding a relatively small number of timely saves, while ignoring goals against when the game was all but decided. I can think of a few games that got out of control where the ‘normal circumstances’ an expected goals model assumes begin to break down.

Low leverage game situation, high leverage franchise situation

Winning DNA?

All hockey followers know goalies can go into brick-wall mode and win games by themselves. The best goalies do it more often, but is it a more distinguishable skill than the raw ability to prevent goals? Remember, we are chasing the enigmatic concept of clutch-ness or ability to win at the expense of sample size, threatening statistically significant measures that give analysis legs.

To test this we can split goalie season into random halves and calculate PAX in each random split, looking at the correlation between each split. For example, goalie A might have 20 of their games with a total PAX of 5 end up in ‘split 1’ and their other 20 games with a PAX of 3 in ‘split 2.’ Doing this for each goalie season we can look at the correlations between the 2 splits.[6]

Using goalie games from 2009 – 2017 we randomly split each goalie season 1,000 times at minimum game cutoffs ranging from 20 to 50,[7] checking the Pearson correlation between each random split. Correlations consistently above 0 suggest the metric has some stability and contains a non-random signal. As a baseline we can compare to the intra-season correlation of a save efficiency metric, goals prevented over expected, which has the advantage of being a shot-level split.

The test reveals that goals prevented per shot carries relatively more signal, which was expected. However, the wins metric also contains stability, losing relative power as sample size drops.

Winning on the Reg

Goalies that contribute points above expected in a random handful of games in any given season are more likely to do the same in their other games. Not only does a wins based metric make sense to the soul, statistical testing suggests it carries some repeatable skill.

Final Buzzer

Goalie wins as an absolute number are a fairly weak measure of talent, but they do contain valuable information. Like most analyses, if we can provide the necessary context (goal support and chances against) and apply fair statistical testing, we can begin to learn more about what drives wins. While the measure isn’t vastly superior to save efficiency, it does contain some decent signal.

Exploring goaltender win contributions with more advanced methods is important. Wins are the bottom line, they drive franchise decisions, and frame the narrative around teams and athletes. Smart deep dives may be able to identify cases which poor win-loss records are bad luck and which have more serious underlying causes.

A quick look at season-level total goals prevented and PAX (the metrics we compared above) show an additional goal prevented is worth about 0.37 points in the standings, which is supported by the 3-1-1 rule of thumb, or more precisely,  2.73 goals per point calculated in Vollman’s Hockey Abstract. Goal prevention explains about 0.69 of the variance in PAX, so the other 0.31 of the variance may include randomness and (in)ability to win. Saves are still the best way to deliver wins, but there’s more to the story.

Save saves for when they matter?


When I was a goalie, it was helpful to constantly reaffirm my job: give my team a chance to win. I couldn’t score goals, I couldn’t force teams to take shots favorable to me, so removing that big W from the equation helped me focus on what I could control: maximizing the probability of winning regardless of the circumstances.

This is what matters to goalies, their contribution to wins. Saves are great, but a lot of them could be made by a floating chest protector. While the current iteration of the ‘Goalie Points Above Expected’ metric isn’t perfect, hopefully it is enlightening. Goalies flip game probabilities on their head all the time, creating a metric to capture that information is an important step in figuring out what drives those wins.

Thanks for reading! I hope to make data publicly available and/or host an app for reference.  Any custom requests ping me at @crowdscoutsprts or cole92anderson@gmail.com.

Code for this analysis was built off a scraper built by @36Hobbit which can be found at github.com/HarryShomer/Hockey-Scraper.

I also implement shot location adjustment outlined by Schuckers and Curro and adapted by @OilersNerdAlert. Any implementation issues are my fault.

My code for this and other analyses can be found on my Github, including the feature generation and modeling of current xG and xRebound models and PAX calculations.


[1] I personally averaged 1 point/season, so this assumption doesn’t always hold.

[2] Adequately screaming at defensemen to cover the slot or third forwards to stay high in the offensive zone is also assumed.

[3] If a goalie makes a huge save late in a tie game and subsequently win in overtime, the overtime goal was conditional on the play of the goalie, making the win (with an extra goal in support) look easier than it would have otherwise.

[4] Despite it partially delegitimizing my offensive production in college.

[5] Hockey analysts can look to baseball for how advanced analysis aided by more granular data can begin to lend credence to concepts that had been dismissed as an intangible or randomness explained by a narrative.

[6] Note that the split of PAX is at the game-level, which makes it kind of clunky.  Splitting randomly will mean some splits will have more or less games, possibly making it tougher to find a significant correlation. This isn’t really a concern with thousands of shots.

[7]The ugly truth is that an analyst with a point to prove could easily show a strong result for their metric by finding a friendly combination random split and minimum games threshold. So let’s test and report all combinations.

Rebound Control

Pictured: Marc-Andre Fleury makes an amazing save at the end of game 7 to win the 2009 Stanley Cup, moments after giving up a rebound. Did he need to make this dramatic save? Should he be credited for it? Looking at the probability of a rebound on the original shot can help lend context.

A few years ago I was a seasoned collegiate goaltender and a raw undergrad Economics major. This was a dangerous combination. When my save percentage fell from something that was frankly pretty good to below average, I turned to an overly theoretical model to help explain this slip in measured performance, for my own piece of mind and general curiosity. The goal was to measure goaltending performance by controlling for the things out of their control, like team defense. Specifically, this framework would properly account for shot quality (of course) and adjust for rebounds, by not giving goalies credit for saves made on preventable rebounds . The former considered things out of the goalies control, the later considers what is actually in the goalies control. Discussing the model with my professor it was soon clear that I included a lot of components that didn’t have available data, such as pre-shot puck movement and/or some sort of traffic index. However, this hasn’t stopped analysts, including myself, from creating expected goals models with the data available publicly. But a public and comprehensive expected goal model remains elusive.

Despite their imperfections, measuring goaltender performance with expected goals are an improvement over raw save percentage and gaining some traction. However, rebounds as they relate to a comprehensive goaltending metric has garnered less research. Prior rebound work by Robb Pettapiece, Matt Cane, and Michael Schuckers suggests preventing rebounds is not a highly repeatability skill, though focusing on pucks frozen might might contain more signal. Building on some of these concepts I hope to give rebound rates some more context by attempting to predict them and explore their effect on a comprehensive goaltending metric consistent with my 2017 RITHAC presentation.[1]

Unfortunately there is nothing to tell us whether a rebound is a “bad” or preventable rebound, so my solution was to create an expected rebounds model using the same framework used to develop an expected goal models. The goal is the same, compare observed goals and rebounds relative to what we would expect a league average goaltender might surrender controlling for as much as we can.

Defense Independent Performance

One of the first iterations and applications of an expected goals model was Michael Schuckers’ Defense Independent Goalie Rating (DIGR). This framework has been borrowed by other analysts, myself included. The idea being the shots goalies face are largely out of their control, they can’t help if they face 3 breakaways in a period or Ovechkin one-timers from the slot. However, goalies can assert some control over rebounds. How much and if this makes a difference is something we will explore.

Regardless of the outcome of the analysis, logic would suggest we discount credit we give goaltenders for facing shots that they could have or should have prevented. Bad rebounds that turn into great saves should be evaluated from the original shot, rather than taking any follow-up shots as a given.

Luoooooong rebound

Rebounds Carry Weight

It’s important to note that rebound shots results in higher observed probability of a goal, which makes sense, and expected goal models generally reflect this. However, this disproportionate amount of an expected goal can be confounding when ‘crediting’ goalie for a rebound opportunity against when it could have been prevented. Looking at my own expected goal model, rebounds account for about 3.2% of all shots, but 13% of total expected goals. This ratio of rebounds being about 4 times as dangerous is supported by observed data as well. Shooting percentage on rebounds is about 27%, while it is 5.8% on original shots.

In the clip above and using hypothetical numbers, Luongo (one of my favorite goalies, so not picking on him here) gives up a bad rebound on a wrist shot from just inside the blueline, with an expected goal (xG) value of ~3%, but the rebound shot, due to the calculated angular velocity of the puck results in a goal historically ~30% of the time. Should this play be scored as Luongo preventing about 1/3 of a goal (~3% + ~30%)[2]?

What if I told you the original shot resulted in a rebound ~2% of the time and that the average rebound is converted to a goal ~25% of the time? Wouldn’t it make more sense to ignore the theatrical rebound save and focus in on the original shot? That’s why I’d rather calculate that Luongo faced a 3.5% chance of a goal, rather than ~33% chance of goal. An xG of 3.5% is based on the  3% of the original shot going in PLUS 0.5% chance of a rebound going in (2% chance of rebound times ~25% chance of goal conditional on rebound), and no goal was scored.

Method xG Saved/Prevented Goals Given Up Total xG Faced xG 1st shot xG 2nd Shot Calculation Method
Raw xG Calculation 33.0% 0 33.0% 3% 30% Historical probability of goal *given* rebound occurred
Rebound Adjusted 3.5% 0 3.50% 3% 0.50% 𝑃(𝐺𝑜𝑎𝑙)=𝑃(𝐺𝑜𝑎𝑙│𝑅𝑒𝑏𝑜𝑢𝑛𝑑)∗𝑃(𝑅𝑒𝑏𝑜𝑢𝑛𝑑)

0.05% = 25% * 2%

Removing Credit Where Credit Isn’t Due

As to not give goaltenders credit for saves made on ‘bad’ rebound shots we can do the following:

  1. Strip out all xG on shots immediately after a rebound (acknowledging the actual goals that occur on any rebounds, of course)
  2. Assign a probability of a rebound to each shot
  3. Convert the probability of a rebound to a probability of a goal (xG) by multiplying the expected rebound (xRebound) by the probability of a goal on rebound shots, about 27%. This punishes ‘bad’ or preventable rebounds more than shots more likely to result in rebounds. Using similar logic to an expected goals model, some goalies might face shots more likely to become rebounds than others. By converting expected rebounds (xRebounds) to xG, we still expect the total number of expected goals to equal the total number of actual goals scored even after removing xG from rebounds.

To do this we can create a rebound probability model using logistic regression and a similar set of features as an xG model. My most recent model has an out-of-sample area under the ROC curve of 0.68, where 0.50 is random guessing (or assuming every shot has a 3.2% chance of rebound, which is the historical rate). Compare this the current xG model out-of-sample ROC AUC of 0.78, suggesting rebounds are tougher to reliably predict than goals (and we’re not sure there either). A weak rebound model is fine, reflecting the idea an given shot has some probability of turning into a dangerous rebound, maybe a bad bounce or goaltender mishap or fortunate forward, we just have a tough time knowing when.

This does make some sense though, unlike goals where the target is very clear (put the puck in the net), rebounds are less straight forward, they require the puck to hit the goalie and find a opposing players stick before the defense can knock it away. Some defensemen might be able to generate rebounds from point shots more than random, but despite what they might tell you after the fact, players are generally trying to score on the original shot, not create a rebound specifically.

It is also true that goals are targeted, defined events (the game stops, lights go on, goalie feels shame, and the score keeper records it), whereas rebounds escape an obvious definition. Hockey analytics have generally used shots <= 2 seconds from the shot prior, so let’s explore the data behind that reasoning now.

Quickly: What is a rebound?

It’s important to go back and establish what a rebound actually is, without the benefit of watching every shot from every game. We would expect the average shot off of a rebound to have a higher chance of being a goal than a non-rebound shot (all else being equal) since we know the goalie has less time to be able to get set for the shot. And just hypothesizing, it probably takes the goalie and defenders a couple seconds to recover from a rebound. To test the ‘time since last shot’ hypothesis, we can look in the data to see where the observed probability of a goal begins to normalize.

Strike while the iron is hot

Shots within 2 seconds or less of the original shot are considerably more likely to result in goals than shots than otherwise. There is some effect at a 3 second lag, and certainly some slow-fingered shot recorders around the league might miss a ‘real’ rebound here and there, but the naive classifier of 0-2 seconds between shots is probably the best we can do with limited public data. At 3 seconds, we have lost about half of the effect.

Model Results

Can your favorite goalie prevent rebound compared to what would be expected? If so great, they will be credited with excess xG (xRebounds multiplied by the observed probability of rebound goals 27%) without having to face a bunch of chaotic and dangerous rebound shots. If they give up more rebounds than average, their xG won’t be inflated by a bunch of juicy rebounds, rather replaced by a more modest xG amount indicative of league average goaltending considering what we know about the shots they’re facing.

Which goalies are best at consistently preventing rebounds according to the model? Looking at expected rebound rates compared to actual rebound rates (below), suggests maybe Pekka Rinne, Petr Mrazek, and Tuukka Rask have a claim at consistently being able to prevent rebounds. Rinne has been well documented to have standout rebound control, so we are at least directionally reaching the same conclusions through prior analyses and observations. However, adding error bars consistent with +/- 2 standard deviations dull this claim a little.

Rebounds happen to everyone…

Generally, the number of rebounds given up by a goalie over the season loosely reflect what the model predicts. The ends of the spectrum being Rinne with great rebound control in 2011-12 and Marc-Andre Fleury in giving up almost 40 more rebounds than expected in 2016-17. Interesting, Pittsburgh has some of the worst xGA/60 metrics in the league that year and ended up winning the Cup anyway. High rebound rates by both goalies (Murray’s rebound rate was about 1% higher than expected himself) definitely contributed to the high xGA/60 number, perhaps making their defense look worse than it was.

..some more than others

Goal Probability Assumptions

I’ll admit we’re making a pretty big assumption that if a errant puck is controlled and a rebound shot is taken the probability of a goal will be 27%. Maybe some goalies are better than consistently making rebound saves than other goalies, either through skill or ability to put rebounds in relatively low danger areas. Below plots, with +/- standard deviation error bars observed goal % (1 – save %) on rebound shots for goalies with at least 5 seasons since 2010-11.

Good Luck

Devan Dubnyk and Carey Price have been consistent in conceding fewer than 27% (the average for the entire sample) of rebound shots as goals. However, considering the standard deviation we can expect from this distribution given the sample size, this may not be ‘skill.’ It’s also important to explore if their rebound shots are less dangerous than average, whether due to skill, luck, or team defensive structure. This appears to be the case, when adjusted for the xG model, they perform about as well as the model predicts in some seasons, and exceed it in others. Certainly not by enough to suggest their rebounds should be treated any differently going forward.

Looking at intra-goalie performance correlation supports the idea that making saves on rebounds is a less repeatable skill than the original shots. From 2014-2017, splitting each goalies shots faced into random halves, the correlation between the split 1 performance and split 2 is about 0.43. On rebound shots, this correlation falls to 0.24, suggesting that there is considerably less signal. While there is some repeatable skill, its not enough to treat any goalies differently in our model post-rebound due to remarkable ability (or inability) to make saves on rebounds.

Turn Up the Luck

Controlling Rebounds, Summary

To reiterate, the problem:

  • Expected goal models are valuable in measuring goaltending performance, but rebounds are responsible for a disproportionate share of expected goals, which the goalie has some control over.

My solution:

  • Remove all expected goals credited to the goalie on rebound shots.
  • Develop a logistic regression model predicting rebounds, the output of which can be interpreted as each shots probability of a rebound.
  • Explore goalie-level ability to make saves on rebounds shots, to support the assumption that 27% of rebound shots will result in a goal, regardless of goalie.
  • Replace ‘raw’ expected goals with an expect goal amount based on the probability of goal PLUS probability of a rebound shot multiplied by the historical observed goal % on rebound shots (27%), considering initial, non-rebound shots only.

Finally it’s important to ask, does this framework help predict future performance? Or it just extra work for nothing?

The answer appears to be yes. My RITHAC work attempted to project future goaltender performance by testing different combinations of  metrics (xG raw, xG adjusted for rebounds, xG with a Bayesian application, raw save %) and parameters (age regressors, Bayesian priors, lookback seasons). Back testing past seasons, the metrics adjusted for rebounds performed better than the same metrics using a raw expected goal metric as its foundation.

Corralling Rebounds

This supports the idea that rebounds, particularly in expected goals models, can confound goaltender analysis by crediting goaltenders disproportionately for chances that they have some control over. In order to reward goalies for controlling rebounds and limiting subsequent chances, goalies can be measured against the amount of goals AND rebounds a league average goalie would concede – which is truer to the goal of creating a metric that controls for team defense and focuses on goaltender performance independent of team quality. Layering in this rebound adjustment increases the predictive power of expected goal metrics.

The limitations of this analysis include the unsatisfactory definition of a rebound and the need for an expected rebound model (alternatively a naive 3.2% of shot attempts result in rebounds can be used). Another layer of complexity might loose some fans and fanalysts. But initial testing suggest that rebound adjustment adds incremental predictive power enough to justify it inclusion in advanced goaltending analysis where the goal is to measure goaltender performance independent of team defense with the publicly data available.

But ask yourself, your coach, your goalie, whoever: should a goalie get credit for a save he makes on a rebound, if he should have controlled it? Probably not.

Thanks for reading! Goalie-season xRebound/Rebound data is updated often and can be downloaded. Any custom requests ping me at @crowdscoutsprts or cole92anderson@gmail.com.

Code for this analysis was built off a scraper built by @36Hobbit which can be found at github.com/HarryShomer/Hockey-Scraper.

I also implement shot location adjustment outlined by Schuckers and Curro and adapted by @OilersNerdAlert. Any implementation issues are my fault.

My code for this and other analyses can be found on my Github, including the feature generation and modeling of current xG and xRebound models.


[1] Pettapiece converted rebounds prevented to goals prevented, but with respect to rebound rate only and to my knowledge did not expand to build into a comprehensive performance metric. (http://nhlnumbers.com/2013/7/15/can-goalies-control-the-number-of-rebounds-they-allow)

[2] Rebound xG actually can’t be added to the original shot like this since we are basically saying the original shot has a 3% chance of going in, so the rebound will only happen 97% of the time. The probability of the rebound goal in the case is 97% * 30%, or 29.4%. But for simplicity I’ll consider the entire play to be a goal 33.3% of the time. The original work and explainer by Danny Page: (https://medium.com/@dannypage/expected-goals-just-don-t-add-up-they-also-multiply-1dfd9b52c7d0)

Advanced Goaltending Metrics

Preamble: The following is a paper I wrote while in college about 6 years ago. It is very theoretical, without understanding the realities of data quality in the real world. However, it still reflects my general attitude toward how goaltending performance should be measured, manifesting itself in my current Expected Goals model.


How new metrics concerning hockey’s most important position can offer critical insights into goaltender performance, development, and value.



During the last 20 years, the goaltending position has changed more than any other position in hockey. Advances in equipment and training have raised the benchmark for expected goaltender performance. Teams promptly began investing in the position in the mid-90’s as a new breed of goaltender found success in the NHL. From 1994-2006 an average of almost 3 goaltenders were selected in the 1st round. Of these 37 highly touted goaltenders, none had won a Vezina trophy as of 2011. With this surprising lack of success, teams began to avoid using high draft picks on goaltenders—from 2007-2011 less than 1 goaltender was drafted in the 1st round annually.

Teams will continue to invest less in the goaltending position for a number of reasons. First, it is a matter of economics—the supply of good goaltenders has increased, decreasing their value. Initially, the demand for goaltenders drove their stock up, but teams eventually realized that they struggle to correctly value goaltending prospects. Subsequently, many of the leagues most successful goaltenders during this period were late round picks. Outside of the legendary Martin Brodeur, the last 3 Vezina trophy winners were drafted in the 5th, 5th, and 9th rounds. In fact, in the last decade the only goaltenders to make the NHL 1st or 2nd All-Star teams that were drafted in the 1st round were Roberto Luongo and, of course, Martin Brodeur. Lastly, goaltenders appear to mature later, which means teams want to invest less in them, especially considering the new Collective Bargaining Agreement allows players to become free agents earlier. In summary, there are more good goaltenders, they are generally incorrectly valued, and teams are hesitant to develop goaltenders through the draft, preferring high-priced, experienced goaltenders.

These factors create a unique opportunity for teams that can properly value goaltenders. Goaltending is still a critical part of any team, but it can be acquired without giving up valuable assets. Goaltenders are generally selected later in the draft, exchanged for less than their intrinsic value via trade, or require no assets to acquire through free agency and from waivers. Solid NHL goaltending should ideally come at a friendly cap hit, since the premium for the highest paid goaltenders is diminishing. Another trend is evident: some of the most successful teams are using strong backups throughout the regular season to compliment their starters and gain a post-season advantage—since the 2005 lockout the average Stanley Cup winning goalie has played less than 50 regular season games. Teams can no longer hope to find a franchise goaltender and maintain elite performance by locking them up to a rich, long-term contract without possessing the option of cheaper alternatives. The inability for teams to objectively understand the difference in performance between a goaltender with a $5 million salary and a $1.5 million salary is curious—goaltending is the only position in hockey that performance could be measured in a largely empirical way, analogous to how baseball has managed to successfully employ advanced metrics to better measure player performance. Teams that could use goaltending metrics that more accurately evaluate goaltenders would have an enormous advantage to acquire and retain elite level goaltending at an economical price.

The Estimated Save Percentage Index Model

The most common metric used to measure goaltending performance is save percentage, the number of saves as a percentage of total shots on goal. This metric is fundamentally flawed. To more accurately understand the quality of a particular goaltender, save percentage must be more sophisticated. This is possible because the goaltending position has two important prerequisites that make performance the most quantifiable in hockey. First, the result is absolute: any shot on goal is either stopped or results in a goal. Second, the position is passive: the difficulty to the goaltender is generally dictated by the game in front of him, except for rebound control and puck handling, which can be addressed later in the model.

The Expected Save Percentage (ES% Index) is a predictor of a goaltenders success based on a number of inputs that assigns the individual difficulty of each shot the goaltender faces. The inputs used in the model are shot location, puck visibility, and the rate at which the puck changes angle before or during the shot. The model assumes the goaltender has NHL quality blocking-width, positioning, lateral movement, and reflexes. Then, through an array of formulas, the model determines the expected save percentage for each shot on goal given the inputs. Once these expected save percentages are aggregated over a game, or over a season, we can see how the goaltender’s actual save percentage compares with the expected save percentage and compare them to their peers. The best goaltenders will consistently exceed the predicted save percentage whether they are facing 20 high quality shots or 40 lower quality shots. The Expected Save Percentage Index—the difference between real save percentage and expected save percentage—will measure the proficiency of the goaltender. The index can be tracked game-by-game and season-by-season. Since we are removing much of the fluctuation in team performance we will have a much better idea of a goaltender’s consistency—an attribute critical to NHL success that can be lost in the potentially misleading statistics that are currently employed.

The inputs have been selected for simplicity and versatility. The most obvious is shot location—the closer the shot, the more likely it will be a goal. Assuming the average NHL shot is about 90 miles/hour and a NHL goaltender has a reaction time of .11 seconds, the Expected Save Percentage increases greatly once the shot is from a distance of greater than 15 feet.  Inside of 15 feet it assumes the goaltender can cover around 70%- 80% of the net through size and positioning, and the distance model reflects this assumption. Location can also allow the model to determine the shot angle and net available to the shooter, two other factors that are automatically worked into the model. If applicable, visibility is a binary input determining whether the goaltender has a chance to see the puck. Again, since we are assuming NHL quality goaltending, there is no ‘half-screen’ or ‘distraction.’ If the goaltender has an opportunity to see the puck, they are expected to gain a sightline to the puck. If they are completely screened, the expected save percentage is lowered as a function of the net available when the shot is taken—the better angle, the more dangerous the screen. Lastly, the model factors in the rate of the change in the angle of the puck when the shot as taken, if applicable. This way we can discount the expected save percentage if the shot is a one-timer, deke, passing play, or even a deflection to better reflect the difficulty of a shot against. The model assumes NHL quality lateral movement, edge control, and post save recovery. At lower levels, where puck movement is slower, goaltenders will have to put up higher real save percentages to maintain an ES% Index that predicts NHL skills.

These inputs create an admittedly arbitrary, yet sophisticated, expected save percentage. The formulas can be retrofitted as more data is collected to move closer to a universally accurate expected save percentage—ideally the median ES% Index would be 0. The data can be then broken up into three categories, shots with no screen or movement, shots that are screened, and shots where the puck is moving laterally as it is released. Breaking each shot into individual components will make it possible to track and eventually acquire objective data, replacing the placeholder formulas with actual NHL results. However, as it stands now, the expected save percentage is a benchmark, and it is the discrepancy between the realized and expected save percentage that will be the true measure of individual performance. Shot placement may seem like a troublesome omission from the model, however since the model is built on aggregated averages we can account for the complete distribution of shots put on net. NHL quality defense generally takes away time and space from shooters, limiting their ability to place the puck wherever they desire. Teams are not necessarily inclined to giving up shots in a particular place in the net, but weaker teams are prone to giving up shots from more dangerous locations on the ice. In this way shot placement is indirectly built into the expected save percentage: a shot from 10 feet out the shooter has a much greater chance of hitting a target, say high glove, than a shot from 20 feet.

Win Contribution

The ES% Index measures goaltender performance in a vacuum, comparing actual performance to how we would expect him to perform in a given situation. However, the goaltender can influence the amount of shots they face through rebound control and effective puck handling. Tracking these occurrences will allow the model to adjust the expected save percentage further. Easier than average shots that result in a rebound will lead to the successive shot not being factored into the model. This is analogous to saying the resulting shot should not have happened. Difficult shots that result in rebounds will take into consideration the difficulty of both shots when assigning expected save percentage to the potentially ‘preventable’ rebound shot. Whenever a goaltender handles the puck and it results in the puck directly clearing the zone, it will be assume the goaltender prevented a shot a certain percentage of the time. By adding the potential shots and removing preventable shots to the actual shot total we will have a good idea of how the goaltender is helping their team and influencing the game.

With the expected save percentage and expected shots against, we can manufacture an expected goals against for each game. We can compare expected goals against to the goal support the goaltender received and determine whether or not the goaltender should have won the game. If the game should have been won based on the actual goals for and expected goals against, but was not, this will be a contributed loss. Conversely, if it was predicted the team should have lost, yet won, this will be a contributed win. So we can remove the bias toward goaltenders on bad teams—who have more opportunity to register contributed wins—we can measure the number of potential contributed wins and losses and compare them to the actual contributed wins and losses.

How does this model predict future goaltending performance?

This analysis allows an NHL team to gain a concise, quantified measurement of goaltending performance across leagues and time. It will more accurately identify goaltending proficiency and consistency. It can be adjusted from league to league as the goaltender advances and will better predict future success as the database grows. The model automatically assumes each goaltender has NHL size, speed, and positioning, so if the goaltender can consistently perform better than his peers, then they will likely continue to outperform them at higher levels. This can apply to a late round pick playing on a weak team in Europe or a college goaltender discredited for being on a strong defensive team. Since the ES% Index can be broken into components—stationary shots, screened shots, and moving shots—it will be easy to identify weaknesses that may be hidden by a specific team. For example, a goalie with poor lateral movement on a team that limits puck movement might perform well by traditional standards, but if the ES% Index on shots with puck movement is below average, chances are they will be exposed at the next level. There is a very real advantage to employing increasingly accurate goaltending metrics that other teams are not using to value goaltenders. It can also be broken up into individual components lending itself to the in-depth analysis of goaltending prospects, opposition goaltenders, and even the performance of other players on the ice. While the ES% Index will likely have limitations, predicting the development and value of goaltenders has not improved during an era when the quality of goaltending has increased dramatically. Therefore, a more accurate metric will almost certainly improve the valuation of each goaltender and offer critical insights into their development.

Other Considerations

While advanced goaltending metrics can aid management decisions, they can also lend coaches a helpful perspective when preparing for games. The objective ES% Index will help explain some of the volatility in goaltender performance. Coaches do not always understand the subtleties of the position, their only concern lies in the proficiency of the goaltender in preventing goals—exactly the intent of the ES% Index. It can also be used as a pre-scout for opposing goaltenders. Situational success rates for each NHL goalie are tracked through the season, offering a strategic advantage to the coaching staff and players. If an otherwise successful goaltender is performing below the norm on shots with puck movement, then this is a clear indication to move the puck before shooting. Ability can be judged based on data from an entire season rather than anecdotal observations. This is advantageous because the goaltending position is inconsistent by nature, one bad bounce or mental lapse can be the difference between a good game and a bad game. Watching a select few games of a goaltender will make it difficult to judge their true ability—no doubt part of the reason teams struggle to value goaltenders at the draft. It can also compliment scouting reports. If a scout sees a particular trend or weakness in a goaltenders game, there will be data available which can be used to verify or contradict the scout’s claims.

Additionally, goaltender performance can influence the statistics of players at other positions. Both a defenseman playing if front of poor goaltending and a goal scorer who faced an unlikely sequence of superb goaltending are going to have their statistics skewed. Adjusting these statistics for goaltending performance will give management a clearer idea of why a certain player’s statistics might be deviating from their expectations. For example, the model can be expanded to measure the difference between even-strength expected goals for and expected goals against for each player over the course of the game based on the data already being recorded. This type of analysis is separate from the ES% Index, however having more accurate goaltending statistics would provide an organization another tool properly evaluate players and put the absolute best product on the ice.


No statistical analysis can replace comprehensive subjective evaluation that is performed by the most experienced hockey minds in the world. However, it can offer a fresh perspective and lend objective analysis to a position where contrarians can often be the most successful. The unorthodox goaltending styles of Tim Thomas and Dominik Hasek have remarkably won 8 out of the last 17 Vezina trophies awarded. Not only were they drafted in the 9th and 10th rounds, respectively, they did not even become starting goaltenders until aged 32 and 29 despite their success outside of the NHL. Very few understood how they stopped the puck, but both men clearly prevented goals. It is my hope that employing more advanced goaltending metrics can remove the biases that exist and pinpoint goal prevention, the sole objective of a goaltender. Due to my extensive knowledge of the position as both a student and a coach, the model has been constructed to reflect the complex simplicity of the position—Where is shot from? Can I see it? Can I reach my optimal position?—while deducing the existence of attributes that are critical to NHL success: size, speed, positioning, lateral movement, and consistency. For these reasons, Expected Save Percentage Index and Win Contribution analysis manages to combine the qualitative and quantitative factors that are necessary to properly evaluate goaltenders, benefiting any team that employs these advanced metrics.

Expected Goals (xG), Uncertainty, and Bayesian Goalies

All xG model code can be found on GitHub.

Expected Goals (xG) Recipe

If you’re reading this, you’re likely familiar with the idea behind expected goals (xG), whether from soccer analytics, early work done by Alan RyderBrian MacDonald, or current models by DTMAboutHeart and Asmean, Corsica, Moneypuck, or things I’ve put up on Twitter. Each model attempts to create a probability of each shot being a goal (xG) given the shot’s attributes like shot location, strength, shot type, preceding events, shooter skill, etc. There are also private companies supplementing these features with additional data (most importantly pre-shot puck movement on non-rebound shots and some sort of traffic/sight-line metric) but this is not public or generated in the real-time so will not be discussed here.[1]

To assign a probability (between 0% and 100%) to each shot, most xG models likely use logistic regression – a workhorse in many industry response models. As you can imagine the critical aspect of an xG model, and any model, becomes feature generation – the practice of turning raw, unstructured data into useful explanatory variables. NHL play-by-play data requires plenty of preparation to properly train an xG model. I have made the following adjustments to date:

  • Adjust for recorded shot distance bias in each rink. This is done by using a cumulative density function for shots taken in games where the team is away and apply that density function to the home rink in case their home scorer is biased. For example (with totally made up numbers), when Boston is on the road their games see 10% of shots within 5 feet of the goal, 20% of shots within 10 feet of the goal, etc. We can adjust the shot distance in their home rink to be the same since the biases of 29 data-recorders should be less than a single Boston data-recorder. If at home in Boston, 10% of the shots were within 10 feet of the goal, we might suspect that the scorer in Boston is systematically recording shots further away from the net than other rinks. We assume games with that team result in similar event coordinates both home and away and we can transform the home distribution to match the away distribution. Below demonstrates how distributions can differ between home and away games, highlighting the probable bias Boston and NY Rangers scorer that season and was adjusted for. Note we also don’t necessarily want to transform by an average, since the bias is not necessarily uniform throughout the spectrum of shot distances.
home rink bias
No Place Like Home
  • Figure out what events lead up to the shot, what zone they took place in, and the time lapsed between these events and the eventual shot while ensuring stoppages in play are caught.
  • Limit to just shots on goal. Misses include information, but like shot distance contain scorer bias. Some scorers are more likely to record a missed shot than others. Unlike shots where we have a recorded event, and it’s just biased, adjusting for misses would require ‘inventing’ occurrences in order to adjust biases in certain rinks, which seems dangerous. It’s best to ignore misses for now, particularly because the majority of my analysis focuses on goalies. Splitting the difference between misses caused by the goalie (perhaps through excellent positioning and reputation for not giving up pucks through the body) and those caused by recorder bias seems like a very difficult task. Shots on goal test the goalie directly hence will be the focus for now.
  • Clean goalie and player names. Annoying but necessary – both James and Jimmy Howard make appearances in the data, and they are the same guy.
  • Determine the strength of each team (powerplay for or against or if the goaltender is pulled for an extra attacker). There is a tradeoff here. The coefficients for the interaction of states (i.e. 5v4, 6v5, 4v3 model separately) pick up interesting interactions, but should significant instability from season to season. For example, 3v3 went from a penalty-box filled improbability to a common occurrence to finish overtime games. Alternatively, shooter strength and goalie strength can be model separately, this is more stable but less interesting.
  • Determine the goaltender and shooter handedness and position from look-up tables.
  • Determine which end of the ice and what coordinates (positive or negative) the home team is based, using recordings in any given period and rink-adjusting coordinates accordingly.
  • Calculate shot distance and shot angle. Determine what side of the ice the shot is from, whether or not it is the shooters off-wing based on handedness.
  • Tag shots as rushes or rebound, and if a rebound how far the puck travelled and the angular velocity of the puck from shot 1 to shot 2.
  • Calculate ‘shooting talent’ – a regressed version of shooting percentage using the Kuder-Richardson Formula 21, employed the same way as in DTMAboutHeart and Asmean‘s xG model.

All of this is to say there is a lot going on under the hood, the results are reliant on the data being recorded, processed, adjusted, and calculated properly. Importantly, the cleaning and adjustments to the data will never be complete, only issues that haven’t been discovered or adjusted for yet. There is no perfect xG model, nor is it possible to create one from the publicly available data, so it is important to concede that there will be some errors, but the goal is to prevent systemic errors that might bias the model. But these models do add useful information regular shot attempt models cannot, creating results that are more robust and useful as we will see.

Current xG Model

The current xG model does not use all developed features. Some didn’t contain enough unique information, perhaps over-shadowed by other explanatory variables. Some might have been generated on sparse or inconsistent data. Hopefully, current features can be improved or new features created.

While the xG model will continue to be optimized to better maximize out of sample performance, the discussion below captures a snapshot of the model. All cleanly recorded shots from 2007 to present are included, randomly split into 10 folds. Each of the 10 folds were then used a testing dataset (checking to see if the model correctly predicted a goal or not by comparing it to actual goals) while the other 9 corresponding folders were used to train the model. In this way, all reported performance metrics consist of comparing model predictions on the unseen data in the testing dataset to what actually happened. This is known as k-fold cross-validation and is fairly common practice in data science.

When we rank-order the predicted xG from highest to lowest probability we can compare the share of goals that occur to shots ordered randomly. This gives us a gains chart, a graphic representation of the how well the model is at finding actual goals relative to selecting shots randomly. We can also calculate the Area Under the Curve (AUC), where 1 is a perfect model and 0.5 is a random model. Think of the random model in this case as shot attempt measurement, treating all shots as equally likely to be a goal. The xG model has an AUC of about 0.75, which is good, and safely in between perfect and random. The most dangerous 25% of shots as selected by the model make up about 60% of actual goals. While there’s irreducible error and model limitations, in practice it is an improvement over unweighted shot attempts and accumulates meaningful sample size quicker than goals for and against.

gains chart
Gains, better than random

Hockey is also a zero-sum game. Goals (and expected goals) only matter relative to league average. Original iterations of the expected goal model built on a decade of data show that goals were becoming dearer compared to what was expected. Perhaps goaltenders were getting better, or league data-scorers were recording events to make things look harder than they were, or defensive structures were impacting the latent factors in the model or some combination of these explanations.

Without the means to properly separate these effects, each season receives it own weights for each factor. John McCool had originally discussed season-to-season instability of xG coefficients. Certainly this model contains some coefficient instability, particularly in the shot type variables. But overall these magnitudes adjust to equate each seasons xG to actual goals. Predicting a 2017-18 goal would require additional analysis and smartly weighting past models.

Coefficient Stability
Less volatile than goalies?

xG in Action

Every shot has a chance of going in, ranging from next to zero to close to certainty.  Each shot in the sample is there because the shooter believed there was some sort of benefit to shooting, rather than passing or dumping the puck, so we don’t see a bunch of shots from the far end of the rink, for example. xG then assigns a probability to each shot of being a goal, based on the explanatory variables generated from the NHL data – shot distance, shot angle, is the shot a rebound?, listed above.

Modeling each season separately, total season xG will be very close to actual goals. This also grades goaltenders on a curve against other goaltenders each season. If you are stopping 92% of shots, but others are stopping 93% of shots (assuming the same quality of shots) then you are on average costing your team a goal every 100 shots. This works out to about 7 points in the standings assuming a 2100 shot season workload and that an extra 3 goals against will cost a team 1 point in the standings. Using xG to measure goaltending performance makes sense because it puts each goalie on equal footing as far as what is expected, based on the information that is available.

We can normalize the number of goals prevented by the number of shots against to create a metric, Quality Rules Everything Around Me (QREAM), Expected Goals – Actual Goals per 100 Shots. Splitting each goalie season into random halves allows us to look at the correlation between the two halves. A metric that captures 100% skill would have a correlation of 1. If a goaltender prevented 1 goal every 100 shots, we would expect to see that hold up in each random split. A completely useless metric would have an intra-season correlation of 0, picking numbers out of a hat would re-create that result. With that frame of reference, intra-season correlations for QREAM are about 0.4 compared to about 0.3 for raw save percentage. Pucks bounce so we would never expect to see a correlation of 1, so this lift is considered to be useful and significant.[2]

intra-season correlations
Goalies doing the splits

Crudely, each goal prevented is worth about 1/3 of a point in the standings. Implying how many goals a goalie prevents compared to average allows us to compute how many points a goalie might create for or cost their team. However, a more sophisticated analysis might compare goal support the goalie receives to the expected goals faced (a bucketed version of that analysis can be found here). Using a win probability model the impact the goalie had on win or losing can be framed as actual wins versus expected.


xG’s also are important because they begin to frame the uncertainty that goes along with goals, chance, and performance. What does the probability of a goal represent? Think of an expected goal as a coin weighted to represent the chance that shot is a goal. Historically, a shot from the blueline might end up a goal only 5% of the time. After 100 shots (or coin flips) will there be exactly 5 goals? Maybe, but maybe not. Same with a rebound from in tight to the net that has a probability of a goal equal to 50%. After 10 shots, we might not see 5 goals scored, like ‘expected.’ 5 goals is the most likely outcome, but anywhere from 0 to 10 is possible on only 10 shots (or coin flips).

We can see how actual goals and expected goals might deviate in small sample sizes, from game to game and even season to season. Luckily, we can use programs like R, Python, or Excel to simulate coin flips or expected goals. A goalie might face 1,000 shots in a season, giving up 90 goals. With historical data, each of those shots can be assigned a probability of a being a goal. If the average probability of a goal is 10%, we expect the goalie to give up 100 goals. But using xG, there are other possible outcomes. Simulating 1 season based on expected goals might result in 105 goals against. Another simulation might be 88 goals against. We can simulate these same shots 1,000 or 10,000 times to get a distribution of outcomes based on expected goals and compare it to the actual goals.

In our example, the goalie possibly prevented 10 goals on 1,000 shots (100 xGA – 90 actual GA). But they also may have prevented 20 or prevented 0. With expected goals and simulations, we can begin to visualize this uncertainty. As the sample size increases, the uncertainty decreases but never evaporates. Goaltending is a simple position, but the range of outcomes, particularly in small samples, can vary due to random chance regardless of performance. Results can vary due to performance (of the goalie, teammates, or opposition) as well, and since we only have one season that actually exists, separating the two is painful. Embracing the variance is helpful and expected goals help create that framework.

It is important to acknowledge that results do not necessarily reflect talent or future or past results. So it is important to incorporate uncertainty into how we think about measuring performance. Expected goal models and simulations can help.

simulated seasons
Hackey statistics

Bayesian Analysis

Luckily, Bayesian analysis can also deal with weighting uncertainty and evidence. First, we set a prior –probability distribution of expected outcomes. Brian MacDonald used mean Even Strength Save Percentage as prior, the distribution of ESSV% of NHL goalies. We can do the same thing with Expected Save Percentage (shots – xG / shots), create a unique prior distribution of outcome for each goalie season depending on the quality of shots faced and the sample size we’ll like to see. Once the prior is set, evidence (saves in our case) is layered on to the prior creating a posterior outcome.

Imagine a goalie facing 100 shots to start their career and, remarkably, making 100 saves. They face 8 total xG against, so we can set the Prior Expected Save% as a distribution centered around 92%. The current evidence at this point is 100 saves on 100 shots, and Bayesian Analysis will combine this information to create a Posterior distribution.

Goaltending is a binary job (save/goal) so we can use a beta distribution to create a distribution of the goaltenders expected (prior) and actual (evidence) save percentage between 0 and 1, like a baseball players batting average will fall between 0 and 1. We also have to set the strength of the prior – how robust the prior is to the new evidence coming in (the shots and saves of the goalie in question). A weak prior would concede to evidence quickly, a hot streak to start a season or career may lead the model to think this goalie may be a Hart candidate or future Hall-of-Famer! A strong prior would assume every goalie is average and require prolonged over or under achieving to convince the model otherwise. Possibly fair, but not revealing any useful information until it has been common knowledge for a while.

bayesian goalie
Priors plus Evidence

More research is required, but I have set the default prior strength of equivalent to 1,000 shots. Teams give up about 2,500 shots a season, so a 1A/1B type goalie would exceed this threshold in most seasons. In my goalie compare app, the prior can be adjusted up or down as a matter of taste or curiosity. Research topics would investigate what prior shot count minimizes season to season performance variability.

Every time a reported result actives your small sample size spidey senses, remember Bayesian analysis is thoroughly unimpressed, dutifully collecting evidence, once shot at a time.


Perfect is often the enemy of the good. Expected goal models fail to completely capture the complex networks and inputs that create goals, but they do improve on current results-based metrics such as shot attempts by a considerable amount.  Their outputs can be conceptualized by fans and players alike, everybody understands a breakaway has a better chance of being a goal than a point shot.

The math behind the model is less accessible, but people, particularly the young, are becoming more comfortable with prediction algorithms in their daily life, from Spotify generating playlists to Amazon recommender systems. Coaches, players, and fans on some level understand not all grade A chances will result in a goal. So while out-chancing the other team in the short term is no guarantee of victory, doing it over the long term is a recipe for success. Removing some the noise that goals contain and the conceptual flaws of raw shot attempts helps the smooth short-term disconnect between performance and results.

My current case study using expected goals is to measure goaltending performance since it’s the simplest position – we don’t need to try to split credit between linemates. Looking at xGA – GA per shot captures more goalie specific skill than save percentage and lends itself to outlining the uncertainty those results contain. Expected goals also allow us to create an informed prior that can be used in a Bayesian hierarchical model. This can quantify the interaction between evidence, sample size, and uncertainty.

Further research topics include predicting goalie season performance using expected goals and posterior predictive distributions.


[1]Without private data or comprehensive tracking data technology analysts are only able to observe outcomes of plays – most importantly goals and shots – but not really what created those results. A great analogy came from football (soccer) analyst Marek Kwiatkowski:

Almost the entire conceptual arsenal that we use today to describe and study football consists of on-the-ball event types, that is to say it maps directly to raw data. We speak of “tackles” and “aerial duels” and “big chances” without pausing to consider whether they are the appropriate unit of analysis. I believe that they are not. That is not to say that the events are not real; but they are merely side effects of a complex and fluid process that is football, and in isolation carry little information about its true nature. To focus on them then is to watch the train passing by looking at the sparks it sets off on the rails.

Armed with only ‘outcome data’ rather than comprehensive ‘inputs data’ analyst most models will be best served with a logistic regression. Logistic regression often bests complex models, often generalizing better than machine learning procedures. However, it will become important to lean on machine learning models as reliable ‘input’ data becomes available in order to capture the deep networks of effects that lead to goal creation and prevention. Right now we only capture snapshots, thus logistic regression should perform fine in most cases.

[2] Most people readily acknowledge some share of results in hockey are luck. Is the number closer to 60% (given the repeatable skill in my model is about 40%), or can it be reduced to 0% because my model is quite weak? The current model can be improved with more diligent feature generation and adding key features like pre-shot puck movement and some sort of traffic metric. This is interesting because traditionally logistic regression models see diminishing marginal returns from adding more variables, so while I am missing 2 big factors in predicting goals, the intra-seasonal correlation might only go from 40% to 50%. However, deep learning networks that can capture deeper interactions between variables might see an overweight benefit from these additional ‘input’ variables (possibly capturing deeper networks of effects), pushing the correlation and skill capture much higher. I have not attempted to predict goals using deep learning methods to date.

Goaltending—Game Theory, the Contrarian Position, and the Possibility of the Extreme

Preamble: The following is a paper I wrote while in college about 6 years ago. It is a slightly different approach and worse logic that I employ now, likely reflecting my attitude at the time – a collegiate goaltender with the illusion of control (hence goals were likely unpredictable events, else I would have stopped it). I have softened on this thinking, but still think the recommendation holds: goaltenders can outperform the average by mixing strategies and adding an element of unpredictability to their game.


How goaltender strategy and understanding randomness in hockey can lend insight into the success of truly elite goaltenders.


This paper outlines general strategies and philosophies behind goaltending, focusing on what makes great goaltenders great. Philosophy and goaltending make interesting partners—few athletic positions are continuously branded with a ‘style.’ Since such subjective labels are the norm for this position, then I feel quite comfortable using the terms rather broadly in a philosophical analysis. I will use loose generalisations to formulate a big-picture view of the position—how it has evolved, the type of goaltender that has consistently risen above their peers during this evolution, and why. Using game theory and attempting to clearly label player strategies is, at times, clumsy. Addressing the impact of unquantifiable randomness in hockey does not provide much comfort either. However, the purpose is to encourage further thought on the subject, and not provide a numerical, concise answer. It is a question that deserves more thought, at both the professional (evaluation and scouting) and grass-root (development and training) level. The question: what makes a consistently great goaltender?

Game Theory—The Evolution of Goaltending Strategy

Passive ‘blocking’ tactics have become prevalent among goaltenders at all levels. It is simple, statistically successful, and passive. There are tradeoffs like any strategy—the goaltender forfeits aggressiveness in order to force the shooter to make perfect shots to beat them. This ‘fated’ strategy exposes the goaltender to the extreme—most goals allowed are classified as ‘great plays’ or ‘lucky,’ certainly not the fault of the goaltender. However, there are other considerations. Shooters, no doubt, have adjusted their strategy based on this approach, further compromising the passive approach to goaltending. This means a disproportionate number of shooters will look to make ‘perfect’ shots—high and tight to the post against a blocking goaltender—despite the risk of missing the net entirely.

Historically, goaltenders did not have the luxury of light, protective equipment that is designed specifically to seal off any holes while in a butterfly position. Equipment lacking proper protection and effectiveness required goaltenders to spend the majority of the time on their feet while facing shots.

Player/Goaltender Interactions Then and Now

Game theory applications allow a crude analysis of the evolution of strategies between players and goaltenders. The numbers I use are arbitrary, however, they demonstrate an important strategic shift in goaltending tactics. First, let us assume that players have to decide whether to shoot high or low and always try to shoot for the posts. Simultaneously, goaltenders must choose to block or react.

In the age of primitive equipment, goaltenders were required to stand-up most of the time to make saves. From here we can make three assumptions in this ‘game’ or ‘shot’: 1) While blocking, the goaltender’s expected success rate was the same if the shooter shot high or low. Since the ‘blocking’ tactic was simply standing up and challenging excessively when possible, it would not matter if the player shot high or low, the goaltender was simply covering the middle of the net. 2) While reacting, high shots were easier saves than low shots. Goaltenders generally stood-up, which make reach pucks with the hands easy and reaching pucks with the feet hard. 3) Goaltenders were still better reacting than blocking on low shots, since players will always shoot for the posts.

We can then use the iterated elimination of dominated strategies technique to find a dominant strategy for each player. In this scenario, goaltenders are always more successful, on average, reacting than blocking. Since goaltenders will always react, shooters acknowledge they are generally better off shooting low than high (while this is just a fabricated example, the fact goaltenders survived without helmets might prove this). Regardless, the point of this exercise demonstrates that goaltenders needed to have the ability to react to shots during this time. These strategies and the expected save percentages are displayed in the matrix below (Figure 1). Remember goaltenders want the highest save percentage strategy, while shooters want to find the lowest.

However, the game of hockey is not as simple as the pure simultaneous-move game we have set up. Offensive players are not shooting in a vacuum. They are often facing defensive pressure or limited to long distance shots, both circumstances limit the ability of offensive players to accurately shoot the puck. If the goaltender believes his team will be able to limit the frequency of high shots to less to 50%, then the goaltenders expected save percentage while blocking is greater than their expected save percentage while reacting.Advances in equipment then allowed the adoption of a new blocking tactic—the butterfly. By dropping to their knees and flaring out their legs, goaltenders were maximising their blocking surface area, particularly along the ice. Equipment was lighter, bigger, and increasingly conducive to the butterfly style, allowing goaltenders to perform at higher levels. Now the same simultaneous-move game described above began to increasingly favour the goaltender. Not only did the butterfly change the way goaltenders blocked, it changed the way they reacted. Goaltenders now tended to react from a butterfly base—dropping down to their knees at the onset of the shot and reacting as they dropped. The effectiveness of the down game now meant shooters were always better off shooting high. In a pure game theory sense, this would suggest players would always shoot high, so goaltenders should still always react. These strategies and the new payoffs are displayed in Figure 2.

This suggests that goaltenders with a good defence, good blocking technique, and modern goaltending equipment are better off blocking. When a goaltender is said to be ‘playing the percentages,’ this suggests the goaltender routinely blocks the majority of the net and forces the shooter to make a perfect shot. This strategy has raised the average performance of goaltenders. However, in a zero-sum game such as hockey, simply maintaining a level of adequate performance will not increase the goaltender’s absolute success, measured in wins and losses. The only way for a goaltender to positively impact their team is to exceed the average, which—as we will see—can be accomplished by defying the norm.

In conclusion, these strategic interactions did not create hard rules for goaltenders or shooters. However, the permeation of advanced tactics has heavily skewed the payoffs toward the goaltender. Goaltenders block more, and shooters shoot high as much as possible. An unspoken equilibrium has been created and maintained at all levels of hockey—thus altering the instinctive strategies employed by both groups.

The ‘Average’ Position

Goaltenders could now simplify their approach to their position, while simultaneously out-performing their historical predecessors. The average NHL save percentage rose from 87.6% in 1982 to 91.6% in 2011.* This rise in success rate would give any goaltender little incentive to break the norm. Imagine an ‘average’ goaltender, posting a save percentage equivalent to the NHL average save percentage each year. The ‘average’ goaltender would put up better numbers each successive year. While they would be perceived to be more valuable—higher personal statistics means a bigger contract, more starts, and a greater reputation—it is entirely conceivable that, despite their statistical improvement, they would not contribute to any more victories. If the goaltender at the other end of the ice is performing just as well as you (on average, of course) then the ‘average’ goaltender will not contribute any extra wins to his team compared to the year before. However, this effect would be difficult to observe over the course of a goaltenders career, and coaches and managers would become enamoured with ‘average’ goaltending, comparing it favourably to the recent past. The ‘success of mediocrity’ encouraged a simplified, safe, and ‘high-percentage’ approach to the position. If you looked like other goaltenders, played like other goaltenders, and performed like other goaltenders, there was little reason to worry about job security. In short, through the evolution of goaltending, goaltenders generally have had very little to gain from breaking the idyllic norm of how a goaltender should look or play like. The implicit equilibrium between shooters and goaltenders has persisted across different eras—most recently centring around a ‘big butterfly, blocking’ game, resulting in historically superior statistics for the ‘average’ goaltender.

The Limits of Success

There is no doubt that now the craft of goaltending is significantly superior to the efforts that preceded it. Goaltenders today are bigger, faster, more athletic, and advanced technically. However, the quest to fulfil the requirement of ‘average’ will be an empty pursuit in absolute terms (wins and losses) to any goaltender. In order to avoid becoming ‘average’ the goaltender must deviate from the strategic equilibrium that primarily consists of large goaltenders simply ‘playing the percentages.’ While goaltenders can exceed the average by simply being even bigger, faster, and more athletic than their peers, this is becoming increasingly difficult. Not only will teams continue to draft goalies for these attributes, there are natural limits to how tall, fast, and coordinated a human being can be. Shooters will also continue to adjust. An extra 2” in height does not necessarily prevent a perfectly placed shot over or under the glove. Recall the over simplified instantaneous move game: shooters will always be better off shooting high and to the posts—when they have time. High-level shooters have evolved to target very specific areas of the net, preying on the predictability of the modern butterfly goalie. However, the shooter will not always have time to attempt the perfect shot, which means the goaltender can revert back to primarily blocking and mediocrity without being exposed.



The Contrarian Position

While the goaltender cannot change his physiology in order to exceed the average, they can (slowly) alter their approach to the game. Remember, the strategic interaction between the goaltender and shooter has become predictable. The goaltender will fill up as much net as possible, forcing the shooter to manufacture a perfect shot, while the shooter will attempt to comply.  If a goaltender were to begin to mix strategies effectively and react some percentage of the time, they would be better off. The shooter has been trained to shoot high (that is their dominant strategy), and goaltenders are better off reacting to high shots than blocking and leaving their arms pinned to their sides. Essentially, by mixing strategies when it is wise, (when the simple block-react instantaneous move model applies) the goaltender can increase their expected save percentage—and exceed the average.

To demonstrate this point we must move away from the abstract and the general, focusing on specific examples. A disproportionate amount of statistical success throughout the ‘butterfly’ era has been the work of unorthodox goaltenders. While an ‘unorthodox’ style has had a negative connotation in the conventional world of goaltending, it is the defectors that have broken through the limits reached by the big, butterfly goaltender. Sub-six-foot Tim Thomas recently broke the modern NHL save percentage record by willing himself to saves and largely defying the established goaltending practice. The save percentage record previously belonged to Dominik Hasek. Like Thomas, Hasek was less than six feet tall and would consistently move toward the puck like no other goaltender in the game. To shooters that have very clear, habitual objectives (shoot high glove or low blocker just over the pad or through his legs if he is sliding, etc.) facing these contrarians led to a historically low shooter success rate. These athletes effectively mixed their strategies between blocking and reacting (their own versions of these strategies, mind you) to keep shooters guessing. Their contrarian approach has been remarkably sustainable as well—Hasek and Thomas have combined to win 8 out of the last 17 Vezina Trophies, despite their NHL careers only overlapping 3 years. By moving further away further the archetypical goaltender, both Thomas and Hasek exceeded the average considerably. It is exceeding the average that causes goaltenders to contribute to victories, the absolute measurement of success for any goaltender.

Consider the correlation between a unique approach and sustained success when accessing the careers of four Calder Trophy winning goaltenders: Ed Belfour, Martin Brodeur, Andrew Raycroft, and Steve Mason. Each began their NHL career in impressive fashion; however, two went on to become generational goaltenders, while the other two will struggle to equal their initial success. This may seem like an unfair comparison, but it is important to understand why it unfair. Both Brodeur and Belfour maintained an elite level of play because they generally defied convention throughout their career. Both played unique styles and were excellent puck handlers. When Belfour entered the league at the very start of the 1990’s his combination of athleticism, intensity, and an advanced understanding of positional play made him formidable. He mastered the butterfly before it was the standard—you could argue the success of Patrick Roy and Belfour helped create the current generation of ‘big, butterfly’ goaltenders. Brodeur has always been different—there has been no comparable goaltender to him throughout his career, just like Thomas or Hasek. He has been the most consistent and celebrated goaltender in NHL history without utilising the most common save tactic employed by his peers—he rarely drops into a true butterfly. Counter-intuitively, despite lacking a standard, universal save movement, he has also been remarkably consistent. Martin Brodeur has mixed his save selection strategies magnificently, preying on shooter programmed to shoot against predictable butterfly practitioners.

Now consider the other rookie standouts: Raycroft and Mason. It is difficult to distinguish their approach to the game from the approach of other ‘average’ professions. Mason is taller than average and catches right, but he does not present a unique challenge to shooters. They are goaltenders with an average, ‘percentage-based’ approach to goaltending. There is nothing note-worthy about the way they play the position. Why the initial success? Both goaltenders likely overachieved (positive deviation from the average) due to a favourable situation and the vague element of surprise. Shooters would soon adjust to the subtleties in the young goaltender’s game.* Personal weaknesses would become exploited and their performance regressed towards the mean. Their rookie years could have been duplicated by a number of other rookie goaltenders, with similar skill and luck. Their ‘average’ size, skill set, and approach to the game have manifested itself in an ‘average’ NHL career. An impressive beginning was nothing more than favourable luck and circumstance—their careers diverged significantly from other Calder-winning goaltenders. Goaltenders that went throughout their career masterfully mixing save selection strategies, by contrast, set the standard for consistency, longevity, and performance.

In conclusion, the modern equilibrium between goaltenders and shooters has been successfully disrupted by the contrarians like Dominik Hasek, Tim Thomas, and Martin Brodeur. The rest have enjoyed the benefits of the ‘big, butterfly goaltender’ doctrine—stopping more pucks on average—but have gained little ground on other ‘average’ goaltenders. These goaltenders are playing a strategy that contributes little to their team because they are more susceptible against the extreme.


The Possibility of the Extreme—The Black Swan Save 

If contrarians exceed the average, it is important to understand how they can do it with remarkable consistency. I believe their unconventional style and willingness to react to shots leaves them better prepared to handle the possibility of the statistically unique shot—which I will call a ‘Black Swan’ opportunity.§ They can always use the butterfly tactic in situations that call for it, while the butterfly-reliant goaltenders struggle to improvise like contrarians. The ‘reaction’ strategy leaves them free to make the unconventional saves necessary to prevent Black Swans from becoming goals.

The position relies on instinct and split second decisions. Reactions and responses to defined situations are drilled into goalies from an increasingly young age. Long before these goaltenders are capable of playing in the NHL, they have generally mastered technical responses to certain, finite situations. Goaltenders may be trained very well to react predictably in trained circumstances, but this leaves the goaltender susceptible to the extreme—breeding mediocrity. In this case, the extreme or Black Swan shot, is the result of 10 position players on the ice, moving at speeds up to 30 miles per hour, chasing an object that can move close to 100 miles per hour. Despite the simple objective and the definitive results of the goaltending position, every shot against them has the potential to create an infinite amount of complexities and permutations. A one-dimensional approach—where the goaltender determines they are better off ‘playing the percentages’—to the position offers the goaltender the opportunity to make a large number of saves, but it does not prepare the goaltender to react favourably to a Black Swan. The problem, then, is not maintaining a predictable level of performance—making the saves ‘you should make’—it is the ability to adjust to the unpredictable and the extreme in order to make a critical save. This is accomplished by reacting to shots a healthy percent of the time.

The real objective of the goaltender is to give up fewer goals than the opposing goaltender. In a low scoring game such as hockey, it is likely one goal against will determine the outcome of any given game. Passively leaving the outcome up to chance is a mistake in my opinion. Aggressiveness and assertiveness are competitive qualities that are compromised by a predominantly butterfly style. By dropping in the butterfly the goaltender is surrendering to whatever unlikely or unlucky shot that may occur. A great play, a seeing-eye shot, or unlikely bounce—the ‘unlikely, undrilled’ occurrences that have the potential to win or lose games—happen randomly. The goaltender must be aggressive and decisive in order to adjust to these situations. These are the shots that cannot be replicated in repetitive drills; they require the creativity and instinctual reaction of an instinctual contrarian.

Goaltending—A Lesson in Randomness

The frequency of the Black Swan shot or goal against is erratic. They can happen at any time. There is little correlation between shots against and goals against on a game-by-game basis. If we assume the amount of Black Swan’s a goaltender faces is roughly proportional to the number of goals given up*— generally the more improbable shots faced, the more goals against—we counter-intuitively observe that the ‘Black Swans’ and the goals they caused occur randomly in a hockey game, largely independent of the number of shots against the goaltender. Taking the 10 busiest goaltenders of the 2010-2011 season, we see that their save percentage generally goes up as they receive more shots against. It does not matter whether the team gives up 20 shots or 40 shots, the random Black Swan occurrences that result in goals will happen just as frequency, regardless of the shots against. In outings where those goaltenders faced more than 40 shots, the average save percentage and shots against were 94.63% and 43.51, respectively. This implies these goaltenders gave up, on average, 2.33 goals per game when facing more than 40 shots. When these same goaltenders faced less than 20 shots, their save percentage was a paltry 82.17% on an average of 14.85 shots. This implies 2.64 goals against per outing where the goaltender faced less than 20 shots.§ Counter-intuitively they fared worse while facing less than half of the shots.

The frequency of the ‘Black Swan’ occurrences that led to goals appears to be largely independent of shots on goal. ‘Playing the percentages’ leaves every goaltender hopelessly exposed to random chance throughout the game. Goaltenders in the world’s best league do no better in absolute terms when they face 20 shots than 40 shots. They are the same goaltenders, they just fall victim to circumstance and luck.

Simply ‘playing the percentages,’ with an emphasis on blocking from the butterfly, leaves the goaltenders fate up to pure chance. No goaltender can attempt to consistently out-perform their peers by playing the percentages—at least, not with certainty. Hoping to block 90% of the net while relying on your team to limit quality opportunities will result in mediocrity. The Black Swan events that lead to goals occur randomly and just as frequently facing 15 shots as 50 shots. This has manifested itself in ‘average’ goaltenders’ performances fluctuating unpredictably from game to game and from season to season. In a game where random luck is prevalent, employing a strategy that struggles to adjust to the complexities of a game as dynamic as hockey will result in erratic and unexplainable outcomes.

The Challenge to the Contrarian

This creates a counter-intuitive result: the prototypical, ‘by the book’ goaltender will likely be subjected to greater fluctuations in performance, despite having the technical mastery of the position that suggests a level of control. Instead, it is the contrarian, with no attachment to the ‘proper’ way to make the save that will achieve more consistent results. The improvisational nature of a Tim Thomas stick save may appear out of control, but his approach to the game will yield more consistent results. The aggressiveness and assertiveness will allow the contrarian to make saves when there is no technical road map to reach the proper position on a Black Swan shot. Consider the attributes necessary the make an incredible save. Physical attributes vary among NHL goaltenders, but not by much. Height, agility, reflexes, and other critical skills for any professional goaltender will cluster around a certain standard. On the other hand, the mental approach to the game can vary between goaltenders by magnitudes. Goaltenders can become robust against the effects of Black Swans by having the creativity to reach pucks ‘technicians’ could not and having the courage to abandon the perceived safety of the butterfly. Decreasing the effects of Black Swan’s would be huge, and there are no theoretical limitations (unlike physical limitations) that exist. In a game containing the possibility of the extreme, it is the contrarian goaltender that will best be able to prevent goals against.

Leaving the safety of the ‘butterfly style’ can be dangerous for a goaltender. Coaches, managers, analysts, and peers will be quick to realise when a goal could have been stopped by a goaltender passively waiting in their butterfly. These ‘evaluators’ and ‘experts’ have subscribed to the ‘average’ goaltender paradigm for over a decade. After game 5 of the 2011 Stanley Cup Final, Roberto Luongo suggested that the only goal of the game against Tim Thomas would have been “an easy save for (him).” Proactively mixing save strategies does leave the contrarian potentially exposed to the unconventional goal against. Improbable, unconventional saves are great, but coaches and managers really only care about goals against. They can handle them if it was not the fault of the goalie—the perfect shot or improbable bounce that prey’s upon the passive butterfly goaltender. Just don’t pass up the opportunity to make an easy save and get scored on, contend the experts (luckily, Thomas was able to put together the greatest season of any goaltender in the modern game, he got a pass). Playing the game like freed from the ‘butterfly-first’ doctrine is a leap of faith, but it gives the goaltender the opportunity to contribute something positive to their team: wins.

Consider the great Martin Brodeur—the winningest goaltender in NHL history has often been discredited for playing behind strong defensive clubs while winning games and championships. However, random Black Swan chances have little regard for the number of shots against, as we have seen.  So why does Martin Brodeur have the most victories of any goaltender in NHL history? I would give a large amount of credit to his ability to make the ‘key save’ on the unlikely chance against. These saves would not necessarily manifest themselves noticeably at the end of the game or in any statistically significant way—rather they are randomly distributed throughout the game, like Black Swan’s are. Remember that, while New Jersey has been traditionally strong defensively, they have averaged 16th in the league in scoring during Brodeur’s tenure. With this inconsistent (and at times lethargic) goal support, Brodeur’s win totals remained remarkably consistent. During his prime he recorded at least 37 victories in 11 consecutive seasons. The low scoring years required extreme focus and competency. Where the game could hinge on one great play or bad bounce, Brodeur preserved victory more than any contemporary by being vigilant against the Black Swan chances. You can make the argument the low shot totals (and the subsequent merely ‘good’ save percentage) led to him being overrated considering his absolute success. However, Black Swan’s are somewhat independent of shots against, and until his detractors understand how three ‘Brodeur-only saves’ were the difference in a 3-2 win in a game where New Jersey gave up only 23 shots, the winningest goaltender of all-time will continue to be regrettably underrated, except for where it counts. No statistical analysis can measure the increased importance of a save to preserve victory compared to a save without that pressure.


I felt it was important to actively think about the strategies that have permeated the goaltending position and the impact it has had on goaltending performance. It was also important to liberate my thinking from too much quantitative analysis, rather focusing on the qualitative relationships between goaltender strategy, the random nature of the position, the goaltenders that consistently exceed the norm, and the goaltenders that will always be products of circumstance. None of this could be done with traditional goaltender metrics, they do not begin the even consider the possibility of the Black Swan opportunity against. Traditional statistics can be manipulated to underrate the winningest goaltender of all-time. Winning is sport’s sole objective, the goaltender always has some influence on winning, so goaltender wins are important. Traditional statistics lead to complacency with ‘average’ goaltending, which is goaltending that adds nothing to the bottom-line—winning. Leaving these statistical constraints behind can help clarify the connection between strategy and the contrarian, then between the contrarian and success.

Based on this philosophical analysis, I believe goaltenders should unsubscribe from the conventional goaltending handbook, aggressively mix their save selection, helpful remaining robust against the inevitable Black Swans opportunities against. This will allow them to exceed the ‘expected’ performance, and ultimately win more games.


* A 4% increase in save percentage is significant; this is analogous to saying goaltenders gave up 48% more goals of the same number of shots in 1982 than 2011.

* While the butterfly style may be generic, each goaltender has relative strengths and weaknesses. NHL shooters will eventually expose these weaknesses unless the goaltenders can successfully vary their strategy (remain unpredictable).

In the ‘modern’ game-theory example, the goaltender would have to react the vast majority of the time to force the shooter to mix between shooting high or low (which is ideal for the goaltender). By doing so the goaltenders can exert their influence on the shooter, opposed to simply accepting that a great shot or lucky bounce will beat them.

  • A term borrowed from Nassim Nicholas Talib and his book The Black Swan: The Impact of the Highly Improbable. Black Swan’s, named after the rare bird, represent the improbable and random occurrences in hockey and in life. Just because we cannot conceive a particular challenge nor have we prepared for it, does not mean it will not happen. ‘Black Swans’ are unpredictable, can have a large impact (a goal), and are the result of an ecosystem that is far too complex to predict (10 players, a puck, and physics create infinite possibilities). Events are weakly explained after the fact (you held your glove too high) but in reality the causes are much deeper and impossible to predict.

* While I would argue some goaltenders are better equipped to handle ‘Black Swan’ opportunities against them, these difficult, unforeseen events will still be approximately proportionate to the amount of goals they give up. NB: Tim Thomas is not included in this list.

This ‘extreme’ case happened 47 times out of the 677 games collectively played.

  • Many of these games saw the goaltender pulled, so the goals against is ‘per appearance’ rather than ‘per game.’ While it may be argued that these goaltender just ‘didn’t have it’ these games, I would argue that more often they faced a cluster of bad luck and improbable chances against them. The total sample size is 60 games.

This attitude may explain the regression in Luongo’s game over the last couple of seasons. He once was a 6’3 goaltender with freakishly long limbs that would reach pucks in unconventional and spectacular ways. Now he views himself as pure positional goaltender that is better off on the goal line than aggressively attacking shots against him. Apparently it is better to look ‘good’ getting scored on multiple times than look ‘bad’ getting scored on once.

The standard deviation is 10 places, basically all over the place, both leading the in goals for and finishing last in goals for.

Hockey Analytics, Strategy, & Game Theory

Strategic Snapshot: Isolating QREAM

I’ve recently attempted to measure goaltending performance by looking at the number of expected goals a goaltender faces compared to the actual goals they actually allow. Expected goals are ‘probabilitistic goals’ based on what we have data for (which isn’t everything): if that shot were taken 1,000 times on the average goalie that made the NHL, how often would it be a goal? Looking at one shot there is variance, the puck either goes in or doesn’t, but over a course of a season summing the expected goals gives a little better idea of how the goaltender is performing because we can adjust for the quality of shots they face, helping isolate their ‘skill’ in making saves. The metric, which I’ll refer to as QREAM (Quality Rules Everything Around Me), reflects goaltender puck-saving skill more than raw save percentage, showing more stability within goalie season.
Goalies doing the splits
Good stuff. We can then use QREAM to break down goalie performance by situations, tactical or circumstantial, to reveal actionable trends. Is goalie A better on shots from the left side or right side? Left shooters or right shooters? Wrist shots, deflections, etc? Powerplay? Powerplay, left or right side? etc. We can even visualise it, and create a unique descriptive look at how each goaltender or team performed.

This is a great start. The next step in confirming the validity of a statistic is looking how it holds up over time. Is goalie B consistently weak on powerplay shots from the left side? Is something that can be exploited by looking at the data? Predictivity is important to validate a metric, showing that it can be acted up and some sort of result can be expected. Unfortunately, year over year trends by goalie don’t hold up in an actionable way. There might be a few persistent trends below, but nothing systemic we can that would be more prevalent than just luck. Why?

Game Theory (time for some)

In the QREAM example, predictivity is elusive because hockey is not static and all players and coaches in question are optimizers trying their best to generate or prevent goals at any time. Both teams are constantly making adjustments, sometimes strategically and unconsciously. As a data scientist, when I analyse 750,000 shots over 10 seasons, I only see what happened, not what didn’t happen. If in one season, goalie A underperformed the average on shots from the left shooters from the left side of the ice that would show up in the data, but it would be noticed by players and coaches quicker and in a much more meaningful and actionable way (maybe it was the result of hand placement, lack of squareness, cheating to the middle, defenders who let up cross-ice passes from right to left more often than expected, etc.) The goalie and defensive team would also pick up on these trends and understandably compensate, maybe even slightly over-compensate, which would open up other options attempting to score, which the goalie would adjust to, and so on until the game reaches some sort of multi-dimensional equilibrium (actual game theory). If a systemic trend did continue then there’s a good chance that that goalie will be out of the league. Either way, trying to capture a meaningful actionable insight from the analysis is much like trying to capture lightning in a bottle. In both cases, finding a reliable pattern in a game where both sides and constantly adjusting and counter-adjusting is very difficult.

This isn’t to say the analysis can’t be improved. My expected goal model has weaknesses and will always have limitations due to data and user error. That said, I would expect the insights of even a perfect model to be arbitraged away. More shockingly (since I haven’t looked at this in-depth, at all), I would expected the recent trend of NBA teams fading the use of mid-range shots to reverse in time as more teams counter that with personnel and tactics, then a smart team could probably exploit that set-up by employing slightly more mid-range shots, and so on, until a new equilibrium is reached. See you all at Sloan 2020.

Data On Ice

The role of analytics is to provide a new lens to look at problems and make better-informed decisions. There are plenty of example of applications at the hockey management level to support this, data analytics have aided draft strategy and roster composition. But bringing advanced analytics to on-ice strategy will likely continue to chase adjustments players and coaches are constantly making already. Even macro-analysis can be difficult once the underlying inputs are considered.
An analyst might look at strategies to enter the offensive zone, where you can either forfeit control (dump it in) or attempt to maintain control (carry or pass it in). If you watched a sizable sample of games across all teams and a few different seasons, you would probably find that you were more likely to score a goal if you tried to pass or carry the puck into the offensive zone than if you dumped it. Actionable insight! However, none of these plays occurs in a vacuum – a true A/B test would have the offensive players randomise between dumping it in and carrying it. But the offensive player doesn’t randomise, they are making what they believe to be the right play at that time considering things like offensive support, defensive pressure, and shift length of them and their teammates. In general, when they dump the puck, they are probably trying to make a poor position slightly less bad and get off the ice. A randomised attempted carry-in might be stopped and result in a transition play against. So, the insight of not dumping the puck should be changed to ‘have the 5-player unit be in a position to carry the puck into the offensive zone,’ which encompasses more than a dump/carry strategy. In that case, this isn’t really an actionable, data-driven strategy, rather an observation. A player who dumps the puck more often likely does so because they struggle to generate speed and possession from the defensive zone, something that would probably be reflected in other macro-stats (i.e. the share of shots or goals they are on the ice for). The real insight is the player probably has some deficiencies in their game. And this where the underlying complexity of hockey begins to grate at macro-measures of hockey analysis, there’s many little games within the games, player-level optimisation, and second-order effects that make capturing true actionable, data-driven insight difficult.[1]
It can be done, though in a round-about way. Like many, I support the idea of using (more specifically, testing) 4 or even 5 forwards on the powerplay. However, it’s important to remember that analysis that shows a 4F powerplay is more of a representation of the team’s personnel that elect to use that strategy, rather than the effectiveness of that particular strategy in a vacuum. And team’s will work to counter by maximising their chance of getting the puck and attacking the forward on defence by increasing aggressiveness, which may be countered by a second defenseman, and so forth.

Game Theory (revisited & evolved)

Where analytics looks to build strategic insights on a foundation of shifting sand, there’s an equally interesting force at work – evolutionary game theory. Let’s go back to the example of the number of forwards employed on the powerplay, teams can use 3, 4, or 5 forwards. In game theory, we look for a dominant strategy first. While self-selected 4 forward powerplays are more effective a team shouldn’t necessarily employ it if up by 2 goals in the 3rd period, since a marginal goal for is worth less than a marginal goal against. And because 4 forward powerplays, intuitively, are more likely to concede chances and goals against than 3F-2D, it’s not a dominant strategy. Neither are 3F-2D or 5F-0D.
Thought experiment. Imagine in the first season, every team employed 3F-2D. In season 2, one team employs a 4F-1D powerplay, 70% of the time, they would have some marginal success because the rest of the league is configured to oppose 3F-2D, and in season 3 this strategy replicates, more teams run a 4F-1D in line with evolutionary game theory. Eventually, say in season 10, more teams might run a 4F-1D powerplay than 3F-2D, and some even 5F-0D. However, penalty kills will also adjust to counter-balance and the game will continue. There may or may not be an evolutionarily stable strategy where teams are best served are best mixing strategies like you would playing rock-paper-scissors.[2] I imagine the proper strategy would depend on score state (primarily), and respective personnel.
You can imagine a similar game representing the function of the first forward in on the forecheck. They can go for the puck or hit the defensemen – always going for the puck would let the defenseman become too comfortable, letting them make more effective plays, while always hitting would take them out of the play too often, conceding too much ice after a simple pass. The optimal strategy is likely randomising, say, hitting 20% of the time factoring in gap, score, personnel, etc.

A More Robust (& Strategic) Approach

Even if it seems a purely analytic-driven strategy is difficult to conceive, there is an opportunity to take advantage of this knowledge. Time is a more robust test of on-ice strategies than p-values. Good strategies will survive and replicate, poor ones will (eventually and painfully) die off. Innovative ideas can be sourced from anywhere and employed in minor-pro affiliates where the strategies effects can be quantified in a more controlled environment. Each organisation has hundreds of games a year in their control and can observe many more. Understanding that building an analytical case for a strategy may be difficult (coaches are normally sceptical of data, maybe intuitively for the reasons above), analysts can sell the merit of experimenting and measuring, giving the coach major ownership of what is tested. After all, it pays to be first in a dynamic game such as hockey. Bobby Orr changed the way the blueliners played. New blocking tactics (and equipment) lead to improved goaltending. Hall-of-Fame forward Sergei Fedorov was a terrific defenseman on some of the best teams of the modern era.[3]  Teams will benefit from being the first to employ (good) strategies that other teams don’t see consistently and don’t devote considerable time preparing for.
The game can also improve using this framework. If leagues want to encourage goal scoring, they should encourage new tactics by incentivising goals. I would argue that the best and most sustainable way to increase goal scoring would be to award AHL teams 3 points for scoring 5 goals in a win. This will encourage offensive innovation and heuristics that would eventually filter up to the NHL level. Smaller equipment or big nets are susceptible to second order effects. For example, good teams may slow down the game when leading (since the value of a marginal goal for is now worth less than a marginal goal against) making the on-ice even less exciting. Incentives and innovation work better than micro-managing.

In Sum

The primary role of analytics in sport and business is to deliver actionable insights using the tools are their disposal, whether is statistics, math, logic, or whatever. With current data, it is easier for analysts to observe results than to formulate superior on-ice strategies. Instead of struggling to capture the effect of strategy in biased data, they should be using this to their advantage and look at these opportunities through the prism of game theory: testing and measuring and let the best strategies bubble to the top. Even the best analysis might fail to pick up on some second order effect, but thousands of shifts are less likely to be fooled. The data is too limited in many ways to create paint the complete picture. A great analogy came from football (soccer) analyst Marek Kwiatkowski:

Almost the entire conceptual arsenal that we use today to describe and study football consists of on-the-ball event types, that is to say it maps directly to raw data. We speak of “tackles” and “aerial duels” and “big chances” without pausing to consider whether they are the appropriate unit of analysis. I believe that they are not. That is not to say that the events are not real; but they are merely side effects of a complex and fluid process that is football, and in isolation carry little information about its true nature. To focus on them then is to watch the train passing by looking at the sparks it sets off on the rails.

Hopefully, there will soon be a time where every event is recorded, and in-depth analysis can capture everything necessary to isolate things like specific goalie weaknesses, optimal powerplay strategy, or best practices on the forecheck. Until then there are underlying forces at work that will escape the detection. But it’s not all bad news, the best strategy is to innovate and measure. This may not be groundbreaking to the many innovative hockey coaches out there but can help focus the smart analyst, delivering something actionable.



[1] Is hockey a simple or complex system? When I think about hockey and how to best measure it, this is a troubling question I keep coming back to. A simple system has a modest amount of interacting components and they have clear relationships to other components: say, when you are trailing in a game, you are more likely to out-shoot the other team than you would otherwise. A complex system has a large number of interacting pieces that may combine to make these relationships non-linear and difficult to model or quantify. Say, when you are trailing the pressure you generate will be a function of time left in the game, respective coaching strategies, respective talent gaps, whether the home team is line matching (presumably to their favor), in-game injuries or penalties (permanent or temporary), whether one or both teams are playing on short rest, cumulative impact of physical play against each team, ice conditions, and so on.

Fortunately, statistics are such a powerful tool because a lot of these micro-variables even out over the course of the season, or possibly the game to become net neutral. Students learning about gravitational force don’t need to worry about molecular forces within an object, the system (e.g. block sliding on an incline slope) can separate from the complex and be simplified. Making the right simplifying assumptions we can do the same in hockey, but do so at the risk of losing important information. More convincingly, we can also attempt to build out the entire state-space (e.g different combinations of players on the ice) and using machine learning to find patterns within the features and winning hockey games. This is likely being leveraged internally by teams (who can generate additional data) and/or professional gamblers. However, with machine learning techniques applied there appeared to be a theoretical upper bound of single game prediction, only about 62%. The rest, presumably, is luck. Even if this upper-bound softens with more data, such as biometrics and player tracking, prediction in hockey will still be difficult.

It seems to me that hockey is suspended somewhere between the simple and the complex. On the surface, there’s a veneer of simplicity and familiarity, but perhaps there’s much going on underneath the surface that is important but can’t be quantified properly. On a scale from simple to complex, I think hockey is closer to complex than simple, but not as complex as the stock market, for example, where upside and downside are theoretically unlimited and not bound by the rules of a game or a set amount of time. A hockey game may be 60 on a scale of 0 (simple) to 100 (complex).

[2] Spoiler alert: if you performing the same thought experiment with rock-paper-scissors you arrive at the right answer –  randomise between all 3, each 1/3 of the time – unless you are a master of psychology and can read those around you. This obviously has a closed form solution, but I like visuals better:

[3] This likely speaks more to personnel than tactical, Fedorov could be been peerless. However, I think to football where position changes are more common, i.e. a forgettable college receiver at Stanford switched to defence halfway through his college career and became a top player in the NFL league, Richard Sherman. Julian Edelman was a college quarterback and now a top receiver on the Super Bowl champions. Test and measure.

CrowdScout Score and Salary – A Study in Market Value

It’s All Relative

In a salary cap league, how teams spend their finite budget has become very important to any present or future success.[1] The relative value of a contract is often more important than the absolute value of the contract. Within a very strict set of contract rules, teams will devote a share of their allotted cap space to a player at a price dependent on a number of market forces. The goal of this study is to determine what that price should be considering some of those market forces to compare to the actual salary.

So, how do we go about determining the market rate?[2] First, it helps to make some simplifying assumptions – we expect the cap-hit or AAV (Annual Average Value of the contract) to probably be a function of:

  • Position – different positions are valued slightly differently. Any contract negotiation anchor would consist of comparables playing the same position.
  • Age – the NHL’s not-so-free labor market puts significant restrictions and limitations on young player’s earnings. Thus, any analysis looking at market rate should factor in age.
  • Skill / ability / comprehensive contribution to winning – the player’s perceived ability will determine market value. Unlike age and position, skill is extremely difficult to accurately gauge and forecast (since many deals are multi-year). This will pose the biggest obstacle to a clean quantitative analysis. Across all sports, teams consistently misvalue player ability, most notoriously over-valuing their ability and overpaying them.
  • Contract Length (Term) – There are different interactions between age, term, and AAV. A short contract length might signal less money (a ‘show me’ bridge contract) for a young RFA or more money (player trading longer term for higher AAV) for an older UFA. Data courtesy of generalfanager.com.
  • Projected Salary Cap at Contract Date – A $5M AAV contract signed in the summer of 2009 is not the same as a contract signed in the summer of 2016. Managers are forward-looking allocating a set percentage of their expected salary cap to a player rather than an absolute amount. Data courtesy of generalfanager.com.

Finding Value

To determine how each player cap-hit stacks up against what we would expect, we must create a formula or algorithm to return each player’s expected AAV. Finding the difference between the expected AAV and actual AAV – or residual – would signal the relative value of their cap-hit. Spending a million less than market forces would expect (or, more specifically, our model would predict) allows the team to allocate to either save money or invest it elsewhere.

A model can be built using the features discussed above, predicting AAV as a function of age, position, and ability – the catch-all for talent or skill or whatever. But how do we comprehensively quantify ability, the age old question?

One Feature to Rule Them All

My baseline method will be to use GAR (Goals Above Replacement) from war-on-ice.com to help predict salary. GAR is a notable attempt to assign numerical credit to players based on their team winning, which proves a decent proxy for ability. However, GAR or any ‘be all, end all’ stat has limitations – injuries interrupt accumulation of goals above replacement and defensive contributions are very difficult to quantify, among other things. No algorithm is omnipotent, but GAR is a very helpful attempting to answer this question.

In addition to GAR, I will use data collected from my project, CrowdScout Sports, designed to smartly aggregate user judgment. It has been in beta over the course of the 2015-16 season with over 100 users making over 32,000 judgments on players relative to each other. With advanced metrics provided, a diversity of users, and the best forecasters gaining influence, I hope the data provides an increasingly reliable comprehensive player rating metrics. The rating is intended to answer the question posed to the user as they are prompted to rank two randomly chosen players – if the season started today, which player would you choose if the goal were to win a championship.[3] 

Both metrics will be used as a proxy for ability when trying to explain AAV, data courtesy of generalfanager.com. Both metrics are designed not to be influenced by cap-hit, a necessity for the model to properly to explain cap-hit.

GAR Linear Model

First, let’s explore the relationship between AAV and term, salary cap expectations, position, age, and ability using the GAR metric. Using 2014-2015 data[4] from war-on-ice.com and using their GAR model, a dataset containing player features at the onset of 2014-15 season was assembled. The AAV of the upcoming 2015-16 season (where the player was signed prior to the season) was targeted. Any incomplete records were removed. The age variable was transformed into a bucketed variable since there isn’t a linear relationship between age and AAV, rather different levels of pay by age. The natural bucketing of age in relation to cap-hit are:

  • 18-21 – Entry Level Contract (ELC) players
  • 22-24 – A mix of ELCs, bridge contracts, and a few high fliers who get paid
  • 25-27 – RFA controlled, second contract players in their early prime
  • 28-31 – UFA contract years (likely higher cap-hit) but players likely to still be in their prime
  • 32-35 – UFA contract years with some expected decline in ability
  • Over 35 – Declining ability compounded with specific contract rules for 35 plus players

The 924 remaining players were then split into 10 folds to cross-validate the Generalized Linear Model (GLM) – iteratively training on 90% of the data and testing out of sample on the remaining unseen 10% of data, then combining the 10 models. The cross-validated model is then used to score the original dataset – the coefficients from the GLM are multiplied by each player’s individual variables – age (1/0 for each bucket), position (1/0 for each position), contract length, projected cap, and GAR. The outcome is the expected AAV.

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 836, 837, 838, 838, 838, 836, ...
Resampling results:

RMSE Rsquared
1.10514 0.7112609

Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.109087 0.838919 -3.706 0.0002***
GAR 0.066895 0.005919 11.301 < 2e-16***
age_group21-24` -0.022097 0.165447 -0.134 0.8938
age_group24-28` 0.473283 0.163387 2.897 0.0039**
age_group28-31` 0.812176 0.17113 4.746 0***
`age_group31-35` 1.078278 0.179754 5.999 0***
age_groupgt35 1.819195 0.21776 8.354 0***
PosD 0.129242 0.099992 1.293 0.1965
PosG 0.218529 0.140888 1.551 0.1212
PosW -0.112796 0.095704 -1.179 0.2389
Contract Length 0.673488 0.021353 31.541 < 2e-16***
Projected.Cap 0.041061 0.011842 3.467 0.0006***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Our simple GLM explains about two-thirds of cap-hit. GAR, Contact Length, and Projected Cap are all a strong positive predictors. Each age bucket is subsequently paid more. Of note, the 22-24 age bucket is the weakest age coefficient since at that age some players are on their ELC while others have earned legitimate star contracts. In this model, position wasn’t a significant predictor, although it signals defensemen and goaltenders probably go at a premium to centers, while wingers take a discount.

The player-level residuals (expected AAV less actual AAV, a positive value representing surplus value to the team) are plotted below. The model would be stronger, but for some significant outliers – Jonathan Toews, Patrick Kane, Thomas Vanek, and Tyler Meyers were all paid about $4M more than the model expected. Conversely, Duncan Keith, Roberto Luongo, and Marian Hossa were all underpaid by at least an expected $4M. Like most linear models, it had trouble predicting a non-normal target. That is, the distribution of AAV values had a skew to the right, where the model struggled to pick up ‘extreme’ values. Transforming AAV into a log of AAV did not increase predictive power.


Crowd Wisdom

The next iteration of the GLM was run using the CrowdScout score as a proxy for ability. A few notes on the inclusion of this data:

  • What is this metric? It represents the relative strength of that player’s Elo rating compared to the entire population at the time of analysis. The Elo rating is the cumulative result of over 100 scouts selecting between two randomly generated (but generally similar) players some 32,000 times. Each of these selections feed into an algorithm that adjusted each player’s score based on the prior probability of the match-up and k-factor given to the user – the more active and accurate that user had been historically the greater their influence.
  • I think skepticism should be applied to any analysis performed on data acquired through some level of effort of the owner. That said, the CrowdScout data is the result of my own engineering project and is intended to aid (fantasy) managerial decision-making, rather than provide advanced analytical insight. Any clean, methodologically tight analysis would be a bonus.
  • There is a concern of collinearity in this analysis – since it is possible a subset of users associated higher salary with better ability, opposed to the reverse. Conversely, an obviously overpaid player can be under-rated due to an emotional discounting of their ability. For the purpose of this analysis, we will assume the effects neutralize each other and in aggregate AAV did not significantly impact the CrowdScout score.[5] There will obviously be a correlation between player score and AAV, but that does not imply causation.

With the CrowdScout data, I kept all players from the 2015-16 who had been judged at least 70 times, effectively dropping players who did not spend a significant amount of time on an NHL roster or didn’t receive many implied ratings from a diverse set of users. A dataset containing position, age bucket (same buckets as GAR Linear Model) as of 10/1/2015[6], and CrowdScout score as of 5/25/2016 was constructed for 548 players. A model was then built cross-validating 10 folds from the data, testing each model on unseen, out of sample subsets.

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 494, 494, 493, 494, 492, 492, ...
Resampling results:

RMSE Rsquared
1.039983 0.7632717

Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.059971 0.99779 -4.069 0.0001***
CrowdScout Score 0.040156 0.002344 17.129 < 2e-16***
age_group21-24 -0.048386 0.387105 -0.125 0.90057
age_group24-28 0.716755 0.378693 1.893 0.05894.
age_group28-31 1.148196 0.386094 2.974 0.00307**
age_group31-35 1.530753 0.389492 3.93 0.000096***
age_groupgt35 2.2832 0.422219 5.408 0.0000000963***
PosD -0.151021 0.123711 -1.221 0.22272
PosG -0.050527 0.171222 -0.295 0.76803
PosW -0.126127 0.122439 -1.03 0.30342
Term 0.474544 0.027242 17.419 < 2e-16***
Projected.Cap.K.Date 0.042666 0.013781 3.096 0.00206**

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The same model methodology using CrowdScout score as a proxy for ability explains about three-quarters of AAV.  Like the GAR model, ‘ability’ has a strong positive relationship with AAV. Pay increases with significant jumps in expected pay from 21-24 to 24-28 and then again as players hit unrestricted free agency at around 28. Goaltenders and wingers are likely expected to have their AAV discounted, all else equal, although the relationship isn’t significant.

Using CrowdScout as a proxy for ability creates a better fitting model compared to using GAR. This is consistent with what we would expect to see since CrowdScout data doesn’t have to worry about players missing games due to injury. This is a study into what we would expect players to be paid – rather than players should be paid – therefore the CrowdScout score is very likely baking in some reputational assessments leading to a stronger relationship with cap-hit. It’s all possible that crowd wisdom is able to determine the impact defensive prowess has on comprehensive ability better than most public data.

Player-Level Residuals:


Team-Level Residuals:


This analysis also measures spending efficiency based on the 2015-16 AAV and ability, because the CrowdScout Score was not available at the start of the season. However, we can create a predicted CrowdScout Score from the 2014-15 season to hold up against 2015-16 AAV, since teams can only act on past performance and project out.

Paid Against the Machine

The original goal of the analysis was to compare player cap-hit to the expected cap-hit. A simple linear model explaining AAV as a function of age, position, term, projected cap at the time of the deal, and CrowdScout score does a good job predicting cap-hit. However, we can also explore additional modeling methods, increasing the depth of interactions between variables (i.e. age and draft year) and strengthen the predictive power. I will make an adjustment to the CrowdScout Score and use a machine learning model which will be able to handle the additional interactions between features:

  • Predicted CrowdScout Score – Outlined here the CrowdScout Score can be reliably predicted using on-ice metrics. I will score each players 2014-15 statistics from puckaltyics.com with the GLM and Random Forest model and take the average of the predicted scores. This will replace the actual CrowdScout Score in the model, which can be biased.
  • Age (as of season start, 10/1/2015) – Move from a strictly bucketed age to a continuous age variable, to help aid the different interactions. This would not work in a linear model, Jagr would mess everything up.
  • Contract Length – Length has proved to be a key explanatory variable. Data courtesy of generalfanager.com.
  • Projected Salary Cap at Contract Date – Also a key explanatory variable. Data courtesy of generalfanager.com.
  • Drafted Boolean – The interaction between whether the player was drafted or not, term, and age should help the model to work out if the player is on an ELC, 2ndcontract, or UFA contract player.

In order to handle interactions between the new variables in the model, a Regression Tree will be used – known as the Random Forest algorithm. A Random Forest is an ensemble model, creating decision trees from randomized variables and subsets of observations, then each ‘tree’ is considered when scoring or predicting an observation. The advantage of this algorithm is that it is extremely powerful. The disadvantage is that it is basically a black box, there are no clean, interpretable parameters to say ‘when all else is equal we expect a player moving from the 31-35 age group to over 35 to be paid about $500k more’ like in a GLM.

A 500 tree model was able to minimize the RMSE under 0.5, with an R2 of close to 0.95.

Despite the lack of coefficients, we can also take a peek under the hood to check how important each variable is in the algorithms decision-making.


The CrowdScout Score and Term variables are the most important variables in the Random Forest model when explaining AAV. That is, when they are used to create a ‘tree’ or decision, they cumulatively reduce the sum of squared residuals more than the other variables. Age, which should work together in tandem with draft history and term, was also important. Projected Cap was had some influence, Draft History even less so.  Team salary and position (consistent with the linear models) were the least important, having no influence in the enhanced model and were dropped.

Note, when the 2014-15 GAR was added to a dataset of non-rookie players and added to the Random Forest model, the importance of GAR was around that of age and did not increase the performance of the model.[7]

The Random Forest model still has trouble predicting very high cap-hits. For example, Patrick Kane and Jonathan Toews and their AAV of $10.5M are considered to be overpaid by over $1M when compared to market value, Toews slightly more with a 78 predicted CrowdScout Score compared to Kane’s 86. With a predicted CrowdScout score of 88, Alex Ovechkin makes $1.2M more than the model would predict. On the flip side, Justin Abdelkader was underpaid by about $2M in the Random Forest model last season. Interestingly, this summer he received a raise of almost the same amount. Patrick Eaves was also underpaid last year by over a million. He was notably underpaid in both GLM models, using Elo and GAR – sporting a healthy predicted CrowdScout score of 58 and 2014-15 GAR of 13.8 he was a 31-year-old winger paid a paltry $1.15M. Other players making about a million less than predicted during the 2015-116 season were Morgan Rielly, Mattias Ekholm, and Kyle Okposo – all of who received healthy raises this summer.


At a team level, the Islanders, Hurricanes, and Predators led the way in contracting players for less than market value last year. The Islanders received strong value from pending free agents Nielsen and Okposo. The Hurricanes had positive value across the board less Skinner. The Predators are frugal by design, extracting value from their young defense. Note that this analysis fails to include goaltending, where Rinne and Ward would move each team down.

The Avalanche, Flames, and Rangers had the worst value from their contracts. Colorado has very few good contracts when compared to the market. The Flames had a few bad contracts on defense and did not receive an sort of bonus from having top players on ELCs. The Rangers were also pulled down by an overpaid defense.

Also note that the error terms here are small and it wouldn’t take much to move a team up or down the rankings. It also demonstrates that the future is tough to predict and few managers can avoid making salary allocation errors every now and then.



It is critical that NHL franchises effectively manage their salary cap in order to be viable. It appears a model and can explain about 95% of the market for NHL talent. This feels about right, some deals are visibly off from the start, some valuations will change with time, but most of the time teams and agents are in line with what the market would expect as a function the player’s age, draft year, position, term, team salary, and ability. In this study, it appears holding up data from the CrowdScout project to objective on-ice features provided a good proxy for ability.

The Random Forest model is quite strong, with 5% of contracts left unexplained. Some share of this is mis-valuation of the player and market, some of it is inaccuracies of the CrowdScout rating and modeling, some of it might be unexplainable (discount to stay close to family, injury or character concerns, etc.). We are specifically interested in quantifying the first term – how teams might misvalue certain players. With a relatively small error term, it is possible the majority of these residuals are made up of the unquantifiable and the majority of team-level differences is noise. Eye-balling teams in the top 5 and bottom 5 by spending efficiency passed the sniff test, but most managers and agents settle on deals that are in line with the league market.

Finally, it’s important to remember this is a study in what we expect a player’s cap-hit to be given market conditions, rather than what they should make in a free-market NHL. Players on ELCs often provide teams very good value relative to their contract, but in this analysis there is no bonus for production from ELCs since the player age and contract length often signaled when players are likely to be on an ELC. The expected AAV is also calculated with perfect information at the start of the 2015-16 season, where deals have to project out future performance during contract discussions. This alternative analysis might be looked at in the near future, expecting considerably larger error terms – longer timelines introduce more uncertainty.

It’s also important to remember that this analysis leans on ever-maturing data from the CrowdScout project. As expected, it contains enough reputational information to help build a stronger model than using GAR from war-on-ice.com as a proxy for ability. It is possible that this data contains systemic bias – if a higher salary caused the CrowdScout Score to be higher, rather than them simply being correlated. A simple plot (below) suggests that the CrowdScout Score often differs from AAV, which is encouraging. Given that, I hope this unique dataset and model will prove helpful in evaluating contracts and cap management in the future.

Huge thanks to asmean to contributing to this study, specifically advising on machine learning methods.



[1] If a team can consistently acquire and retain talented players who consistently play above their expected contract, they will be operating with a significant advantage. If your 24-year old top 4 defenseman is signed at $4.5M AAV and most comparable players are averaging over $5M AAV, more depth or quality can be acquired elsewhere. If your mid-range starting goalie makes $6M and the goaltending market falls out and sees comparables average less than $5M, you are at a disadvantage. Easy enough.

[2] In absolute terms, that’s a very tough question. The NHL labor market is a long way than the economic-textbook-supply-meets-demand-free-efficient-market. There are salary floors, ceilings, team floors, team ceilings, bonuses, rules regarding age and accrued seasons. Deals are often made with little certainty of future performance (read: teams are poor at forecasting individual player career arcs), and often see a trade-off in salary and duration. An efficient market this is not.

[3] A model is only as good as its target variable, and I believe any comprehensive analysis of ability should attempt to answer that question or one similar to it. Hockey is a goal-scoring contest first and foremost, but the ultimate goal (winning the championship) resembles a marathon of hockey games. This is a tricky distinction since it invites past winners to be overrated, when in alternative histories they did not win, thanks to luck. This is certainly a deeper philosophical question, but an analysis in market value should only care about results.

[4] 2015-2016 GAR has not or will not be posted.

[5] Opposed to simply over-rating a player based due to reputation and other biases. The system is designed to reward those users who have the foresight to forecast declining ability of a player getting by on reputation alone. Some reputational bias will be present until the time a sizeable crowd of excellent forecasters exists.

[6] Presumably when most players were under contract for the 2015-16 season.

[7] varimp

The Path to WAR*

*Wins-Above-Replacement-Like Algorithm-Based Rating

Dream On

The single metric dream has existed in hockey analytics for some time now. The most relevant metric, WAR or Wins Above Replacement, represents an individual player’s contribution to the success of their team by attempting to quantify the number of goals the add over a ‘replacement-level’ player. More widely known in baseball, WAR in hockey is much tougher to delineate, but has been attempted, most notably at the excellent, but now defunct, war-on-ice.com. The pursuit of a single, comprehensive metric has been attempted by Ryder, Awad, Macdonald, Schuckers and Curro, and Gramacy, Taddy, and Jensen.

Their desires and effort are justified: a single metric, when properly used, can be used to analyze salaries, trades, roster composition, draft strategy, etc. Though it should be noted that WAR, or any single number rating, is not a magic elixir since it can fail to pick up important differences in skill sets or role, particularly in hockey. There is also a risk that it is used as a crutch, which may be the case with any metric.

Targeting the Head

Prior explorations into answering the question have been detailed and involved, and rightfully so, aggregating and adjusting an incredible amount of data to create a single player-season value.[1] However, I will attempt to reverse engineer a single metric based on in-season data from a project.

For the 2015-16 season, the CrowdScout project aggregated the opinions of individual users. The platform uses the Elo formula, a memoryless algorithm that constantly adjusts each player’s score with new information. In this case, information is the user’s opinion that is hopefully guided by the relevant on-ice metrics (provided to the user, see below). Hopefully, the validity of this project is closer to Superforecasting than the NHL awards, and it should be: the ‘best’ users or scouts are given increasing more influence over the ratings, while the worst are marginalized.[2]

The CrowdScout platform ran throughout the season with over 100 users making over 32,000 judgments on players, creating a population of player ratings ranging from Sidney Crosby to Tanner Glass. The system has largely worked as intended, but needs to continue to acquire an active, smart, and diverse user base – this will always be the case when trying to harness the ‘wisdom of the crowd.’ Hopefully, as more users sign-up and smarter algorithms emphasize the opinions of the best, the Elo rating will come closer to answering the question posed to scouts as they are prompted to rank two players – if the season started today, which player would you choose if the goal were to win a championship.

Let’s put our head’s together

Each player’s Elo is adjusted by the range of ratings within the population. The result, ranging from 0 to 100, generally passes the sniff test, at times missing on players due to too few or poor ratings. However, this player-level rating provides something more interesting – a target variable to create an empirical model from. Whereas in theory, WAR is cumulative metric representing incremental wins added by a player, the CrowdScout Score, in theory, represents a player’s value to a team pursuing a championship. Both are desirable outcomes, and will not work perfectly in practice, but this is hockey analytics: we can’t let perfect get in the way of good.

Why is this analysis useful or interesting?

  1. Improve the CrowdScout Score – a predicted CrowdScout Score based on-ice data could help identify misvalued players and reinforce properly valued players. In sum, a proper model would be superior to the rankings sourced from the inaugural season with a small group of scouts.
  2. Validate the CrowdScout Score – Is there a proper relationship between CrowdScout Score and on-ice metrics? How large are the residuals between the predicted score and actual score? Can the CrowdScout Score or predicted score be reliably used in other advanced analyses? A properly constructed model that reveals a solid relationship between crowdsourced ratings and on-ice metrics would help validate the project. Can we go back in time to create a predicted score for past player seasons?
  3. Evaluate Scouts – The ability to reliably predict the CrowdScout Score based on on-ice metrics can be used to measure the accuracy of the scout’s ratings in real-time. The current algorithm can only infer correctness in the future – time needs to pass to determine whether the scout has chosen a player preferred by the rest of the crowd. This could be the most powerful result, constantly increasing the influence of users whose ratings agree with the on-ice results. This is, in turn, would increase the accuracy of the CrowdScout Score, leading a stronger model, continuing a virtuous circle.
  4. Fun – Every sports fan likes a good top 10 list or something you can argue over.

Reverse Engineering the Crowd

We are lucky enough to have a shortcut to a desirable target variable, the end of season CrowdScout Score for each NHL player. We can then merge on over 100 player-level micro stats and rate metrics for the 2015-16 season, courtesy of puckalytics.com. There are 539 skaters that have at least 50 CrowdScout games and complete metrics. This dataset can then be used to fit a model using on-ice data to explain CrowdScout Score, then we use the model output to predict the CrowdScout Score, using the same player-level on-ice data. Where the crowd may have failed to accurately gauge a player’s contribution to winning, the model can use additional information to create a better prediction.

The strength of any model is proper feature selection and prevention of overfitting. Hell, with over 100 variables and over 500 players, you could explain the number of playoff beard follicles with spurious statistical significance. To prevent this, I performed couple operations using the caret package in R.

  1. Find Linear Combination of Variables – using the findLinearCombos function in caret, variables that were mathematically identical to a linear combination of another set of variables were dropped. For example, you don’t need to include goals, assists, and points, since points are simply assists plus goals.
  2. Recursive Feature Elimination – using the rfe function in caret and a 10-fold cross-validation control (10 subsets of data were considered when making the decision, all decision were made on the models performance on unseen, or holdout, data) the remaining 80-some skater variables were considered from most powerful to least powerful. The RFE plot below shows a maximum strength of model at 46 features, but most of the gains are achieve by about the 8 to 11 most important variables.
  3. Correlation Matrix – create a matrix to identify and remove features that are highly correlated with each other. The final model had 11 variables listed below.RFEcorr.matrix

The remaining variables were placed into a Random Forest models targeting the skaters CrowdScout Score. Random Forest is a popular ensemble model[3]: it randomly subsets variables and observations (random) and creates many decision-trees to explain the target variable (forest).  Each observation or player is assigned a predicted score based on the aggregate results of the many decision-trees.

Using the caret package in R,  I created Random Forest model controlled by a 10-fold cross-validation, not necessarily to prevent overfitting which is not a large concern with Random Forest, but to cycle through all data and create predicted scores for each player. I gave the model the flexibility to try 5 different tuning combinations, allowing it to test the ideal number of variables randomly sampled at each split and number of trees to use. The result was a very good fitting model, explaining over 95% of the CrowdScout Score out of sample. Note the variation explained, rather than the variance explained was closer to 70%.


Note the slope of the best-fit relationship between actual and predicted scores is a little less than 1. The model doesn’t want to credit the best players too much for their on-ice metrics, or penalize the worst players too much, but otherwise do a very good job.


Capped Flexibility

Let’s return to the original intent of the analysis. We can predict about 95% of CrowdScout Score using vetted on-ice metrics. This suggests the score is reliable, but that doesn’t necessarily mean the CrowdScout Score is right. In fact, we can assume that the actual score is often wrong. How does a simpler model do? Using the same on-ice metrics in a Generalized Linear Model (GLM) performs fairly well out of sample, explaining about 70% of the variation. The larger error terms of the GLM model represent larger deviations of the predicted score from the actual. While these larger deviations result in a poorer fitting model fit, they may also contain some truth. The worse fitting linear model has more flexibility to be wrong, perhaps allowing a more accurate prediction.



Note the potential interaction between TOI.GM and position

Residual Compare

How do the player-level residuals between the two models compare? They are largely the same directionally, but the GLM residuals are about double in magnitude. So, for example, the Random Forest model predicts Sean Monahan’s CrowdScout Score to be 64 instead of his current 60, giving a residual of +4 (residual = predicted – actual). Not to be outdone, the Generalized Linear Model doubles that residual predicting a 68 score (+8 residual). It appears that both models generally agree, with the GLM being more likely to make a bold correction to the actual score.



The development of an accurate single comprehensive metric to measure player impact will be an iterative process. However, it seems the framework exists to fuse human input and on-ice performance into something that can lend itself to more complex analysis. Our target variable was not perfect, but it provided a solid baseline for this analysis and will be improved. To recap the original intent of the analysis:

  1. Both models generally agree when a player is being overrated or underrated by the crowd, though by different magnitudes. In either case, the predicted score is directionally likely to be more accurate than the current score. This makes sense since we have more information (on-ice data). If it wasn’t obvious, it appears on-ice metrics can help improve the CrowdScout Score.
  2. Fortunate, because our models fail to explain between 5% and 30% of the score and vary more from the true ability. Some of the error will be justified, but often it will signal that the CrowdScout Score needs to adjust. Conversely, a beta project with relatively few users was able to create a comprehensive metric that can be mostly engineered and validated using on-ice metrics.
  3. Being able to calculate a predicted CrowdScout Score more accurate than the actual score gives the platform an enhanced ability to evaluate scouting performance in real-time. This will strengthen the virtuous circle of giving the best scouts more influence over Elo ratings, which will help create a better prediction model.
  4. Your opinion will now be held up against people, models, and your own human biases. Fun.


Huge thanks to asmean to contributing to this study, specifically advising on machine learning methods.

[1] The Wins Above Replacement problem is not unlike the attribution problem my Data Science marketing colleagues deal with. We know the was a positive event (a win or conversion) but how do we attribute that event to the input actions between hockey players or marketing channels. It’s definitely a problem I would love to circle back to.

[2] What determines the ‘best’ scout? Activity is one component, but picking players that continue to ascend is another. I actually have plans to make this algorithm ‘smarter’ and is a long overdue explanation on my end.

[3] The CrowdScout platform and ensemble models have similar philosophies – they synthesize the results of models or opinions of users into a single score in order to improve their accuracy.

Goaltending and Hockey Analytics – Linked by a Paradox?

There may be an interesting paradox developing within hockey. The working theory is that as advanced analysis and data-driven decision-making continue to gain traction within professional team operations and management, the effect of what can be measured as repeatable skill may be shrinking. The Paradox of Skill suggests as absolute skill levels rise, results become more dependent on luck than skill. As team analysts continue (begin) to optimize player deployment, development, and management there should theoretically be fewer inefficiencies and asymmetries within the market. In a hypothetical league of more equitable talent distribution, near perfect information and use of optimal strategies, team results would be driven more by luck than superior management.

Goaltenders Raising the Bar

Certainly forecasting anything, let alone still-evolving hockey analytics, is often a fool’s errand – so why discuss? Well, I believe that the paradox of skill has already manifested itself in hockey and actually provides a loose framework of how advanced analysis will become integrated into the professional game. Consider the rise of modern goaltending.

Absolute NHL goaltender ability has continually increased for the last 30 years. However, differential ability between goaltenders has tightened. It has become increasingly difficult to distinguish long-term, sustainable goaltender ability while variations in results are increasingly owed to random chance. Goalies appear ‘voodoo’ when attempting to measure results (read: ability + luck) using the data currently available – much like the paradox of skill would predict.[1] More advanced ways of measuring goaltending performance will be developed (say, controlling for traffic and angular velocity prior to release), but that will just further isolate and highlight the effect of luck.[2]

Spot the Trend Data courtesy of hockey-reference.com
Spot the Trend
Data courtesy of hockey-reference.com

Will well-managed teams create a similar paradox amongst competing professional teams in the future? Maybe. Consider such a team would maximize the expected value talent acquired, employ optimal on-ice strategies, and employ tactics to improve player development. Successful strategies could be reverse engineered and replicated, cascading throughout the league – in theory. Professional sports leagues are ‘copycat’ leagues and there is too much at stake not to adopt a superior strategy, despite a perceived coolness to new and challenging ideas.

Dominant Strategies“I don’t care what you do, just stop the puck”

How did goaltending evolve to dominate the game of hockey? And what parallel pathways need to exist in hockey analytics to do the same?

  1. Advances in technology – equipment became lighter and more protective.[3] This allowed goaltenders to move better, develop superior blocking tactics (standing up vs butterfly), cover more net, and less worry of catching a painful shot. The growth of hockey analytics has been dependent on web scraping, automation, and increasing processing power and will soon come to rely on data derived from motion-tracking cameras. Barriers to entry and cost of resources are negligible lending all fanalysts the opportunity to contribute to the game.
  2. Contributions from independent practitioners – The ubiquitous goaltending coach position is a relatively new one compared to most professional leagues. In the early 2000s, I was lucky enough to cross paths with innovative goaltending instructors who distributed new tactics, strategies, and training methods available to young goaltenders. Between their travel, camps, and clinics (and later their own development centers) they diffused innovative approaches to the position, setting the bar higher and higher for students. A few of these coaches went on become NHL goalie coaches – effectively capturing a position that didn’t exist 30 years prior. Now the existence of goalie coach cascade down to all levels of competitive hockey.[4]  Similarly, the most powerful contributions to the hockey analytics movement have been by bright individuals exposing their ideas and studies to the judicious public. The best ideas were built upon and the rest (generally) discarded. Will hockey analytics evolve (read: become accepted widely among executives) faster than goaltending? I don’t know – a goaltending career takes well over a decade to mature, but they play many games providing feedback on new strategies rather quickly.[5] Comparatively, ideas develop quicker but might take longer to demonstrate their value – not only are humans hard-wired to reject new ideas there are fewer managerial opportunities to prove a heavy data-driven approach to be a dominant strategy.
  3. Existence of a naïve acceptance – The art (and science) of goaltending is not especially well understood among many coaches, particularly with relative skill levels converging. However, managers and coaches do understand results. Early in my career, I had a coach who was only comfortable with stand-up goaltenders, his own formative experiences occurring when goaltender predominately remained erect (in order to keep their poorly padded torso and head from constant danger). However, he saw a dominant strategy (more net coverage) and placed faith in my ability without a comprehensive understanding or comfort of modern goaltending. Analytics will have to be accepted the same way – gradual but built on demonstrated effectiveness. Not everyone is comfortable with statistics and probabilities, but like goaltenders, the job of analysts is to produce results. That means rigorous and actionable work that offers a superior strategy to the status quo. This will earn the buy-in from owners and senior management who understand that they can’t be at a competitive disadvantage.

Forecasting Futility

Clearly the arc of the analytics evolution will differ from the goaltender evolution, primary reasons being:

  • Any sweeping categorization of two-decade-plus ‘movement’ is prone to simplification and revisionist history.
  • While goaltending as a whole has improved substantially, incremental differences in ability still obviously exist between goaltenders. In the same way, not all analysts or teams of analysts will be created equal. A non-zero advantage in managerial ability may compound over time. However, the signal will likely be less significant than variation in luck over that extended timeframe. In both disciplines, that rising ability may give way to a paradox of not being able to decipher their respective skills, muddying the waters around results.
  • Goaltending results occur immediately and visibly. Fair or not, an outlier goaltender can be judged after a quarter of a season, managerial results will take longer to come to fruition. Not only that, we only observe the one of many alternative histories for the manager, while we get to observe thousands of shots against a goaltender. Managerial decisions will almost always operation under a fog of uncertainty.

Alternatively, it important to consider the distribution of athlete talent against those of those in the knowledge economy. Goaltenders are bound by normally distributed deviations of size, speed, and strength. Those limitations don’t exist for engineers and analysts, but they do operate in a more complex system, leaving most decisions to be subjected to randomness. This luck is compounded by the negative feedback loops of the draft and salary cap, it is unlikely a masterfully designed team would permanently dominate, but it suggests some teams will hold an analytical advantage and the league won’t turn into some efficient-market-hypothesis-all-teams-50%-corsi-50%-goals-coin-flip game. But if a superstar analyst team could consistently and handily beat a market of 29 other very good analyst teams in a complex system, they should probably take their skills to another more profitable or impactful industry.


Other Paradoxes of Analytics

Because these are confusing times we live in, I’d be remiss if I didn’t mention two other paradoxes of hockey analytics.

    • Thorough, rigorous work is often difficult to understand and not easily understood by senior decision-makers. This is a problem in many data-intensive industries – analytical tools outpace the general understanding of how they work. It seems that (much like the goaltending framework available to us) once data-driven strategies are employed and succeed, all teams will be forced to buy-in and trust that they have hired competent analysts that can deliver actionable insights from a complex question. Hopefully.

  • With more and more teams buying into analytics, the some of the best work is taken private. The best work is taken in-house seemingly overnight, sometimes burying a lot of foundational work and data. That said, these issues are widely understood and there is a noble and concerted effort to maintain transparency and openness. We can only hope that these efforts are appreciated, supported, and replicated.


Final Thoughts

The best hockey analysis has borrowed empiricism and data-driven decision-making from the scientific method, creating an expectation that as hockey analytics gain influence at the highest levels, we (collectively) will know more about the game.[7] However, assuming the best hockey analysts end up influencing team behavior, it is possible much of the variation between NHL teams[8] will be random chance – making future predictive discoveries less likely and weakening the relationship of current discoveries.

Additionally, when it feels like the analytical approach to hockey is receiving unjustified push back or skepticism, it is important to remember that the goaltender evolution, initiated by fortuitous circumstance, eventually forced buy-ins from traditionalists by offering a superior approach and results. However, increasing absolute skill in a field can have unintended consequences – relative differences in skill will decrease, possibly causing results to become more dependent on luck than skill. Something to consider next time you try to make sense of the goaltender position.


[1] This is not to say all goalies in 2016 are of equal skill levels, but they are absolutely more talented than their ancestors and fall within a smaller range of abilities. That said, outside of a top 2 or 3 guys, the top 5-10 list of goalies is a game of musical chairs, quarter to quarter, season to season.

[2] Goaltenders don’t get a chance to ‘drive the play,’ so it is very important to control for external factors. This can’t be done comprehensively with current data. Even with complete data, it may be futile.

[3] And cooler, possibly attracting better athletes to the position, your author notwithstanding.

[4] Another feature of the paradox of rising skill levels: to fail to improve is the same as getting worse. Hence, employing a goalie coach is necessary in order to prevent a loss of competitiveness. The result: plenty of goalie coaches of varying ability, but likely without a strong effect on their goaltender’s performance. This likely causes some skepticism toward their necessity. This is probably a result of their own success, they are indirectly represented by an individual whose immediate results might owe more to luck than incremental skill aided by the goalie coach.

[5] For example, a strategy devised at 6 years old of lying across the goal line forcing other 6 year-olds to lift the puck proved to be inferior and was consequently dropped from my repertoire.

[7] Maybe even understanding the link between shot attempts and goals (you can read this sarcastically if you like).

[8] And other leagues that are able to track and provide accurate and useful data.