Expected Goals (xG) Limitations – A Deep Dive

Expected Goals (xG) are a great tool for analyzing goaltender and skater performance, but it’s important to discuss limitations and opportunities for improvement.

It’s Tough to Make Predictions, Especially About the Future

When I projected goaltender performance for the 2017-18 season, Cam Talbot was probably going to have an above average season, even possibly a top 10 performance.  Comparing his ability to stop the puck relative to what we might expect his average peer (using expected goals), his track record has been good to very-good prior to this season, within a certain degree of certainty. Circumstances changed of course, as he went from backing up one of the best ever to an overworked starter in 2 years.

For other goalie-season performance summaries, see: http://crowdscoutsports.com/hockey_g_compare.php

Edmonton played him 86 games last season and though I don’t know his exact physiological profile, it stands to reason it may have been too many games. Could this burn out be impacting his performance this season? Impossible to say, but his performance has been down (albeit with a smaller sample size, represented by the wider error bars above) and he has battled some injury issues. Regardless, this prediction has been wrong so far.

All Models Are Wrong, But Some Are Useful

My projection method is built on goals prevented over expected (I say goals prevented, rather than saved because I think it’s important to remove credit for saves made on preventable rebounds), which measures the difference in actual goals allowed by a goalie and the number an average goalie would. In theory, if we can control for number and quality of shots faced we should be able to isolate and identify puck stopping ‘skill,’ the marginal difference between the best and the rest would manifest itself over an adequate number of shots.[1]

This means it falls on the analyst to properly adjust for shot quantity and quality. I mention quantity because in theory goaltenders can discourage shots against and encourage shooters missing the net through excellent positioning. However, I’m not fully confident there’s a convincing way to separate the effect of a goalie forcing misses and playing on a team where the scorer might be more or less likely to record a missed shot. These effects don’t necessarily persist season-to-season, so I’m still using all non-blocked attempts in my analyses, but it’s important to acknowledge what should be a simple counting statistic is much more complex beneath the surface.

A more popular contention is shot quality, where the analyst tries to weight each shot by its probability of being a goal by holding constant the circumstances out of goalies control that can be quantified using publicly available NHL play-by-play data – things like  shot location, team strength, shot type, prior events (turnover, faceoff, etc) and their locations. But these factors, as skeptics will point out, currently doesn’t include shot placement, velocity, net-front traffic, or pre-shot puck movement involving passes (though we can calculate pre-shot movement or angular velocity on rebounds from shot 1 to shot 2 or location and angular velocity of other events immediately preceding a shot; Moneypuck explains and diagrams how this is done well under in their Shot Prediction Expected Goals Model explainer).

Shot quality is also heavily dependent on shot location, which also has recorder bias – some rink scorers systemically record shots closer or further away from the net than we see when that team plays in other rinks. Adjustment methods help standardize these locations (basically aligning home rink distribution of x,y coordinates with what is observed in away rinks games containing the team in question), but there no way to verify how much closer this gets us to the ‘true’ shot locations without personally verifying many, many shots. In any event, it invites differences in measured performance between ‘objective’ metrics developed by different analysts.

All of this is to say the ability to create a ‘true,’ comprehensive, and accurate xG model is limited by the nature of the public data available to analysts. Any evaluation and subsequent projection is best viewed through a more sceptical lens.

Discussing Deltas

With this in mind, back to Cam Talbot. Talbot has underachieved this season for some combination of the following:

  1. His ‘true’ ability or performance has been worse. This is probably the case directionally, but it’s important to explore by how much worse he has been. Is this an age-related decline that can help inform future projections? Is the decline small enough to be considered luck? Do we expect a bounce back? and so on.
  2. The random nature of outcomes has made him look worse this season (or alternatively, really good in prior years). Using the beta distribution, we can calculate the standard deviation we might expect for each sample size. More shots mean more certainty in the outcome. Simulating seasons and treating each expected goal as a weighted coin flip accomplishes something similar. Understanding and quantifying this uncertainty is an important aspect of any analysis. As you can see below, there is a (small) chance Talbot is a completely average goalie who concedes as many goals as an average goalie would (his ‘true’ talent put him along the grey vertical line) but just had some improbable results.

    Flipping weighted coins (pucks) many, many times
  3. Edmonton is giving up tougher chances against this year that goes undetected by an xG model. They might be allowing more cross-ice passes, moving net-front players out of the front of the crease, or employing unfavourable strategies. How this might manifest itself at the team, goalie, or shooter level is discussed below.

One Night in Edmonton, A Deep-Dive

While east coasters, like I currently am, were out for New Year’s Eve 2017, Edmonton was hosting Winnipeg. NYE is overrated, but it had to be better than watching the Jets win 5-0. Talbot was in for 5 goals against, not even getting a mercy pull in the 3rd period to start planning the night.

Giving up 5 goals is almost certainly going to look bad next to the expected goals, and this game is no exception. Winnipeg had 52 unblocked shot attempts that totalled 3.3 xG. However, removing the xG from 2 Talbot rebounds and replacing with expected rebounds drops that to about 3 xG. However, this game demonstrates specific weaknesses of the model but can also show how the model compensates for that lack of information.

The goal highlights are pretty much a list of things expected goal model can miss:

  1. Goal 1: A 3-on-1 off the rush, a pass from the middle of the ice to a man wide-open to the side of the net. Unfortunately, play-by-play picks up no prior events and the xG model scores it as seemingly low 0.1 xG, recognizing only a wrist-shot 9ft away from a sharper angle. Frankly, this was a slam dunk, even if Talbot adjusted his depth to play a pass, the shooter had time and space in the low slot.
  2. Goal 2: A turnover turns into cross-ice pass and immediately deposited for a goal. Fortunately, the turnover is recorded and the angular velocity can be recorded. This is scored as 0.19 xG, which seems low watching this highlight, but has to be representative of all chances under this circumstances and most aren’t executed so cleanly. Still probably low.
  3. Goal 3: A powerplay point shot is deflected. Both of these factors put this shot at 0.17 xG. This is decently accurate, the deflection had to skip off the ice and miss Talbot’s block, if that point shot is taken 100 times, scoring on that deflection about 20 times sounds about right. Also note there’s some traffic, which in a perfect world would be quantified properly and incorporated in the xG model.
  4. Goal 4: A pass from the half wall to a man wide open in front of the net. This is scored as a 0.16 xG, a deflection tight to the net after an offensive zone turnover. However, this is more of a pass than true deflection, neither passer nor shooter was contested and I would think that they would convert more than 16 times if given 100 opportunities.
  5. Goal 5: A rush play results in a cross-ice pass and a shot from the hashmarks. Talbot makes the original save, but the rebound is immediately deposited into the net. The rebound shot, changing angle rapidly, is scored as 0.43 xG. However, that rebound is conditional on Talbot not deflecting the puck into the corner so the play is scored from the original shot: 0.15 xG on the original shot plus 0.02 xG on a potential rebound – a 6% chance of a rebound times the observed goal probability of a rebound, 0.27%.

A spreadsheet containing each shot against Talbot from this game and relevant variables discussed above can here found here or click the image below. The mean non-rebound shot is 0.057 xG, so the actual goals are scored as 2 to 7 times more likely to be goals as average shots – not bad.

Shot by shot xG

Backing the Bus Up

On the surface, the expected goals assigned to each of these goals are low. A lot of this is confirmation bias – we see the goal, so the probability of a goal must have been higher, but in reality, pucks bounce, goalies make great saves, and shots miss the net.

However, we’ve identified additional factors, that if properly quantified, would increase the probability of a goal for those specific shots. There are extremely exciting attempts to gather this data, (specifically tapetotapetracker.com, released at the beginning of 2018, which if properly supported would reach the necessary scale, but still might require some untangling of biases much like shot location[2]) though still in their infancy and certainly not real-time. Advanced video tracking software would also accomplish the same thing, but not necessarily public or comprehensive.

So what does the model do without that helpful information? In the specific goals discussed, the model is conservative, under-estimating the ‘true’ probability of a goal, but across all shots, the number of expected goals roughly equal the number of goals.[3] This means that impactful, but latent events, like cross-ice passes and screens, are effectively dispersed over all shots.

We assume a un-screened shot from the blueline will be saved 999 times out of 1000 (implying 0.001 xG), but the model, not knowing if there’s a screen or not, might assign an xG of 0.03, an average of all scenarios where there may be a screen. This might increase to 0.05 xG if the shooter has a powerplay, or 0.08 xG if there’s 2-person advantage, since the probability of a sufficient screen increases. Note that these model adjustments for powerplay shots are applied evenly to all shots on the powerplay, unable to determine which shots are specifically more dangerous on the powerplay, though a powerplay indicator may be interacted with specific discrete factors (i.e. PP+1 – slap shot, PP+1 – wrist shot, etc).

This is fine if the probability of a screen (or other latent factors) is about even across all teams. However, it doesn’t take much video analysis to know that this is almost certainly not the case. All teams gameplan to generate traffic and cross-ice passing plays, but some have the personnel and talent to execute better than others, while some teams have the ability to counter those strategies better. Some teams will over or underperform their expected goals partially due to these latent variables.

Unfortunately, there isn’t necessarily the data available to quantify how much of that effect is repeatable at a team or player level. Some xG models do factor in individual player shooting talent. My model, borrowing from Dawson Sprigings, shooting talent is represented by regressed shooting percentage indexed to player position. A shot from Erik Karlsson, who is more apt at controlling the puck and setting up and shooting through screens would have a higher xG than a similar shot from another defender. Shots from Shayne Gostisbehere might receive a similar (though smaller) adjustment if consistently aided by the excellent net-front play of Wayne Simmonds.

So while a well-designed xG model can implicitly capture some effect of pre-shot movement and other latent factors, it’s safe to say team-level expected goal biases exist. By how much and how persistently is ultimately up for discussion.

Talbot’s 99 Problems

Focusing on Talbot again, the big thing holding him back is performance while Edmonton is shorthanded. While his save percentage is about what we’d expect given his shot quality at even-strength, it’s abysmally 5% below expected on the penalty-kill.

What kind of shots is Edmonton giving up? On the penalty kill, non-rebound shots are about 1% less likely to be saved than shots conceded by other penalty-kills (rebound shots are about 8% more dangerous, but some of that may be a function of Talbot himself, so let’s focus on non-rebound shots).

Distribution of Expected Goals

So it seems like Talbot just is poor while down a man. Let’s compare his penalty-kill numbers adjacently to his past performance and recently replaced backup, Laurent Brossoit.

For other goalie-season performance breakouts by strength/score/venue/danger, see: http://crowdscoutsports.com/hockey_g_compare.php

I’m assuming most readers see the same thing as me? Both goalies, after 7 collective seasons of being at least average on the penalty kill within a decent margin of error, aren’t even close this season. While there is a chance that this is a coincidence, but there’s a better chance there are systemic failings on the Edmonton penalty kill that are leading to chances more dangerous chances than the expected goal model might reflect.

None of this is shocking at this point. To some it reinforces the desire to focus on 5v5 goaltender save percentage, it’s easier to imagine team-level systemic short-comings persisting on the penalty-kill, while 5v5 they are more evenly distributed. The noise that 5v4 adds to our measures weakens predictivity eventually, however over a short season the inclusion of that extra data offsets any troublesome team-level bias it may carry.

Much like team shot metrics wisely moved from ‘score-close’ to score-adjusted measures, it’s always desirable to use the entirety of data available. About a quarter of goals (23.5%) are scored on the man advantage (only 17% of the shots), and when we talk about measuring goaltender performance and finding advantages on the very slim margins, including that data becomes very important.

However, we’re still fairly certain there are latent team-level biases in our expect goal model and Edmonton’s penalty-kill hints at that. But we don’t know what existent, if at all.

Fixing the Penalty Kill

One approach to quantifying the relative impact of coaching and tactics is a fixed-effects model, basically holding constant the effect of a Todd McLellan penalty-kill or Jon Copper powerplay as they relate to the probability of a goal being scored on any given shot. (Note: I begin to use team-level xG bias and coaching impact a little interchangeably here. I’m interested in team-level bias, but trying to get there by using coaches as a factor.)

This method is rather crude (and obviously not perfect) and wouldn’t necessarily differentiate between marginal goals the result of net-front play or effective pre-shot passing, though an inquiring team might breakout discrete sequences they’re interested in, like how their penalty-kill performs on shots from the point or against 4 Forward-1 Defender powerplays. However, quantifying specific strategic factors like this is not necessarily easy to do, since both teams are simultaneously optimizing and countering each other’s tactics, so for simplicity, it’s probably preferable to consider the holistic penalty-kill or powerplay for now.

Creating fixed-effects for a Todd McLellan penalty-kill will attempt to capture goal probability relative to other teams penalty-kill, presumably capturing some of the effect of pre-shot passes or screens. Because Cam Talbot has played the majority of the games for Edmonton recently, the current model is goaltender agnostic to avoid potential multi-collinearity rather assigning a numerical value to the quality of goaltender playing, hopefully properly apportioning systemic strengths or weaknesses to the penalty-kill variable rather than Talbot or Brossoit specifically.

One limitation of this approach is the effect will be fixed for whatever time period we select, unable to capture tactical changes that may or may not have worked or changes in personnel available to the coach.

It’s also important to reiterate attempts are made to adjust for home rink scorer bias, but any of this uncaptured systemic bias would possibly be wrongfully attributed to coaching (on home-ice at least).

Model Results

Running the model on the 2016-17 and 2017-18 seasons, the coefficients can be transformed to represent the relative probability of a goal then indexed to the league average.  McLellan’s penalty-kill appears to concede penalty-kill shots about 2.4% more dangerous than the basic expected goal model might expect, putting him 32 out of 38 coaches in the 1.5 season period. Using just the most recent season McLellan finds himself last with Dave Hakstol for conceding the toughest shots in the league while down a player. Cam Talbot’s save percentage on the penalty-kill is about 5% below his expected save percentage, but it appears about half of that (2.4%) might be attributable to McLellan’s tactics – which is a nice, clean compromise.

Preventative Measures

More important than an indictment against specific coaches (this is just reflects shot quality, it doesn’t factor in shot quantity or counter-attack measures on the penalty-kill), it might provide a reasonable team-level margin of errors for expected goal models. A distribution of results from the last 4 seasons suggests that their effect on even-strength shot quality is smaller per shot than special teams by a factor of about 2 to 3. Some of this is definitely the result of smaller sample sizes but makes some intuitive sense – special teams are more reliant on tactics than relatively free-flowing even-strength play.

Distribution of impacts of coaches with 3 seasons coached in last 4 years

This provides a decent rule of thumb, an expected goal percentage used might vary about 0.2% at even strength and about 0.6% on the powerplay due to coaching and tactics. This, of course, is over a few seasons, over a shorter time frame that number might fluctuate more.

State Absolute Mean Standard Deviation
Even Strength 0.20% 0.30%
Special Teams 0.60% 0.80%

Final Thoughts

Model testing suggests that even a good projection of goaltender performance would have a healthy margin of error. Misses will happen, but they should be as small as possible and be accompanied by a margin of error.

Of course, misses can be a helpful guide to improvement, looking at things with a different perspective. Results like Edmonton’s penalty-kill can elicit a deep-dive.

None of this is meant to totally reallocate blame from Talbot to McLellan’s penalty-kill, rather explore if how you might quantify that re-allocation, and how that ties into a more general discussion of the limitation of shot quality metrics.

‘True’ shot quality will likely never be completely captured, but there will be incremental improvements, feeding incremental improvements predicting future performance of goaltenders and other useful insights into performance.

But the job of the analyst will be the same: miss small as possible and quantifying uncertainties. With a complex game like hockey and limited available data those uncertainties can be daunting, but hopefully, this has illuminated some of the weaknesses of current xG models (and maybe reveal a few strengths) and how much of a concern those limitations are. Expected goals are a helpful tool and are part of the incremental improvement hockey analytics needs, but acknowledging their limitations will ultimately make them more powerful.

Thanks for reading! Goalie-season xG data is updated daily and can be downloaded or viewed in a goalie compare app. Any custom requests ping me at @crowdscoutsprts or cole92anderson@gmail.com.

Code for this analysis was built off a scraper built by @36Hobbit which can be found at github.com/HarryShomer/Hockey-Scraper.

I also implement shot location adjustment outlined by Schuckers and Curro and adapted by @OilersNerdAlert. Any implementation issues are my fault.

[1] In my opinion, the marginal difference between goaltender skill is very tight at the NHL-level, making identifying and predicting performance very difficult, some might argue pointless. But important trends do manifest themselves over time, even if there are aberrations in smaller, but notable periods (like a season or playoff series).

[2] One of my concerns are trackers being biased toward certain players or team, or worse yet, being more likely to record passing series that lead to goals. Imagine out-of-sample model scoring being ‘tipped off’ to predict a goal if a passing series exists for that shot, that data is not truly ‘unseen.’

[3] Note that sometimes I used expected goals and probabilities a little interchangeability. A 0.5 xG can be interpreted as a 50% chance of a goal for that particular shot, thanks to the properties of logistic regression. No shot is worth greater than 1 xG, much like no shot is certain to go in.

4 Replies to “Expected Goals (xG) Limitations – A Deep Dive”

  1. Interesting stuff. I do have a couple of questions though.

    First, before I say anything I may as well ask if I’m getting how the model works correctly. So you are essentially holding the coach as
    constant, controlling for the expected goal probability for that shot, and trying to predict whether or not the shot was a goal. So the coefficient can help tell us how more much (or less) than expected goes in for that coach. And this model doesn’t control for the goalie (I think?). I think that’s what you mean (maybe you could put your code up on Github). I guess an easier (yet cruder) way of thinking about this is just doing Sv%/xSv% (and possibly indexing to league average…I think that’s what you are doing). This will exaggerate the effects as a model isn’t fit but I guess it’s close enough.

    Nevertheless, if I’m misunderstanding that’s my fault and the rest of what I say here may just be meaningless.

    I guess my issue here (assuming that the above is correct) is that I’m not sure how this exactly captures a coach effect. I have two reasons:

    1. This is a smaller concern but the effect very well may be just a product of the players themselves. Maybe the players are just shitty so they let more goals in that expected. There isn’t really a casual link here from tactics -> score more goals than expected (not that it’s your fault). There’s a lot of steps in between them so I’m not so confident putting all the blame (or credit) on the coach.

    2. The other (and much more important) issue I have essentially boils down to – “How do I know this isn’t just random”. What I mean by that is you are essentially observing that teams coached by coach “X” tend to given up more goals than expected on the penalty kill. But how do we know that this isn’t what we would expect from randomness alone. We obviously expect some spread here just from randomness (which kind of goes without saying) and I really don’t see how your model accounts for that (again, unless I’m missing something which I very well may be).

    For example, looking at the Oilers the past two years they were perfectly fine last year in regards to Sv% vs. xSv%. So why are they so bad this year? Did Todd Mclellan just forget what he’s doing?

    I guess my point is that unless I see some actual predictivity I have a hard time believing the effects (and I very well could be wrong here). I think some out of sample testing to see how persistent these effects are for coaches would be important to look at. This is just a gut feeling (and I kind of hope I’m wrong) but I imagine if you build the model on more years (instead of the past 1.5) you would see a much smaller effect for coaches.

    1. Ok, good stuff.
      Regarding the model, I have all the same features as my expected goals model predicting a goal, plus I control for a) shooting team coach-situation (ie. McLellan EV, McLellan PP) b) defending team coach-situation (ie. McLellan EV, McLellan PK) c) goalie skill (a continuous variable, double dipping a little in the xG model, things like games and shots factor in too). With this set up the idea isn’t so much to predict goals, but how the coach-situation indicators impact that probability. And my more macro goal wasn’t to definitely say coach X is good or bad, rather what kind of observable effect might we see at the team/coach, so when we see results like Talbot’s on the PK we can make a guess as to how much harder those shots might be due to persistent, latent variables over the course of the season. I always mention to people my xG measurements have team-level bias, but much much? 1%? 2%? 5%? This was my attempt to answer that.
      But to your point on the coaches effect (#1) I agree. The plan was to capture team-level xG biases, and creating coaching indicators was my way to do it, but coaching ARI and NSH are two different things, so I might need to go back and clarify that. Thinking about EDM’s PK, McLellan may not be strategizing to a different personnel available to him. Like you said, he didn’t forget to coach, and of course his overall results aren’t that poor, but they must be doing something wrong right now, but it probably results in ~2% harder shots on the PK, not the full 5% difference.
      Which ties into #2 (randomness), I tried to create error bars from the CIs and most coaches/teams don’t seem to have a large systemic impact, which is actually good! We kinda want things like cross-ice passes and screens and other latent variables to be fairly evenly distributed across teams (as they relate to the shot quality we can ‘see’). But some of the indicators were strong, so we can infer that those shots were probably more dangerous (controlling for goalie quality) than the xG model would let on.
      The predicitvity thing would be useful/necessary if we were firing coaches based on this, but i think as a 100% descriptive model its fine. Like I said, I wanted to put a number on the variance you might see around an xSv%. Like the game vs WPG Talbot’s xSv% might have been 5% lower than might model scored it, but over the season, the max that difference might be is ~0.2% (according to this, there could be better ways to explore that). Then it becomes reattributing deltas between ‘skill’ & luck. Does that make more sense?

      1. Thanks for the reply and for clearing a couple things up (also sorry for taking so long to respond).

        I’ll be honest though, I don’t really get your answer to my 2nd point.

        I guess the best way to explain my problem would be to take it to an extreme. Imagine we are able to measure expected goals as accurately as possible. So we don’t have to worry about not knowing about team effects, cross-ice passes…etc (for the sake of it let’s just imagine nothing exists outside what we can currently measure). In that league xGoals still is not equal to Goals because of random chance. So any deviance from the expected Sv% (I guess besides for goalies) is completely random. In that situation, would your model still pick something up? I’ll be honest and say I don’t see how it still wouldn’t pick *something* up as some teams are still going to have a disparity between their xSv% and their SV%.

        I very well could be completely misunderstanding something but I dont’ know. All I see the coach (or team) variable doing is telling us whether or not a team did better or worse than expected. Which could imply it’s the coaches/teams fault…or it could mean nothing. And that’s really my whole point with checking the predictivity. If this isn’t something that actually a “talent” for coaches (or whoever) then I don’t really see how we can say that McLellan’s tactics are what is causing Talbot to do worse. For example the distribution you observe may be exactly what we expect from chance alone (which I doubt).

        But I’m not sure. I don’t think we are on the same page here and I think I’m missing something. Maybe I’ll think about it some more another time (either way at the end of the day the effects are small).

        1. Sorry I didn’t see this (I disabled alerts to comments after getting spammed). So there’s no way to be 100% sure what you describe isn’t happening, but keep in mind this is done at the shot-level and coaches sometimes span multiple teams. To go hypothetical, if I just assigned one of 40 coaches randomly to each shot and ran the same model, I would expect that the goal probability would be almost entirely explained be the other factors in the model and the coach variables wouldn’t carry any significance, their p-value would be close to one. Alternatively, if we had a perfect model I wouldn’t expect coach variables to explain anything. I’m guessing that’s where we differ?

          I guess it’s more of a question on how much we expect luck to range over a few seasons, goalies (and possibly teams), and few thousand shots. If it were to greater than or equal to the distribution of ‘coach effects’ your concerns would be correct. So would you be more comfortable the ranges I listed as ‘team + luck effects’? Luck is something I kind of avoiding discussing but maybe a possible test is running the model with randomized coaches to compare the results.

Leave a Reply

Your email address will not be published. Required fields are marked *