Statistics And Metrics: What Makes A Good Evaluation Metric

facebooktwitterreddit

Jun 12, 2013; Renton, WA, USA; Seattle Seahawks quarterback Russell Wilson rushes during minicamp practice at the Virginia Mason Athletic Center. Mandatory Credit: Joe Nicholson-USA TODAY Sports

After my recent articles on judging QBs and how the “sophomore slump” is a myth, I’ve been getting a lot of questions about evaluation metrics. What ones do I like? What about ? etc. The conversations have been thought provoking, and have challenged me to reconsider my opinion on a number of metrics that are out there right now.

One of the things I thought I’d do is share with you some of my thoughts on what makes a good metric. Good metrics are hard to come by. There are plenty out there that sounds like a good evaluation tool in theory, but in practice don’t get the job done.

I’ve come up with 3 key tests for evaluating metrics that I’d like to share. I’m sure there are more that simply didn’t come to mind while I was writing this, so if you have some that you think should be added, be sure to include them in the comments section.

A good metric isolates offense, defense, and special teams

If you’re trying to measure the quality of an offense, then you should only be measuring the offense. Field position, which mostly is the result of special teams play, has to removed from of the data if possible. If it can’t be removed, then the metric is fairly pointless.

For example, the problem with using total points as a measure offense or defense is that the number of points scored correlates to a number of variables that have nothing to do with the quality of the offense or defense. It relies heavily on variables like average starting field position, which has nothing to do with what you’re attempting to measure.

A good metric measures skill, and thus the result is reproducible on the field

Recently, I’ve seen people touting points per play as a way of measuring offensive or defensive play. At first it seemed like a quality metric, but upon closer look, there’s a serious problem.

It’s easy to see the problem when you compare it to a simpler statistic, like a QB’s completion percentage. QBs have good days and bad days for completion percentage, but all of their games will fall within a nice normal distribution around the their average. Any group of games will not be distinguishable from any other group of games.

Statistically, the tool I use is called a t-test. If you select any 8 games from a season, and compare it to the other 8 games, the t-test should return the null, which is, that they are not distinguishable from each other.

Yards per play doesn’t pass this test. There is way too much variability from one game to another. It appears to be determined greatly by things like turnovers and special teams, and not by the overall skill of the unit being evaluated.

A good metric isn’t an overly complicated meta-stat.

There are plenty of these around sports right now. Total QBR somes to mind as probably the best example in football right now. I’m talking about metrics that combine many different stats and metrics into one single number. They can be fun, but ultimately they will always be flawed in ways that make them fairly pointless.

The problem lies with the weighting of the variables. With QBR, should TDs carry more weight in creating the final number, or yards? Perhaps interceptions? What about sacks? Running yards? First downs?

The creators of these metrics have their own bias which shows up in the final results. QBR puts way too much emphasis on sacks in my opinion, and not nearly enough on completing passes that are for first down yardage. Completing a 4 yard pass on 3rd and 6 should be a negative and not a positive, or at least that’s my opinion.

I’ve been one to dabble in meta-stats, so don’t think I’m point my finger at anyone. My mathematical power rankings are based entirely on a meta-stat that I created, and am constantly improving. It’s interesting (at least to me), but until the results correlate very highly to winning, they aren’t as meaningful as they might seem.

Metrics must be comprised in a way which eliminates as much of this variable weighting problems as possible. Otherwise, the validity of the metric will always be questionable.