Statistics in sport?

Chad is bemoaning the increase of "stat-geekery" in sports:

I'll admit that I'm somewhat torn about this. I am, after all, a professional nerd, and enjoy working with numbers, so I can see the appeal of quantitative data. And a lot of the regular statistics used in basketball are pretty crude measures, so I can understand trying to develop better statistics.

Very, very crude. And that is where my beef comes from. Can you think of a sports' statistic that includes a measure of error?

Statistics as a field studies the distribution of random variables, which means considering both a measure of location (like an average) and a measure of dispersion (like a standard error). I cannot think of a single statistic in common use in sporting which involves an estimate of error.

The use of "statistics" in sports has long been an argument for the secret numeracy of Americans. After all, if people can memorize the batting averages for the entire 1993 Yankee organization, surely they have the ability to balance a check book, or to appreciate Fermat's Last Theorem.

The problem is that memorizing a bunch of averages tells you nothing without some measure of variability. Knowing that one pitcher's ERA is higher than another's is only interesting if the range of variation intrinsic in pitching is smaller than the difference between ERAs.

Even worse than that casual misuse of statistics is the bizarre use of conditional probability without justification. "Joe Schmoe has a .342 batting average against left-handed pitchers over 200 pounds with runners on second." Is there any statistical basis for thinking that the player bats differently depending on the pitcher's weight, where the runners are, or even whether a runner is on base? If not, there's no statistical reason for slicing up the probabilities like that.

Statistics and the measurement of error, is an important part of life in the 21st century. Presenting statistics without any assessment of error limits their utility, and enforces unhelpful ways of thinking about measurement with error.

More like this

Statistics make my eyes bleed, but there's a guy in Winchester, Kansas who crunches baseball numbers and writes beautifully about his observations.

Dunno if Bill James' work addresses your questions, but's something to look into.

http://en.wikipedia.org/wiki/Bill_James

By MonkeyHawk (not verified) on 28 Feb 2007 #permalink

In baseball, statistics are life. Sadly, too many players and managers and executives are so blinded by "tradition" it has made them slow to realize the value of advanced statistical measures.

Many of the traditional stats in baseball are seriously flawed. It isn't because of a margin of error - it's because they measure an individual's performance because of external factors, instead of an individual's actual performance. A batter can only score a run (R) if he hits a HR (semi-common), steals home (rare), or is batted in by another player (most common). So a player's R total is almost wholly dependent on others. Same goes for Runs Batted in (RBI) - a batter can only rack up RBI's if he hits a HR or has teammates on base. Batting Average is also a huge problem - how much control does a player have over his hitting? Can he really "hit 'em where they ain't"? Or are other measures more reliably correlated to actual run production and performance? (Yes, On Base Percentage)

Pitching stats are also similarly flawed. Earned Run Average is basically half dependent on the defense behind a pitcher. All that a pitcher can really control is strikes, balls and not giving up home runs.

Bill James pioneered the new stats, often called sabermetrics. And there's a lot, and many, many sites devoted to analyzing and coming up with new ways of looking at the game. (baseballprospectus.com, beyondtheboxscore.com are two of my favorites)

Some of those goofy stats in baseball DO matter though. Players do hit differently vs. right handed or left handed pitchers. Players DO hit differently depending on having men on base, and sometimes (because of fielding shifts) which base teammates are on. So yes, there is a very solid statistical basis for many of those sliced and diced stats.

So really, it's not measure of error that you should be worried about with the traditional stats. It's that you're using the wrong stats. Many of the new stats totally remove any worry about that. Take WARP - Wins Above Replacement Player. If we assume an average player will create X runs in a season, and contribute Y wins above average to a team, then how many more will the hometown star contribute? 5 wins? 2 wins? 8 wins? You don't need to know about variation here - its this guy is worth 8 wins more per season. That's huge. You don't need a measurement of error.

You do need an error if you are trying to say that player A is better than player B. If A has WARP 8 and B has WARP 7, the variability matters. If it's large, the two players are the same. If it's small, prefer A.

Amen Josh. When they said during the Super Bowl that those were the two lightest teams in the league, I shrugged "So what?" if the difference between #1 and #32 is 5% of the total average. Pretty much all ranked sports stats have this weakness. And doesn't it make one pause when looking at Barry Sanders' high career average per carry when one discovers that 1/3 of his carries were for losses? Were I a football coach, I'd be a lot more interested in median yards per carry than mean.

Generally sports people don't understand jack about statistical analysis. In my entire life, I've only ever heard one sportscaster use the term "law of averages" correctly. Often the basic stat itself doesn't make any sense. Here are a few of my favorites:

The NBA ranks offenses and defenses by average points per game. That measures tempo, not offensive talent. The stat that matters is points per possession, and all it would take is one kid tallying possessions to do right.

In the NFL if you punt from the 50 yard line, and the ball is not fielded by the opposition, and marked at the 20 yard line, how long is the punt? Well, by NFL illogic, if it went out of bounds on the 20, it's a 30 yard punt, but if it went out of the end zone for a touchback, it's a 50 yard punt! Way to reward the good coffin corner kickers!

And finally, everyone's favorite, the NFL quarterback rating. I got the real formula and algebraically simplified it: (yards passing + 20 X completions + 80 X TDs - 100 X ints) / attempts, and then believe it or not, add .5 and multiply that entire sum by 25 / 6. Sigh. Just like in basketball, the stat that matters is yards per pass play, with an adjustment for turnovers and TDs.

And don't get me started on the BCS...

Yes... sport statisticians must not really know how to work with numbers. How else can you explain all the money the Vegas Sportsbooks give away each day?
Seriously, you need to think your comment again... You probably don't even know who Roxy Roxborough is. He knows plenty about statistics, with all the error bounds, randomness, dispersion, and any other factor you care to toss at him. They don't set the betting line in Vegas by just using averages. If they did they'd be broke in month.

Nice try at a red herring. I never said the people in Vegas don't understand statistics, although their success is far more due to the statistical ignorance of their patrons than their brilliance.

However, I don't need to be able to name drop to recognize baloney statistics for what they are, and mainstream sports are full of them. They make for great fodder for discussion among my fellow actuaries.

I don't see where Vegas entered the discussion. The issue is the statistics regularly cited in TV or newspaper sports commentary, and by even meticulous sports fans.

I for one would like to see error bounds around measures of average yards rushed, ERAs, etc. I'd like to see p-values attached to a claim that this player is on a streak, or that he bats better against righties than lefties.

This isn't about gambling, it's about the use and misuse of statistics in our daily conversation. Whatever the Vegas bookies do doesn't enter into that, and is truly a red herring.

The Red Herring here is that you say sports statistics are baloney. What do you need to see? Some report of standard deviations along with error bounds for yards rushed game or shooting percentages? It is true that Vegas profits for the statistical ignorance of the masses, but you are ignoring the people that do know stats. The best handicappers work the books hard based on a thorough statistical analysis that consider far more stats than what you see in the back of the sports page. Vegas doesn't like to give away chunks of money to anybody. The fodder you discuss with your actuary friends is precisely the red herring the bookies love people to see.
If you want better sports stats, you have to dig them for yourself.
If you're such a hot shot stat guy... play against the line.
Make the dough and prove it.

The Red Herring here is that you say sports statistics are baloney. What do you need to see? Some report of standard deviations along with error bounds for yards rushed game or shooting percentages?

Yes, for starters. That is very relevant information, since there is a cost to variation of results. There also needs to be more thought behind the figures, ie basketball rankngs by points scored. They act as if this has more bearing of offensive skill than it does, for reasons already discussed.

The fodder you discuss with your actuary friends is precisely the red herring the bookies love people to see.

Hardly, since we are trained statisticians and they often aren't. Those things known as "gamblers fallacies", like the due theory, were promoted by people who think like bookies, not people who think like actuaries. There's a reason you don't see many Statistics PHDs sitting at the roulette table.

If you're such a hot shot stat guy... play against the line. Make the dough and prove it.

Non sequitor. My claim, and Josh's as well, is that the sports statistics presented by many sports leagues and obsessed over by fans aren't nearly as meaningful as they think they are. Gambling against the spread is your little red herring, because it's totally irrelevant to what we are talking about.

Oh good grief... Go calc the error bounds and p-values if you really need them when you figure a batting average or free throw percentage. The NFL, NBA, and MLB all do a great job of providing the raw data. Most statistics used in daily conversation are misused, sports are no exception. You can make the same statement about what is reported in the business pages and what is reported for economics and social sciences in the press. Why, because most people don't know the difference from the median and the mean.
The point about the bookies in Vegas is that they actually do take the time to evaluate athletic performance statistics in a very rigorous fashion. They don't report the p-values because they don't want to advertise that a the team just might be in a slump!

Most statistics used in daily conversation are misused, sports are no exception. You can make the same statement about what is reported in the business pages and what is reported for economics and social sciences in the press. Why, because most people don't know the difference from the median and the mean.

Yes, and most people also don't seem to understand a lot of things about statistics. So what's your point, we shouldn't point out these things in sports because they happen elsewhere?

The remarkable thing about statistics in sports that is somewhat the point here is that one would think the people involved would understand them better. For example, the Law of Averages (or large numbers) is not all that complicated, and yet every sportscaster I've ever heard talk about it (with one exception), got it wrong, making the "gamblers' fallacy" mistake of thinking a miss is "due" if there have been several hits. Of course, sportscasters far too often can't be bothered to know the names and positions of the players, and sometimes not even the rules of the games they announce, so I guess I shouldn't be shocked that they don't bother to research a statistical term.

Just study the history of the NBA lottery and the BCS, and all the changes they made over the years, basically because they couldn't seem to understand the basic statistics behind what they were doing. One would think a league making so many millions of dollars would hire PhDs in statistics to keep them from looking like such fools, but they don't.

As one last example, take "the zone", the idea that players get into a premo psychological state that enhances their performance to the extent where they almost can't fail. We hear sports people talk about this all the time. Players talk about it as a fact. Teams have hired psychologists to try to tap into it. And yet a statistical study in the NBA blogged on here recently showed there was no zone. Try telling the coaches and players that.

The point about the bookies in Vegas is that they actually do take the time to evaluate athletic performance statistics in a very rigorous fashion.

First off, bookies and what they do had nothing to do with the point of the original post.

But since you are obsessed with this subject, fine, I'll bite: Prove it, show me some documentation. I don't buy it, because all a bookie has to do to make money on a game is get action roughly 50/50 for the two teams, thus guaranteeing him a profit no matter who wins, because of the "juice", the 10% or so extra they get from the lossers. It doesn't take any kind of sophisticated analysis to do that, you just set a line and shift it if the action gets too lopsided. You do notice Vegas often shifts the points spreads over time. They aren't changing based on someone's statistical analysis, they are doing it because they are getting 55/45 bet spread or worse.

But hey, I'd love for you to prove me wrong, it'd give me some hope for the universe that good statistical analysis was being treated as the serious subject it is by people it effects so strongly. This again, is sort of why I, and I suspect Josh as well, brought the subject up in the first place. It's not the these sports leagues are making tiny nitpicky errors that only Geek God cares about. Its that they are making the kinds of mistakes that anyone with any clue statistically wouldn't make, nor would someone who just took the subject seriously.

Daprez, my point in the original post was that the general public misunderstands statistics because the major place they encounter them sports commentary leaves off estimates of error. I don't know how bookies set the line, and for these purposes, I don't care.

My issue is the misrepresentation of statistical thinking to the general public, in ways which are ultimately harmful to public discussion of more critical issues.

The fodder you discuss with your actuary friends is precisely the red herring the bookies love people to see.

How would you know? You think you and whatever bookies you know understand statistics better than we do? Thanks for the belly laugh.

If you want better sports stats, you have to dig them for yourself.

No shit, that's part of the point of this post.

If you're such a hot shot stat guy... play against the line. Make the dough and prove it.

Nonsequitor. That's like responding to my claim that pro wrestling is fake by challenging me to beat the champ. If you gave me a ranking of Top NFL Running Backs that had them rated by height, I need hardly be able to produce winning bets on games to rightly point out that this statistic is crapola. Likewise for some of the examples I gave earlier.

And since you raise the issue of performance, I fill my pockets at the poker tables with substantive regularity, and a lot of my victims say the same sorts of things you have.

I think MarkP's description about how the Vegas line is set is correct. Pick a number that's close to the feel on the street, then tweak it to make sure that the bets are roughly evenly matched. Bookies don't need to have the right numbers, they just need to have numbers that match the mood of the betting public.

This is less sophisticated than what you'd really like, which would be something closer to the options or futures market. In that situation, each bettor could pick his or her own spread, and you would see the real distribution of estimates. People too far from the mean would never get their bets taken, and the market's estimate of the real spread would be an aggregation of all the estimates by individuals. The interesting games would be the ones with high variability in the marketplace, a statistic that isn't available to anyone right now.

You guys want the sports page to read like a doctoral thesis for particle physics. OK... start publishing one. The public is clamoring for it. Seriously, think about all the fantasy leaguers that need more and better stats.
The point of the post is lame-ass whine about the public being dumb and sport statistics are used as the strawman to elevate yourselves as superior know-betters because you've studied statistics. "Look at the imbeciles as they misuse our holy objects!"
The Vegas line is set in theory to get the money equal on both sides, and then the book gets to keep the vigorish. But this isn't what Vegas does in practice. If you care to do the analysis, adding all proper statistical rigor, you will find that less than 15% of lined events have books balanced within 10% from ideal (50/50 split). 60% of the events fall within the 61/39 to 79/21 range, and the remaining events are greater than 80/20 split range. Why? Vegas wants people to jump all over the favorite and then collect big when the favorite fails to cover.

I should say Vegas wants people to bet heavily on one side, not on the favorite. The money from the vig is puny compared to what they get from collecting from the heavy side of a bet.

I never said that the public was stupid. I think there's a missed opportunity. Sports pages have lines and lines of statistics, but the stats listed are mostly bogus, lacking the measures of error that makes statistics a meaningful discipline and that makes comparison of numbers useful. My beef is that "statistics" in sports misinform the public about what statistics can do for people. I don't blame the public, I blame the people who put together sports pages sports commentary.

I should say Vegas wants people to bet heavily on one side, not on the favorite. The money from the vig is puny compared to what they get from collecting from the heavy side of a bet.

That makes no sense. Having a skewed pattern like that only increases the coeficient of variation of their results, which is more of a threat to their bank. It wouldn't increase their expected value one whit.

Where are you getting your information on what amount of lined wagers produce what spreads of bets? And what makes you think simply adding up historical events (if that is indeed what you did) constitutes "statistical rigor"? Do you even know what any of these terms mean? At the moment it appears you don't, and are merely tossing around jargon to cover your lack of understanding. Didn't work.

I googled Roxy Roxborough, landing me here:

http://www.thegreek.com/2006/nfl-betting.asp

where I found these pearls of wisdom:

"If I can break down a typical 14 or 15-game NFL weekend, there are usually four or five games where there's not much of a decision. For some reason, those game aren't particularly attractive, be it the matchup or that people don't see any edge in the number. Those games don't move. Then there are four or five games where the NFL betting action is split, where there's good two-way action. And then there are a handful of games - let's say four - where most of the action is on one team, where the betting is very one-sided. It all comes down to those one-sided games. If the bookmaker splits them, because of the vigorish, he does fine. If he goes 3-1, he does great; if he goes 4-0, he does absolutely fantastic; if he goes 1-3, that's not so good; if he goes 0-4, that's a disaster."

Rigorous statistical analysis? Yeah, for 4th graders. OK, that's not fair, because these guys are high school graduates. I know because I looked at several such sites, which made sure to trumpet that accomplishment. They also touted Mr. Roxborough's business savy, and his foresight in using computers and other modern technology, all of which he no doubt deserves. But none mentioned standard deviation, or even something as basic as a mean, even once. They are all full of standard hot-shot gambler's jargon, the above being typical, showing no evidence of any statistical understanding worthy of even an introductory class.

Clearly even he admits as much above, once you get down to the heart of what he is actually saying. Bookies don't win because of superior statistical analysis. They win because of the vig.

As Wikipedia points out:

Because of the vigorish concept, bookmakers should not have an interest in either side winning in a given sporting event. They are interested, however, in getting equal action on each side of the event. In this way, the bookmaker minimizes their risk and always collects a small commission from the vigorish. The bookmaker will normally adjust the odds (or line) to attract equal action on each side of an event.

Moving the line to bias the action would mean that you get fewer bets, reducing the income for the bookie.

Josh,
I can only imagine what the sports page would look like if there started adding error bounds and p-values. I am wondering just how it would look and how you would measure it. Are you going show how far the batter misses the ball when he swings? How badly the free throw was missed? It just doesn't make sense because there are no measurements to use.
And like I said before Josh, we all know what the bookies should be doing in theory, no need for the wiki. The issue here is what they are really doing. They are manipulating the line to cater to fan-based loyalties to add bigger profits for the bookies.
MarkP, the information about how much is bet on one side of the line is available from several online sportsbooks. As much as you may think it is a skewed pattern it is not. The bookmakers are manipulating the public with the lines to get a heavy side of the bet for games which there is good statistical information that shows the other side of the bet is actually at an advantage with the spread. Remember, the books are not interested in who wins the game. They are interested in the points scored. No jargon tossing here, just looking at the numbers as reported by the books that share them.
Now maybe they are lying... but if they are I'm still making great money from the information they provide.
$500 bucks of the KU-Texas game. 23% of the betting public took Texas at +8 or +7. Wow... the books sure soaked the KU faithful with that one.

I'd like to see sports writers and commentators saying "X has an ERA that is above Y's, but they aren't statistically diferent." Or "Foo has a batting average of .244 when a man's on 2nd, which isn't statistically different from his over-all average." And if they'd cut the meaningless "streak" column from the standings, they could put a confidence interval on some team averages.

MarkP, the information about how much is bet on one side of the line is available from several online sportsbooks. As much as you may think it is a skewed pattern it is not.

It most certainly is, perhaps you should look up the definition. "Bet heavily on one side" as you put it, is basically what skewed means. If you wonder why I have such low confidence in the statistical capabilities of bookies and their fans, it's because of shit like this. If you can't even get the basics right, why should I believe you can do, or judge, anything actually sophisticated?

The bookmakers are manipulating the public with the lines to get a heavy side of the bet for games which there is good statistical information that shows the other side of the bet is actually at an advantage with the spread.

So you assert. I'm still waiting for evidence one that this is the case. I gave your guy Roxborough a fair shot, but his quotes were all the same kind of shit, either totally obvious, per his quote I referenced in post #20, or just meaningless or misused jargon, like referring to basketball, but not football, as "linear". This is the typical kind of ignorant BS I've come to expect from gamblers and bookies, and you guys have not disappointed.

Remember, the books are not interested in who wins the game. They are interested in the points scored.

Gee, you think? Are you going to tell me now that going 4-0 is fantastic, but going 0-4 is a disaster? Words of wisdom...

Tell me, what is the breakeven winning percentage a gambler needs against a 10% vig. I can figure it out, and will do so my next post. Can you? I am not optimistic. And if you can't manage that, then sorry, this conversation is clearly over your head.

No jargon tossing here, just looking at the numbers as reported by the books that share them.
Now maybe they are lying... but if they are I'm still making great money from the information they provide.
$500 bucks of the KU-Texas game. 23% of the betting public took Texas at +8 or +7. Wow... the books sure soaked the KU faithful with that one.

Dude, if you were trying to persuade me that you weren't just another ignorant gambler, this was the last thing you should have done. Every loser idiot tries to defend his "system" by referencing some big win he just had. What matters is long term return, not anecdotes.

And hey, I'll stand corrected if I'm wrong. Show me some data. Link me to a site where I can look at their spread of action before the games occur, or one that describes what they are doing in a statistically rigorous way. Do it and I'll eat my words. Roxborough's quotes were a joke, only further convincing me that these guys can't find their statistical ass with both hands, just like the con artists that hawk their "money management" winning poker and craps strategies. They only make money over the long haul because of the vig, even Roxy admitted as much in the #20 quote. It also doesn't hurt that most of the people they deal with are complete idiots. Vegas is Idiocracy incarnate.

I await, with great eagerness, evidence to the contrary .