Super Bowl Coin Toss, Mathematically

Every year there's a Super Bowl, and every year the whole shebang gets started by a famous person tossing a coin into the air. The team winning the toss gets to decide whether they want to begin the game on offense or defense. Theoretically this choice might produce an advantage. If so, would be interesting to know how much. The same thing happens in physics - just how much signal is hidden in the random noise of an experimental apparatus?

Let's take a look at the numbers and try to see what kind of advantage the toss-winning team has. The data is pretty straightforward - in 43 coin tosses, the winner has gone on to win 20 games and lose 23 games. As such it would seem that winning the toss is a disadvantage. But is it, or is the seeming disadvantage just a matter of the random fluctuations inherent in a small sample size? Keep in mind that the fairness of the toss itself isn't really the issue. If they all came up heads that wouldn't necessarily have any bearing on whether winning that biased toss was actually an advantage in winning the game. We just want to know if a toss win has any bearing on game outcome.

If there were no relationship between wining the toss and the game, we can expect that the probability of a game win is 0.5 given a toss win. Now it's time to break out some math: since the sample size n = 43 and probability p = 0.5, the normal distribution of mean np and variance np(1-p)is a good approximation for the distribution of game wins that the toss winner ought to experience.

i-2c186c7f640d3e4288128b38d4a077be-wins.jpg

Probability distribution of wins with n = 43, p = 0.5

The standard deviation is the square root of the variance - in this case, the standard deviation is 3.28. In the normal distribution, about 68% of the time the actual number of successes is within one standard deviation of the mean. Here, the mean is 21.5 game wins. Therefore there's a very good chance (~32%) that statistical fluctuations will put the actual number of game wins outside that range without the toss giving an actual advantage or disadvantage. Thus if the number of game wins is inside the range 18-25, we have no grounds for concluding that victory in the toss affects victory in the game. Since the actual number of game wins is 20 - well within that range - it's not especially likely that you need to worry if your team loses the toss. If the actual number of game wins were outside the one-standard-deviation interval but within the two-standard-deviation interval (about 14-28) we might raise an eyebrow, but we still wouldn't necessarily be on very solid ground to assume a relationship. Outside the two-standard-deviation interval we might be justified in suspecting a relationship between toss wins and game wins. A larger n would help clarify the issue, as it shrinks the standard deviation compared to the expected number of wins.

The coin toss that determines possession in overtime is another story. There the number of game wins compared to toss wins is wildly out of proportion to the p = 0.5 hypothesis, and the winner does in fact have an important advantage. But that's a story for another time.

[Statisticians will justly note that what I have not strictly done is to calculate estimated probability of a game win given a toss win. This is true, but is also more complicated and doesn't accomplish much that we didn't already learn by working in terms of deviation from an assumed p = 0.5.]

More like this

Aren't you forgetting that the winner of the coin toss is given a choice of choices: kick/receive and which end to defend? Can you be lucky in coin toss calls and unlucky in choosing which choice to make?

In this case the teams have made the question redundant: in all 43 cases the coin-toss-winning team elected to defend.

The coin toss that determines possession in overtime is another story.

I assume here that you are looking at all overtime games (or at least all overtime games over some period of time), since the number of Super Bowls that have gone to overtime is very small indeed.

Of course there is an obvious reason why the coin toss should matter in overtime. Under NFL rules the first team to score in overtime wins. Therefore if you win the toss you want to receive, in the hope that you score before the opposing offense even gets to take the field.

By Eric Lund (not verified) on 03 Feb 2010 #permalink

Exactly, and that's why (almost) no one likes the NFL's overtime rules. Adopting the NCAA's overtime rules would probably fix the issue.

I assume you meant to say (comment #2) that all the teams elected to receive.

I don't see anything in the rules about it, but I've heard a number of announcements this season that the toss winner elected to "defer," which has the practical result of the winner kicking off in the first half and receiving in the second.

I think this says more about the sport of American Football than it does about coin tossing and statistics. To me, it suggests that much of the 'bias' such as wind advantage have been taken out of the game by stadiums with roofs etc.

It would be interesting to apply a similar analysis to a game like cricket (where there is HUGE advantge to winning the toss), or other football codes like soccer or rugby.

Matt, I'm teaching physics to ITT Technical Institute students. They have to write up a physics paper, so I started typing up papers as examples. One that got a little out of control is Niven Ring Gravitational Stability. Anyway, I thought that if you run out of topics you might consider the Ringworld stability problem. You can do it more elegantly than my version, which avoids calculus. Also, I left the .tex source in the same directory, as ringworld.tex and ringbib.bib.

By Carl Brannen (not verified) on 03 Feb 2010 #permalink

Matt, that graph would be more instructive if you put a thin line down the middle at 21.5, shaded the +/- 1 sigma region, and added a heavy line at 20. That would clearly show where the current situation is relative to random.

The commenter @1 is forgetting that only two Super Bowls have been played since the NFL allowed teams to defer their choice to the second half. Besides, what difference does it make? If there is an advantage from deferring the choice, that would still be expected to show up in the results.

The commenter @6 may not realize that both teams get to choose in each half (the option is between kick-receive and direction-of-play) and that the other team gets the first choice in the second half. And many Super Bowls (like this one) are played outside, although 40 mph winds are rare in Miami at this time of year.

BTW, I saw a college team captain make the mistake of choosing to defend, so the other team got the ball AND took the wind at their back when receiving the kick! (The coach must have wanted him to take the wind and kick off, but he tried to pick both and you can't do that. So, when finally forced to pick one or the other, he chose to kick off.)

By CCPhysicist (not verified) on 03 Feb 2010 #permalink

The above is based upon the assumption that both teams are of equal strength - which is not always true.
It is possible for example that of the 20 wins, many were achieved by the team that had performed worse in the run up to the Super Bowl. More victories from weaker teams helped by a favorable toss, would still account for an advantage.

@9: The weaker team has the same probability as the stronger team of winning the coin toss, assuming a fair (p=0.5) coin. So any effect due to uneven team strength should average out over the long run. Furthermore, as Matt showed in the post, the data do not support the existence of bias: the observed result is within the expected deviation from equal winning probabilities.

By Eric Lund (not verified) on 04 Feb 2010 #permalink

Bayesian shakes head and sputters:

Aren't the weasel words of the last paragraph really just saying that "since the evidence is so weak, let's just not bother thinking very hard about how strong it really is"?
Picking an example where winning toss wins 29 of 43 times might make you have to think a bit harder about it.

"Thus if the number of game wins is inside the range 18-25, we have no grounds for concluding that victory in the toss affects victory in the game." That suffers similarly. I hope you are not saying that 25 of 43 does not affect your opinion at all ("no grounds"), but suddenly, almost magically, 26 of 43 will.

I would very much appreciate knowing why you did not use a chi-square. I ask as one who does not know all that much about statistics.

By Jim Thomerson (not verified) on 04 Feb 2010 #permalink

I would very much appreciate knowing why you did not use a trombone. I ask as one who does not know all that much about brass wind instruments.

By Anonymous Coward (not verified) on 04 Feb 2010 #permalink

Jim #12: Chi-square is the wrong tool for this job. You use chi-square to determine the coefficients of whatever function you are fitting to a set of data points that best fit the data, and then to evaluate how good this fit is compared to some other function that might also fit the data. Obviously you need to have more data points than model parameters, as otherwise you can choose parameters which exactly fit your data. Here we have a single data point: the winner of the coin toss at the beginning of the Super Bowl has won M out of N games, and the question Matt is asking is whether M/N is consistent within statistical uncertainties of the expected value of 1/2 (which turns out to be true). Anonymous Coward's reply may have been excessively snarky, but he has a point.

By Eric Lund (not verified) on 05 Feb 2010 #permalink

Hmm, I thought coward was 100% snark, perhaps clueless snark even. That could be wrong, and instead it is just terminology that is the problem - there are so many "chi-square tests" that one can easily confuse them.

I think what Jim means is this one: we observe 20 wins, 23 losses, and expect 21.5 wins and 21.5 losses (under hull hypothesis that chance of winning is .5).
The observed - expected are -1.5 and 1.5 so Pearson's Chi-square test statistic, which is the squares of those divided by the expected values is
1.5*1.5/21.5 + 1.5*1.5/21.5 = .209.
Looking how much area us under a chisquare distribution (with one degree of freedom) I get that .647 is the area to the right of .209. That is, the (two-sided) p-value is .647.

Now lets do the normal approximation test.
I get np=21.5 of course and that sqrt(np(1-p)) = 3.279, as Eric Lund shows. Actually observed 23 which is 1.5 from the expected mean, so our data is
1.5/3.279 = .4575 standard deviations from the expected mean. Looking at the area under a gaussian (Normal) curve right of .4575 I get .324, which is a p-value for the one-sided test, and twice that is .647, which is the p-value of the two-sided test (which asks both if we saw too many or two few wins).

If you can convince yourself (or just trust me, though I consider faith-based math to be an abomination) that the square of gaussian random variable has chi-square distribution, you can show that the p-values obtained are actually identical, just using algebra.

The first (Pearson) method may seem rather "magical" though.

(Third method: You could actually think of binomial distributions and not have to use approximations. )

Missionary work: This does not tell you how you want to alter your wagering after seeing the result of the coin toss. If you have to (or want to) put your money where you math is, you might want to consider Bayesian methods.

Oh, hell, I show the direct method, for the zero people interested in this engrossing subject. Also how a bayesian would work it, briefly.

If the chance of winning is p=.5 (null hypothesis) then the chance of winning exactly x times in 43 trials has binomial distribution, which is hard too write in plain text:
P(x) = power(.5,43)* 43!/(x!*(43-x)!)
where power(.5,43) is .5 to the 43rd power, a tiny number,
and z! is "z factorial" and means z*(z-1)*(z-2)....*1.
That 43!/(x!*(43-x)!) crap is really just counting how many different ways I can get x wins out of 43 tries, and is a huge number.

The chances of the coin-toss-advantaged team winning 23 times or more, or 20 times or less, is one minus the chances of winning 21 or 22 times. The chance of winning exactly 21 or 22 times are both about .1196 it turns out. So the sum of those is about .239, and one minus that is about .760792 which is the (two-sided) p-value.

So our approximate methods rather suck, eh? They got .647.

A better approximation is to use a "continuity correction" in the normal or chi-square tests above. For the chi-square we could use the so-called "Yates correction" which says to reduce the observed-expected differences by .5, and we'd get
1*1/21.5 + 1*1/21.5 = .093
and area right of that under chi-square distribution turns out to be about .760368. Not bad!

It's not-so-hard to motivate that correction if we look at the normal approximation test - we think that some of the area between 22 and 23 should go with "23 or more" and some with "22 or less" to make it more accurate. After all it was not really possible to win exactly 22.7 times.

Bayesian learning:
for this I consider my personal knowledge, and rather think that winning the toss would give a very slight advantage, perhaps p=.52, and my certainty about that is perhaps equivalent to data where I saw 50 games and observed 26 wins (26/50). (You will not agree with that exactly.) But crap, the data is pointing the other way, 20/43. After seeing that I now think the chances of winning are (20+26)/(43+50) = 46/93 = .4946.
That's right, the data HAVE actually altered my opinion slightly - like they should, no? It's called learning, and it's a good thing. Yes, more data is better, but that's always true, and doesn't help distinguish good from stupid methods. Note it gets trickier if you have allot of faith that p actually is exactly .5, but I see little reason for that in our example.

Football? No thanks, I'm a steelhead fan. Does the orange yarn fly outfish the dark hex nymph on sunny days in Feb is the burning statistical question.

Aren't the weasel words of the last paragraph really just saying that "since the evidence is so weak, let's just not bother thinking very hard about how strong it really is"?

Nah, they're really just saying "Rigorous statistics is very tedious and I don't want to bother given that it won't actually tell us anything new."

Fortunately I have some excellent commenters, and in fact Rork has worked out the problem in much more detail. He finds, of course, that the sample doesn't tell us much.

The who-cares attitude works fine if one is just doing arm-chair work I suppose.
I'd like to gamble on some coin tosses with you some time though.

While I can't be as snarky as the trombonist, I'll do my best:

I am amused to discover, by reading this in-depth statistical analysis, that there have only been 43 coin tosses at the start of a game in the last 40-odd years of American Football. Perhaps the other 3000 in the last 12 years alone are, ahem, journalistically insignificant?

Exactly, and that's why (almost) no one likes the NFL's overtime rules. Adopting the NCAA's overtime rules would probably fix the issue.

Or, just call it a draw!

By Donalbain (not verified) on 07 Feb 2010 #permalink

Just a couple of quick comments from me. First, thanks to Matt for going to the trouble. Thanks at first to rork for a different perspective. But come on, sporting enthusiasts shouldn't belittle enthusiasts of other sports...unless maybe it's WWF. How do we say? It's not cricket? But statistical prowess (interesting to those of us who have long since forgotten from college days) can quickly turn to statistical snobbery. As such, my I observe that one's English and spelling often says "allot" more about him than his math skills.

Also, I can't help but wonder if Anonymous Coward is still waiting for someone to call his hand on the "brass wind instruments." If so, duly observed.

Thanks again to all. This has been most entertaining. Hopefully it will not be more entertaining than the game itself.

What about another Superbowl coil toss anomaly. I understand the NFC has won the coin toss 11 years in a row. If that is true (and I don't have the data to support it) does the calculated probability of that occurrence raise eyebrows? One calculation is .5^11=.00049. That probability falls at 3.297 standard deviations. I did a RAND() trial of 1000 columns of 11 games and got four 10's and one 11 or one in 1000. In general is the extreme unlikelihood of an event reason to question the randomness of the result or does there also have to be a hypothesis that proposes the result to be non random? The calculated probabilities are extremely low but I am will to shrug it off as random good fortune because I have no other hypothesis for the effect. If I did come up with a hypothesis like âThe NFC has learned to identify the sound the coin makes spinning as a forecast of the outcome of the coin toss!â. Has this low probability calculated given reason to consider the hypothesis possible?

@23 - there's a good probability that something (anything!) extremely unusual will happen to you today. That's what makes life interesting :-)

Yes there does need to be some sort of testable hypothesis before we can assign any significance to a particular event.

For a very similar discussion of a totally different topic, see
http://www.universetoday.com/2010/02/09/seven-year-wmap-results-no-they…

A map of the temperature of the Universe shows Stephen Hawking's initials ("SH") conspicuously in an otherwise random pattern. Meaningful? What if we saw "NFC" instead? :-)