Ok, so yesterday wasn't quite as basic as I planned on shooting for in this week or two of working on non-mathematical concepts. But the idea was too cool to resist. This isn't exactly a mathematically elementary subject either, but the concept can be grasped without needing to see the actual functions involved.
This is a random sample of a large number of women, arranged by height:
There are many people of average height in the world, and a smaller number of very tall and very short people. The more extreme the height, the rarer the people with that height.
On the other hand, we could imagine a species in which there was a certain average height and every added inch dropped the probability by a constant amount. The way the heights were distributed wouldn't be a smooth curve (which happens to be called a Gaussian distribution, or a bell curve, or a normal distribution) as it is in the picture, but instead sort of a pyramid. But we never see that.
This isn't confined to biology either. Everything from the frequencies of photons emitted by a laser to the velocity components of a gas molecule do the same thing. That same smooth bell curve happens all throughout the sciences. It's inescapable. Why?
The answer is a mathematical fact called the central limit theorem. In slightly imprecise nonmathematical language it says the following: any time you have a quantity which is bumped around by a large number of random processes, you end up with a bell curve distribution for that quantity. And it really doesn't matter what those random processes are. They themselves don't have to follow the Gaussian distribution. So long as there's lots of them and they're small, the overall effect is Gaussian.
This is of dramatic importance in the sciences, and aside from that it happens to be a good thing to watch for - you'll start seeing it everywhere.
The central limit theorem is the most underdiscussed aspect of the standard normal distribution and all work that follows on from it. Many will apply the assumption of a normal distribution when absolutely no justification for it exists. Case in point, the stock market and economics. Quants based, and still base, their predictions on the assumption that money trading of all kinds has some kind of bell curve distribution, meaning it's the result of random variables. As the recent global sychronised credit freeze showed, this assumption is not justified by evidence.
"Bell curve" doesn't necessarily mean "Gaussian". There are "bell-curve" distributions that have very different properties from the Gaussian, such as, say, undefined mean and infinite variance. The Cauchy distribution is a famous example.
Mandelbrot studied stock price variation back in the 60s (!) and found that the distribution was strongly "fat-tailed", apparently with a diverging variance. Anyone modelling stock prices variation as Gaussian in this day and age is an idiot.
I can visualize the summation of probabilities even more easily when I recall those science museum exhibits of the dropping steel balls through a matrix of pins with bins underneath and a normal curve drawn on the surface of the glass covering the pin matrix. (Think of a pachinko machine.)
Very nice, Matt. You've taken a very complicated statistical topic and started it off with a nicely intuitive demonstration. This is just the sort of short story that might pique the interest of a young student introduced to statistics for the very first time.
I must say, though, that from my googling of "central limit theorem" we get into the math deep end quite rapidly. If you had a link to share for further reading in the shallow end of the pool, that would be appreciated. This looks like a good way to introduce statistics.
For the CLT to hold the underlying distributions do not have to be Gaussian but they do have to have finite variance for the theorem to hold. This is a problem for data that exhibits power law tails, such as securities returns. There is a generalized version for stable distributions (from wiki)
The central limit theorem states that the sum of a number of random variables with finite variances will tend to a normal distribution as the number of variables grows. A generalization due to Gnedenko and Kolmogorov states that the sum of a number of random variables with power-law tail distributions decreasing as 1 / | x | Î± + 1 (and therefore having infinite variance) will tend to a stable Levy distribution f(x;Î±,0,c,0) as the number of variables grows. (Voit 2003 Â§ 5.4.3)
When writing this I came across a Wikipedia article with a very good illustration of the theorem. It starts with a single random variable with a very wonky-looking distribution, and then calculates the distribution for the sum of two of those random numbers. Then three, etc. The more terms, the more the distribution for their sum looks like a normal distribution.
One of the points I wanted to make is that variance of the underlying distributions has to be finite; that has been well covered above. The other point I want to make is that, for a finite number of observations, the tails converge much more slowly than the center does; if you're looking at tens of thousands of samples (with a probability, say, proportional to 1/(x^4+1)), you're liable to find a curve that looks very Gaussian within two or three sigma but still has tails that look like the original distribution (in this case, 1/x^4).
Re: some earlier comments..... Even the distribution of the speed of gas molecules isn't Gaussian.
Another key assumption in the Central Limit Theorem is that there are lots of random processes contributing to the overall distribution. There are applications where this assumption is false. For example, astrophysical plasmas are often collisionless, by which I mean that the mean time between a given particle's interactions with other particles is large compared with other time scales of interest (most often the cyclotron period). This is why cosmic ray fluxes obey a power law over so many orders of magnitude, for instance, and we often see power law distributions in velocity space for magnetospheric and heliospehric plasmas as well.
Another example may well be the stock market. If the number of random interactions in the stock market is sufficiently large, it doesn't matter that the prices of particular stocks vary in a non-Gaussian fashion. But not all interactions are random, as Wall Street quants are learning the hard way. One such nonrandom process: Forced unwinding of a leveraged position, especially a large one, will tend to reduce the value of the position being unwound. That's one of the things that happened in 1929: people forced to sell stock to meet margin calls pushed down the value of the stock being sold, causing other people to get margin calls, until essentially everybody who had stocks bought on margin was wiped out (at that time retail investors could have 10:1 leverage in stocks, vs. the 2:1 limit that has been in effect since the 1930s).
and as a result the most glorious tool of political science
random sampling and measurable error
I'm having trouble understanding that there isn't a manmade element here.
The women in the picture aren't arranged by height. Arbitrarily-sized bins of women are arranged in height order. Looks like about 14 of them. If the bin size were much larger or much smaller, then the bell curve wouldn't be as obvious.
The photograph seems to illustrate a combination of the CLT and some truism about sampling.
While random in nature many distributions will NOT be Gaussian. In engineering many of the distributions are Weibull distributions.
These are the foundation fo reliability and wear out calculations.
Matt, you might be interested in an article I wrote last night that equates distribution of life on the planet to the partial pressure of gases dissolved in a liquid. I've labeled the partial pressure of life (pL) and among other things, show geographically from pole to pole, there appears to be a Gaussian distribution.
Dude, if this is the beginning of a series of posts on the mathematical basis of statistics, then "w00t!!!!"
I wholeheartedly disagree with Bell Curve theory being not applicable to markets. What you see is oscillation being deviated over a center line.
The market's run (DOW) from 1k (1983) to 14k (2008) was the peak of accumulation by a large percentage of the population investing for retirement. Now, on the distributive side, divestiture is occurring although at a faster rate in a flight to safety. This is due to "investor IQ" of the thought to put one's earnings into stocks and real estate. As this "IQ" becomes lessened by the rupture of this bubble, less and less will be inclined to continue. Hence, the Bell Curve shape of monetary amounts invested by population and percentage. (Less people involved, less money invested.)
Unfortunately, this interpretation of the Central Limit Theorem is WRONG. The Central Limit Theorem says nothing about individual heights (or weights, or lifetimes). What it says is that AVERAGES of many heights will have a Normal distribution as the sample size gets large. The fact that women's heights has a Normal distribution is not a consequence of the CLT. The fact that they are "bumped around by large number of random processes" isn't enough. Lifetimes are bumped around by just as many, but they clearly aren't Normally distributed.