statistics
Suppose you've got a bunch of data. You believe that there's a linear
relationship between two of the values in that data, and you want to
find out whether that relationship really exists, and if so, what the properties
of that relationship are.
Once again, I'll use an example based on the first example that my
father showed me. He was working on semiconductor manufacturing. One of the
tests they did was to expose an integrated circuit to radiation, to determine
how much radiation you could expect it to be exposed to before it failed. (These were circuits for satellites, which are exposed…
Several people have asked me to write a few basic posts on statistics. I've
written a few basic posts on the subject - like, for example, this post on mean, median and mode. But I've never really started from the beginnings, for people
who really don't understand statistics at all.
To begin with: statistics is the mathematical analysis of aggregates. That is, it's a set of tool for looking at a large quantity of data about a population, and finding ways to measure, analyze, describe, and understand the information about the population.
There are two main kinds of statistics: sampled…
Earlier this week, I attended the International Human Microbiome Consortium Meeting (the human microbiome consists of the organisms that live on and in us). I'm not sure to make of the whole microbiome initiative, but one thing is clear to me: this is being driven by the wrong group of scientists.
Instead of being directed by biologists (medical primarily) who have devised a set of important questions, and want to use the power of high throughput genomics, including metagenomics which sequences of all the DNA in a specimen--bacteria, viruses, fungi, protozoa, and, yes, human (which raises…
OK, so my mind isn't as great as Stephen J. Gould's was, but when The Bell Curve was first published, I remember looking at the data appendices, and thinking, "These data are crap." A few years later, I found an essay by Gould in The Bell Curve Wars that made the same point, albeit more eloquently. So why bring this up?
I've discussed the recent resurgence of idiotic statements about IQ and genetics, but something Atrios wrote about Saletan's recent missive bugged me (italics mine):
You know, when Saletan went down his courageous racist road it at first didn't even occur to me to bother to…
Last week, the Washington Post took Rudy Giuliani to task for an ad where he claims that his chances of surviving prostate cancer -- which he had about 6 years ago -- were much higher in the US than in the UK. The ad is meant to indict those who wish to modify the health care system.
He says:
"I had prostate cancer, five, six years ago. My chances of surviving prostate cancer and thank God I was cured of it, in the United States, 82 percent. My chances of surviving prostate cancer in England, only 44 percent under socialized medicine."
Here is the ad itself:
Since the ad, a flurry of…
Mark Liberman has an excellent post examining the general public's understanding of basic statistical concepts such as means, variances, and distributions. Here's a taste:
Until about a hundred years ago, our language and culture lacked the words and ideas needed to deal with the evaluation and comparison of sampled properties of groups. Even today, only a minuscule proportion of the U.S. population understands even the simplest form of these concepts and terms. Out of the roughly 300 million Americans, I doubt that as many as 500 thousand grasp these ideas to any practical extent, and 50,000…
Over at Hullabaloo, Tristero is confused by Michael Gordon's claim that the surge has lowered the violence in Iraq:
What I'd like to know, instead, is whether the conclusions he draws in the paragraph quoted are reasonable ones based upon the evidence he presents. Likewise, I'm aware of the "correlation does not necessarily equal causation" fallacy which Gordon flirts with. In this case, tho, I think it is very reasonable to assume that additional military power might "cause" decreased numbers of attacks. I just don't see them; the improvements, except as noted above, seem to be mostly…
General Petraeus is bringing new meaning to the phrase 'head count':
Intelligence analysts computing aggregate levels of violence against civilians for the NIE puzzled over how the military designated attacks as combat, sectarian or criminal, according to one senior intelligence official in Washington. "If a bullet went through the back of the head, it's sectarian," the official said. "If it went through the front, it's criminal."
Which led to this assessment:
"Depending on which numbers you pick," he said, "you get a different outcome."
Gee, do ya think?
So let's think about this 'metric'…
125 taxa. This analysis is never going to end (stupid GTR models):
I'm getting annoyed....
Update: It took five days to run this thing. And, yes, that was after using ModelTest.
By the way, if you have no idea what I'm talking about, here's a post (without technical jargon) on likelihood and phylogenetics to introduce you to the basics.
I hadn't really ever thought about it, but surveys consistently report that heterosexual men have a larger number of sexual partners on average than heterosexual women. However, that really isn't logically possible, is it? I mean, last time I checked it took two to tango. Mathematician David Gale demonstrates why these results cannot be right in the NYTimes:
One survey, recently reported by the federal government, concluded that men had a median of seven female sex partners. Women had a median of four male sex partners. Another study, by British researchers, stated that men had 12.7…
Over the weekend, there was a lot of discussion of those ridiculous conservative faithtank graphs that were rerun in the Wall Street Journal. Several of my fellow ScienceBloglings have debunked the analysis that claims these data support the Laffer curve, although my favorite criticism is by Brad DeLong who points out that to prove something the editorial writers like (the Laffer curve), the Wall Street Journal editors use the the Norwegian data, and to weaken something they don't (increased corporate taxes lead to increased tax revenue), they remove the same data.
Is there any question how…
So the last post was pretty dense, and I haven't used an example since the first post, so I thought I'd throw one out there that you can play with. In what follows, I pretend to use the equations, but I'm actually doing all this in Excel. If you've got Excel, here are some helpful functions. AVERAGE gives you the mean of a range of numbers, VAR gives you the variance, and STDEV gives you the standard deviation. Note that VAR and STDEV give you the variance and standard deviation for a sample (i.e., using n-1 instead of n). If you want population variance and standard deviation, use VARP and…
So far we've been talking about different distributions and their parameters. If we're looking at a population with known parameters, then we're going to be dealing with either a normal distribution or a standardized normal distribution (Post I and II). If we're dealing with samples, we're going to use either the sampling distribution of means, if the population parameters are known, or more often, the t-distribution if they're not (post III). Normal and standardized distributions allow us to determine the probability associated with a particular value of a variable in a population, and thus…
Before we start in on new stuff, let's recap what we've covered so far. We started with the Central Limit Theorem, which tells us that if a bunch of random variables go into determining the values of yet another variable, then the values of that variable will approximate a normal distribution. The normal distribution is great because the measures of central tendency -- the mean, median, and mode -- converge, and because the measures of spread (variance and standard deviation) can be associated with specific probabilities (derived from the area under the curve in the distribution).
Then we…
So in the last post, we talked about the normal distribution, and at the very end, discussed that if you knew the mean and standard deviation of a population for a particular variable, than you can compute the probabilities associated with a particular value of that variable within that population. The problem is, to do so, you have to use a really long equation that involves math and stuff, and if you're reading this, chances are you're not a big fan of math. I know I'm not. What we need, then, is a simpler way to get those probabilities. And it turns out there is just such a way: a…
So here's the first post on statistics. If you know the basics, and I suspect most of you do, then you can just ignore these posts (unless you want to check to make sure I'm getting it right). If you don't know the basics, then hopefully you will when I'm done. Even for those of you who've never taken a stats class, much of this will probably be familiar, but I'm going to start from the assumption that I'm writing for someone who has no knowledge of statistics whatsoever, so bear with me. Alright, let's begin.
The Normal Distribution
In cognitive psychology, two related types of statistics…
For some reason, John Hawks thinks my disc flipping calculations have something to do with population genetics. He extends it to FST, which is just plain ridiculous. There is nothing about binomial sampling that can be related to popgen theory. Nothing.
In yesterday's post, I argued that, when flipping two unfair discs (or coins), there is a greater chance that both discs will land with the same side up than different sides up. As pointed out in the comments, I was assuming that the probability of heads is equal for both discs:
Aren't you assuming that p (and q=1-p) are the same for both discs? But isn't it more reasonable to assume that, while no disc has a perfect p=0.5 probability of landing 'heads', the p's of no two discs are likely to be the same? (Assume, perhaps, that each disc's p is drawn independently from some kind of larger…
The beginning of many Ultimate (nee, Frisbee) games is marked by flipping discs to decide which team must pull (kick off) and which goal each team will defend at the start of the game. This is sort of like the coin flip before an American Football game. Two players -- one from each team -- flip a disc in the air. A third player -- a representative from one of the teams -- calls "same" or "different", referring to whether both discs land with the same side (top/bottom or heads/tails) facing up or different sides facing up. If he guesses right, his team gets to choose whether they want to pull…
Keith Robison, at Omics! Omics!, asks and answers the question, "What math courses should a biologist take in college?" His answer: a good statistics course is a must (one where you learn about experimental design and Bayesian statistics), and a survey course that covers topics like graph theory and matrix math would provide a nice introduction to important topics (that course probably doesn't exist at most colleges). He also advocates taking a programming class and turning math education into something more stimulating rather than rote drilling (easier said than done).
This being a blog, I,…