Casual Fridays: Almost significant

Nonsignificance is the bane of every researcher. They know they've got an effect, but those darned statistics prove otherwise. In cognitive psychology, the standard for significance is p < .05, which means, essentially, that there's a 5 percent chance that the results are simply due to chance, instead of revealing a bona-fide phenomenon. For this week's Casual Friday, try as I may, I just couldn't find a significant effect.

The idea was straightforward enough. We based our design on a really cool study from Gillian Rhodes' lab, which found that by repeatedly exposing viewers to fatter-than-normal or thinner-than-normal faces, viewers' perceptions of "normal" and "beautiful" faces was distorted. If you saw lots of fat faces, then you thought fatter faces were more attractive, even after just a few minutes.

We wondered if even a face was required to generate the effect. Instead of faces, we exposed our viewers to different shapes: a fat oval, a thin oval, or a circle:

i-dda39b26cd7b65bb9fabb7aefeddd85c-biasface1.gif

The task you might have thought was the primary task was simply a distractor: viewers were asked to compare the colors of 10 different pairs circles, fat ovals, or thin ovals. The real task we were interested was at the end, when all participants were shown a woman's face and asked to rate it on two scales ranging from thinner than average to fatter than average, as well as less attractive than average to more attractive than average.

If Rhodes' procedure worked for shapes as well as faces, then viewers who'd seen the fatter ovals should have rated the face as thinner than average, while viewers who'd seen the thinner ovals should have rated the face as fatter than average. Here are our results:

i-0e6f3ce6075c7fccc3d86d6e9adbf188-biasface2.gif

It looks like we might have an effect. Indeed, there is a bit of a trend, but only for circles compared to thin ovals, and it's definitely not significant, no matter how I break down the numbers. Even with 500 participants, we don't have an effect.

That said, I'm pretty sure we've got something -- the numbers are quite close to significance. Maybe with some intermediate shape, or if viewers were exposed to a variety of different fat and thin shapes, or even with just a little more time with the shapes we used, the results might be significant. I'd say this would be a nice project for a graduate student -- if you're interested in using this study as a jumping-off point, you're welcome to it!

More like this

A question for those who participated:

Were you surprised by the face-rating at the end? Or were you suspicious that the study wasn't really about color differences from the start?

Not from the start, necessarily. But twelve questions into the "11 question" study, with all the colors looking pretty similar, my guess was that the link may have been incorrect.

I think my standard of beauty has now been set permanently by the third oval of question 17.

 

You should only have been asked 11 questions (other than the one which divided you into groups). I suppose technically you could count the two categorizations of the face as two different questions, so that would get you to 13.

The first question divided people into groups, so you may have been "fast-forwarded" to a certain question, but then you should have fast-forwarded to the end, for a total of 11 questions.

Based on the data, it appears that was the experience of all users except for one, who answered every single question -- so the test may have malfunctioned in your case. If that's what happened to you, please accept my apologies.

Yup, that was me, in a group of one. The face was question 31. Let's think of it as a separate little pilot experiment. Hypothesis: after viewing 30 nearly identical ovals, pretty much any other image offers considerable appeal.

  ;)

Cheers

Yeah, I suspected pretty early that the color was a red herring. Part of the problem with giving this kind of test on this kind of blog is that people here are likely to be familiar with psychological experiments and the kinds of methods used, and the test that isn't about what you think it's about is one of the standard tricks.

I didn't know what it was actually going to be about, though.

Part of the problem may be the mental shifting gears from looking at abstract shapes and thinking about their color to looking at a face and thinking about its shape. IIRC I took a few seconds to mentally re-adjust when I saw the face question - I don't know if looking at and thinking about faces uses a different part of the brain than looking at and thinking about other things, but I wouldn't bet against it.

"In cognitive psychology, the standard for significance is p < .05, which means, essentially, that there's a 5 percent chance that the results are simply due to chance, instead of revealing a bona-fide phenomenon."
No it doesn't. It's one of those myths about significance tests that seem impossible to dispel. What the test result means is that if the null hypothesis is true, and if the distributional assumptions are correct, and if you were to run this experiment a very large number of times, then the statistic you computed (a difference of means, for example) would fall in an interval that includes the extreme 5% percent of the results.
Good luck using that in a scientifically meaningful way. The sad fact is that when an effect is obvious then significance tests will not give you any useful information, and, when it's not, they've a good chance of being misleading. It's a crying shame to see how significance testing have become so central in cog psych.
In your case anyway it's hard to say because there's no indication of the variability in your data on the graph. But if you do get an effect in the same direction for every subject, then it's likely that the effect exists but is small. Running a few extra subjects should do the trick.

I had no idea what the survey was about. I assumed it was just color judgments and the fact that you falsely judge colors a certain way when they're on different backgrounds.

Even when I got to the face I thought that it would be a color issue that was influencing our opinions of attractiveness. I really had no idea the ovals had to do with it.

By Katherine Moore (not verified) on 13 Jan 2007 #permalink

"It's one of those myths about significance tests"

I'm no statistical whiz, but I don't really see the difference between my explanation and yours. Yours is more comprehensive, and your point about the general problem with significance testing is well-taken, but I think we're saying essentially the same thing.

"Running a few extra subjects should do the trick."

By my calculations, assuming the trend holds up, it would take about 200 more, in just the round and thin cases. We would have had to run about 800 people to get the effect using our design, since 100 of them would have been in the fat oval group. Again, assuming the trend holds up, we could have done it with 500 if we had eliminated the fat ovals and only tested circles and thin ovals.

But a better design probably would have yielded results with even fewer participants. Unfortunately, we needed to run the experiment first to know what a better design would be.

Sorry if this seems like quibbling : our two definitions are not equivalent. It took me a while to wrap my brain around this, but here's the idea. The problem with "there's a 5 percent chance that the results are simply due to chance" is that it's ambiguous. It's got two possible meanings, and none of those apply to the p-value.

Meaning number 1 of "5% chance of the results being due to chance" refers to the likelihhod of the data given the null hypothesis.

The probability of getting your data given some hypothesis (eg. we're taking random samples from one single gaussian of mean m and sd s., or drawing from some kind of infinite urn with f% of black balls, etc.) - the likelihood - is the product of getting all the different observations your data is made up of. This is assuming that the data are independently and identically distributed (IID).
Let's suppose we're measuring performance for one subject across two levels of a conditioning factor (say this is a detection task, and in a few blocks we run the subject blindfolded). Performance is rate of correct detection. Our null hypothesis is that the rate is the same in both conditions, and that it's at chance level.
The subject gets it right 10 times out of 20 with blindfold, and 15 out of 20 without.
In this case, the probability of the data given the null hypothesis ( written p(D|H)) is :
p(D|H) = nchoosek(40,25)*(1/2)^25*(1/2)^15 = something really small.
That's the general case for the likelihood of the data : since we're multiplying values inferior to 1, it tends to 0 quite quickly (which is why we use log-likelihoods, BTW).
So the probability of observing the data given the null hypothesis is most definitely not 5% percent. In the case of continuous data, it's null : the probability of observing *exactly* a mean of 1.0004 for instance, is 0.
The p-value is computed based on what the sampling distribution function would be like for a large number of samples. For large N, the binomial used here would approximate a gaussian distribution, so based on that approximation we compute the probability of observing our data or higher, and we report that value. That's Fisher's original method, and in no case does it reflect the probability of getting the data given the null hypothesis. The .05 threshold comes from another brand of statistical testing, one that's based on decision theory, and the idea of having a fixed threshold is a complete misunderstanding.

Meaning number two would be "given that I observed data d, the probability of there being no difference across factor levels is more or less 5%". That's a more useful quantity, known as the posterior probability of the null hypothesis, and to compute it you have to go bayesian and define prior probabilities for your hypotheses. A test won't give you the posterior probability, and for good reasons : classical statistical testing was explicitly designed to be non-Bayesian.

There's a hundred problems with statistical testing as practiced in psychology (and biology as well). For the straight dope check out James Berger's 1983 volume on Decision Theory and Statistics, or E.T. Jaynes' 'Probability : The Logic of Science'. For a somewhat scary and less mathematical summary, see that article by Gerd Gigerenzer :
http://www.mpib-berlin.mpg.de/en/institut/dok/full/gg/mindless/index.htm

I think the main reason statistical tests survive to this day is that they make articles in psychology journals look more complicated than they really are. It's unfortunate, but they're a sign of psychology's low status among the sciences. But by perpetuating their use we're doing nothing to improve that situation.