Understanding cocktail-party conversation: Why do we look where we do?

Blogging on Peer-Reviewed ResearchWhen we are trying to understand what someone is saying, we rely a lot on the movement of their face. We pay attention to how their faces move, and that informs our understanding of what is said. The classic example of this is the McGurk effect, where the same sound accompanied by different facial movements gets interpreted differently.

Take a look at this short video clip (QuickTime required) of me talking, with my voice muffled by what sounds like cocktail party conversation:

Can you understand what I'm saying? What about after I stop moving? Can you understand me in the second part of that clip? Go ahead and replay the video to see if you can hear it the second time through.

That's right, I said two three-word phrases, not just one. If you're like me, you only heard background noise during the second part of the clip. In fact, I'm curious as to whether anyone can understand me at all. Let's make this one a poll:

I'll play the video with me actually moving at the end of the post, and we'll see if the results change.

Since the McGurk effect, researchers have studied precisely where we look when we watch someone speak, and found that we're not always looking at the mouth. Indeed, we look at speakers' eyes more often. Even more striking, we tend to look disproportionately at the right side of a speaker's face. Why the right side? Several studies have found that the right side of most speakers' faces are more expressive than the left side, so we appear to be focusing on the side of the face that offers the most information.

But what if the left side of a particular face was actually offering more information? Would we switch our focus to that side? A team led by Ian T. Everdell showed 28 college students a series of videos similar to the one I presented above. The students' eye movements were monitored with a tracking device. Speakers uttered one of six phrases, and, as above, sometimes their faces were static and sometimes they were moving. In addition, some of the time the faces were flipped, so what appeared to be the right side of the face was actually the left side in the original.

As expected, viewers could understand the moving faces more often (90 percent of the time) than the static faces (60 percent of the time). Also as expected, for the non-flipped faces, viewers indeed focused more of the time on the right side of the speakers' face. This picture shows the results for one typical viewer:

i-04c66b8645bdbd9a8d5a5a8ec21f2fe4-everdell.jpg

Some viewers did focus more on the left than the right, but the vast majority of viewers focused on the right side of the speaker's face. So what about when the faces are flipped?

i-78e45c003a4d29ad5ca1ad33c2977485-everdell2.gif

As you can see, there's practically no difference in the results. Whether the faces were presented in the original or mirrored form, nearly all viewers focus on the right side of the face. Whether they focused on the left or right side of the face, viewers were consistent -- left-focusers focused on the left side for both normal and mirrored faces.

Everdell's team argues that if we focus on the left side of the face because there's more information available to help us understand speech, we're not able to adapt very quickly to different speakers. When we're confronted with someone who's more expressive with the left side of their face, we're not able to instantly adapt and focus on that side of their face.

Oh, one last thing. Were you wondering what I said in the second half of that clip? Here's the unaltered original video:

Record your answers below. Let's see if we get a different result now.

Everdell, I.T., Marsh, H., Yurick, M.D., Munhall, K.G., Paré, M. (2007). Gaze behaviour in audiovisual speech perception: Asymmetrical distribution of face-directed fixations. Perception, 36(10), 1535-1545. DOI: 10.1068/p5852

More like this

I did not hear either well enough to understand them, but I managed to catch a syllable or two; just enough to rule out most of the options in the multiple-choice.

I just guessed based on the first syllable I heard vaguely.

In the first video I heard the word quickly at the end and whaddya know, it was right? Didn't hear the rest of it though. I didn't even hear the first phrase, I guess I'm not a skilled face reader.

I believe people have different ways of understanding people - people like me rely entirely on their hearing in such situations.

It is very difficult even to lipread you. Below a transcript of my inner voice.

Sometimes I feel that we ascribe too much to humans. My domain is finance and you don't want to know how people fly in the face of logic. Yeah, those rational agents.
Let's be careful about the sides here: I'll adopt the first person view. The crux becomes "that discrimination of upright faces in sheep preferentially engages the right temporal cortex, as it does in humans." (Mimmack et al, 2000) The thing with the sheep is to show evolutionary constancy, though conceivably you could transfer the research.
Fact is, right hemisphere steers my left side, also that of m y face. So the few people contemplating s.o.'s LH side of the face are the enlightened ones. The bulk of people looking left merely accesses their right brain hemisphere.
Proposal for research: Pease (2006) makes a point that lies distort facial symmetry towards sb.'s left side of the face (s.a.). Abstracting: When inverting the images, does the recognition of emotions and detection of lies increase since people get to read the richer side of the face while still working the r i g h t brain hemisphere?

You needed to offer an additional option in your polls: "I couldn't hear it at all". If such an option had been present I would have selected it for both polls: I couldn't even tell if you were speaking, much less tell which words you said.

In fact, I'll be completely unsurprised if you come back next week and tell us that you were conducting some experiment on your readers and that actually there were no words whatsoever in the audio, you're actually testing to see which text people choose when there's no reason to choose one over another.

For all the complaints about the demo, it's interesting to note that it's been one of the most dramatic demonstrations we've ever posted. Only about 20 percent of respondents got it right when they couldn't see my face moving, and about 90 percent responded correctly when they could.

In the actual study, respondents were about 60 percent accurate for the static images, so my demo is quite a bit more difficult, but still it's interesting to see how many people managed to respond correctly when they could see my face moving.

I couldn't understand the second phrase in the first video at all, but otherwise had no trouble making out what you were saying--but for a very particular reason, I suspect. I read through all the choices beforehand. Then it was easy to tell what was being said. The interesting thing is that even now, watching the first video, I effectively don't hear anything at all for the second phrase, even knowing that it's there, even knowing what's being said.

By Robert Rushing (not verified) on 07 Dec 2007 #permalink

All I hear is the sound of a marketplace. And some faint voices that seem to say "bones play sleep" or something like that.

I'm so relieved to see that most other people couldn't distinguish it! I couldn't either, but I attributed it to my growing problem recognizing speech. My hearing is quite good, but when my wife and I watch a movie at home, I frequently turn to her and ask, "What'd he say?" Indeed, I can still recall watching that nomination speech by candidate Bush in 1988, where he said "mumble mumble mumble, No New Taxes!" and never figuring out what he said. ;-)

By Chris Crawford (not verified) on 07 Dec 2007 #permalink

You picked phrases where only one could be correct if you can lipread at all -- the first had to be start with an f or v, the last with a p or b, so that left only one option when you put up multiple choice. If you'd had other phrases that started with the same point of articulation, you might have had dramatically different results. I wouldn't have gotten either without the list of phrases -- all I could get was the basic phrase contours, although I did hear both the first time.

Ditto to #15. All I could tell for the first one was that it started with "F", and that there was an "L" sound with a hard sound (b or p) right before it. And I didn't catch the second phrase at all on that one.

What fun! I love these kinds of tests. I heard "quickly" at the end of the first video and was able to choose the right sentence. I think answering was greatly aided by the multiple choice format. I would not have been able to answer correctly for the first video if I didn't have a script to choose from. Now if only the next cocktail party I went to had the same assistance... ;)

By braingirl (not verified) on 07 Dec 2007 #permalink

I couldn't understand a word you said in the first part of the first video. Needless to say, I didn't understand what you said in the second part of it either. Both of these statements held true even when I put my ear right up to the speaker and didn't watch the video.

For the second video, my answer was based on reading your lips.

I'm guessing none of the tests were done with well-endowed women?

By TheOldMole (not verified) on 09 Dec 2007 #permalink

I also could not hear anything in the first clip and would have chosen that as an answer if it were an option. I only got the second one right because I could tell you were making a "Q" sound with your lips; couldn't hear the "Q" sound,though.

Interesting, though: I have often thought that I was below average at understanding what someone is saying in the presence of similar ambient noises.

I was unable to discern what you were saying until I fixed my view on your lips. It seems that the I choose was the most voted one!

I'm slightly confused about the results. Do they mean that the subjects looked at the actual right-side of the face, or at the perceived right-side of the face? Right now I'm not sure how to interpret the graph and this result.

I think I'd learned to filter out some of the cocktail party noise in the second video, but nevertheless, I relied on lip-reading more than hearing, to determine what you'd said. I've worked with individuals with head injuries and cerebral palsy through a therapeutic horseback riding program, and since I'm almost always a horse-handler, I have the best view of the rider's face when we stop (I'm required to stand directly in front of the horse at that point, facing the rider). Invariably, I can understand what the rider is saying, even if the therapist and other volunteers cannot. After reading this post, I think that my ability to understand a speaker with upper motor neuron deficits has more to do with watching facial movements carefully, than with my hyperacusis (which is actually a detriment in noisy situations).

I'd be very interested to know about related research that might help therapists and caretakers better understand the speech of individuals with upper motor neuron lesions.

Heh, took me a couple listens to understand the first part of the first video. ^_^ I *definitely* couldn't tell what the second part of it was.

I think part of the dramatic difference might be that a lot of listeners simply marked down what they heard in the first section. Even though I had to guess on the first question, it was quite obvious to me that you weren't saying the same thing.

Once you moved with the audio, though, it was easy to hear.

By Xanthir, FCD (not verified) on 14 Dec 2007 #permalink

I believe people have different ways of understanding people - people like me rely entirely on their hearing in such situations.

All I hear is the sound of a marketplace. And some faint voices that seem to say "bones play sleep" or something like that