 The easiest way to talk to someone else is face-to-face. If you can see the movements of a person's lips and facial muscles, you can more easily work out what they're saying, a fact made obvious if you're trying to have a conversation in a noisy environment. These visual cues clue our brains in on how best to interpret the signals coming from our ears.
The easiest way to talk to someone else is face-to-face. If you can see the movements of a person's lips and facial muscles, you can more easily work out what they're saying, a fact made obvious if you're trying to have a conversation in a noisy environment. These visual cues clue our brains in on how best to interpret the signals coming from our ears.
 But what happens when that's not possible, like when you're chatting on the phone or listening to a recorded message? New research suggests that if you've spoken to someone before, your brain uses memories of their face to help decode what they're saying when they're not in front of you. Based on previous experience, It runs a simulation of the speaker's face to fill in any information missing from the sound stream alone.
But what happens when that's not possible, like when you're chatting on the phone or listening to a recorded message? New research suggests that if you've spoken to someone before, your brain uses memories of their face to help decode what they're saying when they're not in front of you. Based on previous experience, It runs a simulation of the speaker's face to fill in any information missing from the sound stream alone.
These results contradict a classical theory about hearing - the "auditory-only model" - which suggest that the brain deciphers the spoken word using only the signals it receives from the ears. The model has been opposed before, by earlier studies which found that people are better at identifying a speaker by voice if they have briefly seen that person speaking before. Katherina von Kriegstein from University College London extended these discoveries by showing that previous experience also helps us to work out what's being said, as well as who said it.
Face-offs
She trained 34 volunteers to identify six male speakers by voice and name. The volunteers saw videos of three of the speakers as they talked, but the other three remained faceless, represented only by a drawing of their occupation. As a further catch, half of the volunteers had a condition called prosopagnosia or face blindness, that prevents them from recognising faces, but has no effect on their ability to recognise objects in general.
After the training, the Kriegstein tested the volunteers while they lay inside a magnetic resonance imaging (MRI) scanner. They listened to short recordings of one of the six speakers and had to either work out who was speaking ("speaker recognition") or what they were saying ("speech recognition").
Kriegstein found that both prosopagnosics and controls were slightly better at recognising speech when they had seen the speaker's face before. The improvement was small - between 1-2% - but that is still significant given that typical success rate for this task is greater than 90%. But of the two groups, only the controls were better at recognising speakers after seeing videos of them beforehand. They were 5% more accurate, while the prosopagnosics didn't benefit at all.
Facial simulations
Using the fMRI scanner, Kriegstein found that these 'face benefits' were reflected by the strength of neural activity in two parts of the brain. The first, the superior temporal sulcus (STS) detects facial movements (among other biological motion), of the kind that we use to help us make out the words of a person speaking in front of us. The stronger their activity in the STS, the more benefit the volunteers gained from having seeing videos of the speakers in the speech recognition task.
The second area, the fusiform face area (FFA), specialises in recognising faces and is often damaged in prosopagnosics. Unlike the STS, it played more of a role in the speaker recognition task but only the controls were more accurate at identifying speakers if they had strong activity in the FFA. So two separate networks that are involved in facial processing are active even when there are no faces to process.
Kriegstein concluded that the people pick up key visual elements of a stranger's speech after less than two minutes of watching them talk, and we use these to store 'facial signatures' of new speakers. The brain effectively uses these to run 'talking face' simulations, to better decipher any voice it hears. It's one of the reasons why phone conversations are easier if you've previously met the person at the other end of the line in the flesh.
Reference: 10.1073/pnas.0710826105
Image: by Xenia
 
It runs a simulation of the speaker's face to fill in any information missing from the sound stream alone.
Maybe I've read too much BF Skinner, but I suspect all these results are better explained in terms of attention and learning, and that the activity seen on the MRIs is better described as a correlate of recognition than a cause.
People may generally process the nuances of speakers they've previously seen in person much better than those they've never seen, but this may only reflect a greater ability to learn from actual speakers than from disembodied voices. Ultimately, the distinctions learned can still be strictly auditory.
People may also show greater tendency to visualize the speakers they've seen and comprehend well. But it doesn't follow that the visualization is integral to the comprehension. And when you really think about it, how could it be?--The visibility of an actual speaker supplies me with additional stimulation relevant to what he or she is saying. My own visualization of the speaker cannot do the same. I can only imagine the speaker accurately if I hear him or her correctly in the first place, in which case the mental simulation would be superfluous.
I've noticed this a lot when participating in teleconferences, particularly when the participants are scattered across two or three continents, and some of them are not native English speakers. It's much easier to understand people whom I've met at some point than those who are only voices on the phone. So a non-native speaker of English whom I've actually met is easier to understand than the native speaker who I only know as a voice.
(I once sat in a weekly meeting where the project manager was a Hindi speaker, who spoke extraordinarily rapidly just as a personal ideolect, and the code developers were Quebecois French speakers, and I'm a midwestern anglo-USian. Most of the meeting consisted of "ImsorryIcouldnotmakeoutthatlastbitplease," and "Ah don' unnerstan' what you say, eh?" This went on forever. But I have no problem understanding either Hindi or Quebec French speakers when we're all in the same room.)
To follow on TE's comment, I'm not sure if the study conclusively shows that this is a facial-image simulation in the brain or not, but the phenomenon of having trouble following purely audio cues in ordinary speech is real. As a possible data point, I listen to a lot of old radio dramas. And they're generally easy (and enjoyable) to follow. But if you break them down and look at the dialog objectively, it's all highly artificial, and designed to be easily understood without visual cues. Regular folks talking on the phone, especially when dealing with language, dialect, and ideolect -- not so much.
I'm convinced that teleconferences are a sick prank played by more senior people than me, for precisely those reasons.
TE, not sure if this satisfies your objections, but the authors did mention the possibility that the subjects were just paying greater attention to the voices that came with a matching video. They claim that their data rules this out. If the benefits were due to attention rather than some property of the faces themselves, then you would expect both controls and prosopagnosics to do better in both speech recognition and speaker recognition tasks. That wasn't the case - the prosopagnosics gained no advantage in the speaker recognition task.