There was plenty of interest in yesterday’s audio-visual illusion. In case you missed it, I’ll post it again here:
Play the movie with the sound turned up. If the illusion works, then you’ll see a dot flash twice, accompanied by two beeps. But actually the dot only flashes once. Unfortunately, we’ve had a hard time getting viewers to see the illusion (as of this writing, just over a quarter of those viewing the video claim to see two or more flashes). I have a couple ideas about why people don’t seem to see it reliably here on Cognitive Daily, and I’m going to investigate them further on Casual Friday. But for those who do see the illusion, I thought I’d offer a little more explanation about how it works.
What we’re talking about here is a cross-modal effect: one perceptual system (in this case, vision) is affected by another (sound). Cross-modal effects are more common than you might think. The visual system, for example, relies on the sensorimotor system to a very large extent. Think about it this way: when you take a video with a camcorder, every little motion of the camera gets translated into a big jitter on your final video. Videos taken from a car window on a bumpy road are practically unwatchable. Yet sitting in the car, you have no problem seeing all the scenery, with no jitters. That’s because your visual system takes into account the sensed motion of your body and adapts.
The visual system shares information with the auditory system as well. Ventriloquists rely on this phenomenon all the time: Since we see the dummy’s mouth moving, we perceive the ventriloquist’s voice as coming from the dummy.
Why would we have evolved to disregard the source of a sound, to perceive it as coming from a different source than it really is? I suspect it’s because in nature, sound can be displaced, via echoes or other phenomena, and so the visual input synchronized with the sound is usually more reliable.
Shams’ illusion, however, is the reverse of the ventriloquism effect: A sound is inducing a visual perception. As Shams et al. point out, the illusion doesn’t work in reverse: if we see three flashes accompanied by one beep, we are never fooled into believing we only saw one flash. In a separate experiment, Shams’ team found that people watching a cello plucking video accompanied by a cello bowing sound often reported hearing the plucking sound. Why didn’t they report seeing bowing instead? When does the visual system take precedence, and when does the audio system?
You might think it relates primarily to context: whichever mode is most appropriate “wins.” But in the case of the Shams illusion, the visual mode is always what observers are paying attention to, and sometimes the audio stimulus affects it and sometimes it doesn’t. More likely, Shams et al. argue, is that different phenomena are related to the particular stimuli themselves: the discontinuous stimulus (beeping, plucking) is more likely to affect the continuous stimulus, compared to the other way around.
Clearly, however, there are limits to this effect. In the context of Cognitive Daily, most viewers don’t see it. As Mick Grierson noted in the comments yesterday, the fluidity of the effect can have interesting implications in the arts. I’d be very interested to see a work of art that explores the boundaries of this illusion: when it works, and when it doesn’t.