How do we recognize scenes?

ResearchBlogging.orgTake a look at this movie (you'll need a video player like QuickTime or Windows Media Player installed in your browser to see it). You'll see four different outdoor scenes flash by, one at a time. The scene itself will only be displayed for a fraction of a second, followed immediately by a distraction pattern designed to mask any image left over in your visual system. Your job is to spot any desert or mountain scene. Watch carefully!

Did you spot them? What cued you in to the idea of a "desert" or a "mountain" scene? Was it a specific object in the picture (a mesa or a snowfield)? Was it a color? Perception research has historically focused more on the idea of objects or parts of objects (borders, curves) than entire scenes. But is that the way our visual system actually works? What if people are actually taking in the whole scene rather than (or in addition to) focusing in on individual objects?

Michelle Greene and Aude Oliva had 55 viewers rank hundreds of scenes for seven different more general properties: Concealment (C), Transience (Tr), Navigability (N), Temperature (Te), Openness (O), Expansion (E), and Mean Depth (Md). The pictures were presented on a 30-inch color monitor in groups of 100. So if a rater ranked pictures for Navigability, she would drag half the pictures (the least navigable) to the left of the screen, and the other half (the most navigable) to the right. Then these groups were each divided in half two more times, to create a spectrum of eight groupings, from least- to most- navigable. The least navigable pictures might be a dense forest or a steep cliff, while the most navigable might be an open field or a road. Every viewer didn't rate every picture or property, but at least ten viewers rated each picture for each property. Here's how the ratings broke down for four types of scenes:


The boxes correspond to 50 percent of the rankings, so as you can see, for Navigability, nearly all field scenes were ranked high, and most mountain scenes were ranked low. Deserts were ranked high for Temperature, while mountains were ranked low.

Next, a new set of 73 viewers watched hundreds of movies like the four that I showed you above -- only the scenes were flashed for an even shorter time (30 milliseconds, difficult to duplicate online). They saw movies in groups of 50. So, for example, during the first 50 movies they might be asked to identify whether or not a lake scene had flashed by. Then for the next 50 they would identify mountain scenes, and so on. This graph shows how they did:


This graph shows accuracy in rejecting scenes that weren't of the desired category. So, for example, if a viewer was looking for forest scenes, then the typical forest would rank very low on openness. Mountains rank lower on openness than deserts, so their distance to the prototypical forest would be lower than desert scenes. As you can see, accuracy was lower for scenes that are lower in distance to the prototypes: viewers looking for forests made more mistakes when presented with mountain scenes compared to desert scenes. The results in this graph are averaged over all seven properties and all eight different scene types, and the pattern still holds.

But perhaps viewers aren't really classifying the scenes based on these general properties -- couldn't it be true that mountains and forests just tend to have similar objects compared to deserts?

To test this concept, Greene and Oliva developed Bayesian classifiers using a mathematical model. One classifier was trained to classify images based only on the properties of each image as rated by the humans at the start of the study. The other was trained to classify the images based on the physical objects in the scene: trees, water, rock, flowers, and so on. The simulated results of the property-classifier matched the human results nearly exactly, while the object-classifier was much different from the human results. When the property-classifier made a mistake, it was similar to the mistakes the humans made, like mistaking a waterfall for a river. When the object-classifier made an error, it was different from the humans, like mistaking a desert for a field.

Greene and Oliva are careful to say that the properties of a scene may not be the only way we identify scenes, but it does seem clear from these results that properties of a scene a very important part of how we initially identify a scene.

Greene MR, & Oliva A (2009). Recognition of natural scenes from global properties: seeing the forest without representing the trees. Cognitive psychology, 58 (2), 137-76 PMID: 18762289

More like this

Curious whether this is influenced by real-life experience. Perhaps who sees hills or mountains every day when they look out their window would find it easier to recognise a mountain than someone who has only seen hills and mountains in books or on the internet.

Using Quicktime Alternative 2.9.0, no video here either. Poking around, it seems the QT Alternative plugin will not be called for files with the extension of .mp4, which is what Dave is using now. Not sure if there is a way to force or add the extension to QT Alternative's settings (as it can indeed open MPEG-4 compressed videos), but I'll look into it...