How do we remember scenes?

ResearchBlogging.orgTake a look at this quick movie. What you'll see is two sets of three views of the same scene (our living room). For each group of three views, your job is to decide if the third view is taken from the same angle as one of the two previous views. After the first two views flash by, the text "Same?" will appear, and that's your cue to decide if the third view is the same as EITHER of the first two. Give it a shot:

Do you think you got the correct answer? Was either of the two sequences more difficult than the other?

The movie was inspired by a recent experiment led by Monica Castelhano. Castelhano's team was interested in how we remember scenes. While there has been a fair amount of work on how we remember objects, the work on scenes is inconclusive. Do we store a holistic, three-dimensional representation of a scene (defined as a coherent, visible part of the world) in our heads, allowing us to consider it from any angle? Or do we remember scenes as separate snapshots, each one processed independently from the next?

For objects, researchers have found that an object (like, say, one of several different toy horses) is more readily recognizable from a different viewpoint if that viewpoint is between two different viewpoints that we've seen before. If you've seen a horse from the front and the side, you're better able to recognize it from a 3/4 view halfway between the front and a side view than from a 3/4 view halfway between the side and the back view. This suggests that we have at least a partial three-dimensional representation of the object in our memory, we're not just recalling the particular views. Does the same principle hold for scenes?

Castelhano's team showed 36 college students movies similar to the ones in my video, but the scenes they saw were created using 3-D rendering software so that the camera angle and lighting could be more carefully controlled. Sometimes the final picture was the same as one of the first two in the sequence, and sometimes it was different in one of two ways -- either it was between the first two views, or outside of the first two views. This diagram shows how it worked:


The first and second camera views were 40° apart, and the third view was always 20° away from one of the first two views -- sometimes in between (B), and sometimes outside of the previous camera view (A or C). As you did, the students had to say whether the third view was new or old -- whether or not it was the same as one of the previous two views.

Here are the results:


This graph shows accuracy identifying "new" views -- in fact in this case all views are new; the third picture is shifted 20° from one or both of the previous views. As you can see, viewers were significantly more accurate identifying the new views when they were outside of the previous views. When view 3 fell between views 1 and 2 (case B), accuracy sank below 50 percent.

The researchers say this demonstrates that viewers aren't simply remembering static pictures and then comparing them to the final picture -- if that was the case, then accuracy would have been similar for all three cases -- and arguably it should have been best for case B, since it was near both view 1 and 2 and therefore easy to compare.

They suggest that at least part of our visual representation of a scene is holistic -- we don't remember individual views, but the layout of the entire scene. Since our memories of the middle part (between the first and second views) are blended together, it's more difficult for us to realize that the third view is actually different.

More on recognizing scenes here, here, and here.

Castelhano, M., Pollatsek, A., & Rayner, K. (2009). Integration of multiple views of scenes Attention, Perception & Psychophysics, 71 (3), 490-502 DOI: 10.3758/APP.71.3.490

More like this

The first was a definite no, the second I think was yes. The second was more difficult because the control scenes were more similar to each other than in the first experiment.

First: no - I just looked at the position of the windows. But in the second set, I see only two photos.

By patientia (not verified) on 28 Jul 2009 #permalink

The three photos in the second set were the same as the three photos in the first set, but in a different order. There were no repeat photos within sets.

I recalled an impression of the scenes with reference to the lightest and darkest areas, and it was easy to see that these didn't match up. Neither set was more difficult as such, but the second set threw me because I wasn't expecting it to re-use the photos from the first.


This experiment is inconclusive. They should have tested recognition of 2D scenes as well. The reason the accuracy was greatest with A is because it's easy to spot the difference between two consecutive images (3D or not). The reason the middle image was hardest was because it had the least actual retinal difference to the average of the other views.

They need to eliminate the probability of plain 2D comparison, especially since the views were shown on a flat screen -- this is not how we perceive the world around us.

The rant in the beginning about objects being easier to recognize if presented from an angle halfway between previously shown views does not, in any way, suggest that we have "at least a partial three-dimensional representation".

Of course you have a greater chance of recognizing an object if its image is a blend of two previously seen images. Then you have two sets of cues to help you remember it. Rather, if anything, it suggests that we remember views as snapshots.

I recognised the first set as a "no", because of the difference in lighting.

For the second set, the video ends before the third picture appears.

When I was looking at the first two photos in each set, I was trying to remember clues that would help me determine if the third photo matched. If I didn't know that I was taking part in a memory exercise, I probably would have looked at the first two scenes in a more holistic way.

It would be interesting to know whether the in-between shot B, which is closest to the average of the stored information about the scene, does look more familiar than even the original shots 1 and 2. I think an effect like this was found in the (false) recognition of faces.

A quite-theoretical view on Castelhano et alii study.
It will be useful to reconstruct this hypotesis strarting from Marvin Minsky's 'visual frame-hypotesis' (and then linking it with pattern recognition) and also with a 'classical' hypotesis about physical world geometry recognition by Bertrand Russell 1904 "Philosophy of Geometry".
Any suggestion?
Thanks for patience,