Recent work has leveraged increasingly sophisticated computational models of neural processing as a way of predicting the BOLD response on a trial-by-trial basis. The core idea behind much of this work is that reinforcement learning is a good model for the way the brain learns about its environment; the specific idea is that expectations are compared with outcomes so that a “prediction error” can be calculated and minimized through reshaping expectations and behavior. This simple idea leads to exceedingly powerful insights into the way the brain works, with numerous applications to improving learning in artificial agents, to understanding the role of exploration in behavior and development, and to understanding how the brain exerts adaptive control over behavior.
So far, however, neuroimaging and electrophysiology suggest that these prediction error signals can be found through much of the cortex, including large swaths of parietal, frontal, and striatal areas.
This is where a 2010 Neuron paper by Gläscher, Daw, Dayan & O’Doherty comes to the rescue. Traditionally, reinforcement learning has been viewed as a somewhat monolithic entity, such that expected rewards are compared with reward outcomes to generate a “reward prediction error.” It’s easy to imagine that most of the brain might light up in response to rewards. But Gläscher take this a step farther, and dissociate between two flavors of reinforcement learning (RL):
Model-based RL learns about the association of states with one another by producing an internal model of state transitions without respect to reward
Model-free RL: learns about the direct associations of states with rewards
This distinction is important: model-free RL can learn about the average reward expected from a 2nd order stimulus (state…reward), but may not conditionalize that on the actions that are available to that stimulus, which may allow the agent to enter a new state where reward expectation is higher. In contrast, model-based RL can learn about that kind of state-action-state transition.
To assess whether these conceptually different flavors of RL have any neural basis, Gläscher et al used fMRI to scan subjects performing a “sequential two-choice Markov decision task.” It should come as no surprise that this has many similarities to the tasks used in the hierarchical control literature, which I’ve briefly covered in the last week.
The task is devised as follows: subjects first observe a series of 3 stimuli appear on the screen, each one appearing after the subject responded to the previous; they’re told how to respond to each, and no reward is provided at the end of a series. This task allows the model-based system to learn about which stimuli tend to follow which others, and with what probability – i.e., state transitions – but provides no information for the model-free system, because there is no reward provided, and reward prediction error is the core calculation performed by that system.
In a second phase of the experiment, subjects were explicitly trained about the reward outcomes associated with the possible stimuli that could occur last in the series of 3 stimuli. In the third and final phase, they then proceeded to complete the same serial choice task in phase 1, except that they weren’t told how to respond to each stimulus and reward was provided according to the contingencies they had learned in phase 2. Thus, subjects had to find their way “through the decision tree” to acquire maximal reward – putatively by integrating a reward prediction error now calculated by the model-free system with the “state” prediction error learned in the task where no rewards were provided.
Indeed, subjects seemed to perform this kind of integration across RL systems, as indicated by significantly above-chance performance on their first trial in the third phase (p<.05 one-tailed). The authors then modeled individual subjects’ choices in the third phase at the trial-by-trial level, to see if these were best captured by a combination of model-free and model-based learning, or by either one alone. The combined model fit best, suggesting that subjects were integrating these two RL systems during behavior. Although the task was probabilistic, with the same probabilities used across subjects, different subjects experienced different sequences of state transitions in phase 1; the authors found that by incorporating those experiences into their models, their models acquired a significantly better fit to the trial-by-trial behavior of subjects. fMRI demonstrated that across phases 1 and 3, estimates of prediction error from the model-based system (aka “state prediction error”) predicting neural activity in the lateral prefrontal cortex (dorsal bank of posterior IFG) and in the posterior intraparietal sulcus. ROI analyses indicated that these areas also showed significant effects just in phase 1, consistent with the idea that these areas implement a model-based RL system even in the absence of rewards. Estimates of prediction error from the model-free system (aka reward prediction error) in the 3rd phase showed no consistent modulation in cortex, but rather only in the ventral striatum – an area long implicated in classical, model-free reinforcement learning. One area is not like the others, however: only the posterior parietal cortex showed significantly greater correspondence to the model-based estimates of prediction error than those based on a model-free RL system. And only activity in this same region significantly correlated with optimal behavior in the 3rd phase, suggesting parietal cortex is critical for the kind of model-based prediction error investigated here. What’s surprising here is that the same thing cannot be said of lateral prefrontal cortex, which many would have believed to be involved in model-based learning. The authors are a little more willing to interpret their (multiple-comparison uncorrected) lateral prefrontal results than I am. Under this more skeptical read, it may suggest that lateral prefrontal cortex plays a different or more integrative role across both model-free and model-based learning. I think this conclusion is largely consistent with the authors’ read, and consistent with some modeling work emphasizing the abstraction of prefrontal representations, but also puts prefrontal cortex at an uncomfortable distance from the “selection for action” control representations typically ascribed to it.