Very early in the history of artificial intelligence research, it was apparent that cognitive agents needed to be able to maximize reward by changing their behavior. But this leads to a “credit-assignment” problem: how does the agent know which of its actions led to the reward? An early solution was to select the behavior with the maximal predicted rewards, and to later adjust the likelihood of that behavior according to whether it ultimately led to the anticipated reward. These “temporal-difference” errors in reward prediction were first implemented in a 1950′s checker-playing program, before exploding in popularity some 30 years later.
This repopularization seemed to originate from a tantalizing discovery: the brain’s most ancient structures were releasing dopamine in exactly the way predicted by temporal-difference learning algorithms. Specifically, dopamine release in the ventral tegmental area (VTA) decreased in response to stimuli that were repeatedly paired without a reward – as though dopamine levels “dipped” to signal the overprediction (and under-delivery) of a reward. Secondly, dopamine release abruptly spikes in response to stimuli that are suddenly paired with a reward – as though dopamine is signaling the underprediction (and over-delivery) of a reward. Finally, when a previously-rewarded stimulus is no longer rewarded, dopamine levels dip, again suggesting overprediction and underdelivery of reward.
Thus, a beautiful computational theory was garnering support from some unusually beautiful data in neuroscience. Dopamine appeared to rise for items that predicted a reward, to dropped for items that predict an absence of reward, and to show no response to neutral stimuli. But as noted by Thomas Huxley, in science “many a beautiful theory has been destroyed by an ugly fact.”
These ugly facts are presented in Redgrave and Gurney’s new NRN article that is circulating the field of computational neuroscience. Among the ugliest:
1) Dopamine spikes in response to novel items which have never been paired with reward, and thus have no predictive value.
2) The latency and duration of dopamine spikes is constant across species, experiments, stimulus modality and stimulus complexity. In contrast, reward prediction should take longer to establish in some situations than others – for example, reward prediction may be slower for more complex stimuli.
3) The dopamine signal actually occurs before animals have even been able to fixate on a stimulus – this questions the extent to which this signal is mechanistically capable of the “reward prediction error” function.
4) VTA dopamine neurons fire simultaneous with (and possibly even before) object recognition is completed in the infero-temporal cortex, and simultaneous with visual responses in striatum and subthalamic nucleus. It seems unlikely that VTA can perform both object recognition and reward prediction error.
5) The most likely visual signal to these VTA neurons may originate from superior colliculus, a region that is sensitive to spatial changes but not those that would be involved in object processing per se.
6) Many of the experiments showing the apparent dopaminergic-coding of reward prediction error had stimuli that differed not only in reward value but also in spatial location. Therefore, data in support of reward prediction error is confounded with hypotheses involving spatial selectivity.
Redgrave & Gurney suggest that VTA dopamine neurons fire too quickly and with too little detailed visual input to actually accomplish the calculation of errors in reward prediction. They advocate an alternative theory in which temporal prediction is still key, but instead of encoding reward prediction, dopamine neurons are actually signalling the “reinforcement of actions/movements that immediately precede a biologically salient event.”
To understand this claim, consider Redgrave & Gurney’s point that “most temporally unexpected transient events in nature are also spatially unpredictable.” The theory is basically that a system notes its own uncertainty, via the spatial reorientation in the superior colliculus, and attempts to reduce that uncertainty by pairing a running record of previous movements with the unexpected event.
Although this alternative theory is intriguing, there is not an abundance of evidence supporting it: it seems to me more like a pastiche of fragments from the apparently broken “reward prediction error” hypothesis.
We should also be cautious in discarding any theory as powerful as the reward prediction error hypothesis on the basis of null evidence: in this case, we simply don’t know how reward prediction error could be calculated so quickly. This kind of theoretical arrogance (“we don’t know how it could be calculated, so it isn’t calculated”) is particularly dangerous in computational neuroscience – the whole point of which is to identify possible mechanisms of neural information processing.
Of course, this article may ultimately be seen as the obituary of yet another beautiful theory killed by science. What’s your prediction?