Whereas yesteryear’s artificial neural networks models were focused on achieving basic biological plausibility, today’s cutting edge networks are modeling cognitive phenomena at the level of neurotransmitters. In a great example of this development, McClure, Gilzenrat & Cohen have an article in Advances in Neural Information Processing Systems where they propose a role for both dopamine and norepinephrine in switching behavior between modes of “exploration” and “exploitation.”
First, a little background. In artificial intelligence circles, the “temporal difference” algorithm has been a well-known method for simulating reinforcement learning. Exciting advances in our understanding of midbrain dopaminergic nuclei have demonstrated that something very similar is actually being computed by the brain. As McClure et al. note, dopamine seems to be released as a function of how wrong the “predicted reward” of a given stimulus was: if you had vastly underestimated the reward you later receive, dopamine is released in larger quantities; likewise, if you had overestimated the reward you would later receive, dopamine release dips below its usual level.
Unfortunately, a system that relies only on reinforcement learning is purely exploitative. In other words, as soon as it finds something rewarding, it will continue to seek out that rewarding stimulus to the exclusion of all other novel things (some which could be even more rewarding!) To solve this dilemma, McClure et al. propose that tonically higher levels of norepinephrine (i.e., noradrenaline) may encourage more exploratory behavior.
The basic mechanism is that norepinephrine release has two basic modes: phasic and tonic. The phasic mode involves transient increases in norepinephrine, which facilitates processing. In tonic mode, however, the overall levels of norepinephrine are higher, which results in “unpredictable” (i.e., explorative) behavior.
What causes the “switch” between these two modes of norepinephrine release? McClure et al. suggest that the anterior cingulate cortex (ACC) may direct noradrenaline release in the locus coeruleus (LC). ACC is sensitive to conflict (i.e., when there are multiple competing stimuli or responses), and when active, will nudge LC into tonic mode. Once a reward has been achieved, dopamine-related reinforcement learning processes (such as temporal difference learning) will tend to strengthen the rewarding response, thereby decreasing the amount of conflict between this response and other possible but unrewarded responses. This lack of conflict decreases the activity in ACC, which then causes LC activity to return to its default phasic mode.
The authors implemented their hypothesis in a neural network model, fitted to data from monkeys on a simple task. Monkeys were reward for responding to one of two stimuli, and punished for responding to the other; this stimulus-reward mapping was sometimes reversed, after which LC neurons initially elevated their firing rate (i.e., transitioned to the tonic mode) and then eventually returned to a lower firing rate with transient bursts in activity (i.e., transitioned back to the phasic mode).
McClure et al. suggest that this model begins to solve the “exploration-exploitation” dilemma of intelligent agents: how do you know when to continue with your current behaviors, and when to seek out other possibilities? The fact that this solution involves norepinephrine is interesting, insofar as a similar model of dopamine release (also by Jon Cohen, summarized here) is claimed to solve the “flexibility-stability” dilemma.
The “stability-flexibility” dilemma refers to the fact that it is efficient to be able to limit your focus and actively maintain only currently-relevant stimuli – called “stability” because you are unlikely to be distracted. But this has a risk: when you need to switch tasks, this “attentional inertia” incurs a cost in terms of flexibility. Phasic and tonic dopamine release is thought to mitigate this dilemma, in that the tonic dopamine release is associated with increased maintenance, whereas phasic bursts in dopamine release is associated with “updating” new information into that otherwise stable active maintenance system.
In summary, the temporal difference algorithm had been recognized as an efficient method of reinforcement learning, but it was always associated with a cost: once a rewarding stimulus is found, it becomes the focus of attention at the expense of more exploratory behavior. Recent work in cognitive neuroscience has demonstrated how the temporal difference algorithm may be neurally implemented by dopamine fluctuations, and the McClure, Gilzenrat & Cohen paper reviewed in this post describes how a different neurotransmitter system is used to solve the exploration-exploitation dilemma of temporal difference learning.