Starting Small, All Over Again: Shaping Neural Networks in the 12AX-CPT

By developinginte… on January 9, 2009.

There's little evidence that "staging" the training of neural networks on language-like input - feeding them part of the problem space initially, and scaling that up as they learn - confers any consistent benefit in terms of their long term learning (as reviewed yesterday).

To summarize that post, early computational demonstrations of the importance of starting small were subsequently cast into doubt by numerous replication failures, with one exception: the importance of starting small is replicable when the training data lack temporal correlations. This leads to slow learning of the importance of longer timescales by simple recurrent networks, but learning can be sped up by "shaping" the type of input the network sees early in its learning.

Typically, these investigations have focused on grammar-like input data, in which a recursive phrase can be embedded between a noun and verb (e.g., the cat [who ate the bird [who had chicklings [who ate a worm [who...]]]] chased the dog.) A new paper from Krueger & Dayan uses training data from a very different context - the 12AXCPT - and illustrates that as in cases with uncorrelated embeddings, networks learning the 12AXCPT benefit from developmental staging, or shaping, of the input data. Interestingly, however, the 12AXCPT does involve correlations between the embeddings , thus not fitting the conditions (as laid out by Rohde and Plaut) that should generate an advantage from "starting small."

The 12AXCPT is a fairly complex task. Here are the rules, to give you a flavor: if the last number you saw was a "1", respond with your left hand to X's followed by A's, and respond with your right hand to anything else you see; but if the last number you saw was a "2", now respond with your left hand to Y's followed by B's, and with your right hand to anything else you see. In this this task, the numbers 1&2 make up the "outer loop", and can be followed by 1 to 4 "inner loop" pairs. The target sequence for the current outer loop always has a higher probability of occurrence than the nontarget sequences, yielding errors or reaction times indicative of anticipation (AY in outer loop 1, BX in outer loop 2) among subjects who use this probability information and are capable of maintaining previously-observed stimuli.

Krueger & Dayan trained a gated working memory network (like an SRN that learns when to maintain, forget, and output information from the recurrent layer) on the 12AX-CPT. Interestingly, their architecture includes three distinct gates for each "slot" in working memory (a gate for updating, for forgetting, and for output) trained with backpropagation through time. Certain networks were "shaped" as described below.

Shaped networks were first trained to provide one response to any stimuli following the number 1, and a different response to any following the number 2, and these numbers were presented with decreasing frequency over 4 successive stages of shaping. This most crucial stage of shaping directly targets the difficulties SRNs have in learning long-range sequential structure. This followed by a stage in which the net first provided a target response to any stimulus following an A that was last preceded by a 1 with a nontarget response to all other stimuli. The final stage required a target response to any stimulus following a B that was last preceded by a 2, and nontarget responses to all other stimuli.

In their first set of simulations, these phases were manually "hacked" by the authors to utilize separate slots in working memory; in my experience, this aspect of learning in gated networks (i.e., learning to use discrete slots for different items) is one of the largest determinants of training time.

To illustrate this, I've included the learning curve from a hobby network training on a 2-item serial recall task. The long plateau in sum squared error (on the y-axis) reflects the network's difficulty in assigning certain slots in working memory to particular serial orders, and as you can see, it makes up the majority of the training time!

i-70b5639e80c5362d12dff69d3f38d48d-SerialRecallNetSSE.jpg

Krueger & Dayan next compare their shaped network (with hacked memory slots) to an unshaped network (where it has to figure out which slots should code for what). I don't think this is a fair comparison, so I'll skip the summary here.

However, in the following section they come full circle. They specify (and implement) a way for the network to intrinsically allocate its experiences to particular slots in working memory, by expanding the number of memory slots when errors occur.

Somewhat disappointingly, this yields a improvement relative to unshaped networks that is significant only at the 1-tailed level, and raises questions about generalization - networks that can freely create new units at will may in fact be memorizing those inputs in a 1-to-1 fashion rather than extracting the underlying rules. The generalization results presented for the algorithmically allocated networks (in the form of reversal learning) do not clarify this state of affairs, because the nets with the algorithmic allocation of memory slots were exposed to those tasks during shaping (and thus could have assigned memory slots specific to those tasks)!

In summary, these results don't resolve the issue of external shaping - since the 12AX clearly falls under the same rubric as those grammars shown by Rodhe & Plaut to generate no benefit from shaping, since the allocation of memory slots was hacked when shaping appeared to show the most benefit, and since algorithmic memory slot allocation in combination with shaping yielded only marginal improvements over unshaped networks (with the aforementioned caveats about generalization).

However, they do shed some unexpected light on internal or self-shaping. That is, Kruger & Dayan's algorithmic implementation of the slot assignment hack actually allows the network to expand its own memory capacity as it learns the task in a way that minimizes error. This is very similar to a cascade correlation network. But again, the effects if this are somewhat unimpressive: in combination with external shaping, this yields only a marginal improvement (at the 2-tailed level) over unrestricted networks, and with questionable effects on generalization.

Krueger KA, Dayan P. (2009). Flexible shaping: How learning in small steps helps. Cognition

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Performance Improves with Transcranial Random Noise Stimulation

November 21, 2011

Stimulating the brain with high frequency electrical noise can supersede the beneficial effects observed from transcranial direct current stimulation, either anodal or cathodal (as well as those observed from sham stimulation), in perceptual learning, as newly reported by Fertonani, Pirully &…

Attractors All the Way Up: Metastability, Rostrocaudal Hierarchies, and Synaptic Facilitation

November 18, 2011

In their wonderful Neuroimage article, Braun & Mattia present a comprehensive introduction to the possible neuronal implementations and cognitive sequelae of a particular dynamical phenomenon: the attractor state. In another excellent paper, just recently out in Frontiers, Itskov, Hansel and…

Architecture of the VLPFC and its Monkey/Human Mapping

November 17, 2011

If you ever said to yourself, "I wonder whether the human mid- and posterior ventrolateral prefrontal cortex has a homologue in the monkey, and what features of its cytoarchitecture or subcortical connectivity may differentiate it from other regions of PFC" then this post is for you. Otherwise,…

Modus Tollens, Modus Shmollens! When people commit a fallacy so absurd that it's only recently been given a name.

November 16, 2011

Suppose - rather reasonably - that soups which taste like garlic have garlic in them. You observe two people eating soup; one of them says to the other, "There is no garlic in this soup." Do you think it's likely that the soup taste like garlic? If you said yes, then congratulations! You've just…

Greater Performance Improvements When Quick Responses Are Rewarded More Than Accuracy Itself.

November 8, 2011

Last month's Frontiers in Psychology contains a fascinating study by Dambacher, HuÌbner, and SchlÃ¶sser in which the authors demonstrate that the promise of financial reward can actually reduce performance when rewards are given for high accuracy. Counterintuitively, performance (characterized as…

Starting Small, All Over Again: Shaping Neural Networks in the 12AX-CPT

More like this

Networks Under Construction

A Two Americas of Cable TV Viewers

Predictive Nature: Externalizing Supervised Learning

PZ Has a Question: Is George Gilder Wrong About Network Theory?

Performance Improves with Transcranial Random Noise Stimulation

Attractors All the Way Up: Metastability, Rostrocaudal Hierarchies, and Synaptic Facilitation

Architecture of the VLPFC and its Monkey/Human Mapping

Modus Tollens, Modus Shmollens! When people commit a fallacy so absurd that it's only recently been given a name.

Greater Performance Improvements When Quick Responses Are Rewarded More Than Accuracy Itself.

Are there more tornadoes because of global warming?

The Large Hadron Collider, the Higgs, and Hope

Mostly Mute Monday: Dark Matter's Secrets Revealed By Colliding Galaxy Clusters (Synopsis)