There’s little evidence that “staging” the training of neural networks on language-like input – feeding them part of the problem space initially, and scaling that up as they learn – confers any consistent benefit in terms of their long term learning (as reviewed yesterday).
To summarize that post, early computational demonstrations of the importance of starting small were subsequently cast into doubt by numerous replication failures, with one exception: the importance of starting small is replicable when the training data lack temporal correlations. This leads to slow learning of the importance of longer timescales by simple recurrent networks, but learning can be sped up by “shaping” the type of input the network sees early in its learning.
Typically, these investigations have focused on grammar-like input data, in which a recursive phrase can be embedded between a noun and verb (e.g., the cat [who ate the bird [who had chicklings [who ate a worm [who…]]]] chased the dog.) A new paper from Krueger & Dayan uses training data from a very different context – the 12AXCPT – and illustrates that as in cases with uncorrelated embeddings, networks learning the 12AXCPT benefit from developmental staging, or shaping, of the input data. Interestingly, however, the 12AXCPT does involve correlations between the embeddings , thus not fitting the conditions (as laid out by Rohde and Plaut) that should generate an advantage from “starting small.”
The 12AXCPT is a fairly complex task. Here are the rules, to give you a flavor: if the last number you saw was a “1”, respond with your left hand to X’s followed by A’s, and respond with your right hand to anything else you see; but if the last number you saw was a “2”, now respond with your left hand to Y’s followed by B’s, and with your right hand to anything else you see. In this this task, the numbers 1&2 make up the “outer loop”, and can be followed by 1 to 4 “inner loop” pairs. The target sequence for the current outer loop always has a higher probability of occurrence than the nontarget sequences, yielding errors or reaction times indicative of anticipation (AY in outer loop 1, BX in outer loop 2) among subjects who use this probability information and are capable of maintaining previously-observed stimuli.
Krueger & Dayan trained a gated working memory network (like an SRN that learns when to maintain, forget, and output information from the recurrent layer) on the 12AX-CPT. Interestingly, their architecture includes three distinct gates for each “slot” in working memory (a gate for updating, for forgetting, and for output) trained with backpropagation through time. Certain networks were “shaped” as described below.
Shaped networks were first trained to provide one response to any stimuli following the number 1, and a different response to any following the number 2, and these numbers were presented with decreasing frequency over 4 successive stages of shaping. This most crucial stage of shaping directly targets the difficulties SRNs have in learning long-range sequential structure. This followed by a stage in which the net first provided a target response to any stimulus following an A that was last preceded by a 1 with a nontarget response to all other stimuli. The final stage required a target response to any stimulus following a B that was last preceded by a 2, and nontarget responses to all other stimuli.
In their first set of simulations, these phases were manually “hacked” by the authors to utilize separate slots in working memory; in my experience, this aspect of learning in gated networks (i.e., learning to use discrete slots for different items) is one of the largest determinants of training time.
To illustrate this, I’ve included the learning curve from a hobby network training on a 2-item serial recall task. The long plateau in sum squared error (on the y-axis) reflects the network’s difficulty in assigning certain slots in working memory to particular serial orders, and as you can see, it makes up the majority of the training time!
Krueger & Dayan next compare their shaped network (with hacked memory slots) to an unshaped network (where it has to figure out which slots should code for what). I don’t think this is a fair comparison, so I’ll skip the summary here.
However, in the following section they come full circle. They specify (and implement) a way for the network to intrinsically allocate its experiences to particular slots in working memory, by expanding the number of memory slots when errors occur.
Somewhat disappointingly, this yields a improvement relative to unshaped networks that is significant only at the 1-tailed level, and raises questions about generalization – networks that can freely create new units at will may in fact be memorizing those inputs in a 1-to-1 fashion rather than extracting the underlying rules. The generalization results presented for the algorithmically allocated networks (in the form of reversal learning) do not clarify this state of affairs, because the nets with the algorithmic allocation of memory slots were exposed to those tasks during shaping (and thus could have assigned memory slots specific to those tasks)!
In summary, these results don’t resolve the issue of external shaping – since the 12AX clearly falls under the same rubric as those grammars shown by Rodhe & Plaut to generate no benefit from shaping, since the allocation of memory slots was hacked when shaping appeared to show the most benefit, and since algorithmic memory slot allocation in combination with shaping yielded only marginal improvements over unshaped networks (with the aforementioned caveats about generalization).
However, they do shed some unexpected light on internal or self-shaping. That is, Kruger & Dayan’s algorithmic implementation of the slot assignment hack actually allows the network to expand its own memory capacity as it learns the task in a way that minimizes error. This is very similar to a cascade correlation network. But again, the effects if this are somewhat unimpressive: in combination with external shaping, this yields only a marginal improvement (at the 2-tailed level) over unrestricted networks, and with questionable effects on generalization.
Krueger KA, Dayan P. (2009). Flexible shaping: How learning in small steps helps. Cognition