An early classic in computational neuroscience was a 1993 paper by Elman called “The Importance of Starting Small.” The paper describes how initial limitations in a network’s memory capacity could actually be beneficial to its learning of complex sentences, relative to networks that were “adult-like” from the start. This still seems like a beautiful idea – the cognitive limitations of children may somehow be adaptive for the learning they have yet to do.
And Elman is not alone in proposing it; a number of other researchers have proposed that a lack of cognitive control or working memory capacity could actually be beneficial. Unfortunately, there is very little behavioral data which supports this idea.
In 1999, Rodhe & Plaut appeared to deal the death blow to this idea. They showed that Elman’s result is true only for a very particular type of sequential input stimulus: those where long-range dependencies contain intervening information that are correlated with the items showing long-range dependence. It’s worth pausing to consider how large the gulf is between theory and data on this point.
Newport, Braver, Thompson-Schill, Dayan, and undoubtedly others have all suggested the same general idea (with varying rationales): somehow, cognitive limitations must be advantageous. Otherwise, the cost of these limitations would surely eliminate them, perhaps evolutionarily (for example, among children/teenagers that do something stupid and accidentally kill themselves).
In fact, when Rohde & Plaut used input stimuli which did contain correlations between information intervening between items with long-range dependence, they actually observed an advantage (or at least no disadvantage) for starting “big.” The message was clear: something peculiar about the training data or parameters used by Elman must have driven the results.
Rohde & Plaut argue that connectionist networks intrinsically extract more basic covariations in training data before extracting more complex ones. Subsequent work by Conway et al has demonstrated that staged input can improve language-like learning in some cases, but the potential benefit of initial limitations in memory capacity remains where it did as of Rohde & Plaut’s paper.
A related issue concerns the cascade-correlation algorithm for changing the topology of neural networks. Briefly, the concept is that the network can spontaneously generate new units for processing once its learning appears to stagnate. Some claim these networks can learn up to 1000-5000% faster than those using the more standard backpropagation algorithm with a pre-specified architecture, but I can’t find a citation to back this claim, and I can’t check it in Emergent (it doesn’t include a cascade correlation algorithm). Nonetheless, cascade correlation is the only implemented algorithmic “self-shaping” mechanism I know of (please see comments section for important corrections – apparently there are many forms of this, including one described in this followup post).
That was going to be the end of this post. In a funny case of synchronicity, I discovered after writing it that Krueger & Dayan have a new paper – available online only as of New Year’s – demonstrating a new case in which the Elman result holds. I’ll discuss that in my next post.