Staging, Self-Shaping, Starting Small: Not Important?

An early classic in computational neuroscience was a 1993 paper by Elman called "The Importance of Starting Small." The paper describes how initial limitations in a network's memory capacity could actually be beneficial to its learning of complex sentences, relative to networks that were "adult-like" from the start. This still seems like a beautiful idea - the cognitive limitations of children may somehow be adaptive for the learning they have yet to do.

And Elman is not alone in proposing it; a number of other researchers have proposed that a lack of cognitive control or working memory capacity could actually be beneficial. Unfortunately, there is very little behavioral data which supports this idea.

In 1999, Rodhe & Plaut appeared to deal the death blow to this idea. They showed that Elman's result is true only for a very particular type of sequential input stimulus: those where long-range dependencies contain intervening information that are correlated with the items showing long-range dependence. It's worth pausing to consider how large the gulf is between theory and data on this point.

Newport, Braver, Thompson-Schill, Dayan, and undoubtedly others have all suggested the same general idea (with varying rationales): somehow, cognitive limitations must be advantageous. Otherwise, the cost of these limitations would surely eliminate them, perhaps evolutionarily (for example, among children/teenagers that do something stupid and accidentally kill themselves).

In fact, when Rohde & Plaut used input stimuli which did contain correlations between information intervening between items with long-range dependence, they actually observed an advantage (or at least no disadvantage) for starting "big." The message was clear: something peculiar about the training data or parameters used by Elman must have driven the results.

Rohde & Plaut argue that connectionist networks intrinsically extract more basic covariations in training data before extracting more complex ones. Subsequent work by Conway et al has demonstrated that staged input can improve language-like learning in some cases, but the potential benefit of initial limitations in memory capacity remains where it did as of Rohde & Plaut's paper.

A related issue concerns the cascade-correlation algorithm for changing the topology of neural networks. Briefly, the concept is that the network can spontaneously generate new units for processing once its learning appears to stagnate. Some claim these networks can learn up to 1000-5000% faster than those using the more standard backpropagation algorithm with a pre-specified architecture, but I can't find a citation to back this claim, and I can't check it in Emergent (it doesn't include a cascade correlation algorithm). Nonetheless, cascade correlation is the only implemented algorithmic "self-shaping" mechanism I know of (please see comments section for important corrections - apparently there are many forms of this, including one described in this followup post).

That was going to be the end of this post. In a funny case of synchronicity, I discovered after writing it that Krueger & Dayan have a new paper - available online only as of New Year's - demonstrating a new case in which the Elman result holds. I'll discuss that in my next post.

More like this

There's little evidence that "staging" the training of neural networks on language-like input - feeding them part of the problem space initially, and scaling that up as they learn - confers any consistent benefit in terms of their long term learning (as reviewed yesterday). To summarize that post…
Geoff Hinton has a new TiCS paper describing recent advances in algorithms used to train multilayered neural networks. First, a little background: neural networks of a sufficient size can calculate any mathematical function (an infamous proof among neural network modelers). Unfortunately, the…
The world wide web can be understood as a giant matrix of associations (links) between various nodes (web pages). At an abstract level, this is similar to human memory, consisting of a matrix of associations (learned relationships, or neuronal connections) between various nodes (memories, or the…
"It has attained a certain mystique in the physical and biological sciences because it manages to be both rare and ubiquitous. Examples [...] are found in quasar luminosity, tide and river height, traffic flow, and human heartbeat..." (Gilden & Hannock) Since the mid-90s, a small group of…

A lot of specialists state that loan help a lot of people to live the way they want, just because they are able to feel free to buy necessary things. Furthermore, different banks give collateral loan for different classes of people.

ben (ben hazır alıyorum raÄmen) henüz bir RBF aÄ uygulanmadı, ama benim anlayıŠmevcut çekirdek merkezlerinden çok uzakta Mevcut örnek ek gizli düÄümler aÄa eklenir. Ek gizli düÄümler ekleyerek, RBF aÄ ek bilgi birikir.

AÄ muhtemelen birden fazla dönemleri olan eÄitimli, çünkü standart bir Geriye yayılım ile, bu zor anında aÄırlıkları ayarlamak için. Yani, soru kullanmak için ne önemi olur,

Cascade-Correlation Algorithm input frozen levels == Lakeoff Embodied Metaphors??

I'm sure that's already been suggested before...

sorry, i don't follow - could you spell it out for me?

For other networks that attempt to add complexity look at RBF networks, GRNN (general regression neural networks), FANRE, and I think pi-sigma, and other kernel based neural networks.

hi ds - thanks for the pointers, but i don't see how RBF nets increase their processing capacity as a function of their own learning in a way that standard backprop wouldn't. can you clarify?

There are a number of algorithms that evolve the topology along with the weights of neural networks, which is very much the "self-shaping" mechanism you're talking about. The one I'm most familiar with is NEAT (NeuroEvolution of Augmenting Topologies), developed by Kenneth Stanley:

http://en.wikipedia.org/wiki/NeuroEvolution_of_Augmented_Topologies

One of its defining aspects is "starting small", reducing the dimensionality of the search for an optimal topology by starting with minimal architectures and incrementally adding nodes and connections. As in Elman's work, starting small yields better performance than starting large.

Derek - it's always a pleasure - this is great! I am unfamiliar with this work. The Krueger & Dayan paper I'll discuss tomorrow changes its own topology as well, but with unclear effects on generalization. Maybe I'll post on NEAT soon as well.

CHCH
I haven't implemented an RBF network yet (although I'm getting ready to), but my understanding is that additional hidden nodes are added to the network if the existing kernel centers are too far away from the current instance. By adding additional hidden nodes, an RBF network accumulates additional knowledge.

With a standard backprop, it is tricky to adjust weightings on the fly, because the network was probably trained with multiple epochs. So, the question becomes what significance to use, so that you balance new information against the already established behaviors?