Staging, Self-Shaping, Starting Small: Not Important?

By developinginte… on January 8, 2009.

An early classic in computational neuroscience was a 1993 paper by Elman called "The Importance of Starting Small." The paper describes how initial limitations in a network's memory capacity could actually be beneficial to its learning of complex sentences, relative to networks that were "adult-like" from the start. This still seems like a beautiful idea - the cognitive limitations of children may somehow be adaptive for the learning they have yet to do.

And Elman is not alone in proposing it; a number of other researchers have proposed that a lack of cognitive control or working memory capacity could actually be beneficial. Unfortunately, there is very little behavioral data which supports this idea.

In 1999, Rodhe & Plaut appeared to deal the death blow to this idea. They showed that Elman's result is true only for a very particular type of sequential input stimulus: those where long-range dependencies contain intervening information that are correlated with the items showing long-range dependence. It's worth pausing to consider how large the gulf is between theory and data on this point.

Newport, Braver, Thompson-Schill, Dayan, and undoubtedly others have all suggested the same general idea (with varying rationales): somehow, cognitive limitations must be advantageous. Otherwise, the cost of these limitations would surely eliminate them, perhaps evolutionarily (for example, among children/teenagers that do something stupid and accidentally kill themselves).

In fact, when Rohde & Plaut used input stimuli which did contain correlations between information intervening between items with long-range dependence, they actually observed an advantage (or at least no disadvantage) for starting "big." The message was clear: something peculiar about the training data or parameters used by Elman must have driven the results.

Rohde & Plaut argue that connectionist networks intrinsically extract more basic covariations in training data before extracting more complex ones. Subsequent work by Conway et al has demonstrated that staged input can improve language-like learning in some cases, but the potential benefit of initial limitations in memory capacity remains where it did as of Rohde & Plaut's paper.

A related issue concerns the cascade-correlation algorithm for changing the topology of neural networks. Briefly, the concept is that the network can spontaneously generate new units for processing once its learning appears to stagnate. Some claim these networks can learn up to 1000-5000% faster than those using the more standard backpropagation algorithm with a pre-specified architecture, but I can't find a citation to back this claim, and I can't check it in Emergent (it doesn't include a cascade correlation algorithm). Nonetheless, cascade correlation is the only implemented algorithmic "self-shaping" mechanism I know of (please see comments section for important corrections - apparently there are many forms of this, including one described in this followup post).

That was going to be the end of this post. In a funny case of synchronicity, I discovered after writing it that Krueger & Dayan have a new paper - available online only as of New Year's - demonstrating a new case in which the Elman result holds. I'll discuss that in my next post.

More like this

Starting Small, All Over Again: Shaping Neural Networks in the 12AX-CPT

There's little evidence that "staging" the training of neural networks on language-like input - feeding them part of the problem space initially, and scaling that up as they learn - confers any consistent benefit in terms of their long term learning (as reviewed yesterday). To summarize that post…

Predictive Nature: Externalizing Supervised Learning

Geoff Hinton has a new TiCS paper describing recent advances in algorithms used to train multilayered neural networks. First, a little background: neural networks of a sufficient size can calculate any mathematical function (an infamous proof among neural network modelers). Unfortunately, the…

Google in Your Brain? PageRank As a Semantic Memory Model

The world wide web can be understood as a giant matrix of associations (links) between various nodes (web pages). At an abstract level, this is similar to human memory, consisting of a matrix of associations (learned relationships, or neuronal connections) between various nodes (memories, or the…

"Attentional Noise": ADHD and Serial Autocorrelations in RT

"It has attained a certain mystique in the physical and biological sciences because it manages to be both rare and ubiquitous. Examples [...] are found in quasar luminosity, tide and river height, traffic flow, and human heartbeat..." (Gilden & Hannock) Since the mid-90s, a small group of…

A lot of specialists state that loan help a lot of people to live the way they want, just because they are able to feel free to buy necessary things. Furthermore, different banks give collateral loan for different classes of people.

ben (ben hazÄ±r alÄ±yorum raÄmen) henÃ¼z bir RBF aÄ uygulanmadÄ±, ama benim anlayÄ±Å mevcut Ã§ekirdek merkezlerinden Ã§ok uzakta Mevcut Ã¶rnek ek gizli dÃ¼ÄÃ¼mler aÄa eklenir. Ek gizli dÃ¼ÄÃ¼mler ekleyerek, RBF aÄ ek bilgi birikir.

AÄ muhtemelen birden fazla dÃ¶nemleri olan eÄitimli, Ã§Ã¼nkÃ¼ standart bir Geriye yayÄ±lÄ±m ile, bu zor anÄ±nda aÄÄ±rlÄ±klarÄ± ayarlamak iÃ§in. Yani, soru kullanmak iÃ§in ne Ã¶nemi olur,

Cascade-Correlation Algorithm input frozen levels == Lakeoff Embodied Metaphors??

I'm sure that's already been suggested before...

sorry, i don't follow - could you spell it out for me?

For other networks that attempt to add complexity look at RBF networks, GRNN (general regression neural networks), FANRE, and I think pi-sigma, and other kernel based neural networks.

hi ds - thanks for the pointers, but i don't see how RBF nets increase their processing capacity as a function of their own learning in a way that standard backprop wouldn't. can you clarify?

There are a number of algorithms that evolve the topology along with the weights of neural networks, which is very much the "self-shaping" mechanism you're talking about. The one I'm most familiar with is NEAT (NeuroEvolution of Augmenting Topologies), developed by Kenneth Stanley:

http://en.wikipedia.org/wiki/NeuroEvolution_of_Augmented_Topologies

One of its defining aspects is "starting small", reducing the dimensionality of the search for an optimal topology by starting with minimal architectures and incrementally adding nodes and connections. As in Elman's work, starting small yields better performance than starting large.

Derek - it's always a pleasure - this is great! I am unfamiliar with this work. The Krueger & Dayan paper I'll discuss tomorrow changes its own topology as well, but with unclear effects on generalization. Maybe I'll post on NEAT soon as well.

CHCH
I haven't implemented an RBF network yet (although I'm getting ready to), but my understanding is that additional hidden nodes are added to the network if the existing kernel centers are too far away from the current instance. By adding additional hidden nodes, an RBF network accumulates additional knowledge.

With a standard backprop, it is tricky to adjust weightings on the fly, because the network was probably trained with multiple epochs. So, the question becomes what significance to use, so that you balance new information against the already established behaviors?

One of the things that i love about science is that no one is above reproach.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

EPA Reconsiders Its Biden Ban On Asbestos Everywhere

More by this author

Performance Improves with Transcranial Random Noise Stimulation

November 21, 2011

Stimulating the brain with high frequency electrical noise can supersede the beneficial effects observed from transcranial direct current stimulation, either anodal or cathodal (as well as those observed from sham stimulation), in perceptual learning, as newly reported by Fertonani, Pirully &…

Attractors All the Way Up: Metastability, Rostrocaudal Hierarchies, and Synaptic Facilitation

November 18, 2011

In their wonderful Neuroimage article, Braun & Mattia present a comprehensive introduction to the possible neuronal implementations and cognitive sequelae of a particular dynamical phenomenon: the attractor state. In another excellent paper, just recently out in Frontiers, Itskov, Hansel and…

Architecture of the VLPFC and its Monkey/Human Mapping

November 17, 2011

If you ever said to yourself, "I wonder whether the human mid- and posterior ventrolateral prefrontal cortex has a homologue in the monkey, and what features of its cytoarchitecture or subcortical connectivity may differentiate it from other regions of PFC" then this post is for you. Otherwise,…

Modus Tollens, Modus Shmollens! When people commit a fallacy so absurd that it's only recently been given a name.

November 16, 2011

Suppose - rather reasonably - that soups which taste like garlic have garlic in them. You observe two people eating soup; one of them says to the other, "There is no garlic in this soup." Do you think it's likely that the soup taste like garlic? If you said yes, then congratulations! You've just…

Greater Performance Improvements When Quick Responses Are Rewarded More Than Accuracy Itself.

November 8, 2011

Last month's Frontiers in Psychology contains a fascinating study by Dambacher, HuÌbner, and SchlÃ¶sser in which the authors demonstrate that the promise of financial reward can actually reduce performance when rewards are given for high accuracy. Counterintuitively, performance (characterized as…