Developing Intelligence

An early classic in computational neuroscience was a 1993 paper by Elman called “The Importance of Starting Small.” The paper describes how initial limitations in a network’s memory capacity could actually be beneficial to its learning of complex sentences, relative to networks that were “adult-like” from the start. This still seems like a beautiful idea – the cognitive limitations of children may somehow be adaptive for the learning they have yet to do.

And Elman is not alone in proposing it; a number of other researchers have proposed that a lack of cognitive control or working memory capacity could actually be beneficial. Unfortunately, there is very little behavioral data which supports this idea.

In 1999, Rodhe & Plaut appeared to deal the death blow to this idea. They showed that Elman’s result is true only for a very particular type of sequential input stimulus: those where long-range dependencies contain intervening information that are correlated with the items showing long-range dependence. It’s worth pausing to consider how large the gulf is between theory and data on this point.

Newport, Braver, Thompson-Schill, Dayan, and undoubtedly others have all suggested the same general idea (with varying rationales): somehow, cognitive limitations must be advantageous. Otherwise, the cost of these limitations would surely eliminate them, perhaps evolutionarily (for example, among children/teenagers that do something stupid and accidentally kill themselves).

In fact, when Rohde & Plaut used input stimuli which did contain correlations between information intervening between items with long-range dependence, they actually observed an advantage (or at least no disadvantage) for starting “big.” The message was clear: something peculiar about the training data or parameters used by Elman must have driven the results.

Rohde & Plaut argue that connectionist networks intrinsically extract more basic covariations in training data before extracting more complex ones. Subsequent work by Conway et al has demonstrated that staged input can improve language-like learning in some cases, but the potential benefit of initial limitations in memory capacity remains where it did as of Rohde & Plaut’s paper.

A related issue concerns the cascade-correlation algorithm for changing the topology of neural networks. Briefly, the concept is that the network can spontaneously generate new units for processing once its learning appears to stagnate. Some claim these networks can learn up to 1000-5000% faster than those using the more standard backpropagation algorithm with a pre-specified architecture, but I can’t find a citation to back this claim, and I can’t check it in Emergent (it doesn’t include a cascade correlation algorithm). Nonetheless, cascade correlation is the only implemented algorithmic “self-shaping” mechanism I know of (please see comments section for important corrections – apparently there are many forms of this, including one described in this followup post).

That was going to be the end of this post. In a funny case of synchronicity, I discovered after writing it that Krueger & Dayan have a new paper – available online only as of New Year’s – demonstrating a new case in which the Elman result holds. I’ll discuss that in my next post.


  1. #1 Jes
    January 8, 2009

    Cascade-Correlation Algorithm input frozen levels == Lakeoff Embodied Metaphors??

    I’m sure that’s already been suggested before…

  2. #2 CHCH
    January 8, 2009

    sorry, i don’t follow – could you spell it out for me?

  3. #3 ds
    January 8, 2009

    For other networks that attempt to add complexity look at RBF networks, GRNN (general regression neural networks), FANRE, and I think pi-sigma, and other kernel based neural networks.

  4. #4 CHCH
    January 8, 2009

    hi ds – thanks for the pointers, but i don’t see how RBF nets increase their processing capacity as a function of their own learning in a way that standard backprop wouldn’t. can you clarify?

  5. #5 Derek James
    January 8, 2009

    There are a number of algorithms that evolve the topology along with the weights of neural networks, which is very much the “self-shaping” mechanism you’re talking about. The one I’m most familiar with is NEAT (NeuroEvolution of Augmenting Topologies), developed by Kenneth Stanley:

    One of its defining aspects is “starting small”, reducing the dimensionality of the search for an optimal topology by starting with minimal architectures and incrementally adding nodes and connections. As in Elman’s work, starting small yields better performance than starting large.

  6. #6 CHCH
    January 8, 2009

    Derek – it’s always a pleasure – this is great! I am unfamiliar with this work. The Krueger & Dayan paper I’ll discuss tomorrow changes its own topology as well, but with unclear effects on generalization. Maybe I’ll post on NEAT soon as well.

  7. #7 ds
    January 9, 2009

    I haven’t implemented an RBF network yet (although I’m getting ready to), but my understanding is that additional hidden nodes are added to the network if the existing kernel centers are too far away from the current instance. By adding additional hidden nodes, an RBF network accumulates additional knowledge.

    With a standard backprop, it is tricky to adjust weightings on the fly, because the network was probably trained with multiple epochs. So, the question becomes what significance to use, so that you balance new information against the already established behaviors?

  8. #8 chat
    January 11, 2009

    One of the things that i love about science is that no one is above reproach.

  9. #9 HuffmanKathryn18
    February 1, 2011

    A lot of specialists state that loan help a lot of people to live the way they want, just because they are able to feel free to buy necessary things. Furthermore, different banks give collateral loan for different classes of people.

  10. #10
    July 19, 2011

    ben (ben hazır alıyorum rağmen) henüz bir RBF ağ uygulanmadı, ama benim anlayış mevcut çekirdek merkezlerinden çok uzakta Mevcut örnek ek gizli düğümler ağa eklenir. Ek gizli düğümler ekleyerek, RBF ağ ek bilgi birikir.

    Ağ muhtemelen birden fazla dönemleri olan eğitimli, çünkü standart bir Geriye yayılım ile, bu zor anında ağırlıkları ayarlamak için. Yani, soru kullanmak için ne önemi olur,