There's little evidence that "staging" the training of neural networks on language-like input - feeding them part of the problem space initially, and scaling that up as they learn - confers any consistent benefit in terms of their long term learning (as reviewed yesterday).
To summarize that post, early computational demonstrations of the importance of starting small were subsequently cast into doubt by numerous replication failures, with one exception: the importance of starting small is replicable when the training data lack temporal correlations. This leads to slow learning of the…