Counting Chinese Words

It has been said that "word frequency" is the most important variable in language research, despite the belief by many that it can't be used as a variable because no one really knows what a word is. (see: Minifalsehood: We can't tell what a word is!?!? and A run in my stocking ...)

A recent study in PLoS looks at a heretofore under investigated area, word/character use in Chinese.

Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

ResearchBlogging.orgSo, the leading edge is where the mixing happens. The study concludes that subtitle-based word frequencies do a good job of estimating daily language explsure and exemplify the patterns of variance in word processing. Furthermore, this work generated a database that ...

... is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

The most surprising result of this study is probably the degree to which the word frequency data did NOT represent a biased or strange subset of the language. It was thought that since movies treat certain situations more frequently than others, tend to be thematically "American" and because subtitles are not exactly what is being said on screen, and for other reasons, this would be a interesting but complementary (or at least different) word set. But ...

It was only when we saw how well these word frequencies were doing to predict word processing times for thousands of words ... that we started to appreciate their potential. Despite their shortcomings, subtitle frequencies are a very good indication of how long participants need to recognize words. They also better predict which words will be known to the participants and which not.

If you are inclined, you can read this study (in English) at PLoS.

Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles PLoS ONE, 5 (6) DOI: 10.1371/journal.pone.0010729

More like this

There are 23 new articles in PLoS ONE today. As always, you should rate the articles, post notes and comments and send trackbacks when you blog about the papers. You can now also easily place articles on various social services (CiteULike, Mendeley, Connotea, Stumbleupon, Facebook and Digg) with…
Ambiguity is a constant problem for any embodied cognitive agent with limited resources. Decisions need to be made, and their consequences understood, despite the probabilistic veil of uncertainty enveloping everything from sensory input to action execution. Clearly, there must be mechanisms for…
At least that was my take home message from a new paper in PLoS One, Language Structure Is Partly Determined by Social Structure: Background: Languages differ greatly both in their syntactic and morphological systems and in the social environments in which they exist. We challenge the view that…
tags: researchblogging.org, linguistics, evolution, irregular verbs, languages When I was an undergrad, I almost took a degree in linguistics because I was so fascinated by languages, especially by the rate and patterns of change that languages undergo. So of course, I was excited to read two…