Counting Chinese Words

By gregladen on June 9, 2010.

It has been said that "word frequency" is the most important variable in language research, despite the belief by many that it can't be used as a variable because no one really knows what a word is. (see: Minifalsehood: We can't tell what a word is!?!? and A run in my stocking ...)

A recent study in PLoS looks at a heretofore under investigated area, word/character use in Chinese.

Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

So, the leading edge is where the mixing happens. The study concludes that subtitle-based word frequencies do a good job of estimating daily language explsure and exemplify the patterns of variance in word processing. Furthermore, this work generated a database that ...

... is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

The most surprising result of this study is probably the degree to which the word frequency data did NOT represent a biased or strange subset of the language. It was thought that since movies treat certain situations more frequently than others, tend to be thematically "American" and because subtitles are not exactly what is being said on screen, and for other reasons, this would be a interesting but complementary (or at least different) word set. But ...

It was only when we saw how well these word frequencies were doing to predict word processing times for thousands of words ... that we started to appreciate their potential. Despite their shortcomings, subtitle frequencies are a very good indication of how long participants need to recognize words. They also better predict which words will be known to the participants and which not.

If you are inclined, you can read this study (in English) at PLoS.

Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles PLoS ONE, 5 (6) DOI: 10.1371/journal.pone.0010729

More like this

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Last Post

October 30, 2017

This is my last post at Scienceblogs.com. In the future I will be blogging at Greg Laden's blog, located at its original home at gregladen.com. I have a feeling that Scienceblogs will not last long without me. What do you think? :) But seriously, I'll be talking about the story of the current…

Hacking Voting Machines

October 10, 2017

In every area of life, but especially in the overlapping realms of technology, science, and health, misunderstanding how things work can be widespread, and that misunderstanding can lead to problems. In the area of voting, the main problem seems to be the expenditure of great amounts of outrage and…

On that chilling law suit against the environmental groups

October 5, 2017

... which I've posted on before ... there are new developments, summarized at Inside Climate News: Invoking the Racketeer Influenced and Corrupt Organizations Act, or RICO, a federal conspiracy law devised to ensnare mobsters, the suit accuses the organizations, as well as several green campaigners…

One response to the Las Vegas Shooting

October 5, 2017

from a major non profit, click through the the X Blog to read the press release.

Watch Jeff Merkley Wipe Floor With Trump's William Wehrum

October 5, 2017

William Wehrum is a lawyer and once, apparently, worked for the EPA. Trump is trying to appoint him to be assistant administrator for air and radiation. This is a reasonably important job that concerns many aspects of the environment. Watch: https://twitter.com/SenJeffMerkley/status/…

More like this

Last Post

Hacking Voting Machines

On that chilling law suit against the environmental groups

One response to the Las Vegas Shooting

Watch Jeff Merkley Wipe Floor With Trump's William Wehrum

AAS221 Sciency Bits

How Does Helium Get Underground In The First Place? (Synopsis)

Bacon in the Asteroid Belt