Does anyone around here know of a program or programs that can do the following things with text:
- Frequency counts for parts of speech (nouns, verbs, adjectives, etc.).
- Sort or score words/phrases based on how abstract or concrete they are.
UPDATE: Thank you everyone for the suggestions and tips. I'll try them out tomorrow when I get in the lab.
Since I asked without giving you any details, let me give you a brief, though vague description of the project. A few years ago, another psychologist and I wrote a review/theory paper about a particular type of category that we thought sounded plausible, and could have important implications for concept research. We tried a bunch of different ways to test for the existence of these categories empirically after we published the paper, but it proved difficult, mostly due to my own lack of creativity, and ultimately the research program stalled. However, this spring, I sat down with another colleague who'd been doing research that was related, though not directly linked to the paper. In one lunch (well, I just had coffee), he and I came up with a bunch of possible empirical routes, one of which involved the typical/ideal distinction that the concepts folks out there might recognize from Larry Barsalou's work on ad hoc categories from the 80s and Doug Medin's work on concepts and expertise. Basically, we wondered if the prototypical members of our type of category might be ideals, rather than central tendencies, much as the prototypical members of ad hoc categories, and the categories of some experts, are ideals. If that was the case, then we'd have a pretty good way of determining whether a particular category was one of ours or not.
To make a long story short, we had participants list characteristics for and examples of typical and ideal members of various natural categories, without hoping to find anything in this particular task (it's meant to serve as a comparison for another task), but in entering all the characteristics people listed, I began to notice some things -- like possible differences in the word-types (e.g., adjectives vs. nouns) used to describe different categories, and the abstractness of the characteristics (not surprising, since adjectives tend to be more abstract), and after the three of us working on the project talked it over, we decided there might be something interesting in there, but we weren't sure exactly how to measure those sorts of things.
the MRC database will have those parameters for words.
It's also pretty easy loading it into a database and querying the DB using PERL, so you can read a file and go word-by-word. imageability of phrases might be trickier.
The most accurate way to determine word type is to use a parts-of-speech tagger. There are several open source POS taggers available with accuracies that vary from 98 to 99.xx %. My favorite is one from Carnegie Mellon by Adwait Ratnaparkhi - written in Java.
You'll need to write a little code to run it and use the output ...
Abstract vs concrete is a little harder. How do you define abstract vs concrete? One way is to use the Wordnet database (open source) and look at the hypernym/hyponym relationship - see
or read the excellent book on Wordnet.
Good luck. Sounds like a fun little project.
CTO & CoFounder
you might try the university of south florida norms set up by douglas nelson. I believe there are imagability and concreteness ratings on many english words. you'll have to set something up to parse the files though
Have a look at LIWC: "Linguistic Inquiry and Word Count (LIWC) is a text analysis software program designed by James W. Pennebaker, Roger J. Booth, and Martha E. Francis."
I'm about to use it in some current research. Not sure whether it does abstract/concrete, I haven't had a close look at it yet.
The MRC database is quite good for a lot of things but uses the ancient (40 yr old) Kucera & Francis written word frequency norms. You might also try CELEX and/or the Linguistic Data Consortium. [The inclusion of links seemed to toss the comment into the trash bin, so they're omitted.]
Natural Language Toolkit:
http://nltk.sourceforge.net/index.php/Main_Page should work for you. Can do a lot of useful things.
1. Doable, with some good suggestions above. I would recommend CELEX.
2. I don't think what you want is possible. The established norms for concreteness ratings aren't large enough, as they are less than a few thousand words, with no coverage of phrases. You would have to come up with your own concreteness rating norms or a coding scheme, for the features people list.