There’s been a lot of energy expended blogging and writing about the LA Times’s investigation of teacher performance in Los Angeles, using “Value Added Modeling,” which basically looks at how much a student’s scores improved during a year with a given teacher. Slate rounds up a lot of reactions, in a slightly snarky form, and Kevin Drum has some reactions of his own, along with links to two posts from Kevin Carey, who blogs about this stuff regularly. Finally, Crooked Timber has a post about a recent study showing that value-added models aren’t that great (as CT is one of the few political blogs whose comments aren’t a complete sewer, it’s worth reading the ensuing discussion as well).
Given all that, there’s not a whole lot left to say, but since I have strong opinions on the subject, I feel like I ought to say something. First and foremost, I really like Kevin Drum’s summary of the summary of the problem:
But the problem with teachers is that assessing their performance isn’t just hard, it’s even harder than any of those other professions. Product managers interact closely with a huge number of people who can all provide input about how good they are. CEOs have to produce sales and earnings. Magazine editors and bloggers need readers.
But teachers, by definition, work alone in a classroom, and they’re usually observed only briefly and by one person. And their output — well-educated students — is almost impossible to measure. If I had to invent a profession where performance would be hard to measure with any accuracy or reliability, it would end up looking a lot like teaching.
This is basically what I’ve said dozens of times before. Evaluating teachers is really difficult, and the report linked by Crooked Timber gives one really nice demonstration of just how bad even the value-added method (described by Kevin Carey as “the worst form of teacher evaluation but it’s better than everything else”) can be:
A study designed to test this question used VAM methods to assign effects to teachers after controlling for other factors, but applied the model backwards to see if credible results were obtained. Surprisingly, it found that students’ fifth grade teachers were good predictors of their fourth grade test scores. Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that VAM results are based on factors other than teachers’ actual effectiveness.
This is a major, major problem for any attempt to use this as an evaluation scheme.
That said, I think discussion of and research into these questions is ultimately a good thing.
That doesn’t mean I really approve of the LATimes’s grand-standing, which seems to be more about making a splash and boosting readership than any sincere desire to get to the bottom of this issue. But if that’s what it takes to get public officials to start collecting the data you would need to really study this problem, then it’s probably to the good.
There are severe problems with even VAM evaluations, which are subject to very large fluctuations:
One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis.
There might, however, be ways to tease something useful out of the data. Year-by-year fluctuations may be very large, but does a three-year rolling average, for example, give you more consistent results? Are there factors that haven’t been controlled for that might be taken into account in a new study?
The research clearly seems to indicate that an annual evaluation based on test scores, even value-added test score, is next to useless. And the strong correlations between test scores and socioeconomic factors means that these should absolutely not be used for any kind of state-wide or national merit evaluations. But that doesn’t mean that there isn’t anything to be gained by studying the question, and collecting lots of data is a good place to start.
I haven’t had time to go through the EPI report in detail (I had vain hopes of doing so, which is why this is two days later than all the other posts on the topic), but I did want to pull out one other tidbit that struck me as interesting:
A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy–the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education.
Rather, private-sector managers almost always evaluate their professional and lower-management employees based on qualitative reviews by supervisors; quantitative indicators are used sparingly and in tandem with other evidence. Management experts warn against significant use of quantitative measures for making salary or bonus decisions.
There’s even a scholarly citation, to pp.93-96 of this book. Throw that in with the fact that obvious incompetents somehow hang onto private-sector jobs far longer than many of the assertions made in favor of various teacher evaluation schemes would have you believe (insert your favorite bad customer service story here), as something to keep in mind the next time the subject comes up.