Teacher Evaluation and Test Scores, aleph-nought in a series

By drorzel on September 1, 2010.

There's been a lot of energy expended blogging and writing about the LA Times's investigation of teacher performance in Los Angeles, using "Value Added Modeling," which basically looks at how much a student's scores improved during a year with a given teacher. Slate rounds up a lot of reactions, in a slightly snarky form, and Kevin Drum has some reactions of his own, along with links to two posts from Kevin Carey, who blogs about this stuff regularly. Finally, Crooked Timber has a post about a recent study showing that value-added models aren't that great (as CT is one of the few political blogs whose comments aren't a complete sewer, it's worth reading the ensuing discussion as well).

Given all that, there's not a whole lot left to say, but since I have strong opinions on the subject, I feel like I ought to say something. First and foremost, I really like Kevin Drum's summary of the summary of the problem:

But the problem with teachers is that assessing their performance isn't just hard, it's even harder than any of those other professions. Product managers interact closely with a huge number of people who can all provide input about how good they are. CEOs have to produce sales and earnings. Magazine editors and bloggers need readers.

But teachers, by definition, work alone in a classroom, and they're usually observed only briefly and by one person. And their output -- well-educated students -- is almost impossible to measure. If I had to invent a profession where performance would be hard to measure with any accuracy or reliability, it would end up looking a lot like teaching.

This is basically what I've said dozens of times before. Evaluating teachers is really difficult, and the report linked by Crooked Timber gives one really nice demonstration of just how bad even the value-added method (described by Kevin Carey as "the worst form of teacher evaluation but it's better than everything else") can be:

A study designed to test this question used VAM methods to assign effects to teachers after controlling for other factors, but applied the model backwards to see if credible results were obtained. Surprisingly, it found that students' fifth grade teachers were good predictors of their fourth grade test scores. Inasmuch as a student's later fifth grade teacher cannot possibly have influenced that student's fourth grade performance, this curious result can only mean that VAM results are based on factors other than teachers' actual effectiveness.

This is a major, major problem for any attempt to use this as an evaluation scheme.

That said, I think discussion of and research into these questions is ultimately a good thing.

That doesn't mean I really approve of the LATimes's grand-standing, which seems to be more about making a splash and boosting readership than any sincere desire to get to the bottom of this issue. But if that's what it takes to get public officials to start collecting the data you would need to really study this problem, then it's probably to the good.

There are severe problems with even VAM evaluations, which are subject to very large fluctuations:

One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers' effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis.

There might, however, be ways to tease something useful out of the data. Year-by-year fluctuations may be very large, but does a three-year rolling average, for example, give you more consistent results? Are there factors that haven't been controlled for that might be taken into account in a new study?

The research clearly seems to indicate that an annual evaluation based on test scores, even value-added test score, is next to useless. And the strong correlations between test scores and socioeconomic factors means that these should absolutely not be used for any kind of state-wide or national merit evaluations. But that doesn't mean that there isn't anything to be gained by studying the question, and collecting lots of data is a good place to start.

I haven't had time to go through the EPI report in detail (I had vain hopes of doing so, which is why this is two days later than all the other posts on the topic), but I did want to pull out one other tidbit that struck me as interesting:

A second reason to be wary of evaluating teachers by their students' test scores is that so much of the promotion of such approaches is based on a faulty analogy--the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education.
Rather, private-sector managers almost always evaluate their professional and lower-management employees based on qualitative reviews by supervisors; quantitative indicators are used sparingly and in tandem with other evidence. Management experts warn against significant use of quantitative measures for making salary or bonus decisions.

There's even a scholarly citation, to pp.93-96 of this book. Throw that in with the fact that obvious incompetents somehow hang onto private-sector jobs far longer than many of the assertions made in favor of various teacher evaluation schemes would have you believe (insert your favorite bad customer service story here), as something to keep in mind the next time the subject comes up.

More like this

Firing bad teachers doesn't create good teachers

Sara Mead writes at Ed Week about teacher legislation, especially new policie

Gladwell on spotting great teachers

More Evidence of How Value-Added Testing Fails at Teacher Evaluation

Last week, E.D. Kain took Megan McArdle to task for promoting the use of student testing as a means to evaluate teachers. This, to me, was the key point:

What Makes a Science Teacher Good?

What makes a good science teacher? That is the new ask-a-scienceblogger question. I am sure that there has been a lot of research into this, none of which I have read. That is why this post is categorized as an "armchair musing."

The probability of a simple quantitative measure proving useful for promoting or paying teachers is about nil. The best you can hope for is a system - probably not based on a quantitative pseudo-science - for sacking the real stinkers.

Even observing teacher performance is an iffy process because the addition of an observer, just as in physics, can alter the classroom dynamic. And then all of us have been crucified by some student in evaluations, only to have them realize some years later that we did them a great favor.

Surprisingly, it found that students' fifth grade teachers were good predictors of their fourth grade test scores.

the study:
http://www.nber.org/papers/w14442

The long and the short of it is that a lot of variance in VAM models is due to factors assumed not to exist in the models, such as non-random sorting of students. For example, think of the different learning environments in class rooms that have either 0, 1, or 2 extremely disruptive students.

VAM (as implemented by school districts) might be too noisy for evaluation of teachers, but it would probably still work well for evaluating schools, where some of the noise would cancel out.

One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%.

This is the well-known phenomenon of reversion to the mean. Most of what it takes to reach the top 20% is simply luck of the draw. There might be a small contribution from actual skill, but without knowing details (what the exact percentages were and how many were in the sample) I can't even be sure of that. (There is also the possibility, depending on test design, that some teachers may be cheating. See Freakonomics for an example from Chicago, which also demonstrates a method for identifying truly effective teachers.) It also tells us why it's so hard to weed out the handful of truly incompetent teachers: they are hard to identify among the false positives in these tests.

I agree that evaluating teachers is really difficult. Certainly standardized test scores and sporadic classroom visits are poor tools at best, but I disagree that they are "next to useless". If you properly control for student demographics, year-to-year fluctuations etc., it's a useful tool, and many of the objections in the EPI Briefing Paper go away.

I'm not familiar with the LA stuff, but I think what's been going on in the Washington DC school system is a move in the right direction.

If you have two teachers teaching the same grade level in the same school (with students randomly assigned between classrooms) and one teacher consistently manages to increase the "reading-level" of her students on the tests (with respect to the district-wide average), while the other teacher's students consistently fall further behind on the assessments, I don't think it's too crazy to conclude that the first teacher is more effective, and I don't think it's at all crazy to act on this data. In DC, they were finding circumstances like what I described above (I have lost the reference to the original article) where individual teachers in adjacent classrooms were having drastically different outcomes as per the testing. And I think Teach For America has done some pretty good (preliminary) research determining the teacher qualities that are responsible for these differences.

The assumption seems to be that the issue is defining a way to determine the metric for teachers that correlates with some performance measure related to student learning. What the data may be saying is that teacher performance is not a strong determinant of how students do. So what are the dominant factors? Oh I don't know maybe PARENTs and socio-economic conditions and PARENTS and racial bias and PARENTS

Easy. This is a solved problem in college teaching. We just need to hand out course evaluation forms to 4th graders at the end of every semester.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Go On Till You Come to the End; Then Stop

October 31, 2017

ScienceBlogs is coming to an end. I don't know that there was ever a really official announcement of this, but the bloggers got email a while back letting us know that the site will be closing down. I've been absolutely getting crushed between work and the book-in-progress and getting Charlie the…

Meet Charlie

October 30, 2017

It's been a couple of years since we lost the Queen of Niskayuna, and we've held off getting a dog until now because we were planning a big home renovation-- adding on to the mud room, creating a new bedroom on the second floor, and gutting and replacing the kitchen. This was quite the undertaking…

Physics Blogging Round-Up: August

September 1, 2017

Another month, another set of blog posts. This one includes the highest traffic I think I've ever seen for a post, including the one that started me on the path to a book deal: -- The ALPHA Experiment Records Another First In Measuring Antihydrogen: The good folks trapping antimatter at CERN have…

The Age Math Game

August 22, 2017

I keep falling down on my duty to provide cute-kid content, here; I also keep forgetting to post something about a nerdy bit of our morning routine. So, let's maximize the bird-to-stone ratio, and do them at the same time. The Pip can be a Morning Dude at times, but SteelyKid is never very happy to…

Kid Art Update

August 13, 2017

Our big home renovation has added a level of chaos to everything that's gotten in the way of my doing more regular cute-kid updates. And even more routine tasks, like photographing the giant pile of kid art that we had to move out of the dining room. Clearing stuff up for the next big stage of the…