More Evidence of How Value-Added Testing Fails at Teacher Evaluation

Last week, E.D. Kain took Megan McArdle to task for promoting the use of student testing as a means to evaluate teachers. This, to me, was the key point:

....nobody is arguing against tests as a way to measure outcomes. Anti-standardized-tests advocates are arguing against the way tests are being used, and the prioritization of tests. If you really, truly want to measure outcomes, you should not create a system that incentives teaching to a test. Teaching to a test not only narrows the curriculum, it means that teachers prepare students specifically for the test. This skews the outcomes of test scores enormously. Testing should be done outside of normal instruction so that each teacher, school, and student can be fairly measured.

Tests are a good, if not absolutely perfect way, of assessing how well students have learned (if the tests are well-designed). If you're trying to assess how a particular change in teaching works (e.g., a new math curriculum), you do need some method to assess performance.

But where 'reformers' go off the rails is their incessant belief that testing is a good way to evaluate how well a teacher has taught* (this belief also seems to imply that many teachers aren't performing up to snuff, but I'll let that slide...).

First, the methodological assumptions, such as random** student assignment to classes, of the best (or least worst) method, value-added testing, are usually violated. In one study, this led to fifth grade teachers affecting fourth grade student performance to nearly the extent that their fourth grade teachers did. Yes, you read that last sentence correctly. Either there are problems with the method (likely), or else this school system routinely violates our current assumptions about space-time (not so likely).

Second, the precision in figuring out how well a teacher taught is, to be charitable, non-existent. When a teacher can range between abysmal to 'middle of the pack--let's give her tenure', this isn't a very precise measure. An evaluation scheme this capricious can best be described as 'demotivational.'

A 2010 study from the Annenberg Institute for School Reform authored by Sean Corcoran describes just how imprecise the estimates of teacher performance are. First, consider how Houston, TX teachers would be assessed using two different tests, the Stanford Test and TAKS. In the figure below, teachers are assigned to quintiles based on the TAKS-reading exam, and then compared to scores on the Stanford reading exam.

twodifftestsv1

In every case, one quarter or more of teachers placed on the Stanford exam two quintiles or more away from their TAKS score (using a two quintile difference is very conservative as a teacher who scored 19% on one exam and 21% on the other would be placed into different quintiles). One out of six teachers who placed in the highest TAKS quintile fell into the bottom two Stanford exam quintiles, and vice versa. Believe it or not, this is the 'least worst' evidence for the imprecision of value-added testing estimates of teacher ability.

Consider this range of variation in New York City's Teacher Data Reports (italics mine):

As expected, the level of uncertainty is higher when only one year of test results are used (the 2007-2008 bars) as against three years of data (all other bars). But in both cases, the average range of value-added estimates is very wide. For example, for all teachers of math, and using all years of available data, which provides the most precise measures possible, the average confidence interval width is about 34 points (i.e., from the 46th to 80th percentile). When looking at only one year of math results, the average width increases to 61 percentile points. That is to say, the average teacher had a range of value-added estimates that might extend from, for example, the 30th to the 91st percentile. The average level of uncertainty is higher still in ELA. For all teachers and years, the average confidence interval width is 44 points. With one year of data, this rises to 66 points.

I think a Magic Eight Ball would be more reliable. And since this method is supposed to be able to identify good teachers, how does it perform at that task? Not well (italics mine):

Given the level of uncertainty reported in the data reports, half of teachers in grades three to eight who taught math have wide enough performance ranges that they cannot be statistically distinguished from 60 percent or more of all other teachers of math in the same grade. One in four teachers cannot be distinguished from 72 percent or more of all teachers. These comparisons are even starker for ELA, as seen in Figure 8. In this case, three out of four teachers cannot be statistically distinguished from 63 percent or more of all other teachers. Only a tiny proportion of teachers - about 5 percent in math and less than 3 percent in ELA - received precise enough percentile ranges to be distinguished from 20 percent or fewer other teachers.

Not working well at all.

Again, the issue is the misuse of tests: testing is a good way to determine if a particular intervention works, or to get a handle on the relative importance of various demographic variables when looking at a large number of students. But as a method of measuring teacher performance, value-added testing--which is the 'best' method--stinks. As Corcoran notes:

Persistently exceptional or failing teachers - say, those in the top or bottom 5 percent - may be successfully identified through value-added scores, but it seems unlikely that school leaders would not already be aware of these teachers' persistent successes or failures....

But teachers, policymakers, and school leaders should not be seduced by the elegant simplicity of "value added."

Indeed.

In light of our inability to make meaningful statements about teachers, maybe combating poverty doesn't look so intractable....

*The phrase commonly used is 'teacher performance', but "how well a teacher taught" seems to be a more accurate description of what they're purporting to measure.

**With respect to the variables of interest.

More like this

The LA Times has taken upon itself to rate school teachers in Los Angeles. To do this, the LA Times has adopted the 'value-added' approach (italics mine): Value-added analysis offers a rigorous approach. In essence, a student's past performance on tests is used to project his or her future results…
Nicholas Kristof has done some excellent reporting on the issues facing the developing world. But he is a case study in how reporting and analysis are not necessarily part of the same skill set. In Thursday's column, Kristof writes (italics mine): When I was in college, I majored in political…
Recently, I described how unreliable value-added testing is when used to determine teacher performance. Whenever I write about that subject, inevitably someone raises the suggestion, either in comments or email, of developing a better method of evaluating teachers, such as more frequent tests.…
By way of Observational Epidemiology, we find an interesting NY Times article by Michael Winerip describing a seventh grade teacher's experience with value added testing in New York City. I'll get to value added testing in a bit, but the story also highlights why we need more reporters who have…

"but it seems unlikely that school leaders would not already be aware of these teachers' persistent successes or failures...."

It also seems unlikely that they would have a reliable, impartial method for identifying the bottom 5%, without the value-added testing.

By Adam McCann (not verified) on 10 May 2011 #permalink

Mike,

This is another of our few areas of agreement. ;) As a public school teacher, I find your arguments and references in this area to be excellent. One concern I have, though, is with this:

Testing should be done outside of normal instruction so that each teacher, school, and student can be fairly measured.

Ideally, that would be true. However, student motivation to do well on a test is also a factor that can skew results (as can anxiety in the case of high stakes testing). The problem I've seen at the high school level is some students being unwilling to take a test seriously if it doesn't affect them fairly directly in some way. A significant number of students ended up "christmas treeing" our state science test, for instance, as it has no impact at all for high school students.

Also, teacher attitudes can play a part. We heard second-hand from a student that the teacher proctors in one room actually told the students that the science test didn't count for anything and that they didn't know why they had to take it. The student witness said that most of that room's students either filled in a few bubbles before putting their heads down or simply didn't bother to open up their books.

This is probably less of a problem with elementary grades, but for adolescent students, motivation is a significant issue. I'm not sure how to separate testing from having a real consequence for the student and still maintain validity.

I've struggled with the issue of testing for years. I cannot abide standardized tests. I dislike the very process of grading. And I understand the problems of assessing teachers on the basis of student test scores.

At the same time, I want to know and others legitimately want to know what students are learning and whether a teacher is actually having an impact. Hence the many varieties of performance assessment and the power of more sophisticated approaches to teaching based on projects and the like. But I simply would not want to hire a teacher nor support a teacher if I had no idea how he or she was impacting the cognitive, emotional and social life and development of students.

What is your preferred approach to assessing teachers (and administrators)?

Until students and parents share equally in the responsibility and consequences of testing, teachers cannot, in any fair manner, be held completely responsible.

By thomas sones (not verified) on 12 May 2011 #permalink

I won't tackle the issue of evaluating administrators, but teachers should be evaluated on their teaching. We do know what constitutes best practice in education. In Finland, for example, teachers routinely observe each other, and critique each other based on best practice. It identifies areas where a particular teacher needs to improve and it identifies teachers who are having problems. Of course the entire Finnish system is designed to make better teachers, not punish teachers and students.

We need to use the tools we have (not to say they can't be improved, but they are based on sound teaching in most cases.) If an administrator can't or won't visit teacher classrooms regularly, they will never identify the poor teachers nor will they help any teacher improve. Using testing as a proxy for good administration will never work because it does not identify good practices nor does it identify poor teachers and certainly doesn't help anyone improve.