Problems With Teacher Evaluation: The Value Added Testing Edition

One of the supposed key innovations in educational 'reform' is the adoption of value added testing. Basically, students are tested at the start of the school year (or at the end of the previous year) and then at the end of the year. The improvement in scores is supposed to reflect the effect of the teacher on student learning*. I've discussed some of the methodological problems with value added testing before, and the Economic Policy Institute has a good overview of the subject. But what I want to discuss is a very serious flaw--what I would call fatal--with value added testing that stems from a paper by Jesse Rothstein (pdf).

Before we look at the abstract of the paper, we need to be very clear about what we're measuring. We are taking the difference in test scores--the gain--of students, assigning each student to a teacher, and then asking if we can determine an effect of teachers on the variation of gains (the difference in year to year test scores). This is not the same as correlations between annual scores (e.g., high scores in third grade mean high scores in fourth grade). A teacher who has a class full of students who score 80 out of 100 will could have a class that does well at the end of the year (average ~80) and thus show little gain, but the teacher who starts with a class average of 50 and pulls it up to 70 has done well--that gain is what is being assessed.

Onto the abstract (italics and boldface mine):

Growing concerns over the inadequate achievement of U.S. students have led to proposals to reward good teachers and penalize (or fire) bad ones. The leading method for assessing teacher quality is "value added" modeling (VAM), which decomposes students' test scores into components attributed to student heterogeneity and to teacher quality. Implicit in the VAM approach are strong assumptions about the nature of the educational production function and the assignment of students to classrooms. In this paper, I develop falsification tests for three widely used VAM specifications, based on the idea that future teachers cannot influence students' past achievement. In data from North Carolina, each of the VAMs' exclusion restrictions are dramatically violated. In particular, these models indicate large "effects" of 5th grade teachers on 4th grade test score gains. I also find that conventional measures of individual teachers' value added fade out very quickly and are at best weakly related to long-run effects. I discuss implications for the use of VAMs as personnel tools.

If the sentence in boldtype seems problematic, you're right: it is.

There is no known way a teacher in fifth grade, when students are supposedly shuffled between grades--and the method requires this assumption**--could possibly affect fourth grade improvement. There are two explanations here:

1) Value added testing has serious methodological issues (the technical phrase is "fucking bullshit").

2) The North Carolina primary school system (where the study was conducted) routinely violates space-time. If this is in fact happening, we have far more important things than student achievement to be worrying about.

I'm going with option #1. So here's the methodological problem:

Panel data allows flexible controls for individual heterogeneity, but even panel data models can identify treatment effects only if assignment to treatment satisfies strong exclusion restrictions. This has long been recognized in the literature on program evaluation, but has received relatively little attention in the literature on the estimation of teachers' effects on student achievement. In this paper, I have shown how the availability of lagged outcome measures can be used to evaluate common value added specifications.

The results presented here show that the assumptions underlying common VAMs are substantially incorrect, at least in North Carolina. Classroom assignments are not exogenous conditional on the typical controls, and estimates of teachers' effects based on these models cannot be interpreted as causal. Clear evidence of this is that each VAM indicates that 5th grade teachers have quantitatively important "effects" on students' 4th grade learning.

One key point Rothstein makes is that principals don't randomly assign students to classes. Instead, they typically take previous student performance and perceived (or misperceived) teacher quality into account. Some might place poorly performing students with the 'best' teachers in order to pull those students up (which will make 'good' teachers look worse than they are). Other principals might place the 'best' students with the 'best' teachers. And in other cases, some teachers might have a reputation for performing well with either poorly-performing or high-performing students. Unless student assignment is random, the models break down***. This problem is only magnified when comparing students across different schools, where one can't even attempt randomization ("We would like to improve teacher evaluation, so, thanks to the luck of the draw, your child will be bused to another school an hour away this year." That'll work...)

I realize this seems pretty technical (and if you think I'm bad, read the paper), but if education 'reformers' want to claim that their methods are rigorous, they have to get the methods right--that's how science works. If the methods fail, nobody cares about your results, and the discussion of those results is moot. You can't violate assumptions of the methods.

Or the space-time continuum.

Cited article: Rothstein, J. 2010. Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement. Quarterly Journal of Economics 125(1): 175-214. doi:10.1162/qjec.2010.125.1.175

*This method is actually taken from studies that look at how different firms pay their employees. Intelligent Designer save us from the economists....

**Whether this is good for the students overall is a separate question--should evaluation trump classroom coherency?

***Given correlations with the previous year in terms of absolute scores (and, interestingly, weak negative scores with regards to gains), you need to ensure students are randomized, or else the gain might be a cohort effect and not a teacher effect.

More like this

Last week, E.D. Kain took Megan McArdle to task for promoting the use of student testing as a means to evaluate teachers. This, to me, was the key point: ....nobody is arguing against tests as a way to measure outcomes. Anti-standardized-tests advocates are arguing against the way tests are being…
I've spent the last two days discussing the problems with value-added teacher evaluation, and I thought I would turn it over to the readers, since there has been some really good discussion. At the end, I'll revisit some statistical and methodological issues, but I want to address a good question…
There's been a lot of energy expended blogging and writing about the LA Times's investigation of teacher performance in Los Angeles, using "Value Added Modeling," which basically looks at how much a student's scores improved during a year with a given teacher. Slate rounds up a lot of reactions, in…
Tuesday, I criticized the LA Times' use of the 'value-added' approach for teacher evaluation. There were many good comments, which I'll get to tomorrow, but Jason Felch of the LA Times, pointed me to the paper describing the methodology. I'm not happy with the method used. First, I was right to…

Lovely argument. Could you do it again in English?

@Clam

Which makes more sense to you, that a student's 5th grade teacher will affect the student's test score gains in the 4th grade, or that the student's test score gains will affect which teacher she gets n the 5th grade?

Now, if you want to test the performance of two 5th grade teachers against each other, and one teacher gets more of the 4th grade high performers than the other, is that a fair test?

"One key point Rothstein makes is that principals don't randomly assign students to classes. ..."

All good comments. In addition there is the possibility of parents making specific requests for certain teachers. It's allowable, although not encouraged, here in MI, and only adds to the problem of teacher evaluation.

I tried to correct my first submission as follows:
"My bad, I meant the incomprehensible blockquotes."
but was told "Rejected - Too many submissions in a short space of time".
Too many? Two? How long do I have to wait before the robot lets me join in? WTF?

Ok, I understand. Either is lousy logic. Part of the problem is that someone's trying to apply accounting principles to human beings and the other part is that someone's trying to apply accountancy to human beings.
In Europe we suffer from the dreaded "Value Added Tax" whereby if I buy two widgets at $1.00 each, screw them together and sell them at $2.50, I am charged v.a.t. on the difference (0.50c) at, say 15%. There it is clear (-ish). But to take a pupil who has been failed by his teacher, or has failed himself or is just dumb and then to assess his subsequent teacher by "value added" is crazy. If the child can't read, how can he learn? If you teach him to read, where's the time for this year's curriculum? Potty.

Fair enough, but I'm tired of people saying measurement cannot work, just because there are examples (or they imagine ones) of imperfect models or methods.
For example I wouldn't be shocked if improvement in 4th grade was of some use predicting improvement in 5th grade. We could model that.

Ok, I understand. Either is lousy logic. Part of the problem is that someone's trying to apply accounting principles to human beings and the other part is that someone's trying to apply accountancy to human beings.thnx

In English, the Rothstein paper makes these points:

1. Measuring teachers' skills through VAM modeling is inappropriate because key assumptions in the VAM modeling are violated (randomization).

2. Attempting to adjust (fudge factors) for these violations does not improve the results.

3. There may be good reasons not to organize a school according to rules that would allow for VAM models (certain teachers may handle certain students better and it makes sense to make such assignments accordingly)

4. Therefore, using VAM modeling systems to judge teacher performance will fail to reward/punish teachers appropriately.

5. In the longer run, purely econ-based models may not be appropriate for judging teacher's contribution to student performance (which in a broader sense aren't & maybe shouldn't be, concrete ideas)

The sooner we stop looking at schools as learning factories or cheap day care centers, the sooner we'll actually see some real improvements. I recently watched the wealthiest school districts in my state reorganize and that wasn't very pretty. Especially with the consideration of the vast amount of resources per student they possess. So I'm not holding my breath for any real improvements elsewhere in the state or in the country.

By Always Curious (not verified) on 07 Mar 2011 #permalink

Exactly!

My sister, who is a grade school teacher, has a reputation as the "high structure/high nurturing" teacher, and is routinely assigned most, if not all, children in her grade who fall on the autism spectrum (assigning children with similar special needs to the same class also saves money on in classroom aides). She also tends to be assigned children who don't have a specific diagnosis yet but are very atypical learners.

This can have serious effects on her test results, though she has not yet been disciplined in the school she has worked in.

There is a huge problem with value-added assessment at the upper end of the academic scale. If a child is performing "A" level work (90%+), there is very little room to "add" to that performance. Bringing a failing child up to satisfactory performance would show significant "value added," but sustaining a gifted child at the upper end of performance would not.

By Dr. Shrinker (not verified) on 08 Mar 2011 #permalink

This article made me chuckle! It is so true.

They do the value added a little differenly in Florida though, although it still seems quite shady.

The way they calculate VAM in Florida is instead of measuring growth only based on percentage of the student, they take a bunch of students that are supposedly "just like" the student to be measured and calculate the average of those students and see if the student scored within that range. For example a latino female who is on free lunch gets compared with 1,000 other latino females on free lunch. If her score meets or exceeds the average of the group of students she's compared with, that student has showed gain.

If you ask me...it's all so confusing and just doesn't seem accurate. As a teacher, I think I should be able to understand how I am evaluated. I can honestly tell you that not one other teacher, let alone administrator, can explain how these evaluation scores are calculated.