Tuesday, I criticized the LA Times‘ use of the ‘value-added’ approach for teacher evaluation. There were many good comments, which I’ll get to tomorrow, but Jason Felch of the LA Times, pointed me to the paper describing the methodology. I’m not happy with the method used.
First, I was right to have concerns about the linearity of test scores. Consider the mean score for each quartile:
highest = 852
second highest = 768
third highest = 730
fourth highest = 682
What this means is that an increase from 40th percentile to 50th is not the same as an increase from 50th to 60th. Now, as far as I can tell, the authors in the paper are using the raw scores, but the model they are using assumes linearity. In light of this, using something like proficiency (a gross cutoff) would seem to be a more accurate, if less precise, measure to use (i.e., something like net percent increase/decrease in proficiency per class).
Second, the authors include lots of effects in their model (and determining the significance of these effects isn’t trivial), but there’s one glaring omission:
The model was simplified by assuming that the student heterogeneity term (αi) was zero.
In other words, intrinsic student differences are removed from the model. The authors claim that is warranted:
This assumption was consistent with initial data runs that indicated that student heterogeneity was statistically insignificant after controlling for prior year test score and observed student characteristics. More importantly, recent research has shown that this type of model performs well in predicting teacher performance from year to year in both experimental and non-experimental settings (Kane and Staiger, 2008; McCaffrey et al., 2009).
Oddly enough, the Kane and Staiger paper claims that teacher effects disappear after two years, so, well, I’m not sure what the fuss is about. But the larger issue is that this is a really screwy population of students. Here’s how the percentage of students who qualify for free lunch (an indicator of poverty) breaks down:
highest = 55
second highest = 89
third highest = 94
fourth highest = 97
This is an incredibly monomorphic population. To give you some idea of what that means, if a class has 25 students, half of the classes in the schools belonging to the lowest quartile will have every student qualify for a free lunch (I’m assuming students are distributed equally, which at 97% is probably a reasonable, if not entirely accurate, approximation). It is difficult to tease out the effects of poverty because so many students are poor. Student variation can be ignored–and has little effect in the analysis–is because the environment of the students is rather invariable, albeit for a shameful reason. In other words, this study primarily deals with a population that is homogeneous for poverty. Thus, we can’t say very much about how poverty affects scores in general. It also means that teacher effects will be magnified relative to other student populations. Related to this, Matthew Yglesias, looking at LA’s NAEP scores, concludes:
We see that LA’s black kids do worse than the average big city black kid. LA’s Latino kids do worse than the average big city Latino kid. And LA’s poor kids do worse than the average big city poor kid. LA’s non-poor kids, its white kids, and its Asian kids are average for kids in big city public school systems. Relative to the national average LA’s 8th grade math scores are below average for blacks, Hispanics, and Asians. They’re below average for poor kids and they’re below average for non-poor kids. But LA’s non-Hispanic whites do right in line with the national average.
Finally, according to their analysis, teacher quality accounts for 19% in English and 27% in math (for the stats mavens, the effect sizes are 0.19 and 0.27 respectively). It should be noted that the correlation between years is 0.87, so the greatest contribution to test scores is what the student walked into the classroom with (i.e., students who did well last year will do well this year). If we take the effect sizes at face value, and I think there are other methodological issues, along with what I’ve raised here, that make that a dubious assumption, we’re still talking about ~75% of the effect size is not due to teacher quality.
I’ll have some final thoughts and discuss reader comments tomorrow.