Faculty Evaluation Is Really Complicated

By drorzel on June 17, 2010.

There's a paper in the Journal of Political Economy that has sparked a bunch of discussion. The article, bearing the snappy title "Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors," looks at the scores of over 10,000 students at the US Air Force Academy over a period of several years, and finds a small negative correlation between the faculty effect on performance in an introductory course and performance in a follow-on course. In other words, as they explain in the Introduction,

[O]ur results indicate that professors who excel at promoting contemporaneous student achievement, on average, harm the subsequent performance of their students in more advanced classes. Academic rank, teaching experience, and terminal degree status of professors are negatively correlated with contemporaneous valueâadded but positively correlated with followâon course valueâadded. Hence, students of less experienced instructors who do not possess a doctorate perform significantly better in the contemporaneous course but perform worse in the followâon related curriculum.

I realize this is almost a week old, and thus ancient history in normal bloggy terms. I've been struggling to think of what to say about it, though. The general reaction that I've seen is "Hunh. That's... interesting," with the exception of the Dean Dad, who goes off into a rant that is only vaguely connected to the actual paper. The one-sentence summary of the result is in the title to this post, and beyond that, I don't have a whole lot of commentary. Since I read the whole thing, though, I might as well write it up as a ResearchBlogging post, which will have the added benefit of taking long enough that I won't have time to say anything about the latest round of a different, utterly pointless, argument.

The really interesting thing about this paper is that the authors have stumbled onto a really awesome dataset at the US Air Force Academy. The USAFA curriculum requires all students, regardless of major, to take a certain set of core courses, including introductory calculus and a number of classes that have calculus as a prerequisite. More importantly, students are assigned to sections of the core courses using a random algorithm, so there are no self-selection biases. The core courses are taught in multiple sections by different instructors, but all sections follow the same, fixed syllabus, and all exams are graded in common, with no one instructor responsible for the grade of any given student.

It's basically an education researcher's dream sample. All of the usual confounding factors-- the tendency of students to sort themselves into sections taught by particular professors, or into groups of similar academic ability, is removed. The problem of professors potentially grading their own students more leniently is taken care of by the common exams and pooled grading. Any differences they see can be attributed to the teaching in a cleaner way than in any previous study of these issues.

This seems almost too good to be true, so they spend a good chunck of the paper demonstrating that the dataset really is as good as they claim. They do a bunch of statistical tests to show that the students really are randomly assigned to sections, including looking for reverse correlations-- an apparent effect of future professors on past grades. They find nothing that suggests anything other than a random distribution.

Having established that the data are sound, they then set up a regression model to look for effects attributable to faculty, and find that there are consistent and repeatable differences between faculty members in terms of student performance (as measured by the common exams, which determine the entire grade). Some faculty consistently have higher grades than others in introductory calculus, and the difference is statistically significant. It's not a large difference-- 5% of a standard deviation for a one-sigma increase in the instructor, or about 0.6% of the final grade-- but it's there, and attributable to something done by the instructor.

Since they also have data for follow-on courses that require introductory calculus as a pre-requisite-- Calculus II, Statistics, a variety of Physics and Engineering courses-- they also look for a correlation between the instructor in Calculus I and the performance is subsequent courses. And this is where they find the interesting result-- students of instructors whose classes are better than average in the intro course do slightly worse in the follow-on courses. Again, this effect is small-- about the same 5% of a standard deviation change-- but statistically significant. They do a bunch of cross-checking to see whether this can be attributed to some sort of numerical goof, but it looks robust.

They then use this model to try to tease out the important characteristics of the faculty, and find that student performance in the introductory class gets better when the instructor has less experience and rank, while performance in the subsequent classes is better for more experienced tenure-track faculty. Again, these are small effects, but they have enough data to work with that the effects are statistically significant, and appear mathematically robust.

Finally, they note that student course evaluations are correlated with grades-- students who are getting better grades in a class rate the instructor more highly. Given the above results, though, this means that evaluations for introductory professors are negatively correlated with performance in future courses. The better the evaluations given to the faculty in the first course, the worse the performance in the second course.

So, what's the cause of this? Nice as their dataset is, they don't have any way to really work that out, but they offer three possible explanations: First, that less experienced instructors are more likely to adhere strictly to the common curriculum, and are thus effectively "teaching to the test," while their more experienced colleagues teach a broader range of stuff that leads to better understanding and thus better performance in future classes. The second possibility is that students whose introductory course instructors are "teaching to the test" pick up bad study habits, which come back and bite them in later courses. The final explanation, which even the authors characterize as "cynical," is that students who get low-value-added professors in the introductory course put out more effort in the subsequent courses, in order to pick up their GPA.

They are admirably cautious in drawing conclusions based on all this. About the strongest statement they make in the paper is the concluding paragraph:

Regardless of how these effects may operate, our results show that student evaluations reward professors who increase achievement in the contemporaneous course being taught, not those who increase deep learning. Using our various measures of teacher quality to rankâorder teachers leads to profoundly different results. Since many U.S. colleges and universities use student evaluations as a measurement of teaching quality for academic promotion and tenure decisions, this finding draws into question the value and accuracy of this practice.

This (or, rather, a Washington Post reporter's transcription of this) appears to be what set the Dean Dad off. It strikes me as pretty weak tea, though-- all this really does is confirm what everybody in academia already knows-- that student course evaluations are a questionable way of measuring faculty performance. The data they have to work with give about the cleanest possible demonstration of this, but it's not really news. And, again, the effect they see is really small, and it's only because of the size and quality of the dataset that they can see anything at all.

So, that's pretty much it. They find a small negative correlation between the value added by faculty in an introductory course and student performance in subsequent courses. Which, in the end, adds up to, well, the post title: evaluating the performance of faculty is really difficult. Which is pretty much what most academics have been saying all along.

I don't think these results demand a wholesale revision of academic hiring and promotion practices-- the one good point the Dean Dad makes is that the system is clearly working, in that the professors who produce better results in follow-on courses have been kept on long enough to gain more experience than their younger colleagues whose students get better intro-course grades. At the same time, though, they throw a bit of cold water on a lot of "merit pay" schemes, which rely on things like grades and evaluations in this year's classes to determine next year's salary. If this year's test scores are weakly or negatively correlated with next year's performance, then it's hard to justify using them as a basis for pay and promotion.

(One tangential comment: It's always a little odd to read social-science papers, because they have a tendency to use summary paragraphs in the individual sections that are nearly word-for-word identical to paragraphs in the introduction. That's not as common in physics papers, at least not in the top journals, because the page limits set by PRL and others discourage that kind of repetition. So it's always a little jarring to me when I hit the concluding paragraph, and think "Wait, didn't I read this already?")

Carrell, S., & West, J. (2010). Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors Journal of Political Economy, 118 (3), 409-432 DOI: 10.1086/653808

More like this

Basics: Standard Deviation

When we look at a the data for a population+ often the first thing we do is look at the mean. But even if we know that the distribution

Seasons, short and simple

I love this question: Why is it warmer in the summer than in the winter (for the Northern hemisphere)? Go ahead and ask your friends. I suppose they will give one of the following likely answers:

The Real Bozo Attempts to Atone: Why the DDWFTW Car Works

Technorati Tags: ddftw, bozos, markcc-screwups

BIO101 - Lecture 7 - Physiology: Coordinated Response

Last week we looked at the organ systems involved in regulation and control of body functions: the nervous, sensory, endocrine and circadian systems. This week, we will cover the organ systems that are regulated and controlled.

So, the young faculty teaches for good evaluation, the experienced guys teach to not having to waste 2 weeks at the 400 level on remedial math. The only thing that surprises me is that you can actually show that with statistical significance.

In my department, we have lots of pretty good teachers (some fabulous) and one pretty bad teacher. When students complain to me about that teacher (and it's too late in the term to transfer or withdraw), I tell them that if they make it through, they'll be better students, because they will have learned how to study math on their own, a wonderful skill to have. I don't know if that sort of thing plays in here. (It seems related to #2 in the list of possible causes.)

Regardless of how these effects may operate, our results show that student evaluations reward professors who increase achievement in the contemporaneous course being taught, not those who increase deep learning.

How could it possibly be otherwise? On what basis are students supposed to evaluate the "depth" of their knowledge independently of their grades on tests and homework assignments?

Ping!

There is another explanation, not addressed by the authors:

Less educated instructors mentally approach a topic (especially math) in a fashion much closer to that of a non-expert (on that topic). Conversely, concepts important to an expert are often alien to a non-expert, and this is a major hurdle when communicating with a 'novice'.

The upside is that an expert has a better understanding of which concepts are most important to the field as a whole -- concepts not necessarily important at the time, but needed for more advanced concepts (~later course work).

This phenomena is familiar to any applied scientist who has worked with mathematicians (almost to the point of a cliche). For example, an evolutionary biologist asks a mathematician colleague a question -- and receives a half-an-hour long answer which is "precisely correct, and absolutely useless", and generally almost incomprehensible.

But the complication with this cliche is that the mathematician is generally providing a thorough answer ... just in terms the applied scientist is unfamiliar with, and addressing potential snags and complications they are unlikely to have ever heard of.

As students are (usually) not experts in any field, they experience this very acutely.

Then it seems Caltech is doing it right after all! We have an applied math professor who teaches an extremely difficult course that's required for most majors, and does it very well. He explains the material so clearly that he is able to cover it in much more depth than when anyone else teaches it; he also consistently gets very good student evaluations. After he won a teaching award several times in a row, the school paper included a quote to the effect of "good thing he's not up for tenure for a few more years; if he's teaching this well they'll assume he must be neglecting his research." And there I thought it was a *problem* that many of our professors can't be bothered to write legibly on the board, speak comprehensibly, define variables before using them, or proofread their problem sets.

Perhaps, the study just shows that the tests used in introductory courses are not assessing properly the right, important sets of knowledges and skills required for medium-term success in the program...

The negative correlation between Calc 1 and subsequent marks may also be evidence of regression to the mean.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

Who Controls The Chicken Controls The World

More by this author

Go On Till You Come to the End; Then Stop

October 31, 2017

ScienceBlogs is coming to an end. I don't know that there was ever a really official announcement of this, but the bloggers got email a while back letting us know that the site will be closing down. I've been absolutely getting crushed between work and the book-in-progress and getting Charlie the…

Meet Charlie

October 30, 2017

It's been a couple of years since we lost the Queen of Niskayuna, and we've held off getting a dog until now because we were planning a big home renovation-- adding on to the mud room, creating a new bedroom on the second floor, and gutting and replacing the kitchen. This was quite the undertaking…

Physics Blogging Round-Up: August

September 1, 2017

Another month, another set of blog posts. This one includes the highest traffic I think I've ever seen for a post, including the one that started me on the path to a book deal: -- The ALPHA Experiment Records Another First In Measuring Antihydrogen: The good folks trapping antimatter at CERN have…

The Age Math Game

August 22, 2017

I keep falling down on my duty to provide cute-kid content, here; I also keep forgetting to post something about a nerdy bit of our morning routine. So, let's maximize the bird-to-stone ratio, and do them at the same time. The Pip can be a Morning Dude at times, but SteelyKid is never very happy to…

Kid Art Update

August 13, 2017

Our big home renovation has added a level of chaos to everything that's gotten in the way of my doing more regular cute-kid updates. And even more routine tasks, like photographing the giant pile of kid art that we had to move out of the dining room. Clearing stuff up for the next big stage of the…