Grenade Tossing About Grade Inflation

Via Thoreau, a paper from a physicist in Oregon that's pretty much a grenade lobbed into the always-explosive grade inflation discussion:

We use four years of introductory astronomy scores to analyze the ability of the current population to perform college level work and measure the amount of grade inflation across various majors. Using an objective grading scale, one that is independent of grading curves, we find that 29% of intro astronomy students fail to meet minimal standards for college level work. Of the remaining students, 41% achieve satisfactory work, 30% achieve mastery of the topics.

Intro astronomy scores correlate with SAT and college GPA. Sequential mapping of the objective grade scheme onto GPA finds that college grades are inflated by 0.2 for natural sciences majors, 0.3 for social sciences, professional schools and undeclared majors), 0.5 for humanities majors. It is unclear from the data whether grade inflation is due to easier grading curves or depression of course material. Experiments with student motivation tools indicates that poor student performance is due to deficiency in student abilities rather than social factors (such as study time or decreased interest in academics), i.e., more stringent admission standards would resolve grade inflation.

Yeah, that won't be controversial at all. The seriousness of this contribution is accentuated by the submission note on the arxiv, "not to be submitted to any journal." While this is most likely due to the use of student grades as a dataset, and the difficulty of getting permission to use those data (nothing in the paper allows you to identify specific students, but as I understand it, unless you have approval in advance, you can't usually use class assignments for research purposes), it makes it look like the author is more interested in scoring cheap rhetorical points than making a serious contribution to an intellectual debate by sending this out for peer review.

Appearances aside, what does the paper say? Well, pretty much what's in the abstract: the author taught introductory astronomy for four years running, using exactly the same quiz and exam questions, and piled up a big data set. He then looks at correlations between the grades students received in his class and their SAT scores and overall GPA. The claim of grade inflation is based on the fact that students who performed at a C level in these classes tended to have GPA's slightly higher than a straight C (averages range from 2.2-2.6 for different majors), and so on up the grade ladder.

Does he have a valid point? Really, the entire thing hinges on the definition of the "objective grading scale" used for this study:

Knowledge testing was the domain of the multiple choice exams. Three exams are taken each term(covering only the previous 1/3 of the course material). Each exam is composed of 100 multiple choice questions. Each exam is divided into three types of questions. The ï¬rst type tests knowledge of factual information (e.g., what is the color of Mars?). The second type of questions addresses the students ability to understand the underlying principles presented in the course (e.g., why is Mars red?). The third type of questions examines the ability of the student to process and connect various ideas in the course (e.g. why is the soil of Mars rich in heavy elements such as iron?).

Excellence in answering questions of the first type represents satisfactory grasp of the courses objectives, i.e. a 'C' grade. High performance on questions of the second type demonstrate good mastery of the course material, i.e. a 'B' grade. Quality performance on the top tier of questions would signify superior work, i.e. an 'A' grade. While the design of the various questions may not, on an individual basis, exactly follow an objective standard for 'A', 'B' or 'C' work, taken as a whole this method represents a fairly good model for distinguishing a students score within what most universities consider a standard grading scheme. Certainly, this was the original intent of the ABCDF grading scheme, not to assign a grade based on class rank or percentage, but to reflect the students actual understanding of the core material.

The A/B/C lines are set based on assuming students answer all of the appropriate category or categories of questions correctly, then guess at the answers for the others. So, the minimum score to get a C is set at the total number of points for answering all factual questions, plus 1/5 of the points for the other two categories (exams are five-question multiple choice tests).

If you think that this is a good way to assign grades, then there's probably something to this-- the correlation between class grade and GPA is excellent as far as this sort of study goes. If this sounds like a ridiculous way to assess students' knowledge of introductory astronomy, then the rest of the argument probably falls apart. I'm kind of lukewarm on the whole thing, to be honest-- the slightly combative approach he takes with this puts me off a bit, but the correlations he sees are surprisingly good.

Anyway, I'm sure there are people out there who will have strong opinions on this subject, so I'm passing it along for comment. While I'm tempted to forward it to the local faculty who bang on about grade inflation all the time, I think discretion is the better part of collegiality this early in the year. There's no reason to start an all-faculty listserv shitstorm in August.

More like this

huh? The way you present this it seems as though all that has been demonstrated is that humanities majors perform better in their non-astronomy classes than they do in the astronomy class.

the real test would be a similar analysis in the two other major disciplinary categories. If top-performing physics-major-ugrads flailed at Postmodern Literary Criticism101, then this would be evidence for inflation in the natural sciences, right?

The claim the author is making is that students who are getting low scores in the introductory astronomy classes-- that is, who can't even answer the questions that are basically memorization of trivia ("What is the color of Mars?")-- are simply incapable of college-level work. Higher grades in other areas must be the result of inflation.

That's what I mean when I say you have to buy his grading scheme for this to make any sense. If you don't think that he's accurately characterizing the state of students who get C's or below in his class, then everything else falls apart.

Mat's interpretation sounds reasonable to me, too.

I'm not going to waste anymore of my time reading that steaming pile of crap, but I did get far enough to have the following concerns:

1. I was thinking that there might be some year-to-year increase in scores, given that he's using identical exams. To my surprise, he does check that.

2. He says that the classes' Math SAT scores are higher than the University average, and suggests that this is due to self-selection of students towards science classes. But these students are fulfilling a science requirement with what I suspect is viewed as the easy science class. So I would expect self-selection to lower the Math SAT scores for the class. More discussion of this would help.

3. There is not a single citation. I got one sentence into the main body before I found a claim that could have used a citation.

4. This paper is possibly an ethics violation. [Disclaimer: this is not my field] You're usually exempt from IRB if you're using work that is a natural part of teaching the course. My guess is that he would get an easy IRB exemption for this. But you actually have to go to the IRB and request that exemption. The term IRB doesn't show up anywhere in the paper, and nor does the term ethics. A real journal would reject a paper outright if IRB was not handled properly. The arxiv may or may not be considered publication for the purposes of IRB. If it is, the resulting ethics violation might be considered grounds for termination.

5. There's an interesting statistical question regarding the appropriate way to handle drops. "Interesting" meaning "I don't know the answer." My wife does, though; this is a routine problem in epidemiology. I'm disinclined to give the author the benefit of the doubt, though, and go look for it.

In conclusion, no, this is not a serious paper. Which is a real shame, because I like the idea of trying to establish an objective grading scheme, and I largely agree with his scheme. I might quibble with some of the details of his implementation, but the problems are way too large for me to waste any more time taking him seriously in this area.

Waitaminute, this guy thinks that multiple choice tests are a suitable methodology for gauging whether students who are already in college are prepared to do college level work? [facepalm] If I'm reading the quoted passage right, he's saying that he evaluates students by using three multiple choice exams of 100 questions each. I don't know the details of scheduling, but that's a lot of questions to answer in an hour, unless they are at a highly superficial level. Perhaps some of the alleged student deficiencies are simply not having enough time to answer all of the questions. Not to mention that he completely ignores all sorts of research about different learning styles.

If he were pointing to a bunch of essays or term papers which are barely readable due to endemic spelling/grammar mistakes as well as mangling basic concepts that students are supposed to learn in the class, then he might have a case. Mixing some multiple choice questions with short-answer questions or simple problems would be understandable. But the only predominantly multiple choice test I encountered as an undergrad was the GRE. College work should involve students thinking about what they are learning, not just regurgitating answers on an exam, which is what most multiple choice tests end up doing.

@1: I don't see where Chad is endorsing the idea of greater grade inflation in humanities majors, he's just quoting the abstract, which makes this claim. The claim is plausible (it's easier to fake your way through a humanities course than to do so in a science course), but as you say, not proven.

By Eric Lund (not verified) on 27 Aug 2010 #permalink

Let me at least partially retract my concern 4: it doesn't look like any of the other papers under ed-ph mention IRB directly. I agree with Chad that that is the most likely cause for the "not to be submitted to any journal" disclaimer. And I remain concerned about the lack of citations as well. It really does come off to me as the work of a crank (vis-a-vis the field, of course; he does appear to be a serious scientist in his usual field).

At faculty orientation at a science-friendly school, the academic dean showed us a presentation that included the bell curves of grades in freshman/intro math, psychology, physics, etc. These were all nicely centered in the high-c/low-b area. Then they came to writing classes (my specialty), and the bell curve we had been seeing lurched well into the B/A area. The scientists laughed, it was so noticeable. Grade inflation?

I don't think so, necessarily. Most of the people in my department allow revisions. In fact, we encourage kids to think of writing as a process. If we are doing our jobs right, at the end of the semester we should probably expect students who have invested in revising and resubmitting based on our comments to get higher grades on the whole. Those grades reflect the fruits of a LOT of revision and are likely the inevitable result of differing pedagogies.

HJ

Check Steve Hsu's blog for some work that Jim Schombert worked on with Steve:

http://infoproc.blogspot.com/2010/05/psychometric-thresholds-for-physic…

http://infoproc.blogspot.com/2010/04/dating-mining-university.html

etc

RE: the facepalming and whatever about methodology and pedagogy, Intro Astro is The Most Popular Science Course at Oregon. Do you want to give essay exams to four 150 student sections full of freshman English majors fulfilling requirements? The attitude of a lot of these comments seems to be a little 'privileged' from a teacher/student ratio perspective.

Do you want to give essay exams to four 150 student sections full of freshman English majors fulfilling requirements? The attitude of a lot of these comments seems to be a little 'privileged' from a teacher/student ratio perspective.

I agree, and I'm also surprised about the view that multiple choice questions are somehow intrinsically less rigorous than other forms of questions. Eric Lund brings up the GRE as an example, but I don't recall "regurgitating answers on an exam" when taking the GRE.

I agree that there's nothing inherently wrong with multiple choice, if the questions are done well. I've gotten to really enjoy coming up with conceptual multiple choice questions for our intro exams, and they can be really revealing-- students who can plug numbers into formulae with the best of them will flail spectacularly on well-designed multiple choice sections. It really helps separate the A students from the B students.

He used the same quiz and exam questions each years and didn't assume all the students already had the answers?!

I tend to think that most concerns about re-using questions are overblown. As long as you take reasonable precautions to prevent "cheat sheets," and do things like permuting the answers (order them by increasing magnitude one year, decreasing the next), you can prevent students from getting trivial perfect scores. And if they memorize the answers well enough to be able to defeat permutations and so on, that's all but indistinguishable from actually learning how to do the problems.

Whatever the quality of the examination procedure I think there's something to the point that students who can't tell you the color of Mars after taking introductory Astronomy have a problem. I wonder what the other questions on the 'C' test were like.

Did he check the scores against actual attendance/signs of being awake while attending?

I did waste more of my time reading the rest of that steaming pile of crap. It's pretty clear the guy had an agenda or two going into the study, and surprise! He proved found what he expected. (E.g. "Grades are inflated." or "We scientists do not inflate grades as much as other disciplines.")

There are plenty of explanations for his results. How about maybe other disciplines are better at thinking about how much they can actually teach students in a single course, for one? But a key flaw is assuming that student performance in one specific course maps in any meaningful way onto other courses or other disciplines.

Oh, and one other thing that jumped out at me:
As a quick test of the hypothesis on low student motivation, we introduced the use of clickers in the 2009 academic year. Clickers are the pedagogical toy-du-jour.

A physicist over the age of 50 expressing skepticism at modern pedagogical practices? Quelle surprise. He might be interested in the literature showing that you can't just throw these sorts of things (including tutorials and the like) into a course and expect student learning to magically increase. You actually have to, you know, do it in the right way.

Folks have rightly pointed out that this paper borders on the crankish. James' actual science has actually been listing that way as well. A quote from a recent paper he put on the ph:

"This version of our work released on astro-ph deviates slightly from the version that will be published. This is due to, what we believe, is a repressive editorial policy that allows an anonymous referee to replace our conclusions to match their personal views. Thus, we have restored several sections of text to re-enforce, more strongly, our results based on our data. Since no one ever references our work, we find the possible confusion in bibliography's to be moot."
http://lanl.arxiv.org/abs/0905.0410

Imagine how good his class would be if he had spent his time figuring out ways to be a better teacher rather than blaming his students.

Does Oregon put student course evaluations online? http://www.ratemyprofessors.com has him at a 2.7/5--for someone who teaches intro astronomy that's pretty shabby.

Most of the high-profile education researchers in physics swear up and down that well-designed multiple-choice questions are every bit as good as any other sort of question for gauging learning. "Well-designed" can mean anything from "subjected to years and years of study and revision" to "the ones in the teaching methods book that I'm selling with my face on the cover" to "the ones in the test bank for my hip, progressive new textbook that your students should spend $100+ on."

Of course, most of the high profile education researchers in physics also act like religious missionaries. If you don't want to make the baby Jesus cry, you'll use the clicker.

"...we find that 29% of intro astronomy students fail to meet minimal standards for college level work. Of the remaining students, 41% achieve satisfactory work, 30% achieve mastery of the topics."

Pot... kettle... black? 41% + 30% "of the remaining students" would still leave 29% unaccounted for. A simple mistake, but it seems especially ironic, given the topic of the article--is the author himself ready for scholarship at the graduate level?

There's the meat of a really interesting paper in there.

Unfortunately, the author didnt use Rasch models to examine the data, which would have allowed him to talk intelligently about his question.

Perhaps an argument to leave this sort of data to those who investigate it routinely (i.e. social scientists)

41% + 30% "of the remaining students" would still leave 29% unaccounted for.

LOLS, basic math fail.

41+30+29=100. It's pretty clear that the 29% you're missing are the 29% that failed.

By Kris Rhodes (not verified) on 01 Sep 2010 #permalink

greetings blogsphere, I'm the author of the paper being dissected.

Couple of corrections, I'm an astronomer, not a physicist, but I work in a Physics Dept.

The data used contains no student ID information, just
distributions and means. Therefore, I do not believe
I need permission to publish the reduced data (certainly
means and GPAs are available at the University website).

The "objective" grading scheme (I use quotes here, not in
the paper) is by no means perfect. And this kind of analysis
should be done in other studies. The results here are
(as stated in the paper) restricted to *one* type of
course at *one* type of University by *one* type of teacher,
clearly not a statistically significant sample. However,
in my defense, no one else seems to be doing this, I
would like to see more studies like mine.

Multiple choice exams suck. Of course in comparing to SAT
scores, we are comparing the same kinds of tests. In smaller
classes I use essay exams, which always have the noble goal
of getting students to write.

to Jeff, I had no expectations, I can only publish if I
find results that do not match expectations? There are
many possible explanations, start analyzing/writing dude!
My age excludes me from having opinions on education? damn.

The dry writing style of science papers makes us sound
cranky. Learn to deal with it (opps, that was cranky)

Yes, I ignored all the literature and social science
analysis (Rasch models?). This was a data dump, I
leave interpretation to others.

And, yes, there are typos in the paper (and this post),
damn grammar nazi's.

oh, and to I.P. Freeley

seriously, you use RYP to evaluate professors? I use
math in my classes, therefore, I am well hated by a
majority of the students. Next term I'm going to show
Star Wars films and give out all 'A's. Imagine my
new RYP ratings!

@Kris - I think the issue is with the "of the remaining students" qualifier. As written, the sentence seems to refer to 41% and 30% of the remaining students, ie, those who do not fail to meet minimum standards. It's likely that the author meant these statistics as percentages of the total sample, but writing it this way makes it unclear.

Hi Jim, thanks for answering so many questions people had.

I'm not a huge fan of RMP, it's just an easier than hunting around for UofO evaluations that may or may not be online.

Looking at the evaluations of Astronomy professors teaching intro courses at the large public school where I got my phd, they have a mean of 4.2/5, median 4.5/5. Having served as a TA for many of them, I can safely say there was more math than Star Wars. Indeed, the prof who was lightest on the math got the lowest score by far (2.8/5).

While there are plenty of flaws in student evaluations, I find that I tended to agree with the relative rankings students gave the profs.

It's a pet peeve of mine when professors ignore student evaluations. Your students are giving you a very clear message (they don't like you and don't feel like they are learning the material), but you feel like it's OK to ignore them because you "use math". Total cop-out.

Funny, it's one of my pet peeves when people use
words such as "like" and "feel" rather than "performance"
and "educate". I really don't care if my students like
me, I care that they learn. So I ignore student evaluations
and pay attention to test scores. As long as I am
satisfied in their performance as measured by exams and
homework, I consider the course to be successful. If
my students want to "feel" good, they should go buy a kitten.

I believe that your view is more suited for skill-building
learning, similar to high school. Universities are knowledge
building environments and should demand higher expectations
from the students. Of course, I recognize that my opinion
is in the minority at most colleges.

I too teach introductory astronomy courses at a Public University and have done so for the last 10 years. Its very likely that my student pool, is similar to the one that the University of Oregon study is based on.

My experience suggests that the UO study is somewhat superficial, and, as others have pointed out, is a largely circular argument that is motivated by the particular grading scheme.

There is another dimension of student performance in a class that is missing in the study and missing mostly in the comments about the study. That dimension is student "work ethic".

In my introductory astronomy classes there is a fairly large amount of homework given out. This homework is not busy work, but each assignment is a mini-research assignment and some involve virtual apparati to measure real astrophysical data (stellar brightness, stellar spectra, galaxy color, etc). I can do this because I have a large amount of knowledgeable TA support for these classes because my University has invested resources here.

The homework constitutes 50% of the course grade. What I consistently find is the following:

a) Students with good GPAs do all of the homework and submit it on time. There performance on the homework is
variable, but generally good. These students have a work ethic.

b) Students with lower GPAs don't do all of the homework and are generally late with the homework. These students do not have much of a work ethic. Consequently there course grade is lower and this keeps their GPA lower.

c) Actual exam performance (my exams are a mixture of MC questions and short answer/calculation questions) does not depend very much on GPA although those students in the bottom 20% of the GPA distribution generally do poorly on the exam (those are the ones that probably never come to class either).

So the final course grade, from my method, represents some weighted combination between student work ethic and student ability.