Janet Stemwedel brings her expertise in science and science ethics to bear on the contents of the emails stolen from the University of East Anglia. As with all of her work, the whole thing is worth reading, but I want to pick up on this claim:
If you don't thoroughly document your code, no one but you will have a clear understanding of what it's supposed to do. Actually, if you don't thoroughly document your code, you yourself, at a later moment in time, might not have a clear understanding of what it's supposed to do. (Then there's the question of whether, when executed, it actually does what it's supposed to do, but as far as I can tell, that's not a central issue in the discussions of ClimateGate.)
Sadly, no. First, the criticism of the code focuses exclusively on the comments, not on what the code actually does. As Tim Lambert observes, the supposedly damning comments about artificial corrections generally attach to code which is itself commented out. So no harm, no foul.
But even Janet's premise here is false. It's certainly the case that young programmers are told to comment copiously, just as young scientists are taught that there's no human element to the scientific process. These are the little white lies we tell in hopes of papering over the messy truth of how these fields actually operate.
In reality, "real programmers don't comment code. If it was hard to write, it should be hard to understand." Or as new Scibling Andrew Gelman (a statistician and statistical programmer) puts it: Don't comment code ("I'd heard this before, but good advice is typically worth repeating").
That link goes to a programmer in language processing, who insists that commenting code is futile. "Professional coders don’t comment their own code much," he explains, "and never trust the comments of others they find in code. Instead, we try to learn to read code and write more readable code." The author continues:
Comments Lie
The reason to be very suspicious of code comments is that they can lie. The code is what’s executed, so it can’t lie. …
I don’t mean little white lies, I mean big lies that’ll mess up your code if you believe them. I mean comments like “verifies the integrity of the object before returning”, when it really doesn’t. …
Another common reason is that the code author didn’t actually understand what the code was doing, so wrote comments that were wrong.…
Most Comments Considered Useless
The worst offenses in the useless category are things that simply repeat what the code says.…
Eliminate, don’t Comment Out, Dead Code
…
This isn't to say comments have no value. In my dissertation work, I'd leave comments in my code reminding me when I'd done something non-idiomatic in my programming language, or where I kept trying to optimize a given code block in the same way, always to have it fail. "Don't do that stupid thing you always try to do here" is not canonically perfect code documentation, but it works.
The file from which this material was all drawn is a log of a programmer brought in to sort out software written years earlier by someone else. It was crufty and odd and confusing, and his comments reflect that confusion. I'd wager dollars to donuts that the comments of everything from Linux to Windows to Word and Photoshop are filled with equally frustrated comments as people try to understand why feature X causes feature Y to crash. It's a natural part of programming, and the idea of well-commented code is a fiction, like the idea that science is an enterprise where data are shared freely and widely in a world of internal harmony devoid of personality clashes and grudges. It's a story we tell to students to get them to see the big picture, and something they need to unlearn when they need to do science (or programming) right.
- Log in to post comments
Not commenting is bad. Commenting profusely is bad. They are extremes and extremes are bad.
There are things to comment and there are things to not comment.
One thing *NOT* to comment is the i++ when enumerating the indexes in an array. One thing *TO* comment is doing i+=3 when enumerating the indexes in an array.
One thing *NOT* to comment is an array of the months of the year. One thing *TO* comment is "random" constants.
Comments are for understanding the code they are commenting. Sometimes it's all you have (e.g., libraries).
Some programmers are idiots. Some of those idiot programmers have blogs. Quoting one such doesn't make your case, and if we're going to make arguments to authority I'll gladly put my 20+ years of experience writing the most difficult kinds of code up against "lingpipe" any day of the week. Another thing that doesn't make your case is using examples of poor commenting style, or failure to maintain comments along with the code. Just because something can be done poorly doesn't mean it shouldn't be done at all. Such cherry-picking, from a scientist, in this of all contexts? For shame.
As one of the commenters in the thread you linked to points out, "just read the code" doesn't work because sometimes the code is misleading too. Besides the obvious case of function or variable names that no longer accurately describe the code, it might fail to represent the programmer's intent in any number of ways. Just in the past week I've flagged such failures due to misunderstandings about byte order, integer size, signed/unsigned integers, volatile variables, and non-local exits. I've also seen apparent errors in each of these categories which, upon closer examination, turned out to be non-issues or even intentional behavior. A few comments might have helped to disambiguate these, and would have saved me time.
In any language, there are assumptions and expectations that cannot be expressed in the language itself (even as asserts or similar) but that can be critically important. Fixing bugs in somebody else's code is difficult, and even if a comment is out of date the mismatch between it and the code it supposedly describes can itself been quite informative. Any decently experienced programmer will have encountered the scenario of fixing a bug only to find that they've created an even worse one because their original fix violated the intent (i.e. embodied assumptions) of the code where it was made. This often leads to my favorite kind of code, which is the "I already tried XXX and it didn't work because YYY" kind. Unlike most kinds of comments, these retain their usefulness even as the code around them changes. What didn't work then probably won't work either, and saving your successors the trouble of finding that out the hard way is worth it.
You might think that the sorts of information I'm talking about should be in specs instead of code. Sometimes that's true. On the other hand, there's a very high degree of overlap between those who scorn comments and those who scorn specs as well so it's not much of an argument. The important thing is to describe your intent *somewhere* in English. Every decent programmer knows that such expressions of intent must be read with a grain of salt because the passage of time might have made them inapplicable, but history still has value in computing as much as elsewhere.
Leaving out information because it might be misinterpreted is exactly the mistake is exactly the wrong thing to do, whether it's information in code or climate data. The problem was not with the notion of including comments, but - as in the emails - the level of professionalism in these particular examples.
I've been writing software since 1966. Adding comments to code is a very good idea.
But...
When code is maintained and enhancements are added, the comments are often left unmodified and over long periods of time the comments for actively maintained code tend to not resemble what the software actually does.
Code inspections help.
Colin: Why would you comment "i+=3;"? It's pretty obvious what it does, and if someone reading the code can't figure out why you're incrementing by three, they should think seriously about why they're messing with the code. Comments are, of course, invaluable when you start programming like this: http://www.cs.utah.edu/~elb/folklore/mel.html
But there's no excuse for doing that.
Jeff Darcy: My point is less the argument of authority (though I'll cop to having made at least a feint in that direction) than describing the reality of programming, and the fact that whatever they should do, most programmers comment poorly, leave poor documentation, and often don't program terribly clearly. It's the worst of all worlds, and doubly so in science where projects are often programmed by a single researcher who is probably assuming that the published papers on a software system will count as the documentation, and commenting will just distract from the important research. Possibly foolhardy, but hardly an ethical or scientific lapse. If it's a programming lapse, it's one that nearly all programs and programmers are prone to, and again, hardly grounds for targeting climate science.
As to professionalism, This gets back to a fuzzy area. There's an ideal of science, in which everyone sits around in tweed and bowties and treats even their mortal rivals as princes among men. The reality is that scientists can be asses to one another, just as in any other human endeavor. Indeed, some of the most brilliant scientists have the fewest people skills, and are the most prone to generally dickish behavior. The measure of their worth isn't how they behave in private email or at the bar during a conference, but how their papers look, and whether their work holds up to scrutiny.
Josh, i+=3 is obvious what it does but why 3 instead of 4 or -237? Arguing that the reader has to digest 1000 lines of code (bad form goes with use of unexplained, arbitrary numbers like 3) to understand "why 3?" is absurd and naive.
Arguing that you have to understand the sum of the whole to work on a part of it flies contrary to the point of abstraction. Abstraction being the point of writing a programming language. (I'm just positive that computational biologists would much rather have a ticker tape and two stacks to fold proteins.)
Why bother with the Dewey Decimal System when you can just wander the entire library? The books explain themselves. Why bother with maps when you can just drive around for days? The roads explain themselves. If a librarian mis-shelves a book or the cartographer gets a street or two named wrong...well, I guess just forget the whole thing, eh?
The best code I have ever read is well/properly-documented. Not because the comments were mind blowing but the act of explaining the code well goes with writing good code and having good organization. It is truly awe-inspiring to find such a rare gem.
I'm with Josh. Comments generally suck. I've worked on a wide variety of projects, from just my own stuff to teams of a dozen or so people, to monstrously huge projects (linux kernel, openoffice, etc.). Comments have screwed me over more times than I can count, and I can't remember more than one or two times they have every told me anything interesting.
Code gets executed. Comments either re-state what the code already says, or say something different, and you don't know which without reading the code. In the former case, you already read the code, so now reading the comment is redundant. In the latter, you now know the programmer (or commenter) was confused, but nothing more.
And I agree with Josh that one reasonable exception is just hints and pointers to save time. Like "this is unusual, but go look in this other function and it will make sense".
The unit tests are the documentation.
Ok, not always. Professional software engineers have a lot of ideas, techniques, and tools for supporting and maintaining documentation. You can have automatic checks that all public methods (I'm thinking Java here, but ..) have documentation of all parameters and exceptions. You can check code coverage of unit tests. You can have a QA process that monitors these and assigns resources to fixing problems. And these things will work, at least up to a point.
The problem is that most scientific software is written by scientists. I know, I used to be one, and I cringe at the thought of the monstrosities I used to produce.
Josh:
I'm not saying that scientists or academics should pretend to be polite or anything. As someone who works often in the Linux kernel, I see a lot ruder behavior every day than has been exhibited in this affair. Demeanour counts for nothing in software. However, I still think it's unprofessional to pollute a technical information source with irrelevant or subjective commentary, whether it's in email or code. People should stick to the relevant facts in either case.
Kevin:
Usually you're looking at a piece of code for a reason, such as to fix a bug. The code in front of you is unlikely to be the only thing that's relevant - there's probably other code as well, comments in either, design documents, bug reports, patch/checkin trails, etc. Establishing the right context, sorting through the information available to determine the parts that are relevant and consistent, is an essential part of the programming process. Anybody who can't do that shouldn't be drawing pay as a programmer, and that includes anybody who can't deal with bad comments. The fact that you admit "one reasonable exception" - which actually covers a great swath of potential comment scenarios - shows that you probably are capable of making distinctions between good and bad comments. If you don't like standardized file/function comment headers, fine. I don't either. However, going from there to condemn comments in general (except your own, I'm sure) is throwing the baby out with the bathwater.
My view is that interfaces should be documented with comments so that the semantics of the interfaces are clear. The implementation should have few comments, but the implementor needs to invest enough care to make the code both readable and self-verifiable. Assertions are extremely helpful, since they are executable and provide self-verification, and often provide to someone reading the code more information than a comment would.
My $0.03 --
Most of the coder's I work with/respect do minimal commenting, and only to point out something non-intuitive in a code block -- if it's intuitive it's self documenting.
Comments in code don't usually add much value; DIFFs and comments on Source Control check-ins are usually more valuable -- I changed this code to fix this problem.
Agree with deleting dead code rather than commenting out - if you need the dead code, grab it from a previous rev of source control, although if it's dead, it probably dead for a reason.
Hot button.
1: code is injunctive. Not descriptive. Writing "self-documenting" code has been a chimera since the first stored program computers. At best it is only descriptive about the 'what' it is doing. Except for trivial stand-alone functions, the 'why' for most code can only be understood in the context of both the reason for the program in the first place and the architecture chosen to achieve it. Code expresses neither, and the scope of a screen of code is typically only .01% of the entire code body (50 lines in a 500,000 line program!).
2: Yes, comments lag code. Requirement specs lag comments lag code. Help/man documentation lags them all unless something like Doxygen is used to extract man from source. But trying to spelunk through ten years of accumulated cruft without comments is a f**g nightmare. Even if they are wrong, they at least reveal some historical strata. At least, comment the API if not how it is achieved. Man page minimal.
That being said, one can help one's successors without being overly wordy:
1: pay attention to partitioning. Re-examine it and refactor whenever it gets murky. Pays for itself in the long run. Hierarchically chunked plan, not flat.
2: If a unit of work does not fit in one screen, it is probably not just one unit. The middle bits probably need their own life.
3: (related) Resource allocation and de-allocation must fit in the one screen so you can see it is right and not forgotten. Preferably in the same function, or at least constructor and destructor code are together.
4: Comment at least the API contract for every function. Use man style as minimal.
5: Do NOT be clever. It is almost always the w0rng thing to do on many levels.
- you know it if it takes more comments than the code to explain
- even you in a different mind-set six months later won't understand it
- in practice the system may not even perform better even though you were locally clever so it would. Clever code has hidden overheads.
6: Along with not being clever, avoid hidden dependencies and side-effects. Do not take advantage of happenstances.
7: Do the comments first. Outside, man style. Inside, pseudo-code.
8: try, try, try, to keep the comments in sync with the code.
9: pixels. Even more pixels. Get a wall of them if you can.
wrt 5, I have two primo existence proofs for what I said, both having to do with packet buffer handling in severely resource-restricted embedded datacomms boxes. In fact, both had to do with the same issue - keeping the oldest packet buffer full by moving data from later packets forward if there is room. Seems like a good reason, eh? Well,
In the first instance, the reason was to minimize the number of packets in a dispatch queue so we did not run out of buffers (4K RAM footprint!). I used 'clever' folded self-modifying assembly code that exploited multi-byte instructions that used immediate operands only. The follow-up team was not able to keep it up with packet buffer format evolution so scrapped it and wrote one they could understand. Yes, it was bigger, and yes, it did not compact the queue - in its peephole. But every other way it was a win. Faster, buffers got dispatched quicker so the queue length did not grow, maintainable, etc.
In the second, it was to maximize the number of messages per packet on the wire to keep the bandwidth utilization up. I worked for weeks trying to get the throughput of the system to match the requirements. No luck. Then I scrapped the whole idea and just sent what was already waiting (had to deliver something). Yes, lower wire utilization, but system throughput went up because the cpu had much less work to do. Reliability went up because task level stopped messing with interrupt level constructs (interlocks went along with this so IRQ response time also improved dramatically). And the comments got way easier to be kept correct and understandable. This one never needed follow-up maintenance, it just worked from then on.
KISS really works. Even when you don't think it will.
Colin: I understand that you're arguing that the use of 3 needs a comment, but if the goal is abstraction then you shouldn't comment three, but make a variable "static const int matrix_rows = 3;" or whatever, thus making the code more abstract, and obviating the need for a comment. I doubt that a reader needs to understand every part of every library to see why an index variable needs to be incremented by an unusual number, and if they do, it speaks poorly of the program's overall design.
Jeff Darcy: I don't mean to suggest that comments are uniformly bad, but I think it's totally backward to suggest that subjective commentary in code is inappropriate. Especially on large, multi-programmer projects, comments in the code create a certain culture, diverting people from programming styles which might work but which would not fit the preferred approach of the manager. And reminders not to do that dumb thing people always want to do with a certain algorithm.