Publishing code

Nick Barnes has an excellent opinion piece in Nature. And the comments are good too. There is a comment-on-the-piece by Anthony Fejes which I think is less good: too much like the kind of people who put you off cycling by insisting you have to wear a cycle helmet or walk. And you should read Nick's follow up a CCC.

I've decided that I agree with Nick's overall argument: yes you should publish your code. Which means, everything that is yours, including the little fiddly bits. Even if no-one will understand them. Even if people will deliberately misunderstand them.

I have a number of quibbles and differences of emphasis, though. Some of them are taken up by the comments on the Nature piece (speaking of which I've just noticed CCC has an advisory committee featuring James "Come on if you think you're hard enough" Annan - just look a the picture. I've met him, you know. He has "Love" and "Hate" tattoed across his knuckles).

Mine are:

Publish it all, but only what is yours

Yes, you should publish everything, but only what's yours. I worked with HadCM3, for example, which isn't open, and the Met Office gets pretty picky about that kind of thing. So I would have to publish my "modsets" (a bizarre Cray-ism, a sort of reverse diff), for example:

*D CNTL089
     FOR I=1,IML-1


is an attempt at a sort of example. This one, presumably, re-writes the loop limit on a for loop. Perhaps it corrects an error, who knows. As a fragment, it would be nearly useless to anyone but the author. But publish it anyway, otherwise you have to think about what to publish and what not.

Ignore your institution

Your institution probably has a policy in place saying they retain IP rights to your stuff and you can't publish without permission. Ignore them. They will never notice, and anyway this is like the situation when PDFs first came out - initially, formally, the journals wouldn't let you put them on your own website. But everyone did anyway and now the journals don't care.

This is all independent of all that has gone before

Ignore all the stuff about demands for openess and transparency and stuff. Some of them are valid and some are well meaning and some are merely disguised attacks. It doesn't matter. Publish the code anyway.

But what matters is the process

One of the comments on Nick's piece notes that individual blocks of code without any hint as to what the overall process is, aren't very useful. This is a good point, but it is far more a pointer to a hole in the science process that a criticism of Nick's ideas. Somewhere - and it was a post by Joel Spolsky but I can't find it - there was a list of 10 things that you really ought to have if you were a competent software company [Thanks to AW: there were 12 -W]. One of those was source control, without which you're a joke / doomed, but another was one-step release: you should be able to press one button / type one command and that would start a process that would create a clean copy of your released code, whilst wrapping up an archive of everything that went into it (CSR does this, of course). The same should be true, in slightly modified form, for science: if you're writing a paper (your release) there should, if at all practicable, be a one-step process that generates all the results and draws all the figures. OK, this is the ideal. And if part of your paper is about counting nematodes, clearly the process won't do that for you. But it *will* automatically draw graphs based on the data files you've carefully archived.

But the key point remains

I'll end quoting Nick (because you all have the attention spans of mayflies, of course, so naturally I'm assuming that by the time you've read down this far you've forgoten what he said):

I am a professional software engineer and I want to share a trade secret with scientists: most professional computer software isn't very good. The code inside your laptop, television, phone or car is often badly documented, inconsistent and poorly tested.

Why does this matter to science? Because to turn raw data into published research papers often requires a little programming, which means that most scientists write software. And you scientists generally think the code you write is poor. It doesn't contain good comments, have sensible variable names or proper indentation. It breaks if you introduce badly formatted data, and you need to edit the output by hand to get the columns to line up. It includes a routine written by a graduate student which you never completely understood, and so on. Sound familiar? Well, those things don't matter.

The important point is not to suffer from the assumption that your code is too crap to publish. It isn't.

More like this

A post about "Engineering the Software for Understanding Climate Change" by Steve M. Easterbrook and Timbo "Not the Dark Lord" Johns (thanks Eli). For the sake of a pic to make things more interesting, here is one: It is their fig 2, except I've annotated it a bit. Can you tell where? Yes that's…
ScienceBlogling Revere calls for an open data policy for federally-funded research (italics mine): We've inveighed often here about the shameful practice that many senior and well-respected flu scientists have of keeping their sequences private until they publish -- if they publish using them. If…
Who wants to know how to be an effective crank? Well, I've outlined what I think are the critical components of successful crankiness. Ideally, this will serve as a guide to those of you who want to come up with a stupid idea, and then defend it against all evidence to the contrary. Here's how…
Some parts of the discussion of Oh dear, oh dear, oh dear: chaos, weather and climate confuses denialists have turned into discussions of (bit) reproducibility of GCM code. mt has a post on this at P3 which he linked to, and I commented there, but most of the comments continued here. So its worth…

On ignoring the institution: The institution might not notice it but the sceptics would. Imagine Michael Mann does so and the grant was a federal grant. Cuccinelli*/Barton/Whitfield/Inhofe/Limbaugh, get wind of it...

They would use any reason to accuse malfeasance or some such because they just aren't interested in the science, or how it's done, or openness, or anything good, but want to stop or hinder the science in any way they can. Breaking such rules would just give them an opportunity.

IMHO.

* Especially Cuccinelli - Imagine if he were, by then, in Congress or the Senate.

[Hmm, that is a possible point. It is hard to see how they would twist openess, but I grant you they would try -W]

I like the Joel test. I've read his blog occasionally in the past and appreciate his clear writing and insights into the development process.

I took the test quickly for my one man operation and only got to nine. We fell down on the dedicated testers (hey, it's only me) but prior to releasing major functionality I do reqeust that the people who will be using it test it. We don't have a pushbutton release process, mainly because I don't know Powershell/msdeploy well enough to write scripts to handle this and have no budget for the books/training to learn it. The schedule is a problem also, but mostly because of shifting requirements (this is mainly because the people who dictate this stuff don't really know what they want...).

But most organizations I've worked for run an 11 on his scale and the ones that didn't got better while I was there because of my efforts (introducing SCM/CM processes, bug databases, improved build processes, etc.). It was interesting to look at reactions from other engineers, especially in the '80s and early '90s to the introduction of an actual process to the build/release cycle. The initial reactions to the proposals were mildly negative but turned positive afterwords because it actually made the job easier.

I did post a comment on the CCF blog linked above in which I made the point that scientists do not have to become software engineers to do their job, but a basic knowledge of and use of SCM and CM is a good thing. Hopefully scientists will take Nick's comments to heart.

By Rattus Norvegicus (not verified) on 16 Oct 2010 #permalink

The rumors about Windows code pre Vista were horrifying. Hence the complete rewrite.

One of the curious issues with very expensive commercial software is that it is sold to you as is, and they require you to then take out a very expensive maintenance contract that may, not will, just may, fix up the bugs you paid for in the first place.

Hi William,

Thanks for the mention! I have to say, though, that your analogy of my comment being too militant "much like the kind of people who put you off cycling by insisting you have to wear a cycle helmet" had me laughing. You see, in Vancouver, it is illegal to cycle without a helmet.

Anyhow, I'm not here to debate the merits of wearing a helmet, but rather to comment on your comment.

First, I agree that my tone was a bit strident. I debated not publishing the post for that reason alone. However, I don't think that is really a flaw with the argument I presented. There is often a valid reason for shouting about things that need to be changed.

Second, I completely disagree that you should ignore your institution's IP policies. You're far better off agitating for change - particularly if you're in a position where the university may come after you to recoup their losses. I'm not aware of that occurring off-hand, but biting the hand that feeds you is rarely good advice (unless that hand also is the hand that beats you down...)

[I think we're already into cycle-helmet wearing territory. If before you post your code you even need to *find out* what your institutes policy is, that is enough of a bar for most people. Discovreing that your institute appears to prohibit disclosing code, based o some vague notion of IP, and then being obliged to agitate for change before you can do anything - this becomes a complete non-starter -W]

Third, your comment on openness and transparency being a mix of valid, "well meaning" criticism and "well disguised attacks" is completely off the mark - and rather shortsighted. What exactly is openness and transparency in this case if not the demand for code to be released in an organized manner?

[I don't think you understand what is going on. A bare minimum is accurate quoting: I said "Some of them are valid and some are well meaning and some are merely disguised attacks", not "well disguised attacks". It is certainly true that some people are, in good faith, calling for publication of code (NB is one obvious example). It is also true that a number of people do so in bad faith (McI, perhaps; many others on the septic side). So I'm puzzled by your assertion that this is "off the mark". Similarly, I don't understand your drift from openness to organised. Certainly, you can have openness without clear organisation. And you can have clearly organised opaqueness. So the two concept have little connection, and certainly cannot be equated, as you appear to be doing -W]

Consider for a minute what chaos would ensue if every genome science centre in the world suddenly dumped every script they've ever written to the web on a random Tuesday morning. Would we really be any better off than we were the day before? Now imagine if every lab in the world that has ever published a paper using some piece of code did the same. How would you find the piece of code you need? How would you know which piece did what? Surely documentation is what makes code useful - and surely you're not advocating that documentation is unnecessary.

[I'm not advocating that documentaiton is necessary, but nor is it essential. Over time, standards will develope. Trying to impose them in advance would strangle openness. If people had mandated cycle helmets for all when the bicycle was first developed, it would never have developed -W]

So, while we both agree that widespread code release is good (and the vast majority of code IS bad, so we should do so without regard for quality), and as a big fan of Joel Spoelsky, I agree with the vast majority of what he advocates, why is it that you feel that making our code public AND usable by everyone a bad thing?

[I don't agree that making it public and usable is bad. And you know full well that I don't. Your logic is poor. And, you need to learn not to put words into people's mouths: it is very impolite -W]

I got home from Brussels to find the CACM on my mat has an article saying the same thing....

[How ironic - the article itself is behind a paywall. Clearly the CACM don't believe in openess. Go on, do what I advocate - ignore policy, and post the text in a comment here. Think of it as a test of what you're advocating for researchers :-) -W]

Now, now, you're the one advocating ignoring policy. I'm advocating changing policy. Anyway, I haven't read it properly yet. Maybe tomorrow I will read it and summarize.

I opine that doing so helps defeat the purpose behind scientific replication. As experiments contain more computation, maybe just computational experiments on some big model, the replicators (if any) need to develop their own code (on another model, I hope).

Otherwise errors are simply duplicated elsewhere; doesn't help and rather hinders to advancement of the science.

By David B. Benson (not verified) on 17 Oct 2010 #permalink

I'm not used to having my comments edited to include a reply. I have to admit, I'm not a fan.

[It is the house style here. I find it convenient -W]

I also didn't mean to put words into your mouth. I simply followed what I believed to be your argument to it's conclusion, and stated that I believed that to be your position.

[Nonetheless you erred in your "logic" and were offensive in the process. As must have been obvious to you at the time. You do not, I think, believe that you feel that making our code public AND usable by everyone [is] a bad thing could possibly describe my position. If you've arrived at what is clearly an error in your logic, it is better to step back and try and work out where you have gone wrong, rather than simply write down rubbish -W]

At any rate, I can accept if you don't like my tone, but I fail to see your point.

[I'm not sure I can state this any more clearly: raising the bar to release of code will lead to less code being released (the cycle helmets analogy again) -W]

Opaqueness and Openness both share much in common when it comes to code, but the point of my blog post was that openness is insufficient because you can still have opaqueness as a problem if you stop there. It almost seems as though you're arguing my point for me in your reply - so I'm still waiting for some clarity. If you don't disagree, why was my post less good?

[See above. Openness is insufficient is arguable if unexciting; arguing that openness isn't good enough, and that you need to document the code, places too large a burden on people. They will say, "OK, I can just about be bothered to release the code, but oh look, I get no credit for that, there is this bloke who says I need to document it too! And not only that, he still isn't happy with the docuemnted code, I neeed to release updates too. Oh well, probably I won't bother then, I'll just get on with some work I actually get credit for" -W]

Just to make one point clear, we are clearly involved in different styles of institutions - there is a whole staff of people at my university's intellectual licensing office who's job it is to help staff and students find out the answers to IP related questions. If it's too much effort to send them an email about open source code, the bar is already set too low. I assume your institution makes it much harder for you get that information.

[I work for CSR. We don't release our code, since it is commercial in confidence. But the point you are missing is that when I worked at, say, BAS, the problem was making requests taht would disappear into some opaque managament / bureaucracy layer. Who would want to bother, for no gain? -W]

I won't touch your comments on standards or on bicycle helmets, as I doubt we'll come to an agreement there.

Hi, I noticed your blog in my Google Alerts. I work at Stack Overflow Careers. We're actually compiling results to The Joel Test and analyzing based on Industry and Development Team Size. We'll release the results within the next couple of months. If you'd like to participate and see the results, here is the link:
https://www.surveymonkey.com/s/TheJoelTest

[Thanks. I tried to answer this, but got stuck at "5. Do you fix bugs before writing new code?" The answer is yes and no. No, we don't stop all developement to fix old bugs. Yes we do have a backlog of stuff to fix. I tried to just skip the question but it wouldn't let me -W]

> most professional computer software isn't very good.

Hallalueh! I wish more people understood this point, then we wouldn't have to suffer the likes of the ClimateGate alphas judging people against standards of the Comp Sci academy let alone the what actually happens in the real world.

@12: Damn straight! All through that whole bruhaha, I was constantly thinking to myself, "Have any of you folks complaining about commenting standards ever written so much as a single line of code in the real world?"

[The one I liked best - and someone actually said this with a straight face, or at least non-ironic mouse, was "you should write your code according to the rigorous processes adopted by major public-sector projects" -W]

"you should write your code according to the rigorous processes adopted by major public-sector projects"

!?

Why stop there? Leave everything in a taxi as well.

Nice to see 'the team' walking back their defense of not releasing code. Was really silly seeing the various comments on Tamino's site about why code shouldn't be released.

this is like the situation when PDFs first came out - initially, formally, the journals wouldn't let you put them on your own website. But everyone did anyway and now the journals don't care.

Yup.

And you can't use my copywritten font. Even in fingerquotes ...

Nick Barnes does a very effective job of politely knocking down all the commonly heard reasons for not publishing code and scripts. He's a very good and refreshingly honest writer. Mr. Fejes piece is a bit worry-wartish and Zeno-ish, as in I shant take a single step toward the bread box until I figure out how I can walk 1,000 miles.

The one I liked best - and someone actually said this with a straight face, or at least non-ironic mouse, was "you should write your code according to the rigorous processes adopted by major public-sector projects" -W

As someone with 10+ years of experience of working on major public-sector IT projects (from both inside and outside), I have to say that that's one of the funniest things I've ever heard.

Not that that private sector is any better, mind you... They're just better at keeping their incompetence secret.

The piece of code I most want to see released is the work the Muir Russell panel commissioned to reimplement CRUTEM3. :-)

[Didn't they just rip off yours? -W]

Well, the fact that MR bothered sending an email to a complete stranger about work related stuff while on holiday probably means he's still on holiday, or only just got back and is catching up ;)

That's interesting. I emailed the inquiry on the day of publication, asking for the code, and got no reply. I emailed again a month later and got a bounce message. Credit to JGC for putting the time in to chase it up via a review committee member. We'll have to see what happens.

Of course there's nothing questionable going on here: they got the same result as everyone else who has actually tried to get a result (including ourselves). It just seemed to be a good example of not-joined-up thinking.

A bit late but I was recently reminded of this:

"In the good old days physicists repeated each others experiments, just to be sure. Today, they stick to FORTRAN, so that they can share each others programs. And bugs." - Edsger Dijkstra