Nick Barnes has an excellent opinion piece in Nature. And the comments are good too. There is a comment-on-the-piece by Anthony Fejes which I think is less good: too much like the kind of people who put you off cycling by insisting you have to wear a cycle helmet or walk. And you should read Nick’s follow up a CCC.
I’ve decided that I agree with Nick’s overall argument: yes you should publish your code. Which means, everything that is yours, including the little fiddly bits. Even if no-one will understand them. Even if people will deliberately misunderstand them.
I have a number of quibbles and differences of emphasis, though. Some of them are taken up by the comments on the Nature piece (speaking of which I’ve just noticed CCC has an advisory committee featuring James “Come on if you think you’re hard enough” Annan – just look a the picture. I’ve met him, you know. He has “Love” and “Hate” tattoed across his knuckles).
Mine are:
Publish it all, but only what is yours
Yes, you should publish everything, but only what’s yours. I worked with HadCM3, for example, which isn’t open, and the Met Office gets pretty picky about that kind of thing. So I would have to publish my “modsets” (a bizarre Cray-ism, a sort of reverse diff), for example:
*D CNTL089
FOR I=1,IML-1
is an attempt at a sort of example. This one, presumably, re-writes the loop limit on a for loop. Perhaps it corrects an error, who knows. As a fragment, it would be nearly useless to anyone but the author. But publish it anyway, otherwise you have to think about what to publish and what not.
Ignore your institution
Your institution probably has a policy in place saying they retain IP rights to your stuff and you can’t publish without permission. Ignore them. They will never notice, and anyway this is like the situation when PDFs first came out – initially, formally, the journals wouldn’t let you put them on your own website. But everyone did anyway and now the journals don’t care.
This is all independent of all that has gone before
Ignore all the stuff about demands for openess and transparency and stuff. Some of them are valid and some are well meaning and some are merely disguised attacks. It doesn’t matter. Publish the code anyway.
But what matters is the process
One of the comments on Nick’s piece notes that individual blocks of code without any hint as to what the overall process is, aren’t very useful. This is a good point, but it is far more a pointer to a hole in the science process that a criticism of Nick’s ideas. Somewhere – and it was a post by Joel Spolsky but I can’t find it – there was a list of 10 things that you really ought to have if you were a competent software company [Thanks to AW: there were 12 -W]. One of those was source control, without which you’re a joke / doomed, but another was one-step release: you should be able to press one button / type one command and that would start a process that would create a clean copy of your released code, whilst wrapping up an archive of everything that went into it (CSR does this, of course). The same should be true, in slightly modified form, for science: if you’re writing a paper (your release) there should, if at all practicable, be a one-step process that generates all the results and draws all the figures. OK, this is the ideal. And if part of your paper is about counting nematodes, clearly the process won’t do that for you. But it *will* automatically draw graphs based on the data files you’ve carefully archived.
But the key point remains
I’ll end quoting Nick (because you all have the attention spans of mayflies, of course, so naturally I’m assuming that by the time you’ve read down this far you’ve forgoten what he said):
I am a professional software engineer and I want to share a trade secret with scientists: most professional computer software isn’t very good. The code inside your laptop, television, phone or car is often badly documented, inconsistent and poorly tested.
Why does this matter to science? Because to turn raw data into published research papers often requires a little programming, which means that most scientists write software. And you scientists generally think the code you write is poor. It doesn’t contain good comments, have sensible variable names or proper indentation. It breaks if you introduce badly formatted data, and you need to edit the output by hand to get the columns to line up. It includes a routine written by a graduate student which you never completely understood, and so on. Sound familiar? Well, those things don’t matter.
The important point is not to suffer from the assumption that your code is too crap to publish. It isn’t.