A post about “Engineering the Software for Understanding Climate Change” by Steve M. Easterbrook and Timbo “Not the Dark Lord” Johns (thanks Eli). For the sake of a pic to make things more interesting, here is one:
It is their fig 2, except I’ve annotated it a bit. Can you tell where? Yes that’s right, I added the red bits. I’ve circled vn4.5, as that was the version I mostly used (a big step up from vn4.0, which was horrible. Anecdote:it was portablised Cray Fortran, which had automatic arrays, but real fortran didn’t. So there was an auto-generated C wrapper around each subroutine passed such things, which did the malloc required. Ugh). vn4.5 was, sort of, HadCM3, though the versionning didn’t really work like that. Although that pic dates vn4.5 to 1999 that is misleading: it was widely used both within and without the Met Office until, well, outside it was still being used when I left in 2007, partly because HadGEM (which as I recall was vn6.0/1, though I could be wrong) was much harder to use. Also the “new dynamics” of vn5.0, although in theory deeply desirable, took a long time to bed in.
Note: you should also read Amateurish Supercomputing Codes? and the interesting comments therein.
Anyway *this* post is to read their paper and point out the bits I disagree with, as well as any interesting bits I do agree with. You can assume I agree, weakly or strongly, with the rest. [Actually this post seems to have rambled. Never mind. Also I seem to have ended up quoting so much of the paper that you might just as well read it yourself :-)].
Scientists have additional requirements for managing scientific code: they need to keep track of exactly which version of the code was used in a particular experiment, they need to re-run experiments with precisely repeatable results, and they need to build alternative versions of the software for different kinds of experiments . For all these reasons, scientific teams tend to develop their own tools in-house, rather than relying on external providers .
None of this makes sense. Needing to keep track of what code was used for a given purpose is commonplace. Ditto re-running experiments (but see below). Building alternative versions: commonplace. Developing a climate model in house is obvious, because you have to (unless you use someone else’s. In fact more centres should do this; there are too many climate models in the world). Developing the tools to work with it.. is less clear.
Computational scientists generally adopt an agile philosophy, because the requirements are generally not known up front, but they do not use standard agile process models . Such projects focus on scientific goals rather than software quality goals, and so use measures of scientific progress rather than code metrics to manage their projects
As long as you understand “agile” not “Agile” then this is fair enough; I think it would be fairer to say that they adopt no philosophy at all. because the requirements are generally not known up front is twaddle. HadCM3, for example, had very well known requirements up front: to be a global climate model.
Prior to this study, we had investigated the Met Office’s code management practices . Their use of state-of-the-art configuration management and issue tracking
indicated we would be able to focus on the core scientific practices, and avoid the accidental complexity that arises from poor code management.
This doesn’t sound right. They used to use (at least in vn4.5, probably in 6.0) the bizarre Cray “modset” method for the code configuration, which was arcane. Apparently they are now on subversion, which is OK, but isn’t s-o-t-a. As for issue tracking: this brings up one of the issues I was going to raise: a proper bug database referenced back to code changes. One thing you can’t do at all easily in HadCM3 is find out who actually wrote each line of code and why. Later on they say The current release of the UM is about 830,000 lines of Fortran source code. The code was maintained using CVS for a long time, but two years [i.e., in 2006] ago the Met Office adopted a new code management system, FCM, based on the open source tools Subversion and Trac . This is news to me. Perhaps they did somewhere internal, but from the outside it all looked like modsets, no hint of CVS.
Interestingly, the time taken to perform a climate run hasn’t changed over the life of the UM, because climate scientists take advantage of increases in supercomputer power to increase the resolution and complexity of the models. A century-long climate simulation typically takes a couple of months to run on an NEC SX-8. Scientists more often run the models for just 1-2 decades of simulation, which can still take a couple of weeks, depending on the model configuration.
Which is mostly true. also
Met Office staff play a number of distinct roles, organised like the ‘onion’ model often observed in open source projects. At the core, about twelve people from the two IT
support teams (Met R&D and CR) control the acceptance of changes into the trunk of the UM. They act as experts for integration and platform-specific issues. Many of them have scientific backgrounds, with PhDs in numerical computing or related fields. At the next layer, about 20 of the more senior scientists act as code owners, each responsible for specific sections of the UM (e.g. atmosphere, ocean, boundary layer, dynamical core, etc). Code owners are domain experts who keep up to date with the relevant science, and maintain oversight of developments to their sections of the model. Membership in these two layers rarely changes
sounds right. They then talk about “bit reproducibility”, which may not mean much to people not steeped in this stuff, but is interesting, so I’ll expand on it. a computer program is deterministic (if not broken) but the weather isn’t. But a given climate model, if fed with exactly the same inputs, should be re-runnable to produce *exactly* the same outputs, down to the lowest bit (and if it isn’t reproducible down to the lowest bit it will rapidly diverge: there is a good illustration of this in a 2005 RC post by JA and me). That is moderately trivial if the prog runs on a single processor, but less obviously true if the prog has to be identical on multiple processors (so for example any averaging of numbers will need to happen in the same order every time) and even less obvious if it has to be true on an arbitrary number of processors. But, the model manages it (unless you run with the “faster but non-reproducible code” option; generally a bad idea, because then if your model crashes you will never ever find out why). So then you can have code changes which in theory should not break bit-reproducibility (and can be tested as such). Of course even the smallest scientifically interesting code change *will* inevitably break bit-repro with earlier models. And compiler upgrades tend to break it too. To validate other changes, you tend to need long-term (~decade) averages, to get rid of the weather noise.
The study says it asked five questions:
1. Correctness: How do scientists assess correctness of their code? What does correctness mean to them?
2. Reproducibility: How do scientists ensure experiments can be reproduced (e.g. for peer review)?
3. Shared Understanding: How do scientists develop and maintain a shared understanding of the large complex codes they use? E.g. what forms of external representation do they use when talking about their models?
4. Prioritization How do scientists prioritize their requirements? For example, how do they balance between doing what is computationally feasible and what is scientifically interesting?
5. Debugging: How do scientists detect (and/or prevent) errors in the software?
…but it doesn’t really answer them (except for reproducibility, which has a trivial answer). Instead it answers some easier related questions. I think I should try to say something about correctness, since it is such an exciting topic. Correctness in the dynamical core is in theory verifiable in some limited situations, by comparison to known solutions. But for the whole GCM this isn’t even close to possible. You are left with a combination of comparison to previous model runs, comparison to observations, and process-based studies.
Comparison to previous runs is the easiest: you have a previous long control integration known to be good, or at least passable: you can just check your own against this. Tools are available to do it automatically. If you think all you’ve done is make a minor change to the albedo of East Antarctica you can do a 10-year run and check that the world’s climate hasn’t dramatically shifted. Arguably that doesn’t check your change is *right* but it is a coarse check that you haven’t broken much.
Comparison to obs: also worth doing, but since the model will always have biases against the obs (which won’t be totally accurate anyway) not as useful as you might think.
Process studies: check that the new snowfall routine you’ve just put in really does increase the proportion of snow to rain near 0 oC in a given environment. Or somesuch. Lots of effort; does the “right” bit not checked above, but doesn’t check that you haven’t broken the world.
The release schedule is not driven by commercial pressure, because the code is used primarily by the developers themselves, rather than released to customers.
This is a bit of a porkie. IPCC schedules matter. Ditto:
The developers all have “day jobs” – they’re employed as scientists rather than coders, and only change the model when they need something fixed or enhanced. They do not to delegate code development tasks to others because they have the necessary technical skills,
understand what needs doing, and because its much easier than explaining their needs to someone else.
but also because *there is no-one else*. Incidentally, if any of this sounds like me angling for a highly-paid job as some kind of software consultant to the Met Office: I’m not. Finally:
Mapping their concepts onto terms used in the software engineering literature may be problematic. For example, it was hard to distinguish “software development” from other aspects of the scientific practice, including data analysis, theorizing, and the development of observational datasets. From a scientific point of view, the distinction between changing the code and changing the parameters is artificial, and scientists often conflate the two — they sometimes recompile even when it shouldn’t be necessary. Therefore, characterizations of model evolution based purely on source code changes miss an important part of the picture.
also sounds right.
[Update: if you're having problems commenting on this post, please mail me.
Minor thought: one of the problems with the GCMs is Fortran. Not because it is totally unusable, but in part because no-one from SE wants to go near it. One of the reasons I left was because I didn't want to keep writing Fortran for the rest of my life; and (looking at the job ads) it was pretty clear that it was a very restrictive career move.]