Engineering the Software for Understanding Climate Change

By stoat on June 19, 2010.

A post about "Engineering the Software for Understanding Climate Change" by Steve M. Easterbrook and Timbo "Not the Dark Lord" Johns (thanks Eli). For the sake of a pic to make things more interesting, here is one:

It is their fig 2, except I've annotated it a bit. Can you tell where? Yes that's right, I added the red bits. I've circled vn4.5, as that was the version I mostly used (a big step up from vn4.0, which was horrible. Anecdote:it was portablised Cray Fortran, which had automatic arrays, but real fortran didn't. So there was an auto-generated C wrapper around each subroutine passed such things, which did the malloc required. Ugh). vn4.5 was, sort of, HadCM3, though the versionning didn't really work like that. Although that pic dates vn4.5 to 1999 that is misleading: it was widely used both within and without the Met Office until, well, outside it was still being used when I left in 2007, partly because HadGEM (which as I recall was vn6.0/1, though I could be wrong) was much harder to use. Also the "new dynamics" of vn5.0, although in theory deeply desirable, took a long time to bed in.

Note: you should also read Amateurish Supercomputing Codes? and the interesting comments therein.

Anyway *this* post is to read their paper and point out the bits I disagree with, as well as any interesting bits I do agree with. You can assume I agree, weakly or strongly, with the rest. [Actually this post seems to have rambled. Never mind. Also I seem to have ended up quoting so much of the paper that you might just as well read it yourself :-)].

Scientists have additional requirements for managing scientific code: they need to keep track of exactly which version of the code was used in a particular experiment, they need to re-run experiments with precisely repeatable results, and they need to build alternative versions of the software for different kinds of experiments [13]. For all these reasons, scientific teams tend to develop their own tools in-house, rather than relying on external providers [2].

None of this makes sense. Needing to keep track of what code was used for a given purpose is commonplace. Ditto re-running experiments (but see below). Building alternative versions: commonplace. Developing a climate model in house is obvious, because you have to (unless you use someone else's. In fact more centres should do this; there are too many climate models in the world). Developing the tools to work with it.. is less clear.

Computational scientists generally adopt an agile philosophy, because the requirements are generally not known up front, but they do not use standard agile process models [4]. Such projects focus on scientific goals rather than software quality goals, and so use measures of scientific progress rather than code metrics to manage their projects

As long as you understand "agile" not "Agile" then this is fair enough; I think it would be fairer to say that they adopt no philosophy at all. because the requirements are generally not known up front is twaddle. HadCM3, for example, had very well known requirements up front: to be a global climate model.

Prior to this study, we had investigated the Met Office's code management practices [13]. Their use of state-of-the-art configuration management and issue tracking
indicated we would be able to focus on the core scientific practices, and avoid the accidental complexity that arises from poor code management.

This doesn't sound right. They used to use (at least in vn4.5, probably in 6.0) the bizarre Cray "modset" method for the code configuration, which was arcane. Apparently they are now on subversion, which is OK, but isn't s-o-t-a. As for issue tracking: this brings up one of the issues I was going to raise: a proper bug database referenced back to code changes. One thing you can't do at all easily in HadCM3 is find out who actually wrote each line of code and why. Later on they say The current release of the UM is about 830,000 lines of Fortran source code. The code was maintained using CVS for a long time, but two years [i.e., in 2006] ago the Met Office adopted a new code management system, FCM, based on the open source tools Subversion and Trac [13]. This is news to me. Perhaps they did somewhere internal, but from the outside it all looked like modsets, no hint of CVS.

Snippet:

Interestingly, the time taken to perform a climate run hasn't changed over the life of the UM, because climate scientists take advantage of increases in supercomputer power to increase the resolution and complexity of the models. A century-long climate simulation typically takes a couple of months to run on an NEC SX-8. Scientists more often run the models for just 1-2 decades of simulation, which can still take a couple of weeks, depending on the model configuration.

Which is mostly true. also

Met Office staff play a number of distinct roles, organised like the 'onion' model often observed in open source projects. At the core, about twelve people from the two IT
support teams (Met R&D and CR) control the acceptance of changes into the trunk of the UM. They act as experts for integration and platform-specific issues. Many of them have scientific backgrounds, with PhDs in numerical computing or related fields. At the next layer, about 20 of the more senior scientists act as code owners, each responsible for specific sections of the UM (e.g. atmosphere, ocean, boundary layer, dynamical core, etc). Code owners are domain experts who keep up to date with the relevant science, and maintain oversight of developments to their sections of the model. Membership in these two layers rarely changes

sounds right. They then talk about "bit reproducibility", which may not mean much to people not steeped in this stuff, but is interesting, so I'll expand on it. a computer program is deterministic (if not broken) but the weather isn't. But a given climate model, if fed with exactly the same inputs, should be re-runnable to produce *exactly* the same outputs, down to the lowest bit (and if it isn't reproducible down to the lowest bit it will rapidly diverge: there is a good illustration of this in a 2005 RC post by JA and me). That is moderately trivial if the prog runs on a single processor, but less obviously true if the prog has to be identical on multiple processors (so for example any averaging of numbers will need to happen in the same order every time) and even less obvious if it has to be true on an arbitrary number of processors. But, the model manages it (unless you run with the "faster but non-reproducible code" option; generally a bad idea, because then if your model crashes you will never ever find out why). So then you can have code changes which in theory should not break bit-reproducibility (and can be tested as such). Of course even the smallest scientifically interesting code change *will* inevitably break bit-repro with earlier models. And compiler upgrades tend to break it too. To validate other changes, you tend to need long-term (~decade) averages, to get rid of the weather noise.

The study says it asked five questions:

1. Correctness: How do scientists assess correctness of their code? What does correctness mean to them?
2. Reproducibility: How do scientists ensure experiments can be reproduced (e.g. for peer review)?
3. Shared Understanding: How do scientists develop and maintain a shared understanding of the large complex codes they use? E.g. what forms of external representation do they use when talking about their models?
4. Prioritization How do scientists prioritize their requirements? For example, how do they balance between doing what is computationally feasible and what is scientifically interesting?
5. Debugging: How do scientists detect (and/or prevent) errors in the software?

...but it doesn't really answer them (except for reproducibility, which has a trivial answer). Instead it answers some easier related questions. I think I should try to say something about correctness, since it is such an exciting topic. Correctness in the dynamical core is in theory verifiable in some limited situations, by comparison to known solutions. But for the whole GCM this isn't even close to possible. You are left with a combination of comparison to previous model runs, comparison to observations, and process-based studies.

Comparison to previous runs is the easiest: you have a previous long control integration known to be good, or at least passable: you can just check your own against this. Tools are available to do it automatically. If you think all you've done is make a minor change to the albedo of East Antarctica you can do a 10-year run and check that the world's climate hasn't dramatically shifted. Arguably that doesn't check your change is *right* but it is a coarse check that you haven't broken much.

Comparison to obs: also worth doing, but since the model will always have biases against the obs (which won't be totally accurate anyway) not as useful as you might think.

Process studies: check that the new snowfall routine you've just put in really does increase the proportion of snow to rain near 0 oC in a given environment. Or somesuch. Lots of effort; does the "right" bit not checked above, but doesn't check that you haven't broken the world.

The release schedule is not driven by commercial pressure, because the code is used primarily by the developers themselves, rather than released to customers.

This is a bit of a porkie. IPCC schedules matter. Ditto:

The developers all have "day jobs" - they're employed as scientists rather than coders, and only change the model when they need something fixed or enhanced. They do not to delegate code development tasks to others because they have the necessary technical skills,
understand what needs doing, and because its much easier than explaining their needs to someone else.

but also because *there is no-one else*. Incidentally, if any of this sounds like me angling for a highly-paid job as some kind of software consultant to the Met Office: I'm not. Finally:

Mapping their concepts onto terms used in the software engineering literature may be problematic. For example, it was hard to distinguish "software development" from other aspects of the scientific practice, including data analysis, theorizing, and the development of observational datasets. From a scientific point of view, the distinction between changing the code and changing the parameters is artificial, and scientists often conflate the two -- they sometimes recompile even when it shouldn't be necessary. Therefore, characterizations of model evolution based purely on source code changes miss an important part of the picture.

also sounds right.

[Update: if you're having problems commenting on this post, please mail me.

Minor thought: one of the problems with the GCMs is Fortran. Not because it is totally unusable, but in part because no-one from SE wants to go near it. One of the reasons I left was because I didn't want to keep writing Fortran for the rest of my life; and (looking at the job ads) it was pretty clear that it was a very restrictive career move.]

More like this

PART 2

Steve, Iâll respond to your comment in a while, if you can tolerate some more ânonsenseâ about independent VV&T. With your claimed expertise in the subject I am puzzled by your categorical statement that â .. GCMs are independently validated by domain experts to a much greater extent than any of NASA's mission critical spacecraft control systems .. â. To me, a retired engineer with relevant experience in the telecomms industry, well-defined user requirements are essential to effective independent VV&T. This doesnât seem to fit with the statement in your paper (Note 4) that â .. Computational scientists generally adopt an agile philosophy, because the requirements are generally not known up front .. â. Is the development of reliable climate modelling systems by academics really so different from systems engineered for use in industry and commerce that renders proven VV&T procedures irrelevant?.

In the mean time Iâll have a chat with the Dr. Vincent Gray, author of âThe Greenhouse Delusionâ and member of the New Zealand Climate Science Coalition (Note 5) â .. IPCC expert reviewer (and signatory to Ref. 10) was responsible for having the IPCC admit that climate models had never been properly validated, despite the IPCC trying to suggest otherwise. In response to his comment, the chapter entitled "Climate Models - Validation" in an early draft of the IPCC's "The Science of Climate Change" had the word "validation" replaced by "evaluation" no less than 50 times.. â.

Nick, may I congratulate you on your submission to the IPCC regarding AR5. If Iâd known about it soon enough Iâd have asked you to add my name to it. I agree with most of it and emphatically support the goal of The Clear Climate Code project of âincreasing public confidence in climate science resultsâ. This will not be achieved if scientists employed in climate research activities in academia maintain their arrogant attitude towards sceptics.

Thanks for your comments on VV&T. Iâve been retired for 8 years now and in 2002 well respected computer systems integrators like Computer Sciences Corporation (CSC), Accenture (Andersen Consulting) and WIPRO were applying VV&T procedures effectively for Telecomms OSS/BSS. Practices change so VV&T could have become obsolescent but if so why are there numerous adverts for VV&T specialists these days?

Marco, I see that you continue to pour forth your pearls of wisdom.

NOTES:
1) see http://julesandjames.blogspot.com/2010/07/monbiot-exonerated.html
2) see http://en.wikipedia.org/wiki/William_Connolley
3) see http://media.wiley.com/product_data/excerpt/96/07695119/0769511996.pdf
4) see http://www.cs.toronto.edu/~sme/papers/2008/Easterbrook-Johns-2008.pdf
5) see http://nzclimatescience.net/index.php?option=com_content&task=view&id=3…

Best regards, Pete Ridley

William, checking up on your Green Party activism I came across the 26th June article âI feel the need to offer Wikipedia some ammunition in its quest to discredit meâ (Note 1) by James Delingpole in the Spectator. James says âunfortunately Wikipedia is policed by climate alarmists. .. thereâs absolutely no point in trying to shift entries like this to a more balanced position because within moments one of Wikipediaâs gang of in-house trolls will have shifted it back to the âcorrectâ ideological perspective. One of the most assiduous correctors is a British Green party activist called William Connolley , who .. was a founder member of Realclimate, an alarmist website originally designed by friends of Michael Mann to pump out propaganda in support of his hockey stick. By the end of last year .. Connolley had created or rewritten no fewer than 5,428 Wikipedia articles. Though Connolley has had his status as a Wikipedia âadministratorâ .. he remains such an indefatigable contributor to the siteâs pages .. that he seems to consider it a full-time job .. â. Even my friends at NZ CSC had a mention.

There was an even longer article on 23rd Dec 2009 in The Daily Bell News Wire (Note 2) which, among many other things, said â .. The general mainstream take on Connolley (when he is mentioned at all) is apparently that he is a driven leftist who reconfigured 5,000 articles on Wikipedia because of his belief system .. â.

I remember reading quite a while back about someone allegedly âmanipulatingâ the Wikipedia climate change entries but didnât make the connection when I joined your thread recently. Surelt those guys arenât talking about you.

NOTES:
1) see http://www.spectator.co.uk/columnists/all/6099208/part_3/i-feel-the-nee…
2) see http://www.thedailybell.com/683/Wikipedia-as-Elite-Propaganda-Mill.html

Best regards, Pete Ridley

Gator, I get the impression some of you guys see little further than your software engineering, not the full picture from end user requirements definition through to final system integration and operation. Thatâs how VV&T was applied when I was involved in it. (Contrary to what is claimed in âEngineering the Software for Understanding Climate Changeâ (Note 1) in the case of climate modelling systems, the end user is not only the scientists who develop these systems.)

Who is âClaiming that VV+T alone will produce quality SWâ? I have no recollection of anyone ever suggesting that to me, whereas I recall plenty of instances when professionally applied and independent VV&T procedures identified defects in system performance due to deficient software engineering. Consequently deficiencies were rectified much earlier (and much cheaper) than would have occurred if left to the software engineers and defects only identified during operation. It is possible but highly unlikely that VV&T caused the SW to be âdouble the costâ. It would certainly have cost many times more if those defects had remained undetected until during operational use. I donât expect to be the only person who has had such experiences.

I would be surprised if rectification of these deficiencies led to âqualityâ software but they did lead to software and operational systems that more closely satisfied the end usersâ requirements, used throughout the system development program as the prime objective.

Steve (Easterbrook) and Timothy Johns said (Note 1) âV&V practices rely on the fact that the developers are also the primary usersâ. It could be argued that the prime users are the policymakers who are guided by the IPCCâs SPMs which depend upon the projections of those climate models. Steve and Timothy âhypothesized that .. the developers will gradually evolve a set of processes that are highly customized to their context, irrespective of the advice of the software engineering literature .. â.

I prefer the hypothesis of Post & Votta who say in their excellent 2005 paper âComputational Science Demands a New Paradigmâ (Note 2) that â .. computational science needs a new paradigm to address the prediction challenge .. They point out that most fields of computational science lack a mature, systematic software validation process that would give confidence in predictions made from computational modelsâ. What they say about VV&T aligns with my own experience, including â .. Verification, validation, and quality management, we found, are all crucial to the success of a large-scale code-writing project. Although some computational science projectsâthose illustrated by figures 1â4, for exampleâstress all three requirements, many other current and planned projects give them insufficient attention. In the absence of any one of those requirements, one doesnât have the assurance of independent assessment, confirmation, and repeatability of results. Because itâs impossible to judge the validity of such results, they often have little credibility and no impact ..â.

Relevant to climate models they say âA computational simulation is only a model of physical reality. Such models may not accurately reflect the phenomena of interest. By verification we mean the determination that the code solves the chosen model correctly. Validation, on the other hand, is the determination that the model itself captures the essential physical phenomena with adequate fidelity. Without adequate verification and validation, computational results are not credible .. â.

I agree with Steve that âfurther research into such comparisons is needed to investigate these observationsâ and suggest that in the meantime the VV&T procedures should be applied as understood by Post and Votta and currently practised successfully outside of the climate modelling community.

NOTES:
1) see http://www.cs.toronto.edu/~sme/papers/2008/Easterbrook-Johns-2008.pdf
2) see http://www.highproductivity.org/vol58no1p35_41.pdf

Best regards, Pete Ridley.

Hi William, I see that since I mentioned the involvement of Vincent Gray you have rejected two of my comments and removed one. Whatâs wrong â donât you appreciate open debate about The (significant human-made global climate change) Hypothesis. Is it against the rules of your âGreenâ religion?

As I promised in my comment that you removed, I posted the previously rejected one elsewhere (Note 1). Senator Fielding doesnât censor comments but encourages open debate. A respondent there pointed me to another article about your Wilipedia activities â more interesting reading! Thatâs the thing about the Internet, nothing can be hidden for long, as Michael Mannâs âHockey Teamâ discovered with the leaking of the UEA CRU files. The name William M Connolley comes up repeatedly in those E-mails - 1997, 2004, then several times in 2007 âSubject: Figure 7.1c from the 1990 IPCC Report â something to do with a suspect graph. Not another âhockey stickâ was it? That Connolley, as well as being a member of the âHockey Teamâ is also a founder member of RealClimate so maybe it is you after all.

BTW, Vincent Gray is preparing a note on the non-validation of climate models which should be published soon. Meanwhile he has given me permission to quote from his E-mail. If you decide to allow my post onto your blog than I would appreciate you posting my earlier one which started with a response to Gater. If you donât have a copy then I can resend it. If you choose not to post these comments then Iâll sent them elsewhere, starting with Steve Easterbrookâs relevant thread (Note 2), along with relevant parts of my comment on VV&T that you refused to publish. The debate will not go away, despite your efforts to squash it.

Steve (Easterbrook), Vincent Gray advises as follows:

"I cannot possibly agree with your belief that there has been a significant change in the attitude of IPCC or NASA to validation of climate models. The essential process of validation is that the model must be shown to be capable of forecasting of a range of future climate behaviour over the whole range for which the models are intended, to be used to a satisfactory level of accuracy. To my knowledge no such exercise has ever been attempted for climate models and the IPCC and there is no published document I am aware of that even discusses the processes that should be carried out to achieve it.

There are additional reasons why such a procedure cannot be made. The models all assume that emissions of greenhouse gases are the only reasons for changes in climate. There are other important factors such as changes in the sun, the earth's orbit, the ocean oscillations and volcanic eruptions which currently cannot be predicted to a level of accuracy that could satisfy a future forecasting system. For example there has recently been a fall in global temperature which is simply incapable of being predicted or modelled by any current climate model.

For these reasons I reject completely the claims that you have made. The IPCC seems to be firmly committed to the view that their models are mere "projections" which depend entirely on the plausibility and accuracy of their assumptions. They do not claim the models are capable of forecasting and they depend for their "likelihood" statements entirely on the unsupported opinions of those who are paid to produce the models".

NOTES:
1) see http://www.stevefielding.com.au/forums/viewthread/795/P2820/
2) see http://www.easterbrook.ca/steve/?p=1785

Best regards, Pete Ridley.

I'll repeat my question from the last post. Does anyone know if there are regression tests on the GCMs that run against the various modules to see whether each portion of the system is within specifications as changes are made. Or do they simply run the whole thing, and if it fits within their previous assumptions they assume it is fine?

[I don't think there are any regression tests. I also think it would be very difficult to do any and probably meaningless. I could expand on that if you're interested -W]

Yes, I found it an interesting paper. It reminded me vividly of working at Harlequin in the 90s. The company developed its own in-house cross-platform SCM system (called Hope). There were reasonable tools available, but obviously we couldn't use any of them because obviously we had our own unique requirements. We also went on to develop our own in-house solutions for a variety of other common problems.

None of the paper's reasons for an in-house system hold up. Subversion and Trac are good choices. Building 12,000 lines of Perl on top of them is probably not a good choice.

But an excellent article.

Very interesting subject. I'm doing QA for a large general purpose enginering code (at least 3million lines last count), and the same issues exist. I've always had a different way of development then is common in scientific/engineering circles. My approach was based when possible on standalone tests of components that are intended for the code. I.E. create some sort of test environment for your module, and excercise it to see that it conforms to your mathematically defined expectations. Only then do you put in into the large code and start trying to verify that the integration has gone well. But, alas my onservation is that scientists/engineers almost always hack the main code, then start running full scale simulations, stopping to fix any obvious bugs that show up.

Even single procesor bit repeatability can be an issue. We had an instance where changing an output parameter changed the solution. It turned out the the compiler generated code whose numerical ordering was dependent upon memory alignment (i.e. the low order bits of the address of arrays). Natural alignments wrt computer arctitecture (cache lines, pages etc), do not correspond to the natural alignments of data structures, and unexpected effects can emerge.

The subject of scientists/engineers writing their own code, versus working with professional scientific programmers is also an interesting one. Maybe computer training for scientists has improved from the bad old days, when a scientists first exposure to programming was to be handed a computer manual and a terminal, and be expected to be able to generate eficient maintainable code without having first had any specific training.

Interesting about Fortran. The place I work at is a Fortran-77 shop. I figure the boss figures that after a few years none of his employees will have viable options to work elsewhere.

Internal vs external tools can depend on various factors. I've seen it (at companies as well) where the budgets are different and it's easier to get a few people in the department to write something, than get a sign-off on purchasing (and support) of an external tool. This doesn't apply to free tools, but then some managers will not allow you to use one.

Incidentally, a relative works for a large IT contract/out-sourcing company who helped write the UM (he was UKMO employed at the time). Apparently he's easily the best programmer in their section.

WMC,

I would be very interested in why you don't think regression tests would be helpful. Please do write about it.

I have used them in even the most sophisticated software development projects. Of course in the old days we didn't, but that caused terrible issues with integration, and a lot of bugs.

As I read the graph, there are now 800 lines of code in 8000 files. Is the graph perhaps missing a "(thousands)" on the left-hand axis?

When talking about differences in methodology, possibly what did not come through in the paper is that virtually every scientist is compiling a different code-base, because they are compiling in a different set of science sub-components and/or because they are including a different set of change-sets (code differences). This is perhaps a bit different from having periodic releases of a single codebase where every "customer" uses the same code-base, possibly with different options/input data and so forth.

The FCM system is designed to do this easily for dumb users like myself who may do a handful of changes per year, and also to speed up the process of repeat extractions and compiles (use of precompiled code and so forth). It's not rocket science. Looks like about 2500 lines of Perl including comments.

An example of one simple thing it does is to provide a basic wrapper to svn to simplify development from specific UM releases, including older releases (which remain in use long after newer releases come on stream), rather than from the head of the trunk or the latest release, and to enforce a naming convention to make it easier to keep track of which changes relate to which releases. As I've not used svn or cvs, I can't comment on how this might differ from other systems.

The difficult bit of designing FCM was getting the processes surrounding svn and Trac right for the developers.

Regression tests on individual bits of code are hard because each bit of code has a complicated interface. Given that we're modelling a coupled system, there is a high amount of coupling in the code! Setting up regression tests may not be useful since even a "correctly" coded change which produces a minor change in results over a single iteration of the code may produce highly erroneous results in a longer simulation.

The Unified Model has never used CVS!

PS. It's not a porkie to say release schedules are not "primarily" driven by commercial pressures. Normally they are not. Occasionally they are. But the deadlines tend to apply to individual developments or sets of developments and not to the key releases. As it happens, the recently completed HadGEM2 model, the CMIP5 model, is based on vn6.6, but the latest UM release is 7.6!

William, I am involved in debate on James Annanâs blog about paleoclimate reconstructions and the validity of the projections from climate models. One of the questions that I put to James was about the extent to which VV&T procedures were applied to those models and he pointed me to the Easterbrook/Johns article. Before finding your critique here I scanned the article and would appreciate your reaction to my conclusions as presented to James today at http://julesandjames.blogspot.com/2010/07/monbiot-exonerated.html.

[Alas, I have no idea what W&T is, so I've no idea if it will be of interest. I think the principal point I would make is that in attempting to evaluate these models, it is essential to understand them and climate modelling and climatology. You cannot do any of that from an outside perspective.

Reading JA's blog your interests seem to be somewhat different: you meantion "the validity of attempts to reconstruct past climates from air âtrappedâ in ice cores" and ZJ (Zbigniew Jaworowski). I imagine you've spent ssome time on that, so I'm sorry to have to tell you that all ZJ's work on this subject is completely worthless. The best blog posts I know on the subject are here and The Golden Horseshoe Award: Jaworowski and the vast CO2 conspiracy. His work is universally rejected by anyone who knows anything about the subject. Please don't allow yourself to be fooled -W]

QUOTE:
James, thanks for those links to Steve Easterbrookâs articles which at first glance look very interesting. The first provides a link to Jon Pipitoneâs Masters thesis âSoftware quality in climate modellingâ and the second to âEngineering the Software for Understanding Climate Changeâ by Easterbrook & Johns. My initial impression after a quick read was that the first only looked at one aspect of the VV&T process, the engineering of the software. The second did little more, although it does mention Verification and Validation. In fact the conclusion acknowledges this with âOur goals in this study were to characterize the software development practicesâ.

Several statements in the Easterbrook/Johns article made me cringe somewhat. âThe findings show that climate scientists have developed customized techniques for verification and validation that are tightly integrated into their approach to scientific researchâ, âSoftware Verification and Validation (V&V) is particularly hard in computational science [4], because of the lack of suitable test oracles and observational dataâ. âV&V practices rely on the fact that the developers are also the primary users, and are motivated to try out one anothersâ contributionsâ. âThe V&V practices are absorbed so thoroughly into the scientific research that the scientists donât regard them as V&Vâ.

I spent the final years of my career closely involved in VV&T of operational and business support systems (OSS/BSS) developed for use in telecommunications networks and services. Commercial organisations recognised that such systems must be subjected to thorough VV&T procedures carried out by professional and independent practitioners before they can be brought into operational use. The above statements suggest to me that this is not even considered appropriate for global climate models. In my opinion without it the results they produce are highly suspect and should not be relied upon for making policy decisions.

Software engineer and ex-climate modeller William M. Connolley (Note 1) has presented a critique of the Easterbrook/ Johns article (Note 2) on his âStoatâ blog which you may find of interest.

NOTES:
1) see http://en.wikipedia.org/wiki/User:William_M._Connolley
2) see http://scienceblogs.com/stoat/2010/06/engineering_the_software_for_u.php
UNQUOTE.

Best regards, Pete Ridley

William,
Thanks for the careful read of the paper, and the detailed critique. Let me quibble with a few of your points.

You quote the paragraph from the paper "Scientists have additional requirements for managing scientific code: ..." and say it's all twaddle. Well, I've thought a lot about this paragraph since we published the paper, because others have queried it too. So I think I would now modify it somewhat. It's not that other kinds of software don't have these requirements, it's more a matter of degree. So, for non-scientific software, precise version tracking and reproducibility is usually only important to the *development* team themselves, for debugging purposes - when you get a bug report, you have to know which version the user was running. For most software, the *users* don't care, and often have no idea what version they are using. For climate models, you get users who want to do things like re-run a model they used several years ago, with a different diagnostic switched on, and guarantee that nothing else changes. And, as you point out later, the "precisely repeatable results" refers to bit-for-bit comparison of approximate numerical routines, which is important because it's a simple and objective way to check that the climatology in the model didn't change, where the alternative is a subjective (human) judgement that the climate in the new model is "similar enough". This is an awkward requirement because it means most refactoring and automated optimization will come into conflict with it. The only other community I know of who care quite as much as this about which version of the code they are using is would be open source systems/networking software, in the hands of expert users, where they tend to know exactly which version they're running, and why they chose that particular version.

[Thanks for the comment. Had I know you were likely to read the post I'd have been a bit more poilte. Pleaase consider "twaddle" to be replaced by "perhaps in need of a slight rephrase" or somesuch; it is a convenient shorthand. I think there is a distinction to be drawn between users who "just" want to re-run exactly the same software with the same compilers on the same hardware (which always puts something of a time-limit on how long this stuff stays reproducible) with developers who want to be able to track down each line to its source -W]

Later in the paragraph, the phrase "tend to develop their own tools in-house" was meant to refer to software development tools, not the climate models themselves. Your comments about whether each lab should develop it's own climate models is an interesting one, but different from the point I was trying to make.

I also have to take issue with your your comment that "[the requirements not being known up front] is twaddle. HadCM3, for example, had very well known requirements up front: to be a global climate model.'

With all respect, that's not a requirement, that's a very vague project goal. Requirements are detailed, testable statements of the precise functionality that the software is expected to have, and the other measurable qualities (portability, evolvability, reliability, etc) that its users care about. To press the point, HadCM3 could have met your "requirement" trivially, by being exactly the same as HadCM2. While there are project management plans that set out a number of science goals for each new version of the model, these are a long way from being detailed requirements specifications.

[OK, agreed, what I said was badly inexact. And some of the important emergent properties of HadCM3 - not needing flux correction perhaps most importantly - whilst undoubtedly project goals were perhaps not project requirements -W]

William, thanks for the response. First let me respond on the bit about air âtrappedâ in ice cores. I have seen the âSome are Boojumsâ articles (and many others) but find no mention of how the preferential fractionation of CO2 (in both the firn and in the âsolidâ ice beneath) due to its smaller kinetic diameter than other atmospheric gases cannot take place. This is the area of uncertainty of interest to me. None of the papers to which supporters of The (significant human-made global climate change) Hypothesis have provided links have provided evidence that convinces me this does not happen or has been properly researched. Jaworowski challenges acceptance of the validity of reconstructions from ice cores for a variety of reasons but I do not recall him making any specific reference to this one. Until I find a research paper that does address this specifically I remain sceptical about the validity of those reconstructions.

[You can if you like. You are wasting your time in an affair of no interest to anyone else, but if the existing evidence that ZJ's stuff is junk won't convince you, nothing else will -W]

Sorry that I didnât spell out that by VV&T I meant âVerification, Validation and Testâ of a computer system throughout the process of its development, from requirements definition through to operation. âA rigorous development and integration processes, usually performed by a third party, of verifying, validating, and testing of systems or software to make sure that it is for deployment. The process includes validating that the integrated product meets the specified requirements and will perform its intended functionality in its intended operational environment, verifiying the Load, stress, and performance of the product, and operational scenario testingâ (Note 1).

A very important part of that definition is âusually performed by a third partyâ but I would add âindependentâ. Because of the horrendous complexity of global climate processes and drivers and the poor state of scientific knowledge I am inclined to agree with your âin attempting to evaluate these models, it is essential to understand them and climate modelling and climatology. You cannot do any of that from an outside perspectiveâ but does this preclude the application of the best possible professional VV&T procedures by âexpertsâ in that field who are independent of the system developers and users? This would of course benefit enormously if the procedures applied were audited by independent VV&T experts, just as the pealeoclimate reconstructions have benefited from the independent audit by expert statisticians like McIntyre and MCKitrick.

[The paleao records haven't benefitted from the M&M "audit"; they've just introduced a great deal of noise and trouble. But, gain, if you haven't been convinced by all the stuff up to now thee is little hope.

As to the verifiction - yes, GCMs are verified. But they are verified in ways you don't understand because (a) you haven't read the right papers and (b) very likely you wouldn't understand them if you did. I find it irritating when people attempt to apply inappropriate outside methodologies to fields they don't understand.

Most commerical software fails a literal version of your description, unless you mean test-by-use and soak testing of the integrated system. But then, GCMs get all that -W]

Systems that are intended to be used to produce forecasts that is the basis for an organisations essential policy decisions need to be subjected to VV&T procedures by independent professionals. This is recognised as essential as is commonplace in industrial and commercial organisations so why not in academia?

[First off, I don't think it is true. Second, a more obvious example would be economic models. Why aren't you interested in thise? -W]

Steve (Easterbrook) can you can shed some light on this? It appears from your article that you apply such techniques (although perhaps not using independent practitioners) to the software development phase. This is what VV&T practitioners regard as software Verificationâ does it do what it is designed to do - but what about the other phases of the programme, from user requirements specification through to operation, i.e. does the system do what it is required to do? I see the âdefinition of requirementsâ phase as being vital to the achievement of a system that can be relied upon to produce usable results. If that phase has not yet been successfully completed for any climate models then projections/predictions from such models must surely be treated with suspicion.

NOTES:
1) see http://www.birds-eye.net/definition/acronym/?id=1161563436

Best regards, Pete Ridley

"V, V, and T" is "verification, validation, and testing". Probably.

Pete,
Earlier in my career I was lead scientist at NASA's Independent Software Verification and Validation facility, where I studied the VV&T processes used by NASA for spaceflight control software.
I can categorically state that GCMs are independently validated by domain experts to a much greater extent than any of NASA's mission critical spacecraft control systems. Of course, no other software developers have the luxury of two dozen other labs around the world independently building software to do the same thing, with a regular systematic intercomparison project in which they benchmark and assess the quality of each other's models. You have heard of CMIP, right? If not, I suggest you go and do your homework, and stop posting nonsense.
Steve

Steve, if you click on Pete's name, you get to his blog. A cursory read will tell you everything you need to know about what type of person you are dealing with. Don't have anything in your mouth, though. Your computer screen (or your keyboard) will suffer.

Firstly, the idea that systems used to guide important policy decisions all undergo independent external VV+T is pretty bogus. It might be true in some very narrow fields but it is not true of most commercial software. As a trivial example, a huge number of key policy decisions in organisations throughout industry and the public sector are taken on the basis of (frighteningly flimsy) Excel spreadsheets. Do these organisations procure independent VV+T of the spreadsheets? Of Excel itself? You're just making us laugh now. (Incidentally, I'm not saying that the crappy-spreadsheet model of management is a good one).

Secondly, the models of software development in which a detailed req spec is developed and then thrown over the wall to the developers are utterly discredited, except, again, for some very narrow domains (which possibly just haven't woken up to smell the coffee). It doesn't work in industry, it doesn't work in the public sector, why on earth should it be made to work in climate science?

PART 1

William, you have said that â .. I'm sorry to have to tell you that all ZJ's work on this subject is completely worthless .. â and â . if the existing evidence that ZJ's stuff is junk won't convince you, nothing else will .. â. This tells me that you are convinced that preferential fractionation of CO2 cannot take place.

As I pointed out on James Annanâs âMonbiot exoneratedâ thread (Note 1) Dr. Hartmut Frank who wrote the forward to Jaworowskiâs 1994 paper, says â .. Prof. Jaworowski's main argument is valid and will remain valid because it is based on simple, but hard physicochemical facts. Most of the facts can be found in the old, traditional "Gmelin's Handbook of Inorganic Chemistryâ - but nobody reads such books anymore today. The facts are so basic that one cannot even start a research project on an investigation of the validity of such carbon dioxide analyses in ice cores because the referees would judge it too trivial. But if one would apply proper quality assurance/quality control principles, as they are common in most other areas of application of chemical-analytical methods (for instance in drug control or toxicology) the whole building of climate change would collapse because of the overlooked fault. And so one continues because there are so many living in or from this buildingâ.

Professor Frank, with his many years of scientific experience behind him, in his forward to Jaworowskiâs 1994 paper, said of consensus âAlso in scientific discussions the sentiment of the generally accepted view of the scientific community is heard â as if verification or falsification of scientific hypothesis is a matter of majority vote. There are many historical examples when the common belief, the majority of those who knew, hindered true progress. Derogatory statements about a personâs scientific reputation are least helpful. Often the less firm arguments are, the more is the interpretation placed upon scientific âauthority through majorityâ.â

On his 65th birthday in 2008 the journal Chromatographia said âDr. Frank, Professor of Chemistry and Ecotoxicology, University of Bayreuth, Germany, is internationally recognised for his development of chiral separation phases ..â an area of scientific research in which he has been involved since at least 1978.

I believe that research into chiral separation phases requires more understanding of the interaction of atoms in molecules than does the engineering of software for arctic sea ice models but please correct me if I am talking nonsense here. I understand that you are not a scientist but a software engineer who has worked as a climate modeller specialising in Antarctica sea ice. If you were in my shoes whose opinion would you be inclined to consider the best informed regarding molecular diffusion in firn and the âsolidâ ice beneath?

I see from the Wikipedia entry (Note 2) that you are also a Green Party activist and were a member of RealClimate. I had to chuckle at your comment that Wikipedia âgives no privilege to those who know what theyâre talking aboutâ. I understand where you are coming from now and why you feel obliged to say âThe paleao records haven't benefitted from the M&M "audit"â. Do you â .. find it irritating when people attempt to apply inappropriate outside methodologies to fields they don't understandâ with regard to the independent audit by McIntyre and McKitrick?

I am surprised that on the one hand you say â .. Alas, I have no idea what W&T (sic) is .. â then almost in the next breath say â .. yes, GCMs are verified. But they are verified in ways you don't understand .. â. You might like to try âChapter 1 - Software Systems Engineeringâ (Note 3) for helpful information about Verification, Validation & Test (VV&T).

PART 2

[VG: say no more guv! -W]

Marco, I see that you continue to pour forth your pearls of wisdom.

Best regards, Pete Ridley

An interesting thread derailed by Mr. Ridley. Too bad.

[Yes, sorry about that, I'm not doing my maintenance duties properly. I've just not-approved another irrelevant comment by him -W]

I work in a field where we are forced to follow a structured SW engineering process with integral VV+T (DO-178B.) As far as I can tell all this does is double the cost of any SW without a great increase in reliability. In theory it should, but VV+T is just one more complicated process where failure can enter the system. Claiming that VV+T alone will produce quality SW is naive at best.

Jo Novaâs blog has an interesting new article âThe models are wrong (but only by 400%) â (Note 1) which you should have a look at, along with the comments. It covers the recent paper âPanel and Multivariate Methods for Tests of Trend Equivalence in Climate Data Seriesâ (Note 2) co-authored by those well-known and respected expert statisticians, McIntyre and McKitrick, along with Chad Herman.

[That is an odd way to describe them. Neither are expert statisticians, nor well-known for their statistics. Neither would primarily be described as "statisticians". But you're puffing them so as to be able to argue from Authority, which fails -W]

David Stockwell sums up the importance of this new paper with âThis represents a basic validation test of climate models over a 30 year period, a validation test which SHOULD be fundamental to any belief in the models, and their usefulness for projections of global warming in the futureâ.

David provides a more detailed comment on his Niche Modeling blog âHow Bad are Climate Models? Temperatureâ thread (Note 3) in which he concludes âBut you can rest assured. The models, in important ways that were once claimed to be proof of ââ¦ a discernible human influence on global climateâ, are now shown to be FUBAR. Wouldnât it have been better if they had just done the validation tests and rejected the models before trying to rule the world with them?â.

Come on you model worshipers, letâs have your refutation of the McIntyre et al. paper.

[To be fair, you ought to give me time to read it first. But I doubt it is of much interest. Attacking the temperature record is dull, since the (positive) trends are well known and the fit to the models also -W]

NOTES:
1) see http://joannenova.com.au/2010/08/the-models-are-wrong-but-only-by-400/#…
2) see http://rossmckitrick.weebly.com/uploads/4/8/0/8/4808045/mmh_asl2010.pdf
3) see http://landshape.org/enm/how-bad-are-climate-models/

Best regards, Pete Ridley

I think the rails warped from the heat.

Hi,

Excellent article. Thanks for your comments on VV&T. Iâve been retired for 8 years now and in 2002 well respected computer systems integrators like Computer Sciences Corporation (CSC), Accenture (Andersen Consulting) and WIPRO were applying VV&T procedures effectively for Telecomms OSS/BSS. Practices change so VV&T could have become obsolescent but if so why are there numerous adverts for VV&T specialists these days?

Engineering the Software for Understanding Climate Change

More like this

Last warning: mustelid.blogspot.com

Dynamics of Stoats

Gunz: constitutionalism and majoritarianism

That it is easier to agree on economics than morality

Morality and economics

Adoption Day!!!!!

Good ideas, Bad ideas, MOND, and Dark Matter

Friday Cephalopod: Octopus pitching a tent