Repeatability of Large Computations

By stoat on July 29, 2013.

Some parts of the discussion of Oh dear, oh dear, oh dear: chaos, weather and climate confuses denialists have turned into discussions of (bit) reproducibility of GCM code. mt has a post on this at P3 which he linked to, and I commented there, but most of the comments continued here. So its worth splitting out into its own thread I think. The comments on this issue on that thread are mostly mt against the world; I'm part of the world, but nonetheless I think its worth discussing.

What is the issue?

The issue (for those not familiar with it, which I think is many. I briefly googled this and the top hit for "bit reproducibility gcm" is my old post so I suspect there isn't much out there. Do put any useful links into comments. Because the internet is well known to be write-only, and no-one follows links, I'll repeat and amplify what I said there) is "can large-scale computer runs be (exactly) reproduced". Without any great loss of generality we can restrict ourselves to climate model runs. Since we know these are based effectively on NWP-type code, and since we know from Lorenz's work or before that weather is chaotic, we know that means that on every time step, for every important variable, everything needs to be identical down to the very last bit of precision. Which is to say its all-or-nothing: if its not reproducible at every timestep down to the least significant bit, then it completely diverges weatherwise.

I think this can be divided down into a heirarchy of cases:

The same code, on the same (single-processor) machine

Nowadays this is trivial: if you run the same code, you'll get the same answer (with trivial caveats: if you've deliberately included "true" random numbers then it won't reproduce; if you've added pseudo-random numbers from a known seed, then it will). Once upon a time this wasn't true: it was possible for OSs to dump your code to disk at reduced precision and restore it without telling you; I don't think that's true any more.

The (scientifically) same code, on different configurations of multiple processors

This is the "bit reproducibility" I'm familiar with (or was, 5+ years ago). And ter be 'onest, I'm only familiar with HadXM3 under MPP decomposition. Do let me know if I'm out of date. In this version your run is decomposed, essentially geographically, into N x M blocks and each processor gets a block (how big you can efficiently make N or M depends on the speed of your processor versus the speed of your interconnect; in the cases I recall on our little Beowulf cluster, N=1 and M=2 was best; at the Hadley Center I think N = M = 4 was considered a fair trade-off between speed of completion of the run and efficiency).

Note that the decomposition is (always) on the same physical machine. Its possible to conceive of a physically distributed system; indeed Mechoso et al. 1993 does just that. But AFAIK its a stupid idea and no-one does it; the network latency means your processors would block and the whole thing would be inefficient.

In this version, you need to start worrying about how your code behaves. Suppose you need a global variable, like surface temperature (this isn't a great example, since in practice nothing depends on global surface temperature, but never mind). Then some processor, say P0, needs to call out to P0..Pn for their average surface temperatures on their own blocks, and (area-)average the result. Of course you see immeadiately that, due to rounding error, this process isn't bit-reproducible across different decompositions. Indeed, it isn't necessarily even bit-reproducible across the same decomposition, but with random delays meaning that different processors put in their answers at different times. That would depend on exactly how you wrote your code. But note that all possible answers are scientifically equivalent. They differ only by rounding errors. It makes a difference to the future path of your computation which answer you take, but (as long as you don't have actual bugs in your code or compiler) it makes no scientific difference.

Having this kind of bit-reproducibility is useful for a number of purposes. If you make a non-scientific change to the code, one which you are sure (in theory) doesn't affect the computation - say, to the IO efficiency or something - then you can re-run and check this is really true. Or, if you have a bug that causes the model to crash, or behave unphysically, then you can run the code with extra debugging and isolate the problem; this is tricky if the code is non-reproducible and refuses to run down the same path a second time.

Obviously, if you make scientific changes to the code, it can't be reproducible with code before the change. Indeed, this is practically the defn of a scientific change: something designed to change the output.

The same code, with a different compiler, on the same machine. Or, what amounts to much the same, the same code with "the same" compiler, on a different machine

Not all machines follow the IEEE model (VAXes didn't, and I'm pretty sure DEC Alpha's didn't either). Fairly obviously (without massive effort and slowdown from the compiler) you can't expect the bitwise same results if you change the hardware fundamentally. Nor would you expect identical results if you run the same code at 32 bit and 64 bit. But two different machines with the same processor, or with different processors nominally implementing IEEE specs, ought to be able to produce the same answers. However, compiler optimisations inevitably sacrifice strict accuracy for speed, and two different compiler vendors will make different choices, so there's no way you'll get bit repro between different compilers at anything close to their full optimisation level. Which level you want to run at is a different matter; my recollection is that the Hadley folk did sacrifice a little speed for reproducibility, but on the same hardware.

Does it matter, scientifically?

In my view, no. Indeed, its perhaps best turned round: anything that does depend on exact bit-repro isn't a scientific question.

Why bit-repro doesn't really matter scientifically

When we're running a GCM for climate purposes, we're interested in the climate. Which is the statistics of weather. And a stable climate - which is a scientifically reliable result - means that you've averaged out the bit-repro problems. If you did the same run again, in a non-bit-repro manner, you'd get the same (e.g.) average surface temperature, plus or minus a small amount to be determined by the statistics of how long you've done the run for. Which may require a small amount of trickery if you're doing a time-dependent run and are interested in the results in 2100, but never mind.

Similarly, if you're doing an NWP run where you do really care about the actual trajectory and are trying to model the real weather, you still don't care about bit-repro, because if errors down at the least-significant-bit level have expanded far enough to be showing measureable differences, then the inevitable errors in your initial conditions, which in any imaginable world are far far larger, have expanded too.

Related to this is the issue people sometimes bring up about being able to (bit?) reproduce the code by independent people starting from just the scientific description in the papers. But this is a joke. You couldn't get close. Certainly not to bit-repro. In the case of a very very well documented GCM you might manage to get close to climate-reproducibility, but I rather doubt any current model comes up to this kind of documentation spec.

[Update: Jules, correctly, chides me for failing to mention GMD (the famous journal, Geoscientific Model Development) the goal is what we call "scientific reproducibility".]

Let's look at some issues mt has raised

mt wrote There are good scientific reasons for bit-for-bit reproducibility but didn't, in my view, provide convincing arguments. He provided a number of practical arguments, but that's a different matter.

1. A computation made only a decade ago on the top performing machines is in practice impossible to repeat bit-for-bit on any machines being maintained today. I don't think this is a scientific issue, its a practical one. But if we wanted to re-run, say, the Hansen '88 runs that people talk about a lot then we could run them today, on different hardware and with, say, HadXM3 instead. And we'd get different answers, in detail, and probably on the large-scale too. But that difference would be a matter for studying differences between the models - an interesting subject in itself, but more a matter of computational science than atmospheric science. Though in the process you might discover what key differences in the coding choices lead to divergences, which might well teach you something about important processes in atmospheric physics.

2. What’s more, since climate models in particular have a very interesting sensitivity to initial conditions, it is very difficult to determine if a recomputation is actually a realization of the same system, or whether a bug has been introduced. Since this is talking about bugs its computational, not scientific. Note that most computer code can be expected to have bugs somewhere; it would be astonishing of the GCM codes are entirely bug-free. Correcting those bugs would introduce non-bit-repro, but (unless the bugs are important) that wouldn't much matter. So, to directly address one issue raised by The Recomputation Manifesto that mt points to: The result is inevitable: experimental results enter the literature which are just wrong. I don’t mean that the results don’t generalise. I mean that an algorithm which was claimed to do something just does not do that thing: for example, if the original implementation was bugged and was in fact a different algorithm. I don't think that's true; or rather, that it fails to distinguish between trivial and important bugs. Important bugs are bugs, regardless of the bit-repro issue. Trivial bugs (ones that lead, like non-bit-repro, to models with the same climate) don't really matter. TRM is very much a computational scientist's viewpoint, not an atmospheric scientist's.

3. refactoring. Perhaps you want to rework some ugly code into elegant and maintainable form. Its a lot easier to test that you've done this right if the new and old are bit-repro. But again, its coding not science.

4. If you seek to extend an ensemble but the platform changes out from under you, you want to ensure that you are running the same dynamics. It is quite conceivable that you aren’t. There’s a notorious example of a version of the Intel Fortran compiler that makes a version of CCM produce an ice age, perhaps apocryphal, but the issue is serious enough to worry about. This comes closest to being a real issue, but my answer is the section "Why bit-repro doesn't really matter scientifically". If you port your model to a new platform, then you need to perform long control runs and check that its (climatologically) identical. It would certainly be naive to swap platform (platform here can be hardware, or compiler, or both) and just assume all was going to be well. If there is an Intel fcc that makes CCM produce an ice age, then that is a bug: either in the model, or the compiler, or some associated libraries. Its not a bit-repro issue (obviously; because it produces a real and obvious climatological difference).

Some issues that aren't issues

A few things have come up, either here or in the original lamentable WUWT post, that are irrelevant. So we may as well mark them as such:

1. Moving to 32 / 64 / 128 bit precision. This makes no fundamental difference, it just shifts the size of the initial bit differences, but since this is weather / climate, any bit differences inevitably grow to macroid dimensions.

2. Involving numerical analysis folk. I've seen it suggested that the fundamental problem is one with the algorithms; or with the way those are turned into code. Just as in point 1, this is fundamentally irrelevant to this point. But, FWIW, the Hadley Centre (and, I assume, any other GCM builder worth their salt) have plenty of people who understand NA in depth.

3. These issues are new and exciting. No, these issues are old and well known. If not to you :-).

4. Climate is chaotic. No, weather is chaotic. Climate isn't (probably).

Some very very stupid or ignorant comments from WUWT

Presented (almost) without further analysis. If you think any of these are useful, you're lost. But if you think any of these are sane and you're actually interested in having it explained why they are hopelessly wrong, do please ask in the comments.

1. Ingvar Engelbrecht says: July 27, 2013 at 11:59 am I have been a programmer since 1968 and I am still working. I have been programming in many different areas including forecasting. If I have undestood this correctly this type of forecasting is architected so that forecastin day N is built on results obtained for day N – 1. If that is the case I would say that its meaningless.

2. Frank K. says: July 27, 2013 at 12:16 pm ... “They follow patterns of synthetic weather”?? REALLY? Could you expand on that?? I have NEVER heard that one before…

3. DirkH says: July 27, 2013 at 12:21 pm ... mathematical definition of chaos as used by chaos theory is that a system is chaotic IFF its simulation on a finite resolution iterative model...

4. ikh says: July 27, 2013 at 1:57 pm I am absolutely flabbergasted !!! This is a novice programming error. Not only that, but they did not even test their software for this very well known problem. Software Engineers avoid floating point numbers like the plague...

5. Pointman says: July 27, 2013 at 2:19 pm Non-linear complex systems such as climate are by their very nature chaotic... (to be fair, this is merely wrong, not stupid)

6. Jimmy Haigh says: July 27, 2013 at 3:25 pm... Are the rounding errors always made to the high side?

7. RoyFOMR says: July 27, 2013 at 3:25 pm... Thank you Anthony and all those who contribute (for better or for worse) to demonstrate the future of learning and enquiry.

8. ROM says: July 27, 2013 at 8:38 pm... And I may be wrong but through this whole post and particularly the very illuminating comments section nary a climate scientist or climate modeler was to be seen or heard from. (He's missed Nick Stokes' valuable comments; and of course AW has banned most people who know what they're talking about)

9. PaulM says: July 28, 2013 at 2:57 am This error wouldn’t be possible outside of academia. In the real world it is important that the results are correct so we write lots of unit tests. (Speaking as a professional software engineer, I can assure you that this is drivel).

10. Mark says: July 28, 2013 at 4:48 am Dennis Ray Wingo says: Why in the bloody hell are they just figuring this out? (They aren't. Its been known for ages. The only people new to this are the Watties).

11. Mark Negovan says: July 28, 2013 at 6:03 am... THIS IS THE ACHILLES HEAL OF GCMs. (Sorry, was going to stop at 10, but couldn't resist).

Refs

* Consistency of Floating-Point Results using the Intel® Compiler or Why doesn’t my application always give the same answer? Dr. Martyn J. Corden and David Kreitzer, Software Services Group, Intel Corporation

More like this

"The same code, on the same (single-processor) machine"... a little bit more than a decade ago I was running large numbers of climate runs on a small, somewhat rickety beowulf cluster, and there were some stability problems in that when we were doing a particularly large set of Monte Carlo analysis the cluster as a whole would heat to the point that there were occasional bit flips. Which meant that while 3 out of 4 runs of an identical experiment would be identical, the 4th would diverge. (and about 1 in a 100 runs would crash due to a bit flip in a particularly sensitive area).

Not a problem scientifically, but it did make debugging occasionally annoying. e.g., if you are updating the code with some non-scientific updates, it is nice to get exactly the same result to confirm that in fact the update didn't change anything relevant to the computation itself.

-MMM

ps. This is also why it is preferable, when computationally feasible, to run ensembles rather than single runs, in order to avoid the appearance that one is attempting to project weather rather than climate.

well, 6 and 7 were particularly droll.

You state "anything that does depend on exact bit-repro isn’t a scientific question". This is a reasonable enough definition, but it doesn't leave us much worth discussing. You're inviting me to dispute a tautology of your own construction. I do not intend to dispute that there are no members of set A that are nonmembers of set A.

[I intended it more as a distinction worth making. As I've agreed, there are definitely coding issues relevant to bit-repro. And I think its worth discriminating between these and scientific issues -W]

OTOH, the whole definition of science itself traces back to reproducibility. Thus, maximal reproducibility is a worthy goal for the advancement of science and that its sacrifice should be conscious, deliberate, documented and defended. It should not be a matter of course.

[Oh hold on, that's a bit of a step you've made there. I agree that repro is important - but as I've attempted to say, scientific repro isn't the same as bit-repro. Or at least, that's the argument I'm making. But to step from that to "maximal reproducibility is a worthy goal for the advancement of science" is not clearly justified -W]

There was a reproducibility track at SciPy here in Austin just last month,

https://conference.scipy.org/scipy2013/presentations.php#Reproducible%2…

which follows onto a special issue of Computing in Science and Engineering on Reproducibility in 2009.

Here's the introductory article to the latter.

http://www.computer.org/portal/c/document_library/get_file?uuid=55fc5f4…

The motto of this movement is probably "An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."
—D. Donoho

[That's very much the CompSci view. But a paper about GCM results isn't "An article about computational science" in my view. Any more than an article about low-temperature superconductors is an article about liquid helium -W]

Last for now, see also http://sciencecodemanifesto.org/

[I should probably ask NB this... but hopefully he'll be along. I'll ping him. What does "All source code written specifically to process data for a published paper must be available to the reviewers and readers of the paper" mean, in the context of a GCM paper? Does it mean the software to post-process the GCM outputs, or include the GCM code itself? If the latter, does it mean all of it? And if we decide we care about compile optimisations, would we be obliged to provide the compiler code too? -W]

“The same code, on the same (single-processor) machine”

This really has to do with the structure of the application, yes, multiprocessing is (in a sense) emulated on a single-processor machine, but this doesn't mean you can't get the same sort of bugs due to multiprocessing in that environment.

In practice, the kind of decomposition you're speaking of (for a Bewolf cluster or the like), is just silly and just makes your code run slower in a single-processor environment.

But multiprogramming errors are rampant even on code run on single processor machines. Totally common kind of bug in OS code, for instance.

And not really relevant to the points you make in your post, I'm being nitpicky.

MT:

"There was a reproducibility track at SciPy here in Austin just last month"

Skimming the papers there, nothing pops up that addresses the kind of issues being discussed here. Rather, there are presentations regarding organizing and dissemenating software and data, tools for visualization, etc.

i.e. trying to reduce the difficulty in reproducing results due to things such as "your make files don't help me with MVC", or failure to verify that all bits of code and data used in the computations backing up a paper are actually included in a public release, etc.

Many, many years ago I printed out results of a model at every time step to full precision, and watched differences grow from the last decimal place upward. At the time I didn't know about Lorenz etc. so when I finally did hear about chaos and nonlinearity it was a great experience -- "Yes! I've seen that!"

Enough of an old guy telling pointless stories. Something perhaps more relevant is that so far as I understand it, IEEE754 does not extend to transcendental functions such as logarithms or trig functions. These are computed by various approximation methods which can vary from one vendor or compiler release to another.

#4 you are right. I did change the subject.

It was an opportunity to pitch what I'll call "metacomputational tools" to a community woefully underexposed to it.

It also seemed an obligation not to leave the reproducibility flag in Watts' hands for the climate community. There is more to it than their foolishness.

I fully concur that their argument is a clear demonstration of cluelessness, but only to those with sufficient clue.

But there is more to "reproducibility" than that. That's my point.

However, fair enough.

I need to clarify that I totally agree with the thrust of William's critique of Watts.

"4. Climate is chaotic. No, weather is chaotic. Climate isn’t (probably)."

What about D-O events? Perhaps it would be more correct to say that current climate doesn't seem to have any large chaotic cycles.

[No-one follows links any more :-( I meant Holocene climate -W]

American Idiot, a fine point, and one which strongly argues for open source compilers in most publicly funded science applications.

Accept half the performance per clock cycle, and spend the money you save in software licensing; spend it on CPUs and memory!

My claim was " Thus, maximal reproducibility is a worthy goal."

William complains it was "not clearly justified". Okay, consider this.

Reproducibility is clearly a monotonically beneficial metric of the performance of a research project. There isn't any such thing as excessive reproducibility in a result BY SOMEBODY ELSE; given imperfect reproducibility all else equal one might be indifferent to more of it but would never ask for less of it.

It's not THE goal of course! If it were, science would be happiest doing nothing at all, which is perfectly reproducible without effort. Watts and company demonstrate this sort of reproducibility daily.

But if you're going to do something, you might as well do it right. Maximizing the ease with which your work can be reproduced by others is right, in the sense that it provides maximum utility for the human enterprise.

Interests such as Portland Group, Matlab, SPSS, Wolfram, etc., who are more unhappy than they would like to admit about gnu fortran, SciPy, R, Sage, would be hard put to argue for openness. This is understandable, but they are not the tools for the job.

Some individual scientists also benefit from "owning" datasets. This is really terrible and shouldn't be tolerated.

A side point about all this is that whether or not climate has high sensitivity to initial conditions pretty much doesn't matter; all it means is that the error bars in our curves should be made larger. Sure, that means there's a better chance that global warming will be considerably less than predicted, but it also means a better chance that it will be considerably more than predicted.

[I don't think that's correct. If climate were sensitive to initial conditions then it wouldn't be possible to produce stable control runs; without that, any kind of assessment of the GCMs would be immensely harder -W]

Just to pop in to say that I agree almost 100% with William's framing.

In our GCM development, bit-wise reproducibility on the same machine, same compiler + differing number of processors (from 1 to N where N is a large(ish) number) is a key requirement. Without it, diagnosing errors is almost impossible, and failures at bit-wise repro almost always result from a misconception in the coding (assuming something was global that wasn't, improper halo updates) or something irrelevant (random number seeding).

Different compilers, different machines (with different math libraries) give different trajectories for the reasons described above. We have not detected any differences in climatology because of this though (echoing William's point that the climate in these models is not chaotic, only the weather is).

Historically speaking, there have been cases where codes run on different machines had different climates. These issues tended to be associated with codes that used old-style common blocks and out-of-bounds references that would be expressed differently with different compilers. i.e. an error in the code led to two machines doing something differently wrong when it was run. This predates my time as a developer, and codes have been improved so that these kinds of problems do not arise anymore. It is conceivable that something similar could happen again, and so it is worth checking on the climate reproducibility even now.

MT:

"#4 you are right. I did change the subject.

It was an opportunity to pitch what I’ll call “metacomputational tools” to a community woefully underexposed to it."

It's extremely important work, IMO, I wasn't trying to belittle the work represented by those presentations, just didn't see the relationship with WMC's post ...

He’s missed Nick Stokes’ valuable comments; and of course AW has banned most people who know what they’re talking about.

The latter phrase in that statement is absolutely true, but you miss pointing out that Nick Stokes, while not yet banned, routinely has his comments deleted with a vengeance (and an ignorant, snarky, asstarded comment from Watts himself, usually), so it's not technically true to say that his comments were missed.

You can't miss what you never had the chance to see.

WUWT is a sad little echo chamber occupied by trolls.

What I find continuously amusing is that I grew up in Chico, CA, during the time when Anthony Watts was a TV weather guy, and we used to joke that the campus radio station "stick your head out the window and guess" forecast was consistently more accurate than Watts'. Further, the standard line was, "If Watts tells you to bring an umbrella, don't, but grab your sunscreen. You'll need it!"

How someone trained in, but so bad at meteorology can so Dunning-Kruger his way into thinking anything he has to say about climatology is even remotely valid simply amazing to me.

[Agreed, NS is being shamefully abused in those threads. However, some of his comments have survived, and are both valuable and correct, so ROM is still wrong -W]

Gavin:

"In our GCM development, bit-wise reproducibility on the same machine, same compiler + differing number of processors (from 1 to N where N is a large(ish) number) is a key requirement."

Why am I not surprised?

"Different compilers, different machines (with different math libraries) give different trajectories for the reasons described above. We have not detected any differences in climatology because of this though (echoing William’s point that the climate in these models is not chaotic, only the weather is)."

This comment also totally shoots down the WATTSian baloney along the lines of "they've never thought of this?", "they've never tested for such problems?" etc etc ...

It motivated me to take a look at the Model E documentation (something obviously none of the WATTSian types bothered to do):

"The code requires a FORTRAN 90/95 compiler. It has been tested with the SGI, IBM, and COMPAQ/DEC compilers on their respective workstations. For Linux, Macs or PCs, the choice of compiler is wider, and we have not been able to test all possibilities. The Absoft ProFortran compiler works well, as do the Lahey/Fujitsu and Portland Group compilers. We have been unable (as yet) to get the model to compile with the Intel or VAST compilers due to internal errors in those products. Please let us know if you succeed (or more importantly, fail) in compiling the code using any other compiler."

I assume "works well" and "We have not detected any differences in climatology because of this though" are saying essentially the same thing ...

I can't tell you how many times I've refuted denialist claims about GCMs simply by taking five minutes to dig around the excellent GISS Model E site to find out what is really done, vs. what denialists claims is done, by real-live climate modellers.

"Historically speaking, there have been cases where codes run on different machines had different climates. These issues tended to be associated with codes that used old-style common blocks..."

OMIGOD, old-style COMMON blocks ... :)

makeinu:

"How someone trained in, but so bad at meteorology..."

This is totally O/T, but keep in mind that Watts doesn't have a university degree, not in meteorology, nor anything else. He's old enough that he predates the degree requirement for certification. So his training may well have been nothing more than training in script-reading and green-screen hand-waving.

At GMD (the famous journal, Geoscientific Model Development) the goal is what we call "scientific reproducibility".

[I almost referred to you, and I should have. I'll correct that. How close do you think you come to this goal? Has it ever been tested? My guess would be "unknown" and "no" -W]

1) See this comment at RC a while back. People with different sorts of technical backgrounds often over-generalize.
Some kinds of simulations *must* be bit-for-bit.

2) DEC Alphas could do either VAX or IEEE.

3) The SPEC CPU benchmarks (which in one iteration or another have been used to design microprocessors for 20 years) do exact comparisons of results of integer benchmarks and "fuzzy" comparisons on floating-point.,

Long ago, we wanted to use a big integer code, timberwolf, simulating annealing for chip layouts. Unfortunately, it used a tiny amount of floating point that depended on least signficant bits, which had large effects on the layouts. Amongst a bunch of machines (VAX, MC68K and RISCs), there were 3 distinct answers. All layouts were feasible, i.e., useful in the real world, and not too different in size. But they had different computational complexity, sow ere useless as benchmarks.

4) It can take a vast amount of effort to get bit-for bit compatibility from floating-point codes. SAS has done that for decades, including having to achieve that across VAX, IBM mainframes, etc. They would avoid using dynamically-linked math libraries, asking instead to get ones they could statically link, so they wouldn't get surprised by a change.

[Read all about it over here -W]

David Young:

"The references give some details of our recent work. We are not in the business of “getting people to read our papers” so they are not that visible unless you know what to look for."

You're right, so give us the references.

Anyway, glad you've overturned all of climate science. Get in line for your Nobel, your just the last in a long line of those who have [claimed] to have done so.

And most of numerical modelling, too, good job!

I'm sure you'll never step aboard a modern airliner again ...

dhog, I am not claiming to overturn all climate science (where did you get that?), just that a very specific claim is questionable.

The reference is readily available on the AIAA site (aiaa.org) and on James' thread another reference is given that is also referenced in the subject paper.

DY:

"The doctrine that seems to me to be the basis of this whole thread and the previous one is the uniqueness of climate as a function of forcings. If this is false, pretty much everything else said here is questionable."

No, the issue is whether or not climate modellers are aware of the pitfalls. I would be amazed if climate modellers would claim that climate is "unique as a function of forcings". and even more amazed if they were unaware of relevant literature.

You're posting from a combination of personal authority combined with an assumption of professional ignorance on the part of modellers.

"Basically, if your problem is well-posed these rounding errors can be proven to not make much difference. For the ill-posed problems, they do make a difference and as discussed above"

Duh (presuming well-behaved rounding algorithms at the level of basic arithmetic operations).

Anyway, it is a genuine relief to see that David Young is here to teach Gavin Schmidt, among others, how to do their job ...

[Please stop feeding the trolls -W]

[Again, see the burrow -W]

[Getting pretty bored now. Your posts are going to just start disappearing entirely if you have nothing to say. See burrow]

It amuses me that the "skeptics" make a fuss about bit-wise reproducibility, given that if we modelled the climate by getting Magrathea to build us several duplicate Earths and used them to simulate our climate, that wouldn't be exactly reproducible either, even though it would be a perfect model with perfect climate physics and continuous spatial and temporal resoution. Bit wise reproducibility is important for the reasons that Gavin mentions, but the validity of the basic approach to ensemble forecasting/projection isn't one of them.

A useful way of thinking about what we can reasonably expect from GCMs is to think of what a perfect GCM (a duplicate Earth) would be able to do and use that as an upper bound on your expectations.

You called?

So much to say, so little time. Yes, of course an article about GCM
results is an article about computational science. Computational
science is any science of which computation is an indispensable part,
which certainly includes anything which uses a swutting great model.

Bitwise reproducibility is not some sort of mystical unattainable
state, but unfortunately almost all working scientists are
ill-equipped, in skills, knowledge, and tools (most obviously:
language implementations and libraries with bitwise-defined
semantics), to deliver bitwise reproducibility, and so it is
unreasonable to expect it of them.

[But you must have talked to the UKMO folk. They do fully understand the issue. I'm rather sensitive to what can appear somewhat patronising paragraphs like the above, because the Watties have flung very similar but more highly charged assertions of incompetence. Calling it a "mystical unattainable state" is a straw man, no? -W]

Nonetheless it is a worthy goal
(as Our Gracious Host notes it can be useful for regression testing
among many other things), and there are definitely strong moves in
that direction as increasing numbers of scientists start to take it
seriously and remedy some of those deficiencies.

Scientific reproducibility is simply the ability to reproduce the
*scientific* results. Without scientific reproducibility, what you're
doing isn't science any more and you shouldn't get published (or
paid), but that's another rant. It *can* depend on bitwise
reproducibility (but usually doesn't).

[Ah, interesting. So far, I haven't seen any clear examples where sci-repro depends on bit-repro. Can you provide one? -W]

Looked at the other way
around, if you have bitwise reproducibility then that gets you a long
way towards scientific reproducibility.

[This is very hand-wavy, and I'm not at all convinced its true. Bit-repro merely allows you to run someone else's code and get the same answer. Scientifically, that gets you nowhere - you might just as well have picked up their output files in that case and saved yourself the trouble of running the model -W]

So the difficulty we face is
in deciding to what extent a computational result is a scientific
result. For that purpose, the combination of domain expertise,
familiarity with the code base, and a good understanding of numerical
analysis, is indispensable. Only a climatologist is equipped to
decide what is "(climatologically) identical", and as long as she
makes (and states) that decision in advance of the comparison then
that's hunky-dory.

Alarmingly often in some "sciences" a researcher makes a mistake in
this, and decides that some "result" is scientific when in fact it's
just an artefact of the computational method. Reinhart & Rogoff is an
excellent example of this in economics for which we are all still
paying.

[If that's what I think you're thinking of, there was a significant bug in their code. We're all agreed that (scientifically) significant bugs in code are, errm, significant scientifically; so the relevance of this to the issue we're discussing here is unclear to me -W]

I've seen much less of it in the physical sciences such as
climate science, because mostly the practitioners are smart, careful,
and battle-scarred. I guess I should make an exception for the
bloggier denialist fringes, where people like Ross McKitrick sometimes
seem incapable of getting even the simplest things right (remember
http://scienceblogs.com/deltoid/2004/08/26/mckitrick6/ for example).

On implementation I defer to John Mashey, as we all should (it's a
long time since I last found a bug in one of his processors). I
haven't spent much time at this particular bit-face since the mid 90s,
round about the same time that people were first taking conformance to
the relevant IEEE standards seriously. I was hacking compiler
back-ends, for a programming language with a well-defined semantics,
evaluation order, etc. In that sort of environment, you stand a good
chance of getting bitwise reproducibility even if you change
processors, compilers, optimization levels, etc. Unfortunately, most
science codes are in FORTRAN, where such rigour is sometimes possible
(depending on your compiler) but very rare.

Yes, "compiler optimisations inevitably sacrifice strict accuracy for
speed", but IMO that's usually a mistake: it amounts to saying "Well,
I can't do *those* sums any quicker, but I can do *these other ones*
instead, which are kinda the same." Bah.

Yes, the Science Code Manifesto does mean *all* the code written for
the research. This ought to be no big deal (it's just a git commit
tag, right?). For almost all publications, if it *is* a big deal,
you're doing it wrong. It *does* include all the little shell scripts
and config files. There are more and more tools available to make
this reasonably easy (basically to deliver it for free as part of
writing the paper). For a good recent accessible example, see
http://nbviewer.ipython.org/urls/raw.github.com/robertodealmeida/notebo…

Some journals require code publication already, and it is the future,
so scientists need to get with the program or be crushed under the
juggernaut. Greg Wilson has it right: "Scientists won't submit,
publish, and download papers. They will fork and merge projects." For
my deliberately inflammatory presentation about code publication, see
http://climatecode.org/blog/2013/07/ten-reasons-you-must-publish-your-c…

Of course code publication is necessary but not sufficient for bitwise
reproducibility. Where a gap remains, you can close it with
virtualisation. For more on this, talk to Victoria Stodden.

Back to climate science: I don't see how anyone who actually lives on
the surface of a planet can assert that climate is chaotic. The
suggestion is an absurd and useless distraction, like pretty much
everything from WTFUWT.

If weather and climate excursions are limited by conservation of energy dissapative processes have to limit butterfly effects which makes it concerning that the models diverge depending on initiation, precision, etc.

[I'm not following you. Weather is a dynamic equilibrium. Effectively, the mid-latitude weather is the atmosphere shedding energy. Energy conservation makes no statements about which day of the week the next storm will turn up on -W]

My attempt to respond to some of your inline comments from my phone browser failed mysteriously. Briefly:
Yes, UKMO GCM people understand bitwise repro to be important, and we had a good chat about it when I went down there last year to talk about CCF. Since you don't find it important, perhaps you should take it up with them. Climate scientists in general, including many at UKMO, like scientists in other disciplines, don't see the importance. That's slowly changing, and a good thing too.
Yes, R&R had a bug in their code (in fact, several bugs: one could speculate on whether they were entirely accidental but that's rather off-topic). Everyone has bugs in their code. That's why code publication is important. Without the publication of R&R's spreadsheet, and the attempts by others to bitwise repro the results (i.e.: to re-run the spreadsheet) and analyse the code, we would never have known. Without code publication, any attempt at scientific repro is on thin ice (there are a number of studies showing this, although none in climate science AFAIK), and can swiftly degenerate into he said/she said. Without code publication at least, and bitwise repro at best, how are you going to tell whether your failed attempt at scientific repro (which may be necessary to build on someone else's result) is failing because (a) you made a mistake, (b) they made a mistake, or (c) they didn't communicate their methods well enough ?

By the way, I completely agree that the principle of bitwise reproducibility needs to be considered carefully for "large computations"; I just don't think it should be dismissed or abandoned lightly. And system size is no reason not to require code publication.

While I generally agree with you, I do not quite like how you generously shrug off the reproducability issue. In detail, two things come to my mind:

1. Reproducability should be a goal, even though the bitwise reproducability is over the top. If we know that, say, single trajectories are rather pointless, then at least it should be reasonably simple to produce the ensemble, from which you get the same values again. This should be part of papers IMHO.

[I agree that repro should be a goal; but as I said, I think the important goal is sci-repro -W]

In the course of my work, I tried to implement some reference cases for an own program, and repeatedly failed to reproduce published results until I contacted the authors ("Oh, this parameter is actually not 6, but probably more like 4", "Well, we did not actually use _this_ potential", ...). While these errors can happen when writing up a paper, they should be documented in a foolproof manner (e.g., by striving to reproduce the content of a paper with a literal button push, and adding the resulting scripts as supplement).

[This is where it starts to get tricky. For, say, HadXM3 you need the full code and config files, not just the scientific description. It certainly used to be true that the Hadley / UKMO folk (to be fair, the Head Honchos not the people actually doing the work) regarded this as semi-secret and not publishable; which I'd argue is wrong -W]

2. The finding of bugs may be a practical and not a "scientific" problem, but is nevertheless important for the science. How can you ever be reasonably sure that your calculations are correct? So rather strict reproducability for the sake of automated testing should indeed be a goal, which goes a long way towards making your program bug-free and overall better designed.

Again, mt takes a wrong angle, but that does not mean his arguments can be discarded. In practice, you typically have different classes of tests, such as unit tests (make sure that a given piece of code does what it claims to do), integration tests (making sure that different pieces of code work together smoothly without hidden and violated assumptions), and application tests (make sure your complete program works well). These are muddied in the whole discussion IMO.

What you and mt are talking about are application tests. And mt's point is actually bad: If you cannot reproduce something well by virtue of the underlying physics, then it makes no sense to use this as a test case; instead, you have to use ensemble-averages, as you point out.

However, application tests, though not superfluous, are the least important tests. There are simply so many parameters entering the program that you can never even come close to exhaustively testing your application. This requires especially unit tests, and here it would make sense to cut the program down into units that produce repeatable results. I do not know about climate science, but if people there are as ignorant as in other fields, testability is likely to be poor to non-existent, which is far behind the state of the art.

If mt has this in mind, then, skipping all the silliness about bit-wise reproducability, I can only support his notion.

#31 +1; pretty much what I am saying.

I think the value of bitwise reproducibility is being grossly underestimated.

People think I am supporting Watts.

[Good grief no. I never thought that -W]

In no way do I propose that if two computations of a climate are performed using the same equations and same initial conditions and reach an instantaneously different state, at least one must be incorrect. This claim shows a failure to understand the nature of the climate problem.

But consider my experience when the computing center "upgrades" a compiler and discontinues support of the one I have been using. I may be in the midst of a multi-month ensemble. Hundreds of thousands of CPU hours have been invested in the experiment. Now my executable becomes precious because I cannot replicate the build. But suppose something now changes in the physical environment and I have to rebuild against a new message passing library.

Now I have the question of whether I am in the "same" ensemble. Can I just take Gavin's "probably" for granted?

This all did actually happen to me, by the way.

I cannot replicate bit-for-bit using the same code on the same machine. Indeed, I cannot demonstrate that I am using the same executable or the same code for the different parts of the experiment. And it's not for lack of understanding the problem on my part. It's because the compilers we use and the policies of the host supercomputers are not entirely suitable for the purposes we use them for.

They may be computation friendly but they are software engineering hostile. These are platforms where by definition the usual commercial expectation that personnel cost dominates computation cost does not hold. But the product is neither the design of the computation nor its execution, but the science that results.

The impossibility of bitwise reproduction makes it impossible to confidently build directly on the work of others. That this is not done in practice does not mean it shouldn't be. There are many hurdles to building on an existing computational experiment, even in one's own group. Lack of bitwise reproducibility over time exacts a huge toll on group productivity and on the reliability of their output. Very careful practice in version control and tracking can alleviate this to some extent but many other aspects of the platforms and the culture, including scientists untrained in software engineering, and codes tested only on one platform, effectively make this excessively difficult as well.

[I think I'm entirely happy with the idea that there are genuine practical advantages to bitwise repro. And I hope I've never said otherwise -W]

I should concede that Michael's specific example in #33 is a good one. Having to split an ensemble over multiple machines and/or compilers is unfortunate and is best avoided. But a solution (given current infrastructure) is better management, surely? First, don't start big ensemble projects when the computer hardware/OS is changing, but if you do, make sure that you check that the ensemble characteristics and/or climatologies are stable (by doing just enough duplicates to check). If they don't match, then there is a bigger problem than not bit-repro.

I agree this would be unnecessary in an ideal world, but neither is this an everyday occurrence, and I don't see that it exacts a 'huge toll'. Building virtual machines for every computation is also a big job (and in some cases impossible) and the cost/benefit ratio really needs to be examined, rather than assumed.

sci-repro is by far the most important issue - and for that to be as solid as possible you want completely different codes and new ensembles because almost everything you care about is a function of the ensemble, not the individual trajectory (I can thing of a few papers where this wasn't true, but they are rare).

Sadly, I'm out of bandwidth to say much on this, but I'd observe:

1) Back in the 1970s, when Bell Labs was 25,000 people, we had a vast range of software projects, from 1-person efforts by science researchers, to 300-person organizations working on large database or switching-machine systems.

2) The technical management was pretty savvy, and expected the *appropriate* level of software engineering. At one end, scientists at Murry Hill might write one-offs to get some results, and any good software engineer would cringe at the code.
At the end, one would find software-engineering rigor beyond what most had ever heard of., already there in the1970s. SCCS (the ancestor of most software version control schemes) was developed by the next office (Marc Rochkind) and my office-mate, with a lot of discussion on our whiteboard. That was ~1974. We had to support multiple rigidly-controlled libraries and software configurations across a million-person organization, with multiple versions in the field. I mentioned SAS, as they have similar issues, and people might want to see the siae of staff they have to do this.
Stu Feldman did "make."
I did shell programming for procedural automation. We built test systems that could simulate people at terminals entering transactions so that results could be compared. The switching OS folks had elaborate test frames, and whole departments dedicated to that.

3) I used to occasionally be asked to lecture in the Software Engineering Project Management course, i.e., internal course for Bell labs managers. (Internal courses were interesting: take algorithms from Al Aho, or operating systems from Ken Thomposn, etc.,) I always remidned people that they needed to fit the software engineering methodology to the nature of the project.

4) The general model is the usual S-curve, with:
X-axis: level of effort involved in software engineering
Y-axis: value of doing that

In the 1970s, the initial part of the S-curve was pretty long - it could take a lot of effort to do OK, because the tools were pretty bad. People did a lot of work to make that easier, such as the shell scripting to stop people needing to write C programs just for simple tasks. John Chambers did S so that researchers could do better statistics, easier.
Of course, faster computers helped, but general progress moved the S-curve to the left.

But for each kind of work, there's always an S-curve, and at some point, spending vastly more effort on software engineering isn't yielding much benefit.

See SAS Institute, a $3B company with ~13,000 employees, about 25% in R&D. They work very hard at reproducibility and it's worth it, for them.

> The same code, on the same (single-processor) machine

Reproducibility also on this level is actually not necessarily guaranteed. Optimizing compilers regularly generate separate code paths depending on whether the data is aligned on 16, 32, etc byte boundaries in memory. Rounding errors on the different paths can be different, and since memory alignment on this level is random (depending on OS and on the structure of your code), your results too can be. In fact, Intel compilers sport separate flags for enforcing reproducibility on the expense of speed.

This of course has zero meaning on scientific results.

VP:

"Optimizing compilers regularly generate separate code paths depending on whether the data is aligned on 16, 32, etc byte boundaries in memory. Rounding errors on the different paths can be different, and since memory alignment on this level is random (depending on OS and on the structure of your code)"

Any OS I've worked with allows the compiler to generate object code directives to force alignment, and typically compilers do so. And typically the executable loader and linker both align code and data sections on an appropriate boundary, and the compiler is aware of this alignment.

And besides, if alignment were really random the compiler wouldn't know which code sequence would optimize memory access in the first place. Then you've got the processor's various mem caches that makes the effect of alignment unpredictable anyway ...

It's pretty typical for compilers to allow one to specify alignment options, which are passed on to the linker and honored by the exectuable loader.

It makes no sense to me to delay such decisions to run time (i.e. dynamically test the alignment of a variable and execute different code). On the other hand, I've been out of the optimizing compiler business for 14 years, so if you know of documentation on a real-world compiler system that acts as you suggest, I'd welcome a link.

It's possible you're thinking about the fact that compilers do have to make decisions as to what precision intermediate results should be kept in, particularly when operands of varying precision (commonly 32-bit float and 64-bit double types in C) are mixed in expressions. Differing optimizations levels can cause conversions to happen at different times, as can varying interpretations of the language's semantics.

All of which amounts in my ears to a ringing condemnation of the programming languages in use. How can anyone tolerate a system in which one can't reason rigorously about the value of an expression? Madness. Madness, I say.

Imagine if our approach to integer arithmetic were the same: voodoo and hearsay. We'd still get computation done, by hook or crook. People would say "oh, well, division, yes, you never know what you're going to get in the bottom bit, or that's what my PhD supervisor told me, so I wrote this extra routine to deal with that case."

"First, don’t start big ensemble projects when the computer hardware/OS is changing,

Sounds good. But in practice the "platform" you are building on is more like a debris raft. I think the VM solution is a nice one pending somebody finally coming up with a language that is suitable for scientific computation.

(Travis Oliphant promises that it will be Python itself. We'll see. It would be nice.)

"but if you do, make sure that you check that the ensemble characteristics and/or climatologies are stable (by doing just enough duplicates to check)"

Gaah! Is there a standard way of identifying two identical climatologies? It would seem to be definitive only in the limit of infinite computation! Cheaper to start the experiment over.

Nick Barnes:

"All of which amounts in my ears to a ringing condemnation of the programming languages in use. How can anyone tolerate a system in which one can’t reason rigorously about the value of an expression? Madness. Madness, I say."

As I'm sure you know, it is inherent in using floating point representations for computation, not language. As has been said repeatedly, compilers do and long have provided switches to turn off optimization and will guarantee execution order as written by the programmer other than following operator precedence rules (true in Python as well as FORTRAN). Of course, these rules can be overrode by the use of parentheses.

As to why floating point is used, that's a different discussion.

MT:

"Sounds good. But in practice the “platform” you are building on is more like a debris raft. I think the VM solution is a nice one pending somebody finally coming up with a language that is suitable for scientific computation."

Bah, your computer center people will just kill VM images when they want to force an upgrade on people :)

The problem is basically [mis]management, and a traditional computer center staff attitude that they are not really there to serve clients, but to run their domain as they see fit and as suits their needs.

On interesting thing about public cloud is that in many companies with overbearing IT departments, people wanting to get things done have found that buying VMs from a public cloud provider is a way to get out from under their thumb. IT departments are recognizing this too, and you can imagine the rest of the story :)

"Gaah! Is there a standard way of identifying two identical climatologies? It would seem to be definitive only in the limit of infinite computation! Cheaper to start the experiment over."

It would be nice if Gavin could shed some light on how NASA GISS decides this, or point to written stuff on their website or the literature.

Nick Barnes:

"Madness. Madness, I say."

It's also true that crafting algorithms and their implementations when floating point calculations are being used is in itself difficult when values that span wide ranges are involved. A single FP addition in the wrong place can make entire subexpressions disappear, for instance.

If you can force a language processor to execute expressions in the order the programmer writes them, then language doesn't particularly matter since algebraic expressions are very similar among all of them (other than, say, pure functional languages).

Algebraic expressions - the heart and soul of numerical computation. I'm not saying that languages don't matter over a wider domain, they do, but in this particular subset of the world differences aren't that great.

Floating point calculations are just plain hard to get reasonable accuracy out of in many cases. Fortunately, in many cases it is easy to ...

", compilers do and long have provided switches to turn off optimization and will guarantee execution order as written by the programmer other than following operator precedence rules (true in Python as well as FORTRAN). "

Not good enough.

We need the language spec to actually specify execution order for this to solve the remaining problems.

Scientific coders are grownups. (The Fast Fourier Transform was developed by hand, for instance.)

Execution order should not be left to the compiler in ordinary use cases in scientific/engineering computations. It should be explicit in the language.

If automatic refactoring tools have a non-obvious suggestion, they could offer modifications of the source code.

Performance is important in operational weather forecasting, and a few comparable applications I suppose. Some arguably antisocial:

http://www.davidbrin.com/transactionfee.html

It's a big deal in thousand year high resolution GCM runs, too, I suppose.

Most often, though, a twofold performance increase in executable performance is trivial in the throughput of actual science, and that cost would be won back easily with real repeatability once people understand its advantages.

MT:

"Not good enough.

We need the language spec to actually specify execution order for this to solve the remaining problems.

Scientific coders are grownups."

If scientific coders are grownups, then they should know to turn off troublesome optimizations, and this approach should be "good enough". You're suggesting they're not grown-up enough to do so.

I can see the development of a language standard that specifies various levels of guarantees of accuracy that compilers must support. I don't see as likely the development of a language and standard that strictly forbids optimization.

Like it or not, some CPU's implement fused multiply-add, which runs faster than a multiply followed by an add, and can be more accurate (one round instead of two), and helps in writing library functions ...
but of course is not bit-for-bit compatible with machines that lack the instruction or if compiled on same machine using separate multiply and add.

JM:

"Like it or not, some CPU’s implement fused multiply-add, which runs faster than a multiply followed by an add, and can be more accurate (one round instead of two), and helps in writing library functions …
but of course is not bit-for-bit compatible with machines that lack the instruction or if compiled on same machine using separate multiply and add."

Modern compilers look for and exploit such instructions.

Useful for such things a Taylor series approximations of transcendentals, for instance (would normally be hand-written in assembly to take full advantage).

My exposure to the notion of sync'd multiply-add came with the old floating point systems auxillary processor, a friend had a lot to do with formulating the relevant series and I was involved in helping scope out the code ...

"Modern compilers look for and exploit such instructions."

Yes. Starting no later than early 1990s for microprocessors that had it, and probably earlier. CPU architects and compiler folks enjoyed many spirited debates then.
Usually, better precision means slower, i.e., single to double to extended, but fused muladd has better precision and speed, but is *different*.

The point of all this is that it is *not* merely a language design and compiler issue, but the underlying hardware architecture matters. With IEEE floating point, and machines being 32 or 64-bit two's complement, life is much easier than with mixes of 32, 36, 48, 51, 60, 64-bit machines, including some that were ones-complement.

If multiply-accumulate does not yield the same result as multiply followed by accumulate, (which I am not convinced of) then it is hard to believe it is standards compliant. If it is standards compliant then it is hard to believe the standard is not broken. If there's some subtle reason that this has to be different, and there are CPUs which I might want to use which don't support it, then I want a complier mode where it isn't used.

Nobody is saying to not provide optimizations. What we are saying is that we want an language which can be run bit-for-bit stable **across platforms** and **across vendors**. Obviously this is even slower than unoptimized code which is repeatable in one place.

In most cases, within reason, I don't care. I am going to wait weeks for my result anyway. I'd rather wait six weeks for a result I believe in than three for one I doubt.

MT: fused multiply-add does not always produce the same result as multiply-then-add. It can be more precise (essentially because the intermediate result retains more mantissa bits), but it can also produce counter-intuitive results (the usual example is x*x - y*y, which should always be zero when x == y, but which with FMA can give a result which is zero or negative or positive, depending on the rounding direction of the first multiply). This could be OK - imagine a language in which one can specify such an operation explicitly, and therefore reason about the result in the knowledge of the operation's semantics - but such instructions are usually used in an ad-hoc manner by compilers when (a) the target platform provides them, (b) the intermediate representation can be tortured into an order which includes multiply-then-add, and (c) the optimiser can detect that.
And so we are left somewhat helpless, but it's a learned helplessness because it has always been thus with compilers of mainstream languages. To give a very trivial example, in the expression a+b+c, in Fortran (or in C, or many other languages), we don't know what order the additions will happen, and the compiler is even free to rearrange it as (a+c)+b (if, for instance, that makes life easier for the register allocator). Every bit of the result can depend on the order (consider for example the case when a == -b and both are much larger than c). If the language semantics guaranteed an operation order, then bitwise reproducibility would be possible across compilers (and across machines to the extent that they shared an arithmetic model). As it is, they don't, and so it isn't.

Re: #40. One easy test is to look at the Qflux model response to 2xCO2. Or response to instant 4xCO2 in coupled model. If there is sci-repro problem, you'll see.

RE: following links. Yes, sounds like a good idea. I'll try to do better.

RE: #28
I live on the surface of a planet and I assert that climate is chaotic.

The amplitude of chaotic variations in climate doesn't seem to be large in the Holocene. Seems to have larger at times in the past, see D-O events. Might be present now, see Bond events.

--------
#29

No but it makes statements about how often storms show up. The point is that divergence in climate models because of trivialities shows that the models have too much chaotic behavior. Most of the effects should dissipate.

[Still don't understand. In climate models, we expect the exact weather path to be chaotic. Just as it is in the same model considered as an NWP model. We don't expect that divergence to dissipate at all.

OTOH, in climate models we do expect the climate - as in the stats of the weather - to be stable. And it is.

I've no idea what you mean by "divergence in climate models because of trivialities shows that the models have too much chaotic behavior. Most of the effects should dissipate" -W]

Nick Barnes said: "Scientific reproducibility is simply the ability to reproduce the
*scientific* results. Without scientific reproducibility, what you’re
doing isn’t science any more and you shouldn’t get published (or
paid), but that’s another rant. It *can* depend on bitwise
reproducibility (but usually doesn’t)."

W responded: [Ah, interesting. So far, I haven't seen any clear examples where sci-repro depends on bit-repro. Can you provide one? -W]

I've seen a couple of examples where loss of bit repro (caused by a change of platform) produced different
"science". In one, the cause was a bug. In the other the cause was unwisely conditioned IF tests - tests that depended on a real number being zero. On the new platform there were fewer zeroes!

I think these are rare, and both had minor impact on the science. The bigger impact was that it was difficult to test the impact of other minor changes (such as changes to optimisation/loop orderings) because the other minor changes triggered rapid divergence caused by the bugs and ill-conditioned if statements.

Additionally, climate scientists are worried about bit repro mainly for reproducibility. If the tape robot runs over the tape holding the data from years 2030-2035 of your 21st century simulation you want to be able to rerun just that section and have it match up perfectly with 2036. That's why they get nervous running their simulations on general purpose resources in computing centres as they do not trust the computing centre to either look after their data or to maintain a stable platform for long enough to complete the run and the subsequent data validation.

Eli, I think you are changing the subject. I did that too, so who am, in the words of the Pope, to criticize?

Anyway, the climates do not diverge. The instantaneous state (weather) diverges.

This in fact *defines* what climate modelers mean by climate - it is based on the concept of an ensemble of identical planets which of course is a fiction, but "physically realizable" in the sense of a gedanken-experiment.

Anyway, conceptually, the climate is the statistics of the ensemble of identical earths with epsilon-different initial conditions.

This claim requires that the statistics are well-behaved in the sense that the climates do NOT diverge into alternative states based on initial conditions, i.e. that the statistics of the system are "ergodic".

Weather is chaotic, unforced climate is stable. Our models are constructed this way and it is a useful way to think about the world.

We are agreed in this discussion, I think, that the climates of model instances do not diverge, i.e., that they are ergodic, given the SAME compilation of the same model on the SAME platform with the SAME compiler and sometimes even the SAME CPU decomposition. Without bit-for-bit reproducibility we cannot state for certain that we get the same climate varying any of the technical infrastructure, which has some practical implications.

Back to your concern, it is not clear what "too much chaos" means. Basically, either we are sensitive to initial conditions or we aren't. Fluids are sensitive to initial conditions except in linear limits. Generally, there is chaos or there ain't. The instantaneous state (weather) is sensitive to initial conditions in climate models and most fluid models.

"Too much chaos" could mean the divergence happens too quickly, but I don't think that's what you are saying.

Should the actual climate statistics diverge, i.e., that the models be inergodic, this would indeed be problematic. I think the models are provably ergodic though. The long-term statistics do not change without a change in forcing.

I'm not sure this addresses your concern but I hope it does.

#54 "That’s why they get nervous running their simulations on general purpose resources in computing centres as they do not trust the computing centre to either look after their data or to maintain a stable platform for long enough to complete the run and the subsequent data validation."

I have a comment here from a knowledgeable guy who's followed this conversation second hand - it's a bit long and overlaps some previous comments. As he's been doing this sort of programming for a long time, I think it may help. He said it tersely at first, thought that might rub the wrong way, so here's his longer form comment:
______________________________

One problem here is one of establishing a standard result. If the code on which the standard is based is written by someone who does not know the (very substantial) difference between X*X-X and (X-1)*X then the standard results are probably not worth reproducing anyway. Same comment applies for code that uses the mathematical/infinite-precision version of quadratic root finding taught in high school versus more numerically stable flavors, and for a couple godzillion other examples. The point is not to slander people who do that, but merely to point out that insisting that the "standard" result be reproduced may be counterproductive.

Consider also the interaction between multiple standards, for example a language standard and a floating point hardware standard. For example, some computer language standards require that parentheses be honored, but do not require adherence to the structure of the source code. So a line of code may be written as:
A = B+C+D
Without parens, a compiler is permitted to evaluate that as:
A = (B+C)+D
A = B+(C+D)

The differing order of evaluation between those two expressions can (and probably will) lead to different results, both 100% standard conforming.

Once you parenthesize or otherwise organize your code so that there is only one possible order of operations under the language standard, there is the question of how operations should be carried out in the hardware standard. Consider these two lines of code:
A = B*C+D
A = (B*C)+D

Although there is some contention on this point, I think it is generally agreed that B*C+D permits a muladd instruction with only a single rounding at the end, but (B*C)+D requires two roundings. If the "standard" code was written without an understanding of the interplay between the language standard and the hardware implementation, possibly because the code was written before muladd became common or available at all, then there is no good way to determine what the standard result ought to be.

Finally, it is hard or impossible to specify what a block of code in the "standard" is trying to accomplish. If I write a loop to add up a list of numbers, may I accumulate in a higher precision accumulator (e.g., the 80-bit floating point registers of x86) to serve the goal of accuracy and speed, or must I round each time to serve consistency?

So while I appreciate the willingness to wait longer or pay more for the right result, which presumably means a result that is closer to the standard result, the value of the standard as either an achievable or desirable goal seems not to reward that effort.

There. I said it. It was long and duplicates what people can readily find in a more lucid and authoritative form elsewhere, but I said it and now I'm going back to work.
______________________________
end comment ported in.

#57 misses the point altogether, and is a good illustration of how compiler writers subvert science.

My calculations are not stiff enough nor my model good enough that I care in the least whether I get

X *X - X
or
X * (X -1)

I mostly care that I don't have to work very hard to make sure I always the same one every time! On every platform! Forever!

In particular the last thing I want is improvements in the math library. I'd much rather my cosine function, say, was rather slow and slightly wrong than that it suddenly and mysteriously gets faster and more accurate at a time not of my choosing.

JM:

"The point of all this is that it is *not* merely a language design and compiler issue, but the underlying hardware architecture matters. With IEEE floating point, and machines being 32 or 64-bit two’s complement, life is much easier..."

Yes, that's been noted a couple of times. Development and adoption of the IEEE standard was a huge step forward.

"than with mixes of 32, 36, 48, 51, 60, 64-bit machines, *including some that were ones-complement*."

I'm giving this sentence negative zero points :)

I was traumatized by ones-completment arithmetic when very young and idealistic ...

MT:

"In particular the last thing I want is improvements in the math library. I’d much rather my cosine function, say, was rather slow and slightly wrong than that it suddenly and mysteriously gets faster and more accurate at a time not of my choosing."

Not of your choosing being the key.

Not having you choose for everyone being another key that I'm not sure you recognize.

MT:

"#57 misses the point altogether, and is a good illustration of how compiler writers subvert science."

Compiler writers aren't building the computers they've got to generate code for. And #57 isn't undermining repeatability when code is run multiple times on the same platform.

Unless you are in a position to dictate that only one processor exists in the world, and to dictate the constraints that *you* think are relevant, it sounds like you're always going to be frustrated.

Again, the basic problem lies in our use of floating point representations and is inherent in that choice. There are other possible approaches to take, but they're deemed uneconomic as they're much slower and generally more space-consumptive than traditional floating point representations.

It's a compromise, like so many things in life, science and engineering. There will never be a "perfect" solution to the problem, the kind of issues brought up in #57 and by various of us elsewhere are going to exist.

Do you favor a standard that mandates, for instance, the precision of "e" or "pi" to be used in scientific calculations, with values either less or more precise forbidden by the standard?

That's sort of an analogue for what you seem to want.

#61 It's simple. We want to be able to rerun a calculation and get the same exact answer we got before, even if the machine we ran it on doesn't exist anymore. Ideally, forever. Because of the nature of climate simulations, essentially we can't tell if we have changed the model substantively if we can't get the same answer bitwise. "Close" simply does not occur in practice.

We are willing to pay a performance penalty for it at the initial run time. Whether we use such a feature in any particular instance should be an explicit tradeoff, not something utterly impractical.

Certainly we recognize that there are practical reasons to trade off this ability, but we want it recognized and we want it supported.

If FPU architectures stabilized that would not present us with any problem I can see.

MT:

"If FPU architectures stabilized that would not present us with any problem I can see."

Largely, it has, as JM and I have both said, along with probably others I'm forgetting at the moment, thanks to the IEEE standard on floating point representations and semantics.

But arguing that, in essence, technology should stand still forever just so existing programs will run forever in the future giving the same *inexact* and *inaccurate* answers as they do today is probably a non-starter. Just IMO. It's an interesting battle you're waging, though.

And, I guess, the near-universal adoption of the Intel architecture with SSE by computer manufacturers (AMD implements SSE as well).

I'm really having trouble making sense of the arguments in this thread. Technology is still changing. Anyone who demands exact bit-reproduction of some buggy-whip of a machine produced 30 years ago may have to emulate it. Said emulation may be 100 or 1000 or more times slower than code compiled for direct use on the same machine that's doing the emulation. (Analogous considerations are why Java is so often slower than watching paint dry.) It might become a question of delaying a three-week run by months or years, not just an extra three weeks.

Now, all sorts of approximations are made in numerical analysis that are bound to have much larger effects than some fiddly rounding error in an IEEE 488 compatible 64-bit FPU. For example, I can still readily find discussions of a value-by-value weighting scheme called "Simpson's Rule" (2/3, 4/3, 2/3, 4/3...) for numerical integration - but on the other hand, such considerations are normally ignored utterly in Digital Signal Processing systems, Almost everyone simply adds up the values - i.e. piecewise-linear integration - and the processors are optimized to do that very fast. (Knuth, I think it was, in his famous text, tried valiantly to show that Simpson's Rule was 'more accurate' than simply adding up the unweighted values, failed utterly, and was bemused and baffled by said failure.)

My SWAG is that the analogous considerations are likewise ignored utterly in weather, climate, and other three-dimensional models. Bringing them to bear would probably change the detailed trajectories far more than any non-nonsensical selection of rounding setting - as should simply changing the mesh size a little.

So yes, the obsession with bit repro does seem OK for debugging/porting convenience, most especially if nobody actually understands what the code is "really doing" - whatever that means - and is therefore relying on it to yield exactly the same valid answers and mistakes throughout eternity. And yes, checking bit-repro can certainly tell me, say, whether the new version of the compiler optimizes exactly the same way as the old. (Then again, why bother? I've already guessed "no", else why the new version?)

But then again, the obsession strikes me like saying that a chemistry result is not scientifically reproduced until and unless it's duplicated precisely down to the very last molecule. When was the last time *that* happened in the real world??

And to boot, FPUs and the like are made in multibillion dollar factories not controlled by scientists, who in any event lack the ability to stop time. Most of the chips are used in graphical displays and the like, and precious few purchasers give a stuff, or will ever need to, about bit-reproducibility. So I wish y'all good luck with embalming the buggy-whip and freezing it forever.

It might be more sensible and less frustrating to devise a more practical approach, and think more like a chemist.

After all, even with a weather model, if you're running it for enough steps so that a minor change in the settings, or a change in FPU brand, makes a difference in the outcome that's significant to you, you're running it too long for the precision, simple as that. Indeed, the existence of different settings and brands is not a problem but a golden opportunity to test the limits. While brand and setting still have no significant effect, you're still doing science, at least if the code is valid (and bit-reproducibility will reveal precisely nothing about that; a bad model will reproduce as well as a good one.) As they start to have significant effect, you're slipping over into numerology. Keep on stepping, and you'll soon be kidding yourself that you can forecast the weather three weeks from now, all the more so if you were ever able to contrive a world in which every brand and setting worked exactly the same way, so that every machine gave exactly the same nonsense.

Climate should be a far easier problem in these respects. It simply shouldn't matter (significantly) to the *climatology* whether, say, the notionally-same storm rains on Chicago with one FPU-and-settings, and on Peoria with another, and fails altogether to develop with a third. Any weather in the model can be expected to diverge fairly wildly given enough steps. However, if different brands of FPU produce materially different *climate* projections, you've slipped over into numerology. Indeed, one should be able to run the model "forever" and still go on getting an Earth-like climate. After all, if the real-world climate were as wildly unstable against minute changes to initial conditions and ultra-short-term noise as the obsession with bit-reproducibility seems to imply to yours truly as an outside observer, the perturbation from the next solar flare or even distant supernova might well kill us all - or more precisely, since there have been countless insults to the weather/climate system over Earth history, we simply wouldn't be here to talk about it.

In summary, it seems to me that the scientific results should be quite insensitive to the FPU brand and settings, or else they're simply rubbish.

Standard ML of New Jersey
http://www.smlnj.org/
seems to have what MT wants. All old versions remain available so nobody will suddenly change the support. SMLNJ is available on a wide variety of common platforms and several different operating systems. I've studied parts of the complier, written in SMLNJ, and the organization is such that one will have little difficulty rewriting the back end for yet another platform.

SMJNJ comes equipped with a fine Basis Library
http://standardml.org/Basis/manpages.html
and an adequate Utility LIbrary. MT might object to the rather primitive Array2 in the Basis Library, but for most of my work I have sparse two and three dimensional arrays so I don't use Array2 and instead work up some suitable arrangements based on hash table and ordinary one dimensional arrays.

My colleague Carl Hauser also writes in Python. For the larger program he opines that the static typing proved by SMLNJ provides a sense of security.

Regarding express evaluation order, one would have to check but as best as I could tell the SMLNJ compiler does not attempt the fancier (and misnamed) optimization techniques. Despite this, my SMLNJ codes run adequately fast with the slowest taking overnight, at most. But then, I'm not doing a big climate model.

What is SSE?

By coincidence, most of my compiler writing experience was for another Standard ML compiler (we initially bootstrapped using SML/NJ). This is one reason I can be a bit of a hard-ass about well-defined language semantics.

My company recently acquired the rights to that ML system, and open-sourced it. It suffered some bit-rot and needs some work to get it running again, as soon as I get a round tuit:

http://www.ravenbrook.com/project/mlworks/

#62 MT:

"Because of the nature of climate simulations, essentially we can’t tell if we have changed the model substantively if we can’t get the same answer bitwise."

Not sure this makes sense. If we change the model, we change answers (and we have to re-validate the model). That applies whether the change is due to a bug fix or a platform change.

Re-running the same model to get the same result makes no sense though. If you can demonstrate, say, that a move from platform A to platform B gives the same spread of results as you previously had when rerunning the set of experiments that defined your validation you should be happy. And further, if you also get the same spread of results if you tweak your method of validation (using a different seed to randomise your initial conditions in your ensemble spread, for example) you should be happier still.

I wonder whether the more strict desire for bit comparison may have related to the day when results depended on a single simulation (due to the lack of computing resource). Presumably in such a scenario the risk of your result being a "fluke" was higher and your reputation would be at more risk if your result was later disproved and if later re-runs of your model (on a different platform) did not give the same results.

Another point that came to my attention: The refactoring argument is actually wrong. In a world with limited precision floating point accuracy, almost any non-trivial code rearrangement will lead to the actual computation being different. Consequently, rounding errors will be different, and the result will not be bit-wise reproducable.

Effectively, bitwise reproducability thus forbids you to ever substantially modify (improve) your code, which is a very harsh restriction.

Of course, you could go for arbitrary precision FP maths, but the performance penalty is such that you will not be able to do bleeding edge science any more.

Steve Milesworthy is making the same point as Eli. If the range of results is the same (e.g. the climate modelled) bit wise reproducibility is a feature, not a bug because it indicates that the simulation is stable to butterflies, which is damn well better bee.

[!NO! - you're completely wrong. Bit-wise repro tells you nothing about stability-to-butterflies. Bit-wise-repro NWP models are still chaotic -W]

Wrote that backwards. Sorry. Untangled

Steve Milesworthy is making the same point as Eli. If the range of results is the same (e.g. the climate modelled) bit wise reproducibility is a bug, not a feature because without it you get an indication that the simulation is stable to butterflies, which is damn well better bee.

[I'm assuming by "stable to butterflies" you mean non-chaotic. That's wrong, definitely. You might possibly mean "doesn't fall onto the floor in a heap" or "doesn't generate a different climate". Both of those would be true, but trivial -W]

Chuckle. MT, the fellow whose comment I posted above does want to be helpful. I've been trying to steer him toward climate work for years (so did people at NCAR); he started a company instead, not his first one either. He does well these days. But he's not working on the most interesting thing in the world (and I think you climate scientists are, and he may be coming around).

He greatly softened the language in his comment for public posting. You might get more out of his very brief, blunt, hard-argument draft. He'd likely be willing to say more to you directly if you're willing to listen; say so and I'll ask.

I note that the way programmers seem to listen to each other is by bumping heads. He's serious that the problem you're asking about is deep enough and subtle enough that it's not easy to explain.

I think he hasn't missed your point at all, and that this is one of those pointing-at-the-moon situations.

Like Stoat, in this "MT versus the world" exchange he's from the world. He might have what you need, to the extent reality allows it.

Just sayin'. More popcorn ...

If it were me against the world I'd drop it. I have other fish to fry.

But here I have Nick on my side making very cogent points so I am not feeling like I'm alone, anyway. Recall also this all started when I saw the Recomputation Manifesto, which does not come out of the climate community at all, but out of the hard core computational science community.

http://insidehpc.com/2013/07/15/good-science-is-repeatable-the-recomput…

We have at least five important use cases identified:

1) Someone wants to extend an old ensemble in some way, and wants to ensure that they are running the same executable

2) Part of a long run has corrupted data, and one wants to restore it

3) Someone wants to reproduce work exactly in order to verify that they understand what they are critiquing (the McIntyre/Liljegren use case, though one doubts they will take the trouble in a real HPC case)

4) Refactoring which is not intended to change the calculations.

5) Running an ensemble on a utility platform, and the infrastructure changes out from under you.

Of these, #4 is the practical use case that will occur most often, but #3 is the crucial one and you never know when something like that will matter.

I admit that refactoring the actual computations will of course ruin b4b.

CCSM lists versions as "answer-changing" and "non-answer-changing". A pretty childish use of the word "answer" it always struck me, but the right thing to do.

In practice, lots of times you think you are doing nothing to the "answer" and you are. This is easy to check on a machine you've already run on.

To do a complete validation on some other platform is a very expensive proposition compared to just running a single case for five clock minutes. The fact that this is possible doesn't mean it's practical and certainly doesn't mean it's done in practice by the hundreds of people running community models on university facilities.

And there's the fact that we have to hack the source slightly to do out particular experiments to complicate matters...

"In summary, it seems to me that the scientific results should be quite insensitive to the FPU brand and settings, or else they’re simply rubbish."

Yes, they should be. But proving it without bit reproducibility is very difficult. And to be honest, I'm not sure whether they are and I suspect they aren't.

And it may not be the code that is rubbish. We depend on a huge codebase in a horribly baroque language with a dozen competing vendors, each with a small user base compared to ordinary commercial products.

The compilers themselves can easily be at fault. Or they may interpret the language spec differently.

http://www.fortranstatement.com/cgi-bin/petition.pl

If climate and weather are not stable to butterflies about all anyone can do is pop another beer and pray. If climate models are not stable to butterflies then they are not modeling climate.

MT: I can't disagree with many of your use cases (3 is weak, 5 is important). But 1,2 and 4 are reduced in importance if the code/data is well-managed and experiments are carefully planned (to avoid system changes or to be less sensitive to them).

Question is whether the use cases have a higher priority than, say, the requirement to increase model complexity and resolution. Resources at my employer are vastly over-committed as it is in trying to reach these goals.

On the other hand, climate science share of the supercomputer market seems small and the requirement for bit-reproducibility seems less in other areas of science. I saw a bunch of PRACE (EU supercomputing consortium) guys' jaws hit the flaw when the requirement (for bit reprod on a rerun, or change of compute resources on the same platform) was explained to them. We are also being told that such an ability will be "impossible" in the exascale world where you will apparently need to build your model to survive regular individual processor failures mid-run.

(enjoy your popcorn, Hank!)

Steve, I am not excited (from the point of view of climate) by the exascale, by the way.

It's hard to argue otherwise than that scale makes reproducibility harder. I do not agree that the climate world has much to gain from scale, other than the capacity to run larger ensembles. But I admit that will not stop them.

Resolving a larger number of scales may offer some insights in creating parameterizations for smaller models, (and so don't really need b4b) but the real science will be done on very large ensembles of resolution not beyond what is ordinary now. At least I hope so, but nobody is asking me...

It is likely that a cloud-resolving GCM will be built, but those clouds will be coarse and have microphysics parameterizations (droplet size spectra). That just moves the problem downscale. I see no reason to believe that making the models even balkier in this way will achieve anything of importance.

The institutional pressures are always to scale these models up. I suspect that we have passed diminishing returns and are into negative returns here in the US. The Brits seem to have a better grip on things as observed from a distance but I imagine the pressures are similar.

I am afraid I am going to end up an exascale GCM skeptic. The more so if single-platform bitwise reproducibility is abandoned.

I have heard the argument made that if exascale platforms are going to be same-platform bitwise unreliable that all calculations be performed in triplicate; along with the voting mechanism this essentially quadruples the cost of the calculation.

But the developers can't imagine how to proceed in its absence. I can't imagine this is specific to climate, though it may be specific to messy complex systems.

@dhogaza (#37): Directly from the horse's mouth:
http://software.intel.com/sites/default/files/article/164389/fp-consist…
See the "Second example from WRF".

Vladimir, many thanks! It is interesting that this is so obscure, and that in practice I had to figure out the parts immediately relevant to my work (I was indeed using Intel Fortran) by experiment. A link to this should pop up the first time you invoke ifort at the command line!

There are two ways that that William tends to minimize concerns: one is to say the thing is trivial or known for twenty years.

The second is to say that its not a scientific question or acknowledge it might be important and than proceed to not discuss it at all. Its annoying.

The funny thing about all this is that the most important issues are invariably the trivial ones, the ones that have been known for twenty years and the ones that are "non-scientific". Its utterly ridiculous not to focus on these.

[Ah, c'mon big boy, don't go all coy on us: provide examples -W]

Repeatability of Large Computations

What is the issue?

The same code, on the same (single-processor) machine

The (scientifically) same code, on different configurations of multiple processors

The same code, with a different compiler, on the same machine. Or, what amounts to much the same, the same code with "the same" compiler, on a different machine

Does it matter, scientifically?

Why bit-repro doesn't really matter scientifically

Let's look at some issues mt has raised

Some issues that aren't issues

Some very very stupid or ignorant comments from WUWT

Refs

More like this

Last warning: mustelid.blogspot.com

Dynamics of Stoats

Gunz: constitutionalism and majoritarianism

That it is easier to agree on economics than morality

Morality and economics

Weekend Diversion: 8-bit glory for a rainy day

CONFIRMED: The Last Great Prediction Of The Big Bang! (Synopsis)

New artistic life for old military hard drives