A bit more stealth

OK, the Knight et al. paper is here, thanks folks. Clearly they have had some jolly fun dividing the runs up into trees, but the paper is a disappointment to me, as it doesn't really deal with the main issue, which is the physical plausibility of some of the runs. It *does* talk about "Our findings reinforce the fact that variation of parameters within plausible bounds may have a substantial systematic effect..." but that rather slides over the fact that varying a parameter within a plausible range is *not* the same thing as producing a model with a viable climate. As I reported ages ago, and I'm sure its been said elsewhere, there is good evidence that low values of entraiment are phycially implausible, and those low values give the high CS values, above 9K. Kn et al say: Consistent with this, the highest predicted CSs (>9K) are all for low entcoef runs, associated with high rhcrit and ct and low vf1 (Fig2, supporting Table3) a combination indicative of reduced cloud formation. But what they don't do is mention the Palmer stuff (maybe its not published?). As I understand it, the tests for a viable model simulation applied to the cp.net runs are very weak, and they seem to be determined not to improve them.

More like this

Sylvia has written:

"The main purpose of the paper was for it to be the default reference when people say 'but don't you get different results when you run a big complicated model on a PC to the ones you would get on a supercomputer' - which we still get asked on a suprisingly frequent basis, particularly in grant applications. The paper definatively shows that model parameters matter a lot more than hardware or software. "

If that is the purpose and you are looking for something quite different then that is likely to lead to disappointment.

Have you got any complaints about what is in the paper rather than what you would like it to cover?

[As I said, other than disappointment, no -W]

There is a bit of discussion here:

http://www.climateprediction.net/board/viewtopic.php?t=7325

What do you make of my complaint that they are not excluding bit identical output files? This reduces the variation they get for hardware/software. Not considering this helps them reach their conclusion that the variation 'may be treated as equivalent to that caused by changes in initial conditions'.

[Not sure. The HW/SW variation is very small, as I recall, so increasing it - doubling say - wouldn't matter much? -W]

You say "but that rather slides over the fact that varying a parameter within a plausible range is *not* the same thing as producing a model with a viable climate."

Agreed. Very different questions; the one slid over is much the more important one.

This is exactly what I meant on the globalchange list when I said:

"The Roe paper is fundamentally incorrect. It relies on a presumption that parameterization errors are independent. That isn't even close to true."

You put it better though.

Parameters are jointly tuned. Changing one without changing others to compensate leads to systems which don't model climate well, and whose sensitivities therefore don't matter in the least.

We are finding (Charles Jackson at U of Texas with modest assistance from myself) that objective joint tuning of parameters can improve model fidelity and reduce the uncertainty of sensitivity. Our efforts distinctly pull the CCSM sensitivity closer to 3 C.

This "fact that variation of parameters within plausible bounds may have a substantial systematic effect" is inconsequential to the extent that it is correct, and misleading to the extent that it appears consequential. It's very unfortunate and misleading.

[I think we may be agreeing on something :-) What this is pointing to, I believe, is a fundamental flaw in cp.net: they have prioritised getting lots of runs over assessing those runs for plausibility -W]

How does one rule out the implausible without ruling out the unanticipated, as happend with the automated ozone monitors for some years?

[There is something of an urban myth in there, although some truth. Yes the satellites threw out low values, but those had been noticed and were in the process of being investigated.

In the case of cp.net, there are lots of easy things to do to check plauibility of the climate - check that the season cycle is OK for example. And you *do* want to rule out the unanticipated - I'm talking about check ing the control climate of these runs, not the CS -W]

Serious question, I know it's not easy.

I assume that a climate model running on a computer isn't going to discover something like ozone loss --- that's right, I'm fairly sure. Can we assume the more extreme results can't possibly happen given that you all know the programming inputs, and can be sure some of the model results aren't realistic?

But then, how _will_ we find the next such condition before it bites us too badly?

By Hank Roberts (not verified) on 04 Nov 2007 #permalink

That helps me understand, thanks William.

Is it possible to address plausibility checks incrementally -- to start taking out runs that fail the easy checks in a way that would incrementally improve the overall result for sure?

Doing the "easy things to do to check plausibility" would you then lose any run that failed such a check? I'd think you'd remove more runs in the long tail than in the low end.

Is there an argument for doing this kind of weeding of the results, statistically?

I'd like to see CP sit down with your list, or something like a public science draft paper built on it, and work with it. Not throwing away their complete results, but doing an agreed-upon weeding for plausible outcomes providing an, er, alternate scenario.

Are the other climate models already doing some kind of plausibility weeding process, so there might be a generally agreed list of easy things to do to check plausibility?

I realize the apparent risk, from the outsider's point of view, is that what gets thrown out is a judgment call, like trying to pick out the sour cherries after the harvest, or picking bad climate stations from photographs. I'm hoping there's some way that CP would agree on to decide what ought to be weeded out for such an alternate scenario set.

[You could fairly easily just keep a database of what checks had passed or failed. Checking the seasonal cycle is not hard - if you've got the data. I don't know what cp.net sent back. Even if they did this, the tests would be far weaker than those that conventional GCM control climates (e.g. HadCM3 standard) have to pass (with the obvious exception of IAP-FGOALS, of course, which is awful) -W]

By Hank Roberts (not verified) on 05 Nov 2007 #permalink

>[Not sure. The HW/SW variation is very small, as I recall, so increasing it - doubling say - wouldn't matter much? -W]

Sorry for the delay in replying.

Yes I agree. It didn't take you long to say this, why does it take the authors so long to reply? (Well, there were a few other bits to my queries.)

In part I want to object and say if you are trying to write a "default reference when people say 'but don't you get different results when you run a big complicated model on a PC to the ones you would get on a supercomputer'", then it should be done sensibly and why publish a long list of numbers that other people might want to use or refer to which has used a silly methodology. Wouldn't it be better to publish sensible numbers? No-one is going to use the same mix of computers again so it is more sensible to provide information on the size of the differences when differences do occur.

Having said this perhaps I am obsessing on details that don't really matter much - as you say doubling wouldn't matter much to the main thrust of what is being said.

However perhaps part of what I am thinking is based on a false or untested assumption and maybe you could help.

Suppose you run one initial condition ensemble with tiny differences, plus one ensemble where in addition to tiny initial condition differences there is a further tiny perturbation to a random cell in a random direction once a month. Also two more ensembles one with the random perturbations once per day and the other with such a perturbation once per timestep.

Would you expect the differences between members of the ensemble to be similar in size for each of these 4 ensembles or would you expect larger differences the more frequently the perturbations are done?

[Any tiny perturbation will completely change the weather, but should leave the climate unchanged, unless things are odd. So a tiny perturbation once a day or timestep or month (or delivered via slightly different hardware) should lead to the same climate -W]

>"[Any tiny perturbation will completely change the weather, but should leave the climate unchanged, unless things are odd. So a tiny perturbation once a day or timestep or month (or delivered via slightly different hardware) should lead to the same climate -W]"

Thank you for the answer, but...

Isn't that a contradiction of what the paper found? You seem to be saying random tiny perturbation should lead to the same climate. So presumably, variations between ic ensemble members should be expected to be the same.

As the paper found some increases in the variations with dirrerent hardware, does this imply that the differences you get with different maths libraries might be to systematically round in a certain direction leading to a slightly different climate (only enough to say double the differences between ic members - I think the paper showed something like 70% increase but I think this is biased to being too low)?

Or do you see something else causing differences between hardware other than the maths library in use? (There may also be differences in crash frequency but the crashes that do not lead to full runs will not be compared - I don't really expect any differences from crash recovery but it is possible.)

[One random tiny perturbation per timestep certainly should. Every single calculation might just possibly be different. As I recall, the hardware/software diffs were tiny, so maybe they aren't significant? Its hard to tell, and would probably take a lot of effort to sort out properly -W]

I don't know if you are sticking to your "determined not to improve them" despite my and James comments.

Anyway I thought you (and James) might be interested in this post:

http://www.climateprediction.net/board/viewtopic.php?p=71930#71930

"Earlier this year ROSALIND WEST completed her fourth year MPhys project in which she used CPDN data to investigate using the seasonal cycle in temperature to constrain climate sensitivity. She found quite different results for the slab and the coupled experiments and is in the process of writing this up for publication in the peer-reviewed literature."

Presumably this is part of what you want to see.

[Using the seasonal cycle is a good idea, though its not quite clear from that brief extract quite what has been done. What fraction of runs does this throw out, for example? But your quote only makes it obvious how low a priority this has, and I thinkI've already said why I think that was -W]

There is also

"Third year DPhil student HELENE MURI is investigating how well CPDN models fare in simulating paleoclimates. She is looking particularly at the mid-Holocene period - 6000 years before present. Her work is important in helping us decide which models could most reliably be used for climate prediction. Her models will be tested against a variety of paleo-observations; as part of this she will use an offline vegetation model to determine which models produce a realistic shift of the Arctic treeline. She is very close to being able to distribute her slab experiments so look out for those!"

If you want to express an opinion that it is taking a lot of time, fair enough, I don't know how long these things should take. However, "determined not to improve them" seems to imply bad faith. Do you really feel you can justify bad faith?

[Checking the seasaonal cycle should be fast an easy. This isn't rocket science -W]

I find "is a fundamental flaw in cp.net: they have prioritised getting lots of runs over assessing those runs for plausibility" flabergasting as to use of 'fundamental'. It seems obvious to me that you need runs before you can analyse them.

[And you need a meaningful metric of acceptability before you can perform the analysis -W]

I also totally disagree. They had two computer people Carl and Tolu while setting up the system. I think this expected too much of them. The burden should be less now and they have Tolu and Milo. I think it is clear they should have put more resources into the computing side early on then switched the balance.