Amateurish Supercomputing Codes?

Via mt I find

too much of our scientific code base lacks solid numerical software engineering foundations. That potential weakness puts the correctness and performance of code at risk when major renovation of the code is required, such as the disruptive effect of multicore nodes, or very large degrees of parallelism on upcoming supercomputers [1]

The only code I knew even vaguely well was HadCM3. It wasn't amateurish, though it was written largely by "software amateurs". In the present state of the world, this is inevitable and bad (I'm sure I've said this before). However, the quote above is wrong: the numerical analysis foundations of the code were OK, as far as I could tell. It was the software engineering that was lacking. From my new perspective this is painfully obvious.

[Update: thanks for Eli for pointing to http://www.cs.toronto.edu/~sme/papers/2008/Easterbrook-Johns-2008.pdf. While interesting it does contain some glaring errors (to my eye) which I'll cmoment on -W]

Categories

More like this

I agree that the numerics aren't really an issue in GCMs, largely because the numerics don't dominate the other practical issues, and largely because we aren't interested in particular solutions but in ensembles.

What software engineering techniques do you think Hadley should import? Which should they eschew? Which, if any, are they already doing well?

By Michael Tobis (not verified) on 18 Jun 2010 #permalink

I have heard software engineering described as being a craft like basket weaving. Take x number of programmers and you will get x different batches of code. Software engineering is still in it's infancy.

The advanced programming languages that would help to get beyond fortran and enable true parallel processing are still in their early years, and the skills required to use them properly are not widely understood and take a lot of effort to really understand the subtleties required to take full advantage of them.

That is not the scientists fault, that is just the state of the technology. The problem is solvable, just throw money at the problem, and hope you get the genuine practitioners of the science in to help you. There are more charlatans in programming land than you could shake a stick at. By sticking to the conventional Fortran and Vector Processing, they have at least got something that works that gets the job done and is reasonbly well understood by anyone who has to work on it.

The modern technologies are still developing, but should be explored. And I'm not talking about Java and Object Oriented programming. The two have put programming technology back years, and tend to produce products that remind me of the American cars of the 1960's.

I'm not sure how much we're allowed to talk about the proprietary MO & Hadley models -- but in my experience (porting various Hadley models on Linux & Mac & Windoze) I'd have to agree with stoat -- numerically sound but "engineeringly" a bit frighteningly. Just navigating the 500 huge files (the first 3000 lines being comments & variable declarations following by a few more thousand lines of code) was pretty horrifying. Having to dive into the D1 superarrays etc to get diagnostics was frightening and often ulcer-inducing. And discovering all the "interesting" climates you could get just by compiler options, bit options etc. I basically did a Monte Carlo experiment just of a standard model with all sorts of potential compiler options and could get widely different "climates" out of the same params.

I've been out of the loop on things for about a year (since last I looked at hadcm3 & hadgem1) but the new subversion-ish system the MO is now using seems to be a big improvement on that bizarrely antiquated "mods" from before. In an ideal world I would have loved to have been able to put or merge back the many mods I made in C & Fortran for cross-platform portability but there was neither the time or money for that.

Oh yeah the worst I found in the Had stuff was all the various ancillary scripts & executables you needed to run just to get things off the ground & running, i.e. set 100 environmental variables, and process ancil files etc. It seemed that could be streamlined a bit. I haven't found anything analogous to my commercial work of 20 years in IT; maybe if I started in legacy systems of 1972 or something! ;-)

I'm afraid I haven't looked at any other GCMs other than quick looks at CCSM (US) & ECHAM5 (Germany) so I can't really compare it to anything else.

So how are these systems tested? Are there regression tests for each of the component parts? It seems odd that compiler options would produce different outcomes.

By NIcolas Nierenberg (not verified) on 18 Jun 2010 #permalink

I'm 100% with William here. Earth scientists who pick up programming on their own (including myself) generally write numerically-sound but poorly-engineered code. In recently speaking with some of my old professors, it looks like there is something of a push in a couple of communities at least to get more software engineers deeply embedded in science problems, in hopes that the scientist-sort-of-programmers and software-engineers-learning-about-the-science will be able to produce good things. (As a shameless advertisement, http://csdms.colorado.edu, though I'm sure you climate folks are well beyond where we are.)

Thanks for the link to the PDF, Eli!

[I'm sure I've said this before but: the main problem with getting (competent) software engineers involved is that you can't afford them. Scientists are basically slave labour programmers :-( -W]

By Andy Wickert (not verified) on 18 Jun 2010 #permalink

Eli,

That's an interesting paper. Systems may are *very* different at some other institutes, in fact it's likely that the UKMO is an outlier due to the tight link with operational work, which most climate research institutes do not have.

Architectures can be over-engineered too. Some of the most knowledgeable software engineers become so intent of using all the design patterns they know to produce the "perfect" generic solution and end up producing something overly complicated which makes maintenance ironically just as hard as if the code had been written poorly.

>you can't afford them

Another goal for Team Stoat! I was appalled when looking at career opportunities as a software geek in the sciences. Low salaries, no real career path since the HPC/computing lab jobs are all full of PhD scientists (often clueless as to writing software, day to day operations) etc.

I expand what I wrote @ OIIFTG, especially in light of Cambridge.

I suggest a quick read of Languages, Levels, Libraries and Longevity, especially the first section on software 5,000 years from now. Looking backwards there, see the comments on Doug McIlroy (~1968 on components) or David Wheeler. (~1947). (Wheeler was Cambridge guy who is usually credited with inventing subroutines, deceased just a few years ago. Fortunately, I got to meet him beforehand. Later my wife and I took his widow out to a pub for lunch near Cambridge.)

People have been complaining about this for the mere 40+ years I've been involved in computing, and I'm sure they were complaining about it before that.

We *cannot* turn all scientists into numerical analysts, statisticians and good software engineers. At Bell Labs, we had relatively vast resources (25,000 people in R&D), in all phases of R&D (R2-D@ @ Dot Earth).
*We* didn't try. We certainly expected different behavior from researchers than from switching-system software people in 200-person projects. One of my bosses taught the standard software engineering project management course for mid-level managers. I used to lecture in it. A clear message was to size the methodology and procedures to fit the project.

Some of what we did do is the sort of thing mentioned in the first URL. We always had some specialist numerical software folks writing general-use routines for our computer centers. BUT, beyond that:

1) Too many scientists and engineers were writing their own statistics codes.
BAD.
Maybe we should have insisted on training everybody.
DOESN'T HAPPEN.
BETTER IDEA:
John Chambers did S, ancestor of R.

2) In the early 1970s, lots of people built their own OS's for every new computer type, toolkits, and machinery for managing their software, and it was a mess, as some was done by "amateurs" (in software engineering, anyway).
BAD IDEA.
Maybe we should train them to be better OS people and serious software engieneers.
RARELY HAPPENS.
BETTER IDEA: port UNIX to many kinds of hardware.
Build the Programmer's Workbench (PWB) tools common to many programming efforts, and make sure the minicomputers running it could connect to the target machines. Move a lot of that common work off {IBM, Unisys, XDS, etc} onto a place with a good, common toolset. Beef up the underlying UNIX to make it robust enough for multiple groups to share computers safely.

3) In the early 1970s, IBM was pushing Chief Programmer Teams, including Program Librarians to handle versions and such. Big projects (there and elsewhere) had substantial staffs to do that, all with homegrown mechanisms that mixed manual procedures (like documentation standards for how to do rebuilds) with project-specific automation.

Research scientists and smaller software projects did not do that.
Let's have them do that.
BAD IDEA, NEVER PRACTICAL
BETTER IDEA:
SCCS (long-ago ancestor of most modern source control systems, developed by Rochkind & Glasser, between my office and the one next door.)
make
shell programming (to automate procedures, and one of the long ago genetic ancestors of many current scripting languages) (me)

4) Building better software is a function of (training+methodology) + tools/libraries with better leverage. For 60 years, the latter has worked better than the former, in terms of improving productivity, maintainability and quality. That is not to ignore the former, they are important to make up for tool gaps, and some kinds of projects better have savvy professionals, or else. But my favorite example is:
In the old days, people wrote one-off programs in FORTRAN or Basic for simple calculations. Much of this was awful code. Even worse was APL code, marvelously terse, elegant .. and write-only.
IDEA: train everyone to be better programmers so they can write decent code for such things.
BETTER IDEA: buy Excel.

5) But, the *real* problem, at heart is that of too many people *writing* too much code, rather than finding good code, reusing it, and sometimes contributing to the pool. The toolset for this is a *lot* better than it used to be, but for history, see Small is Beautiful (1977). The conference organizer asked me to resurrect that talk 25 years later, saying "There's a lot of young guys who really should hear it." I still had the *original* overhead foils, a novelty to many. We scanned the images and sent them over to USENIX for the website. The sysadmin sent me email asking, "Could I send him the Powerpoint?" Sigh. One of the old guys in the audience leapt up at the end and said "I heard the original talk back then and &%&%& we haven't improved one bit." I said I somewhat disagreed - computers were much faster in 2002, the toolsets were better ... but the human issues were indeed still the same as they had been since software got going.

By John Mashey (not verified) on 19 Jun 2010 #permalink

hm. The publicly available GISS model E code is better documented and easier to read than most of the code I have worked with as a professional software developer. The anecdotes I hear about HadCM3 and other climate models sound no worse than the anecdotes I hear about the code bases of certain professional software products. Indeed, if the less pejorative half of what I have heard about the Windows and MS Office codebases from former MS and current employees is true, the HadCM3 codebase sounds better. But of course that's an apples to soap comparison based on hearsay.

It would certainly seem strange if scientists wrote code with better organization and clarity than professional software developers, but all we really have are anecdotes. So we don't know.
Finally, most of the last 20 years or so of software development advancements have been focused on objected oriented techniques. In my experience, objected oriented techniques are a big help with user interfaces, a modest help with business logic and a few other things, but a gross nightmare in dealing with most numerical or mathematical related code. Climate models, in other words, rely heavily on the techniques that recent advancements in software development have ignored.

Er, I should note my analysis of the publicly available GISS model E code was quite cursory - a few days about a year ago, and a few more days some time before that.

The publicly available GISS model E code is better documented and easier to read than most of the code I have worked with as a professional software developer.

As FORTRAN goes, it's not bad at all. And they're working on improving it. They use source management tools. The physics is all documented in published papers, etc.

I agree entirely with what you say about the industry's obsessiveness with object oriented programming.

Part of the problem lies with the insane cost structure of government labs, at least in the U.S. When visiting a lab that offered me a position I noticed that there were lots of chiefs and few Indians. My hosts explained that overhead rates and benefits work out so that a mid-level programmer or systems admin costs nearly as much as an experienced Ph.D. scientist. So, they had lots of scientists and few programmers.

[And the problem is that you need to pay *more* that an experienced scientist gets -W]

By Joey Dzhugashvili (not verified) on 19 Jun 2010 #permalink

What they did not tell you is that the chiefs are supported by armies of consultant indians in body shops which often are universities who suck on the overhead. The ratio at a number of NASA installations Eli is familiar with is 2 or 3 to 1 consultants to civil service. Since much of the government cost is fringe benefits and retirement and consultants don't have any tenure rights, this, in the short run is less expensive

As a software engineer at a climate modeling center, I must say I agree with most all of what has been said. I used to fiddle with climate model code, but not so much these days. Where I work, there are more scientists than SEs, and we SEs are all oversubscribed. I could easily split my job in two and work full-time at both. We have made huge progress on the SE aspects of our model - going from old CRAY 'update' to CVS and now SVN, with reasonably exhaustive testing after every code change. The model itself is still incredibly complicated to use, however, and I believe that's an irreducible problem. At least we abandoned the idea of a model GUI ages ago!
The comments regarding the costs of SEs vs. scientists are also true - and scientists do the sexy stuff, so get rewarded; more than once I've heard "Oh, that's just coding" or "You're just a programmer", thereby rationalizing the second-class status of SEs. The software isn't where the career traction is, the science is where it is.

By Derecho64 (not verified) on 20 Jun 2010 #permalink

This is *not* just an issue of government funding and overhead structures.

It's irrelevant to go into the details, but I would claim that the 1970s/1980s Bell Labs:

1) Had a bunch of very good research scientists, probably more than any other industrial R&D lab.

2) Had some of the best software engineers and software engineering management around, and some of the best toolsets.

3) And had some of the best overall R&D management.
{Of course, there were still screwups, big ones.]

4) And best of all, we had monopoly money!
So, of course, every scientist must have had enough software engineering support to do everything well.
WRONG.

Each piece of an organization has some mission, and generally the focus is on the key specialists there, with enough others added as needed, but at lower priority. Put another way, career-wise, you don't want to be a software person in a mostly-hardware organization or vice-versa.

In our case, we had plenty more good software engineers than good research scientists, so why weren't they over there helping?

A: Because scientists could get their work done without great SE. Projects involving 10, 20, 100, 200 people, with multi-year-schedules, and impacts of $MS, $100Ms or more, with maintenance/enhancement required for *decades*, could *not* do well without good SE ... so of course, that's where people were.

In big organizations, thee *never* were enough good people to do all the things you want to, well enough. Bell Labs had an awesome recruiting machine. They didn't end recruiters to campuses for job fairs or such. Each campus at which they recruited had a long-term team, mostly managers, often led by a Director. (Translation: someone who managed 4-6 Department Heads, each of whom managed 4-5 Supervisors, say 100-150 people in total. Probably has 15-20 years working experience.) They'd be on campus a few times per year and got to to know the faculty. They'd have figured out a year or two before graduation who they wanted to approach. Then, if you were a grad student, a BTL Director might wander into your office, introduce themselves, and explain why you might consider working there. Sometime, you'd make a 1-2-day visit, interview at (in my case) half a dozen different internal labs, who would in effect *bid* on the candidate if they liked them.)
Since on needed an relevant Masters to be a Member of Technical Staff (MTS), they'd offer the following deal to undergraduates they wanted:

1) After you graduate, work over the summer at the hiring lab.
2) Then we send you to {good school} to get your MS. We take care of expenses, you get half-pay. [Which means a cushy life, compared to most assistantships.]
3) Then you come back. [No legal requirement to stay, but most people did.]
We did this because we simply could *not* hire all the MS/PhD students we wanted. It was a*great* deal.

As a hint of what this produced, I once was in a group that spent a month in a standard program to rub BTL engineers' noses into the realities in the field. We had 11 BTL MTS and 1 AT&T guy (they let them in every once in a while).
We'd visited a site that had a bunch of people (almost all women) doing a job that we all thought was horribly boring [a hyper-repetitive call center] and it was being phased out and the remainder transferred elsewhere.

Although the Bell System was pretty good about retraining and trying to find jobs for people, they were pretty unhappy, and we couldn't understand why, because it was *so* boring. The AT&T guy suggested that the job was local, safe, they worked with their friends, and the women could trade off schedules to make it fit the rest of their lives., which were more important to them anyway.

The rest of kept saying "But it is SO boring."
AT&T guy: so, how how typical are you?
US: we're pretty typical, we're like the rest of the folks we work with.
AT&T guy: Sure, and how typical is *that*?
Let's see hands. Which of you were high-school valedictorians?
US: 11 of 11 hands went up. Oops. We shut up.

WE STILL COULDN'T GET AS MANY GOOD PEOPLE AS WE WANTED.

Anyway, science research organizations are simply not structured to incorporate as much SE as one might want. I haven't *ever* seen one anywhere that did. If some science research effort produces some good software that actually wants to be turned into a *product* in its own right, it wends up moving to a product development organization (inside a company, or via spinoff) at which point SE gets serious.

I've looked at GISS code and procedures as well (a little), and thought it looked pretty good. I've actually seen quite a few chunks of code from different places that were pretty good, even though originally produced by scientists and engineers for their own use. Each kind of activity needs its own right mix of scientists and SEs.

I say all this because software practices *must* continue to improve, but people have to understand how things really work. It's not just an issue of gov't labs and climate models being some odd special case that needs fixed...

By John Mashey (not verified) on 20 Jun 2010 #permalink

well the window of opportunity for good software engineering practices & scientist/supercomputing/modelling codes may have passed anyway. My recent commercial forays it seems that everyone is looking for a tedious "cookie cutter" design pattern solution. Perhaps because that makes it easier to offshore/outsource the code and have it worked on (cheaply) in India.

In the old days writing good software was more elegant and dare I say "artistic" than what happens nowadays in the more BS realm of software engineering (i.e. "agile SCRUM eXtreme Programmed design patterns"). A lot of this stuff is just glorified reusability and from what I've seen obfuscates and complicates and makes what should be basic code run slower (in the zeal to conform to having "design patterns" etc). Maybe I'm just an old codger (at 45 and still in IT, I'm basically Methusaleh) - but the current trends in the industry are far from the elegant & fast code I was used to in the 90's.

So it may all be moot, i.e. if the MO paid for a brand new model from the ground up, with object-oriented software engineering techniques, it may be far slower and buggier and much larger codebase (& harder to follow) than what they have now with UM version 7.1.

Maybe some software houses do the traditional enginerring steps of
prototype
pilot plant
production version
but I don't know where that might be done these days.

Instead I see pair programming as cutting down on the number of thrown-away modules or just substantial rework after QA reports back. The other aspect of agile programming which seems to help is the frequent interactions with the eventual users, partly because requirements change during the development time. Don't know whether this last aspect is relevant to climate modeling, tho'. But IMO pair programming might well produce cleaner climate code.

By David B. Benson (not verified) on 20 Jun 2010 #permalink

I'll add my anecdotes here (one is in the other thread). When I worked in the IT department of UKMO I was easily the lowest paid of my friends from my IT degree.

We used standard IT "engineering" techniques in that I joined an SSADM driven project that was rebuilding the main database for the organisation - MIDAS. We looked at an in-house offering from NOAA, but ended up going for CA-IDMS (I think it's now Oracle). There was little amateurish about the whole thing - it was not HPC though.

I indirectly know (of) someone who worked on the GM who stands out in his department at a large IT company. I don't know why he left the UKMO, but I'd put a tenner on him earning considerably more there than at UKMO.

Some of the code I've seen at many commercial sites is far, far worse than anything I saw whilst working at UKMO.

Nothing new there, but backs up some points above.

numerically sound but "engineeringly" a bit frighteningly. Just navigating the 500 huge files (the first 3000 lines being comments & variable declarations following by a few more thousand lines of code) was pretty horrifying.

You should see some of the crap produced by "professional" software engineers. At least you have comments.

It might be helpful for scientists to adopt some of the same metrics and tools used for commercial software. Then our codes could be as stable and reliable as commercial products, for example Microsoft Vista.

By Joey Dzhugashvili (not verified) on 22 Jun 2010 #permalink

Glancing quickly at the list of recent posts, I misread the title of this one as

Amish Supercomputing Codes

That brief moment of misunderstanding made me laugh, on a morning when I was otherwise burned out and exhausted with overwork.

[Horse powered? I'm not sure they would be allowed Steampunk -W]

re: #23
I rather doubt any Amish have supercomputers, but then very few Americans have supercomputers in their houses. (I do know one who has a Cray C-90 (although not powered), and another who used to have Convex, Alliant, etc in his basement, and actually used them. Well, OK, those are only minisupers.]

However, being from Pennsylvania, and having driven or biked around Lancaster, PA I observe that Amish rules vary *widely*. Some do use electricity and computers, not just craft oak computer desks.
See Wikipedia.

By John Mashey (not verified) on 23 Jun 2010 #permalink

Thanks, John. Yes, I formerly lived in a region in the Midwest US with a large Amish population. I would still be surprised to learn that more than a handful of individuals use computers.

I was just amused by the thought of going past one of their driveways and seeing a sign like "Supercomputing Codes (no Sunday sale)" in place of "maple syrup" or "honey" or whatever.

Or maybe this could be the start of the screenplay for the next Hollywood thriller -- part James Bond, part Harrison Ford in "Witness". A breakaway Amish sect has built a secret underground supercomputing facility from which they've been manipulating the global financial markets in an effort to bring about economic collapse and thus forcibly return the entire western world to pre-industrial conditions ...

Er, okay, maybe not.