Friday Fun: Epic failures: 11 infamous software bugs

By jdupuis on September 17, 2010.

Having started my working life as a software developer, I know a bit about epic bugs. Let's just say I've had my share and leave it at that. At very least, I can say I never caused any vehicles to crash or any companies to fail.

So, from ComputerWorld, Epic failures: 11 infamous software bugs.

Instead, this story is about outright programming errors that caused key failures in their own right.

Have I missed anything important? Consider this a call for nominations for the biggest bugs of all time. These are my suggestions; if you have any honorable mentions, bring 'em on. The worst anyone can do is swat them.

The list includes:

The Mars Climate Orbiter doesn't orbit
Mariner 1's five-minute flight
Forty seconds of Ariane-5
Pentium chips fail math
Call waiting ... and waiting ... and waiting
Windows Genuine Disadvantage
Patriot missile mistiming
Therac-25 Medical Accelerator disaster
Multidata Systems/Cobalt-60 overdoses
Osprey aircraft crash
End-of-the-world bugs

Here's one with details:

Pentium chips fail math

In 1994, an entire line of CPUs by market leader Intel simply couldn't do their math. The Pentium floating-point flaw ensured that no matter what software you used, your results stood a chance of being inaccurate past the eighth decimal point. The problem lay in a faulty math coprocessor, also known as a floating-point unit. The result was a small possibility of tiny errors in hardcore calculations, but it was a costly PR debacle for Intel.

How did the first generation of Pentiums go wrong? Intel's laudable idea was to triple the execution speed of floating-point calculations by ditching the previous-generation 486 processor's clunky shift-and-subtract algorithm and substituting a lookup-table approach in the Pentium. So far, so smart. The lookup table consisted of 1,066 table entries, downloaded into the programmable logic array of the chip. But only 1,061 entries made it onto the first-generation Pentiums; five got lost on the way.

When the floating-point unit accessed any of the empty cells, it would get a zero response instead of the real answer. A zero response from one cell didn't actually return an answer of zero: A few obscure calculations returned slight errors typically around the tenth decimal digit, so the error passed by quality control and into production.

What did that mean for the lay user? Not much. With this kind of bug, there's a 1-in-360 billion chance that miscalculations could reach as high as the fourth decimal place. More likely, with odds of 1-to-9 billion against, was that any errors would happen in the 9th or 10th decimal digit.

More like this

Countdown to Y2K38

The Year 2038 problem could begin today. Similar to the Y2K problem, certain operating systems cannot handle dates after about 3 AM Universal Time on January 19th, 2038. If your bank is handling a 30 year mortgage starting today, funny things could happen starting now. The Y2K problem occurred…

Repeatability of Large Computations

Some parts of the discussion of Oh dear, oh dear, oh dear: chaos, weather and climate confuses denialists have turned into discussions of (bit) reproducibility of GCM code. mt has a post on this at P3 which he linked to, and I commented there, but most of the comments continued here. So its worth…

Arithmetic on the Abacus: Part 1

If you want to talk about mechanical computing tools, you can't ignore the abacus. It's the oldest computing tool in the world; and it's still very commonly used. It's also about as different from the slide rule as you could imagine. The abacus is really fundamentally an addition device; the slide…

Basics: Significant Figures

After my post the other day about rounding errors, I got a ton of requests to explain the idea of significant figures. That's actually a very interesting topic. The idea of significant figures is that when you're doing experimental work, you're taking measurements - and measurements always have a…

the problem with the pentium bug wasn't the bug itself (whatever CPU your current computer uses, it's sure to contain more than a few bugs too). The problem was that Intel tried to sweep the problem under the carpet, with arguments like the 1-in-10 billion chances and the like. It backfired rather spectacularly and they had to offer replacement CPUs for anyone who could show the bug had a real effect.

And it did have a real effect. If one in 10 billion calculations was faulty on average, the cpu was running at 1Ghz and a floating point calculation happened, say, every ten clock ticks, you'd have a faulty result every minute and a half. If you'd used the pentium chips to build a parallel HP computer - exactly the kind of machine that does lots of FP calculations and likes to use the latest cpus - with, say, a thousand nodes then you'd have ten flawed calculations per second.

Most cpu's have bugs. Consequential bugs are patched at the OS level or through runtime firmware (check the detailed boot output if you run Linux sometime). They could probably have done that with this flaw as well if they'd simply acknowledged the fault and issued a workaround. But they didn't and it blew up in their face.

With this kind of bug, there's a 1-in-360 billion chance that miscalculations could reach as high as the fourth decimal place

it's really bad..we r at 21-
but they still make mistakes..damn !

And if your 1 part in whatever error gets into a numerically sensitive algorithm you could end up with garbage on the other end. Computer arithmetic is meant to be correct within the definition of correctness, usually that means the resultant bit pattern is the bit pattern of the closest number to the correct one, i.e. given known inputs a, and b, there is only one "correct" result c. So it seems pretty bad that they missed it at verification time. I think what they were doing was recoprical approximation, use the firmware table lookup for a crude guess, then use a couple of Newtonlike iterations to get up to full precision. A bad guess means you won't be fully converged when the result comes out the other end.

I do believe they came up with a software patch -to avoid the couple of bad table entries. One moral is that it helps to know how something works when testing, i.e. they should have devised a testing regime that verifies every table entry value is correct.

But was it a software bug (did it come from SW, or should we classify it as a hardware bug).

In any case, I still meet people from time to time who are distrustful of Intel CPUs, even though they seemed to have learned their lesson, bad memories live on. I think their initial efforts to pretend it wasn't a big deal that needed fixing were where they really went wrong.

I think "Therac-25 Medical Accelerator disaster" was the worst failure of the history...
http://enlargemaxx.org/enlarge-maxx-reviews-truth-revealed-in-our-revie…

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

ScienceBlogs is no more: Confessions of a Science Librarian is moving

October 30, 2017

As of November 1st, 2017, ScienceBlogs is shutting down, necessitating relocation of this blog. It's been over eight years and 1279 posts. It's been predatory open access publishers, April Fool's posts and multiple wars on science. A long and wonderful trip, career-transforming, network building…

Science in Canada: Save PEARL, The Polar Environment Atmospheric Research Laboratory

September 26, 2017

Deja vu all over again. Just when I thought I was out, they pull me back in. Canadian science under the Harper government from 2006 to 2015 was a horrific era of cuts and closures and muzzling and a whole lot of other attack on science. One of the most egregious was the threat to close the PEARL…

The Trump War on Science: Daring blindness, Denying climate change, Destroying the EPA and other daily disasters

September 11, 2017

The last one of these was in mid-June, so we're picking up all the summer stories of scientific mayhem in the Trump era. The last couple of months have seemed especially apocalyptic, with Nazis marching in the streets and nuclear war suddenly not so distant a possibility. But along with those…

Friday Fun: Is Game of Thrones an allegory for global climate change?

August 18, 2017

After a bit of an unexpected summer hiatus, I'm back to regular blogging, at least as regular as it's been the last year or two. Of course, I'm a committed Game of Thrones fan. I read the first book in paperback soon after it was reprinted, some twenty years ago. And I've also been a fan of the HBO…

The Trump War on Science: EPA budget cuts, More on climate change, The war on wildlife and other recent stories

June 16, 2017

Another couple of weeks' worth of stories about how science is faring under the Donald Trump regime. If I'm missing anything important, please let me know either in the comments or at my email jdupuis at yorku dot ca. If you want to use a non-work email for me, it's dupuisj at gmail dot com. The…