Friday Fun: Epic failures: 11 infamous software bugs

Having started my working life as a software developer, I know a bit about epic bugs. Let's just say I've had my share and leave it at that. At very least, I can say I never caused any vehicles to crash or any companies to fail.

So, from ComputerWorld, Epic failures: 11 infamous software bugs.

Instead, this story is about outright programming errors that caused key failures in their own right.

Have I missed anything important? Consider this a call for nominations for the biggest bugs of all time. These are my suggestions; if you have any honorable mentions, bring 'em on. The worst anyone can do is swat them.

The list includes:

  • The Mars Climate Orbiter doesn't orbit
  • Mariner 1's five-minute flight
  • Forty seconds of Ariane-5
  • Pentium chips fail math
  • Call waiting ... and waiting ... and waiting
  • Windows Genuine Disadvantage
  • Patriot missile mistiming
  • Therac-25 Medical Accelerator disaster
  • Multidata Systems/Cobalt-60 overdoses
  • Osprey aircraft crash
  • End-of-the-world bugs

Here's one with details:

Pentium chips fail math

In 1994, an entire line of CPUs by market leader Intel simply couldn't do their math. The Pentium floating-point flaw ensured that no matter what software you used, your results stood a chance of being inaccurate past the eighth decimal point. The problem lay in a faulty math coprocessor, also known as a floating-point unit. The result was a small possibility of tiny errors in hardcore calculations, but it was a costly PR debacle for Intel.

How did the first generation of Pentiums go wrong? Intel's laudable idea was to triple the execution speed of floating-point calculations by ditching the previous-generation 486 processor's clunky shift-and-subtract algorithm and substituting a lookup-table approach in the Pentium. So far, so smart. The lookup table consisted of 1,066 table entries, downloaded into the programmable logic array of the chip. But only 1,061 entries made it onto the first-generation Pentiums; five got lost on the way.

When the floating-point unit accessed any of the empty cells, it would get a zero response instead of the real answer. A zero response from one cell didn't actually return an answer of zero: A few obscure calculations returned slight errors typically around the tenth decimal digit, so the error passed by quality control and into production.

What did that mean for the lay user? Not much. With this kind of bug, there's a 1-in-360 billion chance that miscalculations could reach as high as the fourth decimal place. More likely, with odds of 1-to-9 billion against, was that any errors would happen in the 9th or 10th decimal digit.

More like this

the problem with the pentium bug wasn't the bug itself (whatever CPU your current computer uses, it's sure to contain more than a few bugs too). The problem was that Intel tried to sweep the problem under the carpet, with arguments like the 1-in-10 billion chances and the like. It backfired rather spectacularly and they had to offer replacement CPUs for anyone who could show the bug had a real effect.

And it did have a real effect. If one in 10 billion calculations was faulty on average, the cpu was running at 1Ghz and a floating point calculation happened, say, every ten clock ticks, you'd have a faulty result every minute and a half. If you'd used the pentium chips to build a parallel HP computer - exactly the kind of machine that does lots of FP calculations and likes to use the latest cpus - with, say, a thousand nodes then you'd have ten flawed calculations per second.

Most cpu's have bugs. Consequential bugs are patched at the OS level or through runtime firmware (check the detailed boot output if you run Linux sometime). They could probably have done that with this flaw as well if they'd simply acknowledged the fault and issued a workaround. But they didn't and it blew up in their face.

And if your 1 part in whatever error gets into a numerically sensitive algorithm you could end up with garbage on the other end. Computer arithmetic is meant to be correct within the definition of correctness, usually that means the resultant bit pattern is the bit pattern of the closest number to the correct one, i.e. given known inputs a, and b, there is only one "correct" result c. So it seems pretty bad that they missed it at verification time. I think what they were doing was recoprical approximation, use the firmware table lookup for a crude guess, then use a couple of Newtonlike iterations to get up to full precision. A bad guess means you won't be fully converged when the result comes out the other end.

I do believe they came up with a software patch -to avoid the couple of bad table entries. One moral is that it helps to know how something works when testing, i.e. they should have devised a testing regime that verifies every table entry value is correct.

But was it a software bug (did it come from SW, or should we classify it as a hardware bug).

In any case, I still meet people from time to time who are distrustful of Intel CPUs, even though they seemed to have learned their lesson, bad memories live on. I think their initial efforts to pretend it wasn't a big deal that needed fixing were where they really went wrong.

By Omega Centauri (not verified) on 21 Sep 2010 #permalink