18. Divide and Conquer:

The Pentium Bug

The Question:

The four arithmetic operators – addition, subtraction, multiplication and division – are familiar to grade school pupils. It’s just about as simple as it gets. Division requires a bit more effort but is also simple.

Correct?

The Paradox:

Not, if you divide by zero. If you do, things tend to explode. Nor if you had a computer in the mid 1990s, running a Pentium chip and you divided a certain number by itself. You would have got an erroneous result.

Background:

On September 21, 1997, the USS Yorktown, a missile cruiser of the US Navy, lost control of its propulsion system and drifted dead in the water for about two hours and 45 minutes. Eventually, it had to be towed to the base at Norfolk, Va., though the US Navy denied that the vessel had to be towed.

* * *

In the summer of 1994, Thomas Nicely, a math professor at Lynchburg College in Virginia, noticed something strange while doing research on prime numbers. (These are numbers that can only be divided by one or by themselves.) He was investigating the so-called ‘twin primes’, pairs of consecutive odd numbers that are both prime, like 5 and 7 or 41 and 43.

When dividing a certain prime number by itself, the computer did not return 1.0, as it should have, but something slightly smaller. Now, we know that digital computers need to round numbers or truncate them (see the chapters on ////). Nevertheless, the result should have been correct to nineteen significant digits, i.e, it should have been something like 0.9999999999999999999275... Instead he got 0.999999996274709702. The difference may seem very small but is actually huge. Moreover, the error did not occur only for this number, but for a whole range of numbers.

Dénouement:

The incident involving the USS Yorktown occurred because of what is commonly called a ‘human error’. The administrator of the missile cruiser’s Standard Monitoring Control System entered ‘zero’ into a data field. Unfortunately, further on in the computer program, something had to be divided by the number in that data field. Since nothing can be divided by zero, computers, if left to their own devices, try to circumvent the problem by posting an immensely large number as a result. In fact, the result was so large that it overflowed into adjacent areas in the computer’s memory, setting off a cascade of system failures throughout the vessel's electronic control systems.

* * *

The number that so offended Nicely was 824633702441. It is the smaller member of one of the rare pairs of twin primes. Namely, the next higher odd number, 824633702443, is also prime. Nicely tried to compute the inverse of these twin primes.

Dividing it by itself, the computer did not return 1.0, or something only a teeny-weeny bit smaller or larger, no, the result was 0.999999996274709702…. The result should have had eighteen or nineteen 9s before trailing off. The fact that the digits wandered away after only eight 9s indicated that something was seriously wrong. By the standards then in force, the error was ten billion times larger than what was acceptable.

The bug had nothing to do with the fact that the offending number is prime. In fact, Nicely found that his personal computer returned erroneous results for most of the 32 numbers between 824,633,702,418 and 824,633,702,449. Moreover, when this interval was multiplied or divided by 2, 4, 8, 16, …, the numbers contained in these intervals always produced errors when divided by themselves. This was definitely a bug in the processor of his personal computer.

The math professor reported the bug to Intel Corporation, the makers of the Pentium processor. At first, Intel was dismissive about Nicely’s finding, calling it not a bug but a ‘flaw’. Anyway, “there is a probability that 1 in 9 billion randomly fed divide or remainder instructions will produce inaccurate results,” was their response. IBM, makers of a competing processor, put this flaw’s incidence at a much higher level: one every ////.

Technical supplement:

The Yorktown incident was a classical ‘Gigo’ scenario (garbage in, garbage out). By entering the number zero into a certain lookup table, the table was set for disaster. More modern operating systems prevent something like that from occurring. They do not allow division by zero. In fact, good software performs range checks before performing a division in order to ascertain that all divisors are nonzero.

The Pentium Bug was caused by the omission of five entries in a table of 1,066 values (part of the chip's circuitry) used by a speedy algorithm known as SRT division, after the initials of the three researchers who independently introduced the scheme more than 30 years ago. The five cells should have contained the constant +2, but because the cells were empty, the processor fetched a zero. This could throw off a calculation, which would result in a number that was always slightly less precise than the correct answer.

With a goal to boosting the execution of floating-point scalar code by 3 times and vector code by 5 times, compared to the 486DX chip, Intel decided to use the SRT algorithm that can generate two quotient bits per clock cycle, while the traditional 486 shift-and-subtract algorithm was generating only one quotient bit per cycle. This SRT algorithm uses a lookup table to calculate the intermidiate quotients necessary for floating-point division. Intel's lookup table consists of 1066 table entries, of which, due to a programming error, five were not downloaded into the programmable logic array (PLA). When any of these five cells is accessed by the floating point unit (FPU), it (the FPU) fetches zero instead of +2, which was supposed to be contained in the "missing" cells. This throws off the calculation and results in a less precise number than the correct answer (Byte Magazine, March 1995).

Comments, corrections, observations: