18. IEEE754 | georgeszpiro

18. To Err is Human, to Really Foul Up Requires a Computer:

IEEE754

The Question:

Computers cannot err. The softwares’ algorithms are tested and validated and vetted before they are let loose. Thus, everybody may rely on them, be they users of personal computers, laptops or smartphones like you and I, or airplane pilots, medical doctors, or operators who control the nuclear button.

Correct?

The Paradox:

In general, yes. On rare occasions, no!

Computers only do as they are told. They perform mathematical computations, mostly just by adding, subtracting and comparing bits of zeros and ones. So what could go wrong? Plenty…sometimes with disastrous consequences.

Background:

On Friday afternoon, November 25th, 1983, the Vancouver Stock Exchange Index closed at 524,811 points. At the start of trading on the following Monday morning, the index stood at 1,098,892 points. There had been no trading over the weekend. So what happened?

On February 25, 1991, during the first Golf War of 1991, a Patriot anti-ballistic missile failed to track and intercept an incoming Iraqi Scud missile. The Scud slammed into an American army barracks, killing 28 soldiers and wounding a hundred more. Patriots had successfully downed numerous Scuds aimed at Israel and at Saudi Arabia. So, what had happened this time?

On April 5, 1992, when election results for the parliament in the German state of Schleswig-Holstein were announced, the Green Party breathed a sigh of relief. Monitors showed that the party had cleared the electoral hurdle of 5.0 percent, albeit by a hair’s breadth. Minutes later, relief turned into disappointment. The election committee announced that the Greens had not passed the threshold after all. No additional votes were counted in the intervening minutes. So what had happened?

On June 4, 1996, after a decade of development, costing seven billion Dollars, the European Space Agency launched the unmanned Ariane 5 space rocket. Forty seconds after lift-off from French Guyana, it exploded.

Dénouement:

In the chapter ‘iPhones and the Butterfly’, I pointed out that numerical data must be rounded or truncated in digital computers. It is a fact of silicon life that electronic computing devices are limited to about 35 digits after the decimal point. But sometimes they occur sooner, due to slipups by the coders.

The Vancouver stock exchange glitch happened because of such rounding errors. Twenty months before it occurred, in January 1982, the index was set to 1000 points. After every trade, the index was recalculated and updated. The recalculation was performed to three digits after the decimal point and then…no, it was not rounded. It was truncated! In other words, the tail of the recomputed number was just ignored. Thus, in effect, the index was rounded down each time, no matter whether the fourth digit was a 0 or a nine. With about 3000 trades a day, the index lost about one point each day. Over the weekend of November 25–28, 1983, the error was corrected, raising the value of the index do double its value from the previous Friday.

The tragic miss of the Scud by the Patriot anti-ballistic missile was due to an inaccurate calculation of the timespan since the Patriot’s start-up. The system measured time in tenths of seconds and had been running for 100 hours. The binary expansion of one tenth of a second – remember, computers operate in zeros and ones – is the never-ending string 0.0001100110011001100110011001100..... The Patriot system truncated this infinitely long number after 24 digits, thus introducing a minute error of 0.0000000000000000000000001001100... – about 0.000000095 in decimal notation – each time. After a hundred hours, i.e, after 100×60×60×10 tenths of seconds, the accumulated error amounted to about 0.34 seconds. Since a Scud rocket travels about one and a half kilometers per second, the Patriot was off by close to half a kilometer.

The explosion of the Ariane 5 rocket was due to an error relating to one of the flight parameters. A 64-bit floating point number was converted to a 15-bit integer. Unfortunately, that number was larger than 32,767, which is the largest integer that could have been stored as a 15-bit integer. The conversion failed and disaster ensued.

In Schleswig-Holstein, the Greens rejoiced after the election committee determined that they had obtained 74,014 votes out of 1,487,909 votes cast. To the Green’s chagrin however, this meant that they had obtained only 4.97% of the total votes. When votes were counted and proportions calculated, results were unfortunately reported only to one digit after the decimal point. And 4.97%, rounded to one decimal gives 5.0% which led to the Green’s premature joy

Technical supplement:

Calculators and computers can compute and store numerical values only to a certain number of digits after the decimal point. When the first computer was built in the mid- 1940s, this did not seem a problem. But in the following decades problems were recognized. Above all, there was the tiresome business of the numbers’ magnitudes. In the early days computers allocated a fixed width to each number, let’s say eight positions before the decimal point, and two behind. So when the number 12345678.90 was added to the number 0.0123456789 the calculation in effect became 12345678.90 plus 0.01.

As a response, floating-point arithmetic was developed. It doesn’t care where the decimal point is, and always registers the same number of significant digits. After the values of the digits have been established, the decimal point floats in and lands at the appropriate position.

Floating-point arithmetic was an advance but no panacea since digital computation always requires truncation of numbers. In the mid-1970s anarchy reigned among computer manufacturers. The companies decided on their own where to truncate numbers, how to handle division by zero, what infinity looks like, etc. Different machines used different procedures to truncate or round off numbers, and when a program was transferred from one machine to another, strange things happened. X – X was not always equal to zero, and X – Y sometimes was. 00 produced an “error” on some machines and 1.0 on others, 0/0 occasionally equaled zero, while it produced an “error” at other times, and the sign of zero was sometimes positive, sometimes negative, and sometimes undefined. Blunders and mistakes made machine computation very unreliable. It became increasingly obvious, that standards were needed to make the different practices compatible.

It was only when William Kahan, a professor of computer science at UC Berkeley, and colleagues presented the IEEE 754 standard for floating point arithemetic in the mid-1980s that some order was established.

Again, it was an advance but flaws remained. However, the examples I presented above, as well as many more, date from the previous century. In the meantime, safeguards, warning flags, emergency brakes, as well as overrides of emergency brakes, have been built into software. Rounding after one digit (Schleswig-Holstein) or three digits (Vancouver stock exchange) is usually no longer an option. Neither is the truncation of vital data after 16 bits (Ariane 5) or 24 bits (Patriot). In fact, Kahan now argues emphatically for the adoption of ‘quadruple precision arithmetic’, which truncates numerical data only after 128 bits, i.e., after about 33 to 36 significant digits.

What is still out of reach are tools to ascertain that software is error-free not only for all conceivable numerical inputs, but for all numerical inputs, without qualifier. After all, that is what mathematical proofs do.

Comments, corrections, observations:

George G. Szpiro