Call Now: (800) 537-1660
 Home    Why?     Free Solver   Testimonials    FAQs    Product    Contact

June 19th

June 19th

# Mathematical Ex pression and Reasoning for Computer Science

Rounding
Most numbers are not exactly representable in a floating-point system using a given base
β. For example, when β = 10, you cannot represent 1/3 exactly (no matter how large t
is), so we used 3.33 10-1 when t = 3. What should we do with something like the base
of natural logarithms , e = 2.718281828 …? Two approaches are used.

• Round to nearest.

• Truncate to zero.

Overflow

There is no way to represent a number larger than the largest floating-point number. In
our first example ( β = 10, t = 3; e ∈ [-4, +4]), there is no way to represent 99901 or
greater.

Underflow

There is no way to represent a positive number smaller than the smallest positive floating point
number. In our first example, there is no way to represent 0.00001 (or a smaller
positive number).

Absolute rounding error

We can calculate the difference between the true value we're trying to represent and
the value of its floating-point representation. For example, the absolute error in our
representation of e is |2.71 - 2.718281828 … | = |0.008281828 … |.

Relative error

100 and 100.1 are closer than 1 and 1.1, even though the absolute difference is 0.1 in
both cases. Look at the size of the error in terms of the size of the value being represented.
Relative error. For x ≠ 0, the relative error between the approximate value x' and the
"real" value x is

For example, |1.1-1|/1.1| = 0.0909 ≈ 9%. However, |100.1-100|/|100| = 0.000999 …
≈ 0.1%.

Relative error in round-to-nearest

When we round numbers to represent them in a floating-point system, can we bound
relative error for positive numbers with no overflow or underflow?.
In general, a number of the form

. . . gets rounded either up or down to one of

The representation of the first number may be different (the +1 may cause a number of
carries in the addition), but the value is the same. The difference between these two
numbers is simply a 1 in the position occupied by for a difference of
Since we round to nearest, our error is at most half this value.

The relative error can be calculated, since we use the convention that the leading digit, d0,
is non-zero, so the smallest denominator (hence the largest bound) is when , giving
a relative error of.

This matches our intuition that by increasing t (the number of digits) we get more precision.

In our example of a floating-point system with β = 2, t = 3, this gives a bound on the
relative error of round-to-nearest of This is also clear from the number line
or the 24 values in this representation.

Accumulation of error

Since we can't represent all real numbers exactly in a floating point representation, what
happens when we repeat operations . For example, take β = 10, t = 3, e ∈ [-2, +2], and
consider the sum

100 + 0.1 + 0.1 + …  + 0.1 [n times]

The first part of the sum, 100 + 0.1 is represented by 1.00 × 102. As we add additional
terms with value 0.1, the result is still 1.00 × 102, since our representation cannot represent
that additional 0.1. The relative error can become arbitrarily large, given large enough
n. If you play around with FloatExample.cumulativeError (on the web page), with say
t1 = 1.0, n ≥ 10, and   you'll see this demonstrated.

The easy way to work around this particular problem is to add the 0.1 terms first, and
then add the 100 in other words, addition is not associative for floating-point numbers.
(a + b) + c ≠ a + (b + c).

Catastrophic cancellation

Use the same floating-point system as in the previous example to compute b2 - 4ac for
b = 3.34, a = 1.22, and c = 2.28. The exact value is 0.0292 = 2.92 × 10-2, and this exact
value is representable in our floating-point system. Look at how the value is calculated,
though.
b2 = (3.34)2
= 11.1556 ≈ 1.12 × 101

4ac
= 4 × 1.22 × 2.28
= 4.88 × 2.28
= 11.1264 ≈ 1.11 × 101

b2 - 4ac
≈ 1.12 × 101 - 1.11 × 101
= 0.01 × 101 = 1.00 × 10-1

Compared to our exact answer of 2.92 × 10-2, this has a relative error of

Subtracting two floating -point numbers that are very close together leaves very few significant
digits a great deal of information is lost. Since the true value is very small, the
round-o error becomes much more significant, and sometimes becomes much larger than
the value being computed (see above).

The expression b2 - 4ac crops up in the solution to the quadratic equation ax2+bx+c =
0. The general form of the solution, for the two roots and is

We may not have to worry about the large relative error in b2 - 4ac, since it may be small
in absolute value compared to -b. Here's a case where computing (using the values that
lead to catastrophic cancellation above) gives a fairly acceptable value using floating-point
operations.

Compare this to the result if there were no error in the computation of b2 - 4ac, which is.

. . . for a relative error of less than 5%.

Stability

There is a built-in problem with the formulas used above to compute the roots of a
equation. if b2 - 4ac is close to b2, then there will be catastrophic cancellation
between -b and This is separate from the catastrophic cancellation that may
happen if b2 is close to 4ac.

Definition. a formula (or algorithm) is called "unstable" i errors in the input values
get magnified during the computation (i.e., i the relative error in the final answer can be
larger than the relative error in the input values).

Dealing with instability
Our first example (100 + 0.1 +… ) was unstable, but there was an easy way to use a
different algorithm that is stable. perform the ope rations in a different order.

Our second example (b2 - 4ac) was also unstable, because of potential catastrophic
cancellation, and there is, unfortunately, no easy x. You could increase the number of
significant digits to make the round-o error smaller, but you will still have the potential
for catastrophic cancellation when you subtract numbers that are very close, even with
your increased precision. The formula is unstable.

Our third example is also unstable (it includes the instability of example 2, plus its
own instability), but the possibility of cancellation between - b and can be
avoid by changing the formula . Suppose b > 0 (otherwise swap the role of and below
if b < 0). Then avoid the subtraction in the numerator of the quadratic formula.

Since both - b and have the same (negative) sign, there will be no catastrophic
cancellation. Now compute using

( multiply the two roots to see this)

This formula involves no subtraction, so there is no catastrophic cancellation.
There are two ways, in general, to deal with unstable formulas or algorithms.

• Increase the precision (the number of significant digits). This does not change the
fact that the formula or algorithm is unstable, but can help minimize the magnitude
of the errors, for some inputs.

• Use a different, stable, algorithm or formula to compute the result. When possible,
this is preferred.

Conditioning

In the previous example we examined the error produced in calculating the roots of ax2 +
bx+c. The numbers a, b, and c may come from measurement and already have some error
associated with them. By using a stable algorithm we get a relatively correct answer for
slightly incorrect inputs.

But, independently of the particular algorithm used to compute roots, what can be
said about the effect that errors in the measurement of a, b, and c have on values of roots?
That is, if the values of a, b, or c change slightly, could the roots change dramatically?

To simply the discussion, let's look at the special case of a quadratic formula x2 - c = 0
(so we're finding ).

Suppose that c = 0.25, but we use a bad approximation c' = 0.36 instead of the true
value. Then our answer will be 0.6 instead of 0.5. The relative error in the input is
0.11/0.25 = 0.44, while the relative error in the result is 0.1/0.5 = 0.2. Taking the square
root makes the relative error smaller!

This isn't something special about the particular case we chose. Let's work the algebra
to see the general case of computing using an approximation c' of c. The ratio of the
relative error of the result to the relative error of the input is (as suming c and c ' are close
enough to have the same sign).

This ratio is always less than 1 (so the error improves), and when c' is very close to c, the
ratio is close to 1/2. So the error is never increased, and as the measurements improve it
is reduced to almost 1/2.

When computing the function f(x) using the approximation x' instead of the true value
x, we say that the condition number is equal to the ratio of the relative error of the
result and the relative error of the input, i.e..

If you take the limit as x' → x (and assume that f is differentiable, and that f(x) ≠ 0),
this is.

For , the condition number is (well, do the derivative)! Not all functions
have good condition numbers. Compute the condition number for cos(x), and you'll see
that you can a huge condition number by choosing an appropriate x.

 Prev Next

Home    Why Algebra Buster?    Guarantee    Testimonials    Ordering    FAQ    About Us
What's new?    Resources    Animated demo    Algebra lessons    Bibliography of     textbooks