**1 Floating-point numbers**

In a digital computer, the set of real numbers R is approximated by a set of
floating-point numbers.

Representing a quantity as a floating-point number is very similar to
representing it using scientific

notation. In scientific notation, a number that has very large or very small
magnitude is represented

as a number of moderate magnitude multiplied by some power of ten. For example,
0.000008119

can be expressed as 8.119 ×10^{-6}. The decimal point moves, or floats, as the
power of 10 changes.

A set of floating-point numbers F is defined by four integers.

β
the base

p the precision

[L ,U] the exponent range

A floating point number x ∈F is written as follows.

where E is an integer such that L≤ E ≤U

The string of base-β digits is called the fraction. Notice that
the value of the precision

p determines how many digits there are in the floating-point number. The number
E is

called the exponent. A set of floating-point numbers is said to be normalized if
the lead digit

unless the number represented is zero . In other words, in a normalized set of
floating point numbers,

a number has a lead digit of 0 if and only if that number is 0. In virtually
every modern

computer, floating point numbers are normalized and use base 2 (i.e. β= 2).
Internally to the

computer, a floating point number is represented by three integers. the sign
(1-bit), the fraction,

and the exponent.

A real number y ∈ R is not always exactly representable in a floating-point
system F . In this

case, y is usually approximated by x ∈ F where x = fl(y) is the closest
floating-point number to

y. This function fl is called round to nearest. In the case where two floating
point numbers are the

same distance from x, the nearest floating-point number ending in an even digit
is chosen.

**2 IEEE Standard 754**

The two sets of floating point numbers most commonly supported on modern
computer systems

are specified by IEEE Standard 754. They are.

**IEEE single-precision.** a base-2 (i.e. β= 2) system with 1 sign-bit, 8-bit
exponents, 23-bit fractions

and a bias of 127

**IEEE double-precision.** a base-2 (i.e. β= 2) system with 1 sign-bit, 11-bit
exponents, 52-bit

fractions and a bias of 1023

We denote the exponent as E, the sign-bit as S, the fraction as F and the bias
as b. The bias allows

the IEEE 754 standard to represent both positive and negative exponents . Assume
we have have a

bias of b where b is a positive integer. If the value of the exponent bits,
interpreted as an unsigned

integer, is E then the exponent of the floating-point number is E - b. The
fraction, F, can be

interpreted in two ways. When representing a **normalized** value, the fraction
represents the value

1.F where an implicit leading 1 is pre-pended followed by the radix point and
then the bits making

up F. When representing **unnormalized **values, the fraction is regarded as
representing 0.F . The

standard also includes special values used to represent
and **NaN** which
stands for “Not a

Number”. We can summarize the rules for single precision IEEE 754 numbers as
follow.

1. If E = 0 and F = 0 then the value is 0 with S indicating positive or
negative.

2. If E = 255 and F ≠ 0, then the value is **NaN.**

3. If E = 255 and F = 0 then the value is with S indicating positive or
negative.

4. If 0 < E < 255 then the number represented is the **normalized** value

5. If E = 0 and F ≠ 0 then the number represented is the **unnormalized**
value

Similar rules apply for double precision numbers, just replace E = 255 with E =
2047 and use

for unnormalized values and
for
normalized values.

3 Floating-Point Arithmetic

To add or subtract two floating-point numbers, the exponents must match before
the fractions can

be added or subtracted. If they do not match, the fraction of one must must be
shifted until the two

exponents match. As a result, the sum or difference of two floating point
numbers will not necessarily

be equal to the true sum of the two real numbers they represent. In other words,
since the

sum of two p-digit numbers can have more than p digits, the excess digits cannot
be represented by

a p-digit floating point number and will be lost. While multiplication and
division of floating -point

numbers does not require the exponents to match, these operations may also
produce answers different

from the corresponding operation on real numbers.

Since floating point operations only approximate
traditional arithmetic operations on real numbers,

they are often represented by different symbols :
, represent floating point addition,

subtraction, multiplication, and division respectively.

The associative law does not hold for floating-point numbers. The following laws
do.

if and only if v = -u

**4 Problems**

In answering the following questions, remember that in a binary number, binary
digits after the

“decimal point” are multiplied by successively smaller powers of 2, e.g.

. It may be useful to review section 3.6 of textbook which
discusses representing

integers in different bases.

**1. Floating-Point Number Systems [10 points]**

(a) A Small Set of Floating-Point Numbers

Suppose we create a simple **normalized** floating-point system F with

The smallest positive number we can represent in this system is

What is the largest positive number representable in the system?

How many numbers are in this system?

(b) Picturing Floating-Point Numbers

Using decimal notation, write out the entire set of numbers representable in the
floatingpoint

system described above (e.g. we have seen 0.5 is one of the numbers represented

in the system so it will be in the list you write out).

Now, draw a number line representing the real numbers in the interval [-3, 3]
and put

tick marks to show where the floating-point numbers in F fall on the line. Are
the

floating-point numbers uniformly distributed on the line ?

**2. IEEE Standard Floating-Point Systems [15 points]**

Answer the following questions for both single precision numbers and double
precision numbers.

(a) How many numbers are in the set?

(b) What is the smallest positive number that can be represented?

(c) What is the largest positive number that can be represented?

**3. Round to Nearest [10 points]**

Suppose we have a round to nearest function fl that maps real numbers to the
small floating

point system F described in Problem 1. Is this function one-to-one? Is it onto?
Justify your

answers. Hint. for one-to-one use a specific example, for onto you can write a
simple direct

proof showing that any x ∈ F has a real number that maps to it .

**4. Arithmetic Operations [15 Points]**

(a) We will prove floating-point addition is not associative. Suppose we have a
decimal (i.e.

β= 10) floating point number system with a precision of 8 digits that uses round
to

nearest. Under these conditions, we would have the following example.

Show that a different answer is reached for

(b) Consider a 2-digit decimal number system in which uses round to nearest. In
this system,

Consider the algebraic identity
. Is this
identity valid

in the number system we are using? To answer that question, try setting a = 1.8
and

b = 1.7 and calculate both the left hand side and the right-hand side of the
identity using

2-digit decimal arithmetic, rounding if necessary after each operation. What
answers do

you get?

(c) This problem involves fixed-point arithmetic but the lesson about the perils
of finite

precision are relevant for floating-point numbers as well. In 1982 the Vancouver
Stock

Exchange instituted a new index initialized to a decimal value of 1000.000. By
design,

indexed was fixed to always have three digits after the decimal point. After
each transaction,

a new index value was computed and truncated to three trailing decimal digits.

Twenty two months later, the value of the index should have been 1098.892.
Instead, the

calculated value was 524.881. In one or two sentences, explain what you think
caused

this inaccuracy.