Week 5 Discussion Notes

CS61C Spring 1999

Section 23 TA: Gek Siong Low (cs61c-tb)

Week 5 Discussion notes

Floating Point Representation

I think many people were confused during the last lecture, so I hope to try to explain this topic in plain English so that you'll get at least the basics right.

Some Basics

A real number consists of an integer part and a fractional part. We've studied the integer. part in some detail when we did two's complement. Now we'll deal with representing the fractional part of the number.

Recall that in decimal notation we can write:
e.g. 12.34 = 1x10^1 + 2x10^0 + 3x10^(-1) + 4x10^(-2)

When we use base 2 (or any other base), the principle is the same:
e.g. 11.101 (base 2) = 1x2^1 + 1x2^0 + 1x2^(-1) + 0x2^(-2) + 1x2^(-3) = 3.625

Normalization/Scientific Notation

Normalized form is when you have only one digit to the left of the decimal/binary point.
Decimal exampe: 12.34 = 1.234 x 10^1
Binary example: 11.101 = 1.1101 x 2^1

For any base, when we shift the decimal/binary point to the left by one position, we increase the exponent of the base by one, and when we shift it to the right by one, we decrease the exponent by one.

Note that in scientific notation we don't have a unique normalized form for zero. As we will see later, we'll need a special representation for zero in our floating point representation scheme.

Converting fractional part to binary

Method 1: The 'trial and error' way
Just find the largest power of two less than the fraction, subtract and repeat. This method works for integers too. But then it's not easy to subtract fractions when you don't have a calculator around.

Method 2: This is best illustrated with an example. Suppose we want to represent 12.3 in binary.

12 = 1100 (base 2)

remainder     remainder x 2    leading digit
  0.3             0.6               0
  0.6             1.2               1          <-+
  0.2             0.4               0            |  this sequence   
  0.4             0.8               0            |  repeats
  0.8             1.6               1          <-+
  0.6

We stop here because we get 0.6 again and continuing will simply repreat the sequence. Thus, 0.3 is 0.0100110011001..... in binary. The final result is therefore 12.3 = 1100.0100110011001.... An important observation here is that what can be represented in base 10 with finite digits may often require infinite digits to represent in binary (i.e. infinite precision), and vice versa. We'll see this again later.

The IEEE floating point representation

The IEEE representation is a floating point representation. What it means is simply that you can move the binary point around, with appropriate changes to the exponent. Compare this with fixed point representation where the number of digits after the binary point is fixed. In floating point representation, we trade off fraction precision for the ability to represent very large numbers. The larger the number, the less digits you have left to represent the fraction (after denormalization, that is). Floating point also allows us to represent very small numbers with a higher precision.

Notice that the assignments of the 2^n bit strings in this case will not be spread out evenly along the real number line. The density of representable numbers is higher for smaller numbers, and lower for the larger numbers, which makes sense since we would like to represent large numbers sometimes but want higher precision for smaller numbers which we use more often. You may ask what's the use if large numbers are represented imprecisely, but remember that for 32 bits we are talking about roughly 4 million numbers.

IEEE Format

               +---+----------------+----------------------------------+
               | S |     E          |             F                    |
               +---+----------------+----------------------------------+
No. of bits:     1    8 (biased-127)         23       single-precision (float in C)
                 1   11 (biased-1023)        53       double-precision (double in C)

S is the sign bit. If it is 1, the number is negative. And if it 0, the number is positive.

E is the exponent in biased-127 (1023 for double) notation. "Biased" just means that a constant is added to the two's complement representation of he exponent. This is so that the larger numbers will have larger E values (unsigned) and simplifies the hardware for floating point number comparisons.

F is the fractional part of the number after normalization. The 1 to the left of the binary point (as in 1.xxxx...) is implicit since it is always there and thus is not stored. Only digits to the right of the binary point is represented. This implicit bit is called the hidden bit.

Converting to and from IEEE floats

To IEEE representation: example 12.3
1. Break into integer and fractional parts 12 and 0.3
2. Integer part 12 = 1100
3. Fractional part 0.3 = 0.0100110011001...
4. Put them together 12.3 = 1100.0100110011001...
5. Normalize 12.3 = 1.1000100110011001... x 2^3
6. S = 0 (positive)
E = 3 + 127 = 130 = 1000 0010 (base 2)
F = 1000 1001 1001 1001 1001 101 (23 bits only and rounding up at the end)
7. Result = 0 1000 0010 1000 1001 1001 1001 1001 100

Wait a moment. We've truncated the repeating sequence to fit into 23 bits, doesn't that mean that what is actually stored is only approximately 12.3? That is correct. Not all numbers can be represented exactly in binary with a finite number of bits. The real number line is continuous, but floating point representations are discrete. So let us see what the actual number is by converting the representation back to decimal.

From IEEE representation:
1. We already know the sign is positive and the exponent is 130 - 127 = 3.
2. We have 1.10001001100110011001100 x 2^3 = 1100.01001100110011001100
3. Integer part = 12
4. Fractional part = 2^(-2)+2^(-5)+2^(-6)+2^(-9)+2^(-10)+2^(-13)+2^(-14)+2^(-17)+2^(-18) = 0.2999992370605 (The answer came from my Win95 calculator program)
5. Result = 12.2999992370605

Therefore, what is stored may not be exact. This is why the numbers you put in a float data type may not display the same. Because of this, you shouldn't use floats as loop counters. There's no guarantee the loop will even terminate!

Special Values

As mentioned earlier, there is no way to represent 0 using the "normal" method because there is no leading 1 for the hidden bit. What IEEE representation does here is to reserve special values for zero and for positive and negative infinity, and for NaN ("Not a Number" - when the result is undefined, e.g. when you divide zero by zero).

zero        E=F=0                       Note that S is not specified
                                        so we can have +0 and -0.
+infinity   S=0, E=255 (all ones), F=0
-infinity   S=1, E=255 (all ones), F=0
NaN         E=255, F!=0

Note that the special values do not just take away 4 bit strings but a whole chunk of them. For example, zero takes up the entire block of bit string with E=0. Thus the total number of unique numbers we can represent with the IEEE representation is less than the 2^n bit strings we have.

What's the range?

Smallest +ve number: 1.0 x 2^(-126)
smallest exponent = -126 (E=00000001=1) Remember -127 is already reserved for zero

Largest +ve number: 1.0 x 2^127
largest exponent = +127 (E=11111110=254) Remember 128 is already reserved for NaN, +/- infinities

The same idea goes for negative ranges.

Overflow/Underflow

Overflow occurs when the exponent gets too large.

Underflow is the reverse - when the exponent is too small, i.e. too close to zero. For example, underflow can occur when you repeatedly divide a number by 2, and eventually you'll get zero. Until underflow occurs, you can get back the original number by multiplying repeating by two. Note that dividing by 2 changes only the exponent and not the fractional part.

There is a partial solution to the underflow problem. We can represent even smaller numbers by denormalizing the representation, where we define the hidden bit to be 0 instead of 1 when E=0. This method in effect uses the chunks of bit strings that were taken up by the representation of zero, so we can now represent more numbers than was previously. You don't need to know about denormalization techniques. It's just a possible solution and is not implemented everywhere.

Written by : Gek Siong Low (cs61c-tb), Spring 1999