FLOATING POINT ROUNDING ERROR

Ravindran Kugan
6 min readMay 8, 2021

--

Lets take a look at the following java code block.

for(float a=5; a!=0.0;a-=0.1)
{
System.out.println(a);
}

So if this code block is run then the expected output would be the values starting from 5.0 to 0.1 and the loop will run for 100 times. But the actual output would look like this.

5.0
4.9
4.8
4.7000003
4.6000004
4.5000005
4.4000006
4.3000007
4.200001
4.100001
4.000001
3.900001
3.8000011
3.7000012
3.6000013
3.5000014
3.4000015
3.3000016
3.2000017
3.1000018
3.000002
2.900002
2.800002
2.7000022
2.6000023
2.5000024
2.4000025
2.3000026
2.2000027
2.1000028
2.0000029
1.9000028
1.8000028
1.7000028
1.6000028
1.5000027
1.4000027
1.3000027
1.2000027
1.1000026
1.0000026
0.9000026
0.8000026
0.70000255
0.6000025
0.5000025
0.4000025
0.30000252
0.20000252
0.10000252
2.5197864E-6
-0.09999748
-0.19999748
-0.29999748
-0.39999747
-0.49999747
-0.59999746
-0.6999975
-0.7999975
-0.89999753
-0.99999756
-1.0999975
-1.1999975
-1.2999976
-1.3999976
-1.4999976
-1.5999976
-1.6999977
-1.7999977
-1.8999977
-1.9999977
...

As seen from the image the program does not reach the 0.0. The values continues to get decreased endlessly. The reason for this is the IEEE 754 floating point representation. This representation will be explained in the following section.

IEEE 754 floating point standard

In order to represent a floating point number IEEE divides them into three different parts.

1. Sign bit : This bit will tell whether the number is positive or negative. 0 means positive and 1 means negative.

2. Exponent : Exponents can be represented in 2⁸ number of bits. If the value is 12.12 then the exponent bias would be adding 127 and 3.( This will be explained in detail in the upcoming sections.)

3. Mantissa : Mantissa is the binary representation of the scientific notation for base 2 number. Mantissa will contain the bits that comes after the decimal point.

According to the IEEE 754 the following images show the different representations for the floating point numbers.

IEEE 754 Single Precision.
IEEE 754 Double Precision.
IEEE 754 Long Double Precision

Now lets take an example number 12.7 and convert into IEEE single precision standard.

first we will convert 12 into binary.

12 -> 1100

Now lets take that .7 and convert it into binary.

0.7 -> 1011011011011011011011011....

it will continue to get 011 so for the moment lets stop at this specific point.

Now the number 12.7 can be written like this

12.7 -> 1100.1011011011011011011011011....

In scientific notation the decimal point will be put to the number after the leftmost bit.

12.7 -> 1.1001011011011011011011011011.... x 2³

To get the biased exponent we add the 3 from 2³ with 127 (single precision) which will give us the value 130. 130 in binary is this.

10000010

Next is the mantissa. In single precession the value of the number can be represented in 23 bits. And we can ignore the 1 that comes before the decimal as when writing binary in scientific decimal point will always come after a 1.

10010110110110110110110

This is a positive number so the sign bit value will be a 0. Now with all this information the IEEE 754 representation of 12.7 should be this.

0 | 10000010 | 10010110110110110110110

Even though this is the value we get computers will get the following value.


0 | 10000010 | 10010110110110110110111

The reason for this is that computers will check the 24th bit 0100101101101101101101101 if the mantissa has more bits then 23 and round up to the upper limit. If the 24th bit is 0 it will add nothing but if the 24th bit is 1 it will add 1 to the 23rd bit which will affect its value when converting back to the decimal.

Lets Convert this value back to decimal so we can see how the values have changed.

1.10010110110110110110111 x 2³ is the value we have to convert. (Additional one has been added left most as it was the 1 that was left behind during conversion). 1100.10110110110110110111 this is the value after moving 3 bits to the right. Now if we convert (2³x1+2²x1+2⁻¹x1+2⁻³x1+2⁻⁴x1+2⁻⁶x1+2⁻⁷x1+2⁻⁹x1+2⁻¹⁰x1+2⁻¹²x1+2⁻¹³x1+2⁻¹⁵x1+2⁻¹⁶x1+2⁻¹⁸x1+2⁻¹⁹x1+2⁻²⁰x1) the value would be 12.71428585052490234375

This is also the reason why we get those values in the first coding example.

Now in order to overcome this problem programmers can use the BigDecimal class in the java.math library.

BigDecimal

BigDecimal has multiple constructors to initialize it. An integer value, string value or a double value can be passed as the parameter to create a BigDecimal object.

BigDecimal(double val)
BigDecimal(int val)
BigDecimal(String val)

Now lets say you create the BigDecimal object parsing a float value. And lets substrat -0.2 from the value.

            BigDecimal a = new BigDecimal(5);
BigDecimal b = new BigDecimal(0.2);
a = a.subtract(b);
System.out.println(a.toString());

This is the output that you will be getting.

4.799999999999999988897769753748434595763683319091796875

The reason for this is that creating BigDecimal objects without parsing the MathContext parameter. By parsing the MathContext parameter we can specify the precision (the decimal place) that we need to round and also say the rounding mode we need to use (In this instance rounding mode is not specified so the defualt HalfUp is used).

            BigDecimal a = new BigDecimal(5);
BigDecimal b = new BigDecimal(0.2,new MathContext(1));
a = a.subtract(b);
System.out.println(a.toString());

With this the output will be 4.8.

Note : The reason for using MathContext only on object b is that when we use the arithmetic methods. The MathContext of the object that we are passing will be set to the Object that is calling the method. In this instance Object a is getting the MathContext of Object b.

If passing in MathContext seems too much work, parsing the value as a String will simply solve the issue. BigDecimal b = new BigDecimal(“0.2”);. The reason for this is, the constructor will set the the precision value on the decimal place of the string. In this instance it will be 1.

As seen from the above example BigDecimal has other Arithmetic Operations. The following methods can be used to arithmetic calculations. (conisder bigDecimal as the object that was created)

bigDecimal.add(BigDecimal obj) -> adds the two values.
bigDecimal.subtract(BigDecimal obj) -> substracts the obj value from the bigDecimal value.
bigDecimal.multiply(BigDecimal obj)-> multiplies the two values.
bigDecimal.divide(BigDecimal onj) -> divides the two values.

The BigDecimal class also has comparative methods to compare two BigDecimal values the below two methods can be used for comparisons.

BigDecimal a = new BigDecimal("5");
BigDecimal b = new BigDecimal("0.2");
boolean val = a.equals(c)
int res = a.compareTo(c)

The equals method will return the Boolean value true if the values are equal and false if the values are not equal. The compareTo() method will return an int value of 1 if the calling object (a in this instance) is bigger, -1 if the calling object is smaller and 0 if both object have equal values. For the above shown example val = false and res = 1.

More details about BigDecimal class can be found by clicking the following link.

Now before I end this article lets write the original problem using the BigDecimal class and get the correct output.

            BigDecimal a = new BigDecimal("5");
BigDecimal b = new BigDecimal("0.1");
BigDecimal c = new BigDecimal("0.0");
for(;a.equals(c)!=true;a=a.subtract(b))
{
System.out.println(a);
}
System.out.println("Loop Succesfully Finished");

Note: Instead of a.equals(c)!=true ,a.compareTo(c)!=0 can also be used. And remember as we are parsing string values, it is not necessary to specify the precision using MathContext.

The output will be this.

 5.0
4.9
4.8
4.7
4.6
4.5
4.4
4.3
4.2
4.1
4.0
3.9
3.8
3.7
3.6
3.5
3.4
3.3
3.2
3.1
3.0
2.9
2.8
2.7
2.6
2.5
2.4
2.3
2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1

This time we will be getting the expected output.

When to use BigDecimal

Now that you all know about BigDecimal is it necessary to use BigDecimal for all operations regarding floating point. The answer would be no. BigDecimal should be used for when dealing with critical operations. Advanced physics calculations and accurate accounts details are a couple of examples to use BigDecimal. For other normal operations that does not require accurate value float or double can be used.

These videos uploaded by Krishantha Dinesh was a huge help to create this article.

These are the other references that are used to write this blog.

--

--

Ravindran Kugan
Ravindran Kugan

Written by Ravindran Kugan

Associate Software Engineer at Virtusa

No responses yet