Why does changing the sum order returns a different result?

java javascript floating-point

23.53 + 5.88 + 17.64 = 47.05

23.53 + 17.64 + 5.88 = 47.050000000000004

Both Java and JavaScript return the same results.

I understand that, due to the way floating point numbers are represented in binary, some rational numbers (like 1/3 - 0.333333...) cannot be represented precisely.

Why does simply changing the order of the elements affect the result?

Sum of real numbers is associative and commutative. Floating-points aren't real numbers. In fact you just proved that their operations are not commutative. It's pretty easy to show that they aren't associative too (e.g. (2.0^53 + 1) - 1 == 2.0^53 - 1 != 2^53 == 2^53 + (1 - 1)). Hence, yes: be wary when choosing the order of sums and other operations. Some languages provide a built-in to perform "high-precision" sums (e.g. python's math.fsum), so you might consider using these functions instead of the naive sum algorithm.

@RBerteig That can be determined by examining the language's order of operations for arithmetic expressions and, unless their representation of floating point numbers in memory is different, the results will be the same if their operator precedence rules are the same. Another point of note: I wonder how long it took the devs who develop banking applications to figure this out? Those extra 0000000000004 cents really add up!

@ChrisCirefice: if you have 0.00000004 cents, you're doing it wrong. You should never use a binary floating point type for financial calculations.

@DanielPryden Ah alas, it was a joke... just throwing around the idea that people who really need to get this type of problem resolved had one of the most important jobs that you know, holds the monetary status of the people and all that. I was being very sarcastic...

Very dry (and old, but still relevant): What Every Computer Scientist Should Know About Floating-Point Arithmetic

Community

Maybe this question is stupid, but why does simply changing the order of the elements affects the result?

It will change the points at which the values are rounded, based on their magnitude. As an example of the kind of thing that we're seeing, let's pretend that instead of binary floating point, we were using a decimal floating point type with 4 significant digits, where each addition is performed at "infinite" precision and then rounded to the nearest representable number. Here are two sums:

1/3 + 2/3 + 2/3 = (0.3333 + 0.6667) + 0.6667
                = 1.000 + 0.6667 (no rounding needed!)
                = 1.667 (where 1.6667 is rounded to 1.667)

2/3 + 2/3 + 1/3 = (0.6667 + 0.6667) + 0.3333
                = 1.333 + 0.3333 (where 1.3334 is rounded to 1.333)
                = 1.666 (where 1.6663 is rounded to 1.666)

We don't even need non-integers for this to be a problem:

10000 + 1 - 10000 = (10000 + 1) - 10000
                  = 10000 - 10000 (where 10001 is rounded to 10000)
                  = 0

10000 - 10000 + 1 = (10000 - 10000) + 1
                  = 0 + 1
                  = 1

This demonstrates possibly more clearly that the important part is that we have a limited number of significant digits - not a limited number of decimal places. If we could always keep the same number of decimal places, then with addition and subtraction at least, we'd be fine (so long as the values didn't overflow). The problem is that when you get to bigger numbers, smaller information is lost - the 10001 being rounded to 10000 in this case. (This is an example of the problem that Eric Lippert noted in his answer.)

It's important to note that the values on the first line of the right hand side are the same in all cases - so although it's important to understand that your decimal numbers (23.53, 5.88, 17.64) won't be represented exactly as double values, that's only a problem because of the problems shown above.

May extend this later - out of time right now! waiting eagerly for it @Jon

when I say that I will get back to an answer later the community is slightly less kind to me ... will get back to this later.

@ZongZhengLi: While it's certainly important to understand that, it's not the root cause in this case. You could write a similar example with values which are represented exactly in binary, and see the same effect. The problem here is maintaining large scale information and small scale information at the same time.

@Buksy: Rounded to 10000 - because we're dealing with a datatype which can only store 4 significant digits. (so x.xxx * 10^n)

@meteors: No, it doesn't cause an overflow - and you're using the wrong numbers. It's 10001 being rounded to 10000, not 1001 being rounded to 1000. To make it clearer, 54321 would be rounded to 54320 - because that only has four significant digits. There's a big difference between "four significant digits" and "a maximum value of 9999". As I said before, you're basically representing x.xxx * 10^n, where for 10000, x.xxx would be 1.000, and n would be 4. This is just like double and float, where for very large numbers, consecutive representable numbers are more than 1 apart.

rgettman

Here's what's going on in binary. As we know, some floating-point values cannot be represented exactly in binary, even if they can be represented exactly in decimal. These 3 numbers are just examples of that fact.

With this program I output the hexadecimal representations of each number and the results of each addition.

public class Main{
   public static void main(String args[]) {
      double x = 23.53;   // Inexact representation
      double y = 5.88;    // Inexact representation
      double z = 17.64;   // Inexact representation
      double s = 47.05;   // What math tells us the sum should be; still inexact

      printValueAndInHex(x);
      printValueAndInHex(y);
      printValueAndInHex(z);
      printValueAndInHex(s);

      System.out.println("--------");

      double t1 = x + y;
      printValueAndInHex(t1);
      t1 = t1 + z;
      printValueAndInHex(t1);

      System.out.println("--------");

      double t2 = x + z;
      printValueAndInHex(t2);
      t2 = t2 + y;
      printValueAndInHex(t2);
   }

   private static void printValueAndInHex(double d)
   {
      System.out.println(Long.toHexString(Double.doubleToLongBits(d)) + ": " + d);
   }
}

The printValueAndInHex method is just a hex-printer helper.

The output is as follows:

403787ae147ae148: 23.53
4017851eb851eb85: 5.88
4031a3d70a3d70a4: 17.64
4047866666666666: 47.05
--------
403d68f5c28f5c29: 29.41
4047866666666666: 47.05
--------
404495c28f5c28f6: 41.17
4047866666666667: 47.050000000000004

The first 4 numbers are x, y, z, and s's hexadecimal representations. In IEEE floating point representation, bits 2-12 represent the binary exponent, that is, the scale of the number. (The first bit is the sign bit, and the remaining bits for the mantissa.) The exponent represented is actually the binary number minus 1023.

The exponents for the first 4 numbers are extracted:

    sign|exponent
403 => 0|100 0000 0011| => 1027 - 1023 = 4
401 => 0|100 0000 0001| => 1025 - 1023 = 2
403 => 0|100 0000 0011| => 1027 - 1023 = 4
404 => 0|100 0000 0100| => 1028 - 1023 = 5

First set of additions

The second number (y) is of smaller magnitude. When adding these two numbers to get x + y, the last 2 bits of the second number (01) are shifted out of range and do not figure into the calculation.

The second addition adds x + y and z and adds two numbers of the same scale.

Second set of additions

Here, x + z occurs first. They are of the same scale, but they yield a number that is higher up in scale:

404 => 0|100 0000 0100| => 1028 - 1023 = 5

The second addition adds x + z and y, and now 3 bits are dropped from y to add the numbers (101). Here, there must be a round upwards, because the result is the next floating point number up: 4047866666666666 for the first set of additions vs. 4047866666666667 for the second set of additions. That error is significant enough to show in the printout of the total.

In conclusion, be careful when performing mathematical operations on IEEE numbers. Some representations are inexact, and they become even more inexact when the scales are different. Add and subtract numbers of similar scale if you can.

The scales being different is the important part. You could write (in decimal) the exact values that are being represented in binary as the inputs, and still have the same problem.

@rgettman As a programmer, I like your answer better =) +1 for your hex-printer helper... that's really neat!

Eric Lippert

Jon's answer is of course correct. In your case the error is no larger than the error you would accumulate doing any simple floating point operation. You've got a scenario where in one case you get zero error and in another you get a tiny error; that's not actually that interesting a scenario. A good question is: are there scenarios where changing the order of calculations goes from a tiny error to a (relatively) enormous error? The answer is unambiguously yes.

Consider for example:

x1 = (a - b) + (c - d) + (e - f) + (g - h);

x2 = (a + c + e + g) - (b + d + f + h);

x3 = a - b + c - d + e - f + g - h;

Obviously in exact arithmetic they would be the same. It is entertaining to try to find values for a, b, c, d, e, f, g, h such that the values of x1 and x2 and x3 differ by a large quantity. See if you can do so!

How do you define a large quantity? Are we talking on the order of 1000ths? 100ths? 1's???

@Cruncher: Compute the exact mathematical result and the x1 and x2 values. Call the exact mathematical difference between the true and computed results e1 and e2. There are now several ways to think about error size. The first is: can you find a scenario in which either | e1 / e2 | or | e2 / e1 | are large? Like, can you make the error of one ten times the error of the other? The more interesting one though is if you can make the error of one a significant fraction of the size of the correct answer.

I realize he's talking about runtime, but I wonder: If the expression was a compile-time (say, constexpr) expression, are compilers smart enough to minimize the error?

@kevinhsu in general no, the compiler is not that smart. Of course the compiler could choose to do the operation in exact arithmetic if it so chose, but it usually does not.

@frozenkoi: Yes, the error can be infinite very easily. For example, consider the C#: double d = double.MaxValue; Console.WriteLine(d + d - d - d); Console.WriteLine(d - d + d - d); - the output is Infinity then 0.

Compass

This actually covers much more than just Java and Javascript, and would likely affect any programming language using floats or doubles.

In memory, floating points use a special format along the lines of IEEE 754 (the converter provides much better explanation than I can).

Anyways, here's the float converter.

http://www.h-schmidt.net/FloatConverter/

The thing about the order of operations is the "fineness" of the operation.

Your first line yields 29.41 from the first two values, which gives us 2^4 as the exponent.

Your second line yields 41.17 which gives us 2^5 as the exponent.

We're losing a significant figure by increasing the exponent, which is likely to change the outcome.

Try ticking the last bit on the far right on and off for 41.17 and you can see that something as "insignificant" as 1/2^23 of the exponent would be enough to cause this floating point difference.

Edit: For those of you who remember significant figures, this would fall under that category. 10^4 + 4999 with a significant figure of 1 is going to be 10^4. In this case, the significant figure is much smaller, but we can see the results with the .00000000004 attached to it.

jbx

Floating point numbers are represented using the IEEE 754 format, which provides a specific size of bits for the mantissa (significand). Unfortunately this gives you a specific number of 'fractional building blocks' to play with, and certain fractional values cannot be represented precisely.

What is happening in your case is that in the second case, the addition is probably running into some precision issue because of the order the additions are evaluated. I haven't calculated the values, but it could be for example that 23.53 + 17.64 cannot be precisely represented, while 23.53 + 5.88 can.

Unfortunately it is a known problem that you just have to deal with.

hotforfeature

I believe it has to do with the order of evaulation. While the sum is naturally the same in a math world, in the binary world instead of A + B + C = D, it's

A + B = E
E + C = D(1)

So there's that secondary step where floating point numbers can get off.

When you change the order,

A + C = F
F + B = D(2)

I think this answer avoids the real reason. "there's that secondary step where floating point numbers can get off". Clearly, this is true, but what we want to explain is why.

Richard

To add a different angle to the other answers here, this SO answer shows that there are ways of doing floating-point math where all summation orders return exactly the same value at the bit level.

Why does changing the sum order returns a different result?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US