For clarity, if I'm using a language that implements IEE 754 floats and I declare:
float f0 = 0.f;
float f1 = 1.f;
...and then print them back out, I'll get 0.0000 and 1.0000 - exactly.
But IEEE 754 isn't capable of representing all the numbers along the real line. Close to zero, the 'gaps' are small; as you get further away, the gaps get larger.
So, my question is: for an IEEE 754 float, which is the first (closest to zero) integer which cannot be exactly represented? I'm only really concerned with 32-bit floats for now, although I'll be interested to hear the answer for 64-bit if someone gives it!
I thought this would be as simple as calculating 2bits_of_mantissa and adding 1, where bits_of_mantissa is how many bits the standard exposes. I did this for 32-bit floats on my machine (MSVC++, Win64), and it seemed fine, though.
2mantissa bits + 1 + 1
The +1 in the exponent (mantissa bits + 1) is because, if the mantissa contains abcdef...
the number it represents is actually 1.abcdef... × 2^e
, providing an extra implicit bit of precision.
Therefore, the first integer that cannot be accurately represented and will be rounded is:
For float
, 16,777,217 (224 + 1).
For double
, 9,007,199,254,740,993 (253 + 1).
>>> 9007199254740993.0
9007199254740992
The largest value representable by an n bit integer is 2n-1. As noted above, a float
has 24 bits of precision in the significand which would seem to imply that 224 wouldn't fit.
However.
Powers of 2 within the range of the exponent are exactly representable as 1.0×2n, so 224 can fit and consequently the first unrepresentable integer for float
is 224+1. As noted above. Again.
Success story sharing
float
and set it equal to 16,777,217. But when I printed it usingcout
it resulted in 16,777,216. I'm usingC++
. Why can't I get 16,777,217?(1 << std::numeric_limits<float>::digits) + 1
, and in C,(1 << FLT_MANT_DIG) + 1
. The former is nice because it can be part of a template. Don't add the +1 if you just want the largest representable integer.