What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?

floating-point precision numerical ieee-754

I know a little bit about how floating-point numbers are represented, but not enough, I'm afraid.

The general question is:

For a given precision (for my purposes, the number of accurate decimal places in base 10), what range of numbers can be represented for 16-, 32- and 64-bit IEEE-754 systems?

Specifically, I'm only interested in the range of 16-bit and 32-bit numbers accurate to +/-0.5 (the ones place) or +/- 0.0005 (the thousandths place).

@bendin: Yes, it exists. en.wikipedia.org/wiki/Half_precision_floating-point_format

Related: Is floating point precision mutable or invariant?

@bendin even 8-bit or fewer float exists and is often taught in computer science curriculum. It's also used in ARM isntruction encoding. 10, 11, 14-bit floats also exist

gnovice

For a given IEEE-754 floating point number X, if

2^E <= abs(X) < 2^(E+1)

then the distance from X to the next largest representable floating point number (epsilon) is:

epsilon = 2^(E-52)    % For a 64-bit float (double precision)
epsilon = 2^(E-23)    % For a 32-bit float (single precision)
epsilon = 2^(E-10)    % For a 16-bit float (half precision)

The above equations allow us to compute the following:

For half precision... If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^10. Any larger than this and the distance between floating point numbers is greater than 0.5. If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 1. Any larger than this and the distance between floating point numbers is greater than 0.0005.

For single precision... If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5. If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005.

For double precision... If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any larger than this and the distance between floating point numbers is greater than 0.5. If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005.

In terms of meters, this means that, at 1m and 1mm precision respectively, half-precision allows 1km and 1m, single-precision allows 8Mm and 8km, and double-precision allows 4Pm and 4Tm.

Rick Regan

For floating-point integers (I'll give my answer in terms of IEEE double-precision), every integer between 1 and 2^53 is exactly representable. Beyond 2^53, integers that are exactly representable are spaced apart by increasing powers of two. For example:

Every 2nd integer between 2^53 + 2 and 2^54 can be represented exactly.

Every 4th integer between 2^54 + 4 and 2^55 can be represented exactly.

Every 8th integer between 2^55 + 8 and 2^56 can be represented exactly.

Every 16th integer between 2^56 + 16 and 2^57 can be represented exactly.

Every 32nd integer between 2^57 + 32 and 2^58 can be represented exactly.

Every 64th integer between 2^58 + 64 and 2^59 can be represented exactly.

Every 128th integer between 2^59 + 128 and 2^60 can be represented exactly.

Every 256th integer between 2^60 + 256 and 2^61 can be represented exactly.

Every 512th integer between 2^61 + 512 and 2^62 can be represented exactly. . . .

Integers that are not exactly representable are rounded to the nearest representable integer, so the worst case rounding is 1/2 the spacing between representable integers.

bendin

The precision quoted form Peter R's link to the MSDN ref is probably a good rule of thumb, but of course reality is more complicated.

The fact that the "point" in "floating point" is a binary point and not decimal point has a way of defeating our intuitions. The classic example is 0.1, which needs a precision of only one digit in decimal but isn't representable exactly in binary at all.

If you have a weekend to kill, have a look at What Every Computer Scientist Should Know About Floating-Point Arithmetic. You'll probably be particularly interested in the sections on Precision and Binary to Decimal Conversion.

Ry-

First off, neither IEEE-754-2008 nor -1985 have 16-bit floats; but it is a proposed addition with a 5-bit exponent and 10-bit fraction. IEE-754 uses a dedicated sign bit, so the positive and negative range is the same. Also, the fraction has an implied 1 in front, so you get an extra bit.

If you want accuracy to the ones place, as in you can represent each integer, the answer is fairly simple: The exponent shifts the decimal point to the right-end of the fraction. So, a 10-bit fraction gets you ±211.

If you want one bit after the decimal point, you give up one bit before it, so you have ±210.

Single-precision has a 23-bit fraction, so you'd have ±224 integers.

How many bits of precision you need after the decimal point depends entirely on the calculations you're doing, and how many you're doing.

210 = 1,024

211 = 2,048

223 = 8,388,608

224 = 16,777,216

253 = 9,007,199,254,740,992 (double-precision)

2113 = 10,384,593,717,069,655,257,060,992,658,440,192 (quad-precision)

See also

Double-precision

Half-precision

Community

See IEEE 754-1985:

https://upload.wikimedia.org/math/7/7/5/775c2ad6fc57863c981972a84dc42f52.png

Note (1 + fraction). As @bendin point out, using binary floating point, you cannot express simple decimal values such as 0.1. The implication is that you can introduce rounding errors by doing simple additions many many times or calling things like truncation. If you are interested in any sort of precision whatsoever, the only way to achieve it is to use a fixed-point decimal, which basically is a scaled integer.

Peter R

If I understand your question correctly, it depends on your language.
For C#, check out the MSDN ref. Float has a 7 digit precision and double 15-16 digit precision.

Actually, IEEE-754 defines the precision, so it shouldn't be language-specific.

PanCrit

It took me quite a while to figure out that when using doubles in Java, I wasn't losing significant precision in calculations. floating point actually has a very good ability to represent numbers to quite reasonable precision. The precision I was losing was immediately upon converting decimal numbers typed by users to the binary floating point representation that is natively supported. I've recently started converting all my numbers to BigDecimal. BigDecimal is much more work to deal with in the code than floats or doubles, since it's not one of the primitive types. But on the other hand, I'll be able to exactly represent the numbers that users type in.

What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US