ChatGPT解决这个技术问题 Extra ChatGPT

How to get the sign, mantissa and exponent of a floating point number

I have a program, which is running on two processors, one of which does not have floating point support. So, I need to perform floating point calculations using fixed point in that processor. For that purpose, I will be using a floating point emulation library.

I need to first extract the signs, mantissas and exponents of floating point numbers on the processor which do support floating point. So, my question is how can I get the sign, mantissa and exponent of a single precision floating point number.

Following the format from this figure,

https://i.stack.imgur.com/tPPeM.jpg

void getSME( int& s, int& m, int& e, float number )
{
    unsigned int* ptr = (unsigned int*)&number;

    s = *ptr >> 31;
    e = *ptr & 0x7f800000;
    e >>= 23;
    m = *ptr & 0x007fffff;
}
Try to start from here: en.wikipedia.org/wiki/Single-precision_floating-point_format, but I am almost sure that you saw this
Aliasing through pointer conversion is not supported by the C standard and may be troublesome in some compilers. It is preferable to use (union { float f; uint32_t u; }) { number } .u. This returns a uint32_t that is the bytes of the float number reinterpreted as a 32-bit unsigned integer.
I'm assuming IEEE 754 32 bit binary. Are you aware of the following issues? (1) The exponent is biassed, by adding 127 to the actual exponent. (2) All except very small floats are normalized, and the leading 1 bit of a normalized float mantissa is not stored.
Do you mean C or C++ (C has no references, only pointers)
Three problems: 0. not removing the bias from the encoded exponent 1. not adding the implicit mantissa bit for normal nonzero numbers 2. not handling denormals, infinities and sNaN/qNaNs's

S
Stargateur

I think it is better to use unions to do the casts, it is clearer.

#include <stdio.h>

typedef union {
  float f;
  struct {
    unsigned int mantisa : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
  } parts;
} float_cast;

int main(void) {
  float_cast d1 = { .f = 0.15625 };
  printf("sign = %x\n", d1.parts.sign);
  printf("exponent = %x\n", d1.parts.exponent);
  printf("mantisa = %x\n", d1.parts.mantisa);
}

Example based on http://en.wikipedia.org/wiki/Single_precision


"For some reason, this original purpose of the union got "overriden" with something completely different: writing one member of a union and then inspecting it through another member. This kind of memory reinterpretation is not a valid use of unions. It generally leads to undefined behavior." stackoverflow.com/a/2313676/1127387
There's no law that says you have to only use things for what they were originally created for. Otherwise the first plane wouldn't have used bits of bicycle. "Generally" undefined? What about those occasions when it is defined, or when you're happy with the behaviour on a given platform/situation?
This method fails when 1) float is not IEEE 754 32 bit binary (not so rare) 2) unsigned is 16-bit (common in embedded world) 3) endian of unsigned/float do not match. (rare). 4) Mathematical interpretation is used for exponent/mantissa as this answer shows the biased exponent and the incomplete significand/mantissa.
Is the above code portable? What happens on big and little endian machines?
Very late to the party here, but no, the union is not better because it is not guaranteed to work at all. It certainly is not portable. Nothing constrains the C implementation to lay out the bitfields such that the union maps them to the desired pieces of the float representation, the separate question of relying on type punning at all notwithstanding.
P
Pietro Braione

My advice is to stick to rule 0 and not redo what standard libraries already do, if this is enough. Look at math.h (cmath in standard C++) and functions frexp, frexpf, frexpl, that break a floating point value (double, float, or long double) in its significand and exponent part. To extract the sign from the significand you can use signbit, also in math.h / cmath, or copysign (only C++11). Some alternatives, with slighter different semantics, are modf and ilogb/scalbn, available in C++11; http://en.cppreference.com/w/cpp/numeric/math/logb compares them, but I didn't find in the documentation how all these functions behave with +/-inf and NaNs. Finally, if you really want to use bitmasks (e.g., you desperately need to know the exact bits, and your program may have different NaNs with different representations, and you don't trust the above functions), at least make everything platform-independent by using the macros in float.h/cfloat.


Is there any potable way to know the existence of the hidden bit in the significand? And what about the detection of the exponent bias?
C
Community

Find out the format of the floating point numbers used on the CPU that directly supports floating point and break it down into those parts. The most common format is IEEE-754.

Alternatively, you could obtain those parts using a few special functions (double frexp(double value, int *exp); and double ldexp(double x, int exp);) as shown in this answer.

Another option is to use %a with printf().


Too bad I'm searching for a portable solution to implement dtoa, which is somewhat a subset of printf...
X
Xymostech

You're &ing the wrong bits. I think you want:

s = *ptr >> 31;
e = *ptr & 0x7f800000;
e >>= 23;
m = *ptr & 0x007fffff;

Remember, when you &, you are zeroing out bits that you don't set. So in this case, you want to zero out the sign bit when you get the exponent, and you want to zero out the sign bit and the exponent when you get the mantissa.

Note that the masks come directly from your picture. So, the exponent mask will look like:

0 11111111 00000000000000000000000

and the mantissa mask will look like:

0 00000000 11111111111111111111111


@MetallicPriest Try now, I had the wrong masks the first time.
What about the so called hidden bit? I don't see anyone set it: m |= 0x00800000;. Note that the number should be checked for special values (denormals, NaN, infinities) first, since these require different treatment.
@RudyVelthuis From their original code, it doesn't look they were trying to actually obtain the values of the exponent and mantissa, just trying to get the bit representation of each. I'm assuming this because they didn't or in the hidden bit or normalize the sign, but I could be wrong.
I'm assuming they forgot and that is why they got wrong values. But I can only guess.
M
Maxim Egorushkin

On Linux package glibc-headers provides header #include <ieee754.h> with floating point types definitions, e.g.:

union ieee754_double
  {
    double d;

    /* This is the IEEE 754 double-precision format.  */
    struct
      {
#if __BYTE_ORDER == __BIG_ENDIAN
    unsigned int negative:1;
    unsigned int exponent:11;
    /* Together these comprise the mantissa.  */
    unsigned int mantissa0:20;
    unsigned int mantissa1:32;
#endif              /* Big endian.  */
#if __BYTE_ORDER == __LITTLE_ENDIAN
# if    __FLOAT_WORD_ORDER == __BIG_ENDIAN
    unsigned int mantissa0:20;
    unsigned int exponent:11;
    unsigned int negative:1;
    unsigned int mantissa1:32;
# else
    /* Together these comprise the mantissa.  */
    unsigned int mantissa1:32;
    unsigned int mantissa0:20;
    unsigned int exponent:11;
    unsigned int negative:1;
# endif
#endif              /* Little endian.  */
      } ieee;

    /* This format makes it easier to see if a NaN is a signalling NaN.  */
    struct
      {
#if __BYTE_ORDER == __BIG_ENDIAN
    unsigned int negative:1;
    unsigned int exponent:11;
    unsigned int quiet_nan:1;
    /* Together these comprise the mantissa.  */
    unsigned int mantissa0:19;
    unsigned int mantissa1:32;
#else
# if    __FLOAT_WORD_ORDER == __BIG_ENDIAN
    unsigned int mantissa0:19;
    unsigned int quiet_nan:1;
    unsigned int exponent:11;
    unsigned int negative:1;
    unsigned int mantissa1:32;
# else
    /* Together these comprise the mantissa.  */
    unsigned int mantissa1:32;
    unsigned int mantissa0:19;
    unsigned int quiet_nan:1;
    unsigned int exponent:11;
    unsigned int negative:1;
# endif
#endif
      } ieee_nan;
  };

#define IEEE754_DOUBLE_BIAS 0x3ff /* Added to exponent.  */

How do we use these in practice ? If we have to check whether we got a nan, how do we do ?
@DimitriLesnoff Start with std::isnan.
This looks like a great (thorough) start, but would it be too much to ask for a little test code that shows the input and the output, and how the two work (or don’t)?
佚名

Don't make functions that do multiple things. Don't mask then shift; shift then mask. Don't mutate values unnecessarily because it's slow, cache-destroying and error-prone. Don't use magic numbers.

/* NaNs, infinities, denormals unhandled */
/* assumes sizeof(float) == 4 and uses ieee754 binary32 format */
/* assumes two's-complement machine */
/* C99 */
#include <stdint.h>

#define SIGN(f) (((f) <= -0.0) ? 1 : 0)

#define AS_U32(f) (*(const uint32_t*)&(f))
#define FLOAT_EXPONENT_WIDTH 8
#define FLOAT_MANTISSA_WIDTH 23
#define FLOAT_BIAS ((1<<(FLOAT_EXPONENT_WIDTH-1))-1) /* 2^(e-1)-1 */
#define MASK(width)  ((1<<(width))-1) /* 2^w - 1 */
#define FLOAT_IMPLICIT_MANTISSA_BIT (1<<FLOAT_MANTISSA_WIDTH)

/* correct exponent with bias removed */
int float_exponent(float f) {
  return (int)((AS_U32(f) >> FLOAT_MANTISSA_WIDTH) & MASK(FLOAT_EXPONENT_WIDTH)) - FLOAT_BIAS;
}

/* of non-zero, normal floats only */
int float_mantissa(float f) {
  return (int)(AS_U32(f) & MASK(FLOAT_MANTISSA_BITS)) | FLOAT_IMPLICIT_MANTISSA_BIT;
}

/* Hacker's Delight book is your friend. */

"Don't use magic numbers" strongly suggests using DBL_MANT_DIG rather than creating one's own version!
@TobySpeight It's FLT_MANT_DIG.
A
AymenTM

See this IEEE_754_types.h header for the union types to extract: float, double and long double, (endianness handled). Here is an extract:

/*
** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
**  Single Precision (float)  --  Standard IEEE 754 Floating-point Specification
*/

# define IEEE_754_FLOAT_MANTISSA_BITS (23)
# define IEEE_754_FLOAT_EXPONENT_BITS (8)
# define IEEE_754_FLOAT_SIGN_BITS     (1)

.
.
.

# if (IS_BIG_ENDIAN == 1)
    typedef union {
        float value;
        struct {
            __int8_t   sign     : IEEE_754_FLOAT_SIGN_BITS;
            __int8_t   exponent : IEEE_754_FLOAT_EXPONENT_BITS;
            __uint32_t mantissa : IEEE_754_FLOAT_MANTISSA_BITS;
        };
    } IEEE_754_float;
# else
    typedef union {
        float value;
        struct {
            __uint32_t mantissa : IEEE_754_FLOAT_MANTISSA_BITS;
            __int8_t   exponent : IEEE_754_FLOAT_EXPONENT_BITS;
            __int8_t   sign     : IEEE_754_FLOAT_SIGN_BITS;
        };
    } IEEE_754_float;
# endif

And see dtoa_base.c for a demonstration of how to convert a double value to string form.

Furthermore, check out section 1.2.1.1.4.2 - Floating-Point Type Memory Layout of the C/CPP Reference Book, it explains super well and in simple terms the memory representation/layout of all the floating-point types and how to decode them (w/ illustrations) following the actually IEEE 754 Floating-Point specification.

It also has links to really really good ressources that explain even deeper.


关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now