/ informatics / theory /

# Floating Point Number [edit]

##### Definition

Floats represents a

$$ (-1)^s \cdot 1.m \cdot 2^{e-127} $$

Precision | Width | Exp. | Bias |
---|---|---|---|

Half | 16 bit | 5 bit | 15 |

Single | 32 bit | 8 bit | 127 |

Double | 64 bit | 11bit | 1023 |

Example -7 = 1.11 * 2^2

## Special Numbers

Floats can also represent signed zeros ($\pm 0$), infinity $\pm \infty$, and Not-A-Number (NaN).

Num | Sign | Exp. | Mant. |
---|---|---|---|

Normal | [0,1] | [-127,126] | $[0,2^{23}]$ |

$\pm 0$ | [0,1] | -128 | 0 |

$\pm \infty$ | [0,1] | 127 | 0 |

Subnormal | [0,1] | -128 | != 0 |

QNaN | [0,1] | 127 | !=0 & MSB=1 |

SNaN | [0,1] | 127 | !=0 & MSB=0 |

## Exceptions

5 exceptions are supported: • Invalid operation: the result of the operation is a NaN • Division by zero • Overflow: the result of the operation is ±∞ or ±Max depending on the rounding mode • Underflow: the result of the operation is a denormalized number • Inexact result: caused by rounding

## Subnormal Numbers

If the result of a calculation is smaller than the smallest normal number there are two option:

- hard underflow: directly assign zero
- gradual underflow: subnormal number

To prevent unwanted behavior by jumping directly to zero, subnormal numbers fill the gap between zero and the smallest normal number.