# Floating Point Number#

Floats represents a

$(-1)^s \cdot 1.m \cdot 2^{e-127}$
Precision Width Exp. Bias
Half 16 bit 5 bit 15
Single 32 bit 8 bit 127
Double 64 bit 11bit 1023

Example -7 = 1.11 * 2\^2

## Special Numbers#

Floats can also represent signed zeros ($$\pm 0$$), infinity $$\pm \infty$$, and Not-A-Number (NaN).

Num Sign Exp. Mant.
Normal [0,1] [-127,126] $$[0,2^{23}]$$
$$\pm 0$$ [0,1] -128 0
$$\pm \infty$$ [0,1] 127 0
Subnormal [0,1] -128 != 0
QNaN [0,1] 127 !=0 & MSB=1
SNaN [0,1] 127 !=0 & MSB=0

## Exceptions#

5 exceptions are supported:

• Invalid operation: the result of the operation is a NaN
• Division by zero
• Overflow: the result of the operation is ±∞ or ±MAX depending on the rounding mode
• Underflow: the result of the operation is a denormalized number
• Inexact result: caused by rounding

## Subnormal Numbers#

If the result of a calculation is smaller than the smallest normal number there are two option:

• hard underflow: directly assign zero
• gradual underflow: subnormal number

To prevent unwanted behavior by jumping directly to zero, subnormal numbers fill the gap between zero and the smallest normal number.