/ informatics / theory /

[edit]

Definition

Floats represents a

$$ (-1)^s \cdot 1.m \cdot 2^{e-127} $$

Precision Width Exp. Bias
Half 16 bit 5 bit 15
Single 32 bit 8 bit 127
Double 64 bit 11bit 1023

Example -7 = 1.11 * 2^2

Special Numbers

Floats can also represent signed zeros ($\pm 0$), infinity $\pm \infty$, and Not-A-Number (NaN).

Num Sign Exp. Mant.
Normal [0,1] [-127,126] $[0,2^{23}]$
$\pm 0$ [0,1] -128 0
$\pm \infty$ [0,1] 127 0
Subnormal [0,1] -128 != 0
QNaN [0,1] 127 !=0 & MSB=1
SNaN [0,1] 127 !=0 & MSB=0

Exceptions

5 exceptions are supported: • Invalid operation: the result of the operation is a NaN • Division by zero • Overflow: the result of the operation is ±∞ or ±Max depending on the rounding mode • Underflow: the result of the operation is a denormalized number • Inexact result: caused by rounding

Subnormal Numbers

If the result of a calculation is smaller than the smallest normal number there are two option:

To prevent unwanted behavior by jumping directly to zero, subnormal numbers fill the gap between zero and the smallest normal number.

Not-A-Number (NaN)

-->