Overflow and Underflow

Goal

Understand the definitions of overflow and underflow, learn how IEEE 754 handles them and the role of denormalized numbers, and master typical occurrence scenarios and avoidance techniques.

Prerequisites

Basic structure of IEEE 754 floating-point numbers

Table of Contents

1. Overflow

Overflow occurs when the absolute value of an arithmetic result exceeds the maximum representable floating-point number.

$$|x| > x_{\max} \quad \Rightarrow \quad \text{Overflow}$$
PrecisionMaximum $x_{\max}$Max exponent
Single (float32)$\approx 3.4 \times 10^{38}$$2^{127}$
Double (float64)$\approx 1.8 \times 10^{308}$$2^{1023}$

In IEEE 754, the result is set to $\pm\infty$ on overflow. Arithmetic with $\infty$ follows rules such as $\infty + 1 = \infty$, $\infty \times 2 = \infty$, $\infty - \infty = \text{NaN}$.

Exact limits for double precision

  • Largest normalized: $(2 - 2^{-52}) \times 2^{1023} \approx 1.7976931348623157 \times 10^{308}$
  • Smallest normalized: $2^{-1022} \approx 2.2250738585072014 \times 10^{-308}$
  • Smallest subnormal: $2^{-1074} \approx 4.9406564584124654 \times 10^{-324}$

The exponent field is 11 bits (bias 1023); any value with exponent $\ge 2^{1024}$ is treated as $\infty$.

2. Underflow

Underflow occurs when the absolute value of a nonzero result is smaller than the minimum normalized number.

$$0 < |x| < x_{\min} \quad \Rightarrow \quad \text{underflow}$$
PrecisionMin normalized $x_{\min}$Min denormalized
Single (float32)$\approx 1.2 \times 10^{-38}$$\approx 1.4 \times 10^{-45}$
Double (float64)$\approx 2.2 \times 10^{-308}$$\approx 5.0 \times 10^{-324}$

Underflow is often less critical than overflow. IEEE 754 gradual underflow ensures results approach zero by gradually losing precision rather than being flushed to zero suddenly.

3. Handling in IEEE 754

0 x_min x_max Subnorm. Normalized +∞ −x_min −x_max Subnorm. Normalized −∞ Gradual underflow Overflow Overflow
Figure 1. Representable range of floating-point numbers (symmetric about zero). Near zero, subnormal numbers provide gradual underflow. Beyond $\pm x_{\max}$, results overflow to $\pm\infty$.

Special values in IEEE 754:

  • $\pm\infty$: Result of overflow. Also includes $1/0 = +\infty$, $-1/0 = -\infty$.
  • NaN (Not a Number): Result of undefined operations such as $0/0$, $\infty - \infty$, $\sqrt{-1}$.
  • $\pm 0$: Positive and negative zero are distinguished. $1/(+0) = +\infty$, $1/(-0) = -\infty$.

4. Denormalized Numbers (Subnormals)

Denormalized numbers (subnormal numbers) represent values smaller than $x_{\min}$ by setting the implicit leading bit of the significand to 0 when the exponent is at its minimum value.

Denormalized numbers guarantee the important property $x - y = 0 \Leftrightarrow x = y$ (this property breaks with flush-to-zero underflow).

However, denormalized numbers have fewer significant digits than normalized numbers, and on some processors, arithmetic with denormalized numbers is significantly slower (tens to over 100x), posing a performance concern.

Subnormal performance penalty

On many x86 implementations, arithmetic involving subnormals falls back to microcode and incurs a tens-to-hundreds-of-times slowdown relative to normalized numbers (the exact ratio is strongly processor-generation dependent). Real-time, DSP, audio, and physics-simulation code often enables FTZ (Flush-to-Zero), trading strict IEEE 754 semantics for throughput:

// Enable FTZ + DAZ via SSE (works on GCC/Clang/MSVC)
#include <xmmintrin.h>
#include <pmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);

5. Typical Occurrence Scenarios

Overflow

  • Factorials: $170! \approx 7.26 \times 10^{306}$ is within double precision range, but $171! \approx 1.24 \times 10^{309}$ overflows.
  • Exponential function: $e^{709} \approx 8.2 \times 10^{307}$ is within range, but $e^{710}$ overflows.
  • Vector norms: In $\|x\|_2 = \sqrt{\displaystyle\sum x_i^2}$, large $x_i$ values can cause $x_i^2$ to overflow.

Underflow

  • Probability products: Products of many small probabilities easily underflow. Compute in log-space (log-probabilities).
  • Exponential decay: $e^{-x}$ underflows to zero when $x > 745$ (double precision).
  • Gaussian density: $e^{-x^2/2}$ at distant points is extremely small.

6. Avoidance Techniques

Log-Scale Computation

Convert products to sums: compute $\displaystyle\sum \log p_i$ instead of $\displaystyle\prod p_i$. Apply $\exp$ only when the final result is needed.

Log-Sum-Exp Trick

To compute $\log\left(\displaystyle\sum_i e^{x_i}\right)$ stably, factor out $e^M$ with $M = \max_i x_i$:

$$\log\left(\displaystyle\sum_i e^{x_i}\right) = \log\left(e^M \displaystyle\sum_i e^{x_i - M}\right) = M + \log\left(\displaystyle\sum_i e^{x_i - M}\right)$$

Since $x_i - M \le 0$, we have $e^{x_i-M} \le 1$, preventing overflow. At least one term equals $e^0 = 1$, so the argument of $\log$ lies in $[1, n]$, also preventing the underflow case $\log 0 = -\infty$.

Scaling (Normalization)

When computing the vector norm $\|x\|_2$, divide by $m = \max_i |x_i|$ first:

$$\|x\|_2 = m \sqrt{\displaystyle\sum_i (x_i / m)^2}$$

Since $(x_i/m)^2 \le 1$, overflow is prevented. LAPACK's dnrm2 uses this technique.

Formula Transformation

Transforming $e^a / e^b$ to $e^{a-b}$ or $\sqrt{a} \cdot \sqrt{b}$ to $\sqrt{ab}$ can avoid intermediate overflow/underflow. For $\sqrt{a^2 + b^2}$, use the dedicated function hypot(a, b) standardized in IEEE 754-2008.

7. Worked Examples

Example 1: Log-Sum-Exp for softmax

Consider computing softmax for $x = (1000, 1001, 1002)$. A naive evaluation

$$s_i = \dfrac{e^{x_i}}{\displaystyle\sum_j e^{x_j}}$$

in double precision overflows on each term since $e^{1000} \approx 1.97 \times 10^{434}$, yielding $\infty / \infty = \text{NaN}$. Shifting by $M = \max_j x_j = 1002$,

$$s_i = \dfrac{e^{x_i - M}}{\displaystyle\sum_j e^{x_j - M}} = \dfrac{(e^{-2},\, e^{-1},\, e^{0})_i}{e^{-2} + e^{-1} + 1} \approx (0.0900,\, 0.2447,\, 0.6652)_i$$

all intermediate values fall in $(0, 1]$. The result is identical in exact arithmetic. Python implementation:

import math

def softmax(x):
    m = max(x)
    exp_x = [math.exp(xi - m) for xi in x]
    s = sum(exp_x)
    return [e / s for e in exp_x]

print(softmax([1000, 1001, 1002]))
# [0.09003057317038046, 0.24472847105479764, 0.6652409557748219]

Example 2: Scaling a vector norm

For $x = (10^{200},\, 10^{200})$, naive evaluation of $\|x\|_2 = \sqrt{x_1^2 + x_2^2}$ gives $x_1^2 = 10^{400}$, exceeding the double-precision range ($\sim 10^{308}$), producing $\infty$. Scaling by $m = 10^{200}$,

$$\|x\|_2 = m \sqrt{(x_1/m)^2 + (x_2/m)^2} = 10^{200} \sqrt{1 + 1} = \sqrt{2} \cdot 10^{200} \approx 1.414 \times 10^{200}$$

is computed correctly. The true value is below $10^{308}$ and thus representable.

Example 3: Safe evaluation via hypot

For $\sqrt{a^2 + b^2}$, use IEEE 754-2008's standardized hypot(a, b). For instance, C's hypot(1e200, 1e200) performs internal scaling and returns $\sqrt{2} \cdot 10^{200}$ directly. Similarly, std::expm1(x) ($e^x - 1$) and std::log1p(x) ($\log(1+x)$) simultaneously avoid precision loss and underflow for $|x| \ll 1$.

8. Pitfalls

  • FTZ / DAZ modes: For performance, SSE/AVX often enable FTZ (Flush-to-Zero, denormal outputs become 0) and DAZ (Denormals-Are-Zero, denormal inputs treated as 0). This disables gradual underflow, so $x \ne y$ can yield $x - y = 0$.
  • Denormal performance penalty: On many x86 processors, arithmetic involving denormals switches to microcode and slows down by 100x or more. Real-time and DSP code often enables FTZ intentionally.
  • Signed-zero comparison: $+0 = -0$ is true, but $1/(+0) = +\infty$ and $1/(-0) = -\infty$, so taking reciprocals yields $+\infty \ne -\infty$ and downstream branches diverge.
  • Intermediate overflow: Even when the final result is in range, intermediate computations can overflow (e.g., $\sqrt{a^2 + b^2}$, $(a \cdot b)/c$). Reorder evaluations or use specialized functions.
  • Distinct from integer overflow: Integer overflow is outside IEEE 754; signed integer overflow is undefined behavior in C/C++. This article addresses floating-point only.

Detecting overflow and underflow

IEEE 754 overflow and underflow occur silently; explicit detection requires inspecting the floating-point exception flags. C99 / C++11 expose them via <fenv.h> / <cfenv>:

#include <cfenv>
#pragma STDC FENV_ACCESS ON

std::feclearexcept(FE_ALL_EXCEPT);
double r = dangerous_computation();
if (std::fetestexcept(FE_OVERFLOW))  { /* +/-inf occurred */ }
if (std::fetestexcept(FE_UNDERFLOW)) { /* subnormal or 0 */ }

A SIGFPE-trapping mode (feenableexcept on glibc) also exists, but it is non-portable, costly, and requires careful flag-state management across library boundaries.

9. Frequently Asked Questions

Q1. What is overflow?

Overflow occurs when the absolute value of an arithmetic result exceeds the maximum representable floating-point number. In IEEE 754, the result becomes $\pm\infty$. The maximum double precision value is approximately $1.8 \times 10^{308}$.

Q2. What is underflow?

Underflow occurs when the absolute value of a nonzero result is smaller than the minimum normalized number. IEEE 754 gradual underflow (denormalized numbers) reduces precision gradually but avoids sudden flushing to zero.

Q3. How can overflow and underflow be avoided?

Key methods include log-scale computation (log-sum-exp), formula transformations, scaling/normalization, and extended precision arithmetic.

10. References

  • Wikipedia, "Integer overflow" (English; see also the floating-point section)
  • Wikipedia, "Arithmetic underflow" (English)
  • Wikipedia, "Denormal number" (English)
  • D. Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," ACM Computing Surveys, vol. 23, no. 1, pp. 5--48, 1991.
  • N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002 (§2.7 analyses overflow/underflow; §27 covers safe primitives such as hypot).
  • IEEE 754-2019, IEEE Standard for Floating-Point Arithmetic, IEEE, 2019.

Implementation in sangi

The multi-precision floating-point class sangi Float stores the exponent in a 64-bit integer, so it is not constrained by the $\pm 10^{308}$ range of IEEE 754 double precision and can represent magnitudes up to about $10^{10^{18}}$. Most of the "intermediate overflow" cases discussed here are eliminated naturally by carrying the computation through in multi-precision.