Overflow and Underflow
Goal
Understand the definitions of overflow and underflow, learn how IEEE 754 handles them and the role of denormalized numbers, and master typical occurrence scenarios and avoidance techniques.
Prerequisites
Basic structure of IEEE 754 floating-point numbers
Table of Contents
1. Overflow
Overflow occurs when the absolute value of an arithmetic result exceeds the maximum representable floating-point number.
| Precision | Maximum $x_{\max}$ | Max exponent |
|---|---|---|
| Single (float32) | $\approx 3.4 \times 10^{38}$ | $2^{127}$ |
| Double (float64) | $\approx 1.8 \times 10^{308}$ | $2^{1023}$ |
In IEEE 754, the result is set to $\pm\infty$ on overflow. Arithmetic with $\infty$ follows rules such as $\infty + 1 = \infty$, $\infty \times 2 = \infty$, $\infty - \infty = \text{NaN}$.
Exact limits for double precision
- Largest normalized: $(2 - 2^{-52}) \times 2^{1023} \approx 1.7976931348623157 \times 10^{308}$
- Smallest normalized: $2^{-1022} \approx 2.2250738585072014 \times 10^{-308}$
- Smallest subnormal: $2^{-1074} \approx 4.9406564584124654 \times 10^{-324}$
The exponent field is 11 bits (bias 1023); any value with exponent $\ge 2^{1024}$ is treated as $\infty$.
2. Underflow
Underflow occurs when the absolute value of a nonzero result is smaller than the minimum normalized number.
| Precision | Min normalized $x_{\min}$ | Min denormalized |
|---|---|---|
| Single (float32) | $\approx 1.2 \times 10^{-38}$ | $\approx 1.4 \times 10^{-45}$ |
| Double (float64) | $\approx 2.2 \times 10^{-308}$ | $\approx 5.0 \times 10^{-324}$ |
Underflow is often less critical than overflow. IEEE 754 gradual underflow ensures results approach zero by gradually losing precision rather than being flushed to zero suddenly.
3. Handling in IEEE 754
Special values in IEEE 754:
- $\pm\infty$: Result of overflow. Also includes $1/0 = +\infty$, $-1/0 = -\infty$.
- NaN (Not a Number): Result of undefined operations such as $0/0$, $\infty - \infty$, $\sqrt{-1}$.
- $\pm 0$: Positive and negative zero are distinguished. $1/(+0) = +\infty$, $1/(-0) = -\infty$.
4. Denormalized Numbers (Subnormals)
Denormalized numbers (subnormal numbers) represent values smaller than $x_{\min}$ by setting the implicit leading bit of the significand to 0 when the exponent is at its minimum value.
Denormalized numbers guarantee the important property $x - y = 0 \Leftrightarrow x = y$ (this property breaks with flush-to-zero underflow).
However, denormalized numbers have fewer significant digits than normalized numbers, and on some processors, arithmetic with denormalized numbers is significantly slower (tens to over 100x), posing a performance concern.
Subnormal performance penalty
On many x86 implementations, arithmetic involving subnormals falls back to microcode and incurs a tens-to-hundreds-of-times slowdown relative to normalized numbers (the exact ratio is strongly processor-generation dependent). Real-time, DSP, audio, and physics-simulation code often enables FTZ (Flush-to-Zero), trading strict IEEE 754 semantics for throughput:
// Enable FTZ + DAZ via SSE (works on GCC/Clang/MSVC)
#include <xmmintrin.h>
#include <pmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
5. Typical Occurrence Scenarios
Overflow
- Factorials: $170! \approx 7.26 \times 10^{306}$ is within double precision range, but $171! \approx 1.24 \times 10^{309}$ overflows.
- Exponential function: $e^{709} \approx 8.2 \times 10^{307}$ is within range, but $e^{710}$ overflows.
- Vector norms: In $\|x\|_2 = \sqrt{\displaystyle\sum x_i^2}$, large $x_i$ values can cause $x_i^2$ to overflow.
Underflow
- Probability products: Products of many small probabilities easily underflow. Compute in log-space (log-probabilities).
- Exponential decay: $e^{-x}$ underflows to zero when $x > 745$ (double precision).
- Gaussian density: $e^{-x^2/2}$ at distant points is extremely small.
6. Avoidance Techniques
Log-Scale Computation
Convert products to sums: compute $\displaystyle\sum \log p_i$ instead of $\displaystyle\prod p_i$. Apply $\exp$ only when the final result is needed.
Log-Sum-Exp Trick
To compute $\log\left(\displaystyle\sum_i e^{x_i}\right)$ stably, factor out $e^M$ with $M = \max_i x_i$:
$$\log\left(\displaystyle\sum_i e^{x_i}\right) = \log\left(e^M \displaystyle\sum_i e^{x_i - M}\right) = M + \log\left(\displaystyle\sum_i e^{x_i - M}\right)$$Since $x_i - M \le 0$, we have $e^{x_i-M} \le 1$, preventing overflow. At least one term equals $e^0 = 1$, so the argument of $\log$ lies in $[1, n]$, also preventing the underflow case $\log 0 = -\infty$.
Scaling (Normalization)
When computing the vector norm $\|x\|_2$, divide by $m = \max_i |x_i|$ first:
$$\|x\|_2 = m \sqrt{\displaystyle\sum_i (x_i / m)^2}$$Since $(x_i/m)^2 \le 1$, overflow is prevented. LAPACK's dnrm2 uses this technique.
Formula Transformation
Transforming $e^a / e^b$ to $e^{a-b}$ or $\sqrt{a} \cdot \sqrt{b}$ to $\sqrt{ab}$ can avoid intermediate overflow/underflow. For $\sqrt{a^2 + b^2}$, use the dedicated function hypot(a, b) standardized in IEEE 754-2008.
7. Worked Examples
Example 1: Log-Sum-Exp for softmax
Consider computing softmax for $x = (1000, 1001, 1002)$. A naive evaluation
$$s_i = \dfrac{e^{x_i}}{\displaystyle\sum_j e^{x_j}}$$in double precision overflows on each term since $e^{1000} \approx 1.97 \times 10^{434}$, yielding $\infty / \infty = \text{NaN}$. Shifting by $M = \max_j x_j = 1002$,
$$s_i = \dfrac{e^{x_i - M}}{\displaystyle\sum_j e^{x_j - M}} = \dfrac{(e^{-2},\, e^{-1},\, e^{0})_i}{e^{-2} + e^{-1} + 1} \approx (0.0900,\, 0.2447,\, 0.6652)_i$$all intermediate values fall in $(0, 1]$. The result is identical in exact arithmetic. Python implementation:
import math
def softmax(x):
m = max(x)
exp_x = [math.exp(xi - m) for xi in x]
s = sum(exp_x)
return [e / s for e in exp_x]
print(softmax([1000, 1001, 1002]))
# [0.09003057317038046, 0.24472847105479764, 0.6652409557748219]
Example 2: Scaling a vector norm
For $x = (10^{200},\, 10^{200})$, naive evaluation of $\|x\|_2 = \sqrt{x_1^2 + x_2^2}$ gives $x_1^2 = 10^{400}$, exceeding the double-precision range ($\sim 10^{308}$), producing $\infty$. Scaling by $m = 10^{200}$,
$$\|x\|_2 = m \sqrt{(x_1/m)^2 + (x_2/m)^2} = 10^{200} \sqrt{1 + 1} = \sqrt{2} \cdot 10^{200} \approx 1.414 \times 10^{200}$$is computed correctly. The true value is below $10^{308}$ and thus representable.
Example 3: Safe evaluation via hypot
For $\sqrt{a^2 + b^2}$, use IEEE 754-2008's standardized hypot(a, b). For instance, C's hypot(1e200, 1e200) performs internal scaling and returns $\sqrt{2} \cdot 10^{200}$ directly. Similarly, std::expm1(x) ($e^x - 1$) and std::log1p(x) ($\log(1+x)$) simultaneously avoid precision loss and underflow for $|x| \ll 1$.
8. Pitfalls
- FTZ / DAZ modes: For performance, SSE/AVX often enable FTZ (Flush-to-Zero, denormal outputs become 0) and DAZ (Denormals-Are-Zero, denormal inputs treated as 0). This disables gradual underflow, so $x \ne y$ can yield $x - y = 0$.
- Denormal performance penalty: On many x86 processors, arithmetic involving denormals switches to microcode and slows down by 100x or more. Real-time and DSP code often enables FTZ intentionally.
- Signed-zero comparison: $+0 = -0$ is
true, but $1/(+0) = +\infty$ and $1/(-0) = -\infty$, so taking reciprocals yields $+\infty \ne -\infty$ and downstream branches diverge. - Intermediate overflow: Even when the final result is in range, intermediate computations can overflow (e.g., $\sqrt{a^2 + b^2}$, $(a \cdot b)/c$). Reorder evaluations or use specialized functions.
- Distinct from integer overflow: Integer overflow is outside IEEE 754; signed integer overflow is undefined behavior in C/C++. This article addresses floating-point only.
Detecting overflow and underflow
IEEE 754 overflow and underflow occur silently; explicit detection requires inspecting the floating-point exception flags. C99 / C++11 expose them via <fenv.h> / <cfenv>:
#include <cfenv>
#pragma STDC FENV_ACCESS ON
std::feclearexcept(FE_ALL_EXCEPT);
double r = dangerous_computation();
if (std::fetestexcept(FE_OVERFLOW)) { /* +/-inf occurred */ }
if (std::fetestexcept(FE_UNDERFLOW)) { /* subnormal or 0 */ }
A SIGFPE-trapping mode (feenableexcept on glibc) also exists, but it is non-portable, costly, and requires careful flag-state management across library boundaries.
9. Frequently Asked Questions
Q1. What is overflow?
Overflow occurs when the absolute value of an arithmetic result exceeds the maximum representable floating-point number. In IEEE 754, the result becomes $\pm\infty$. The maximum double precision value is approximately $1.8 \times 10^{308}$.
Q2. What is underflow?
Underflow occurs when the absolute value of a nonzero result is smaller than the minimum normalized number. IEEE 754 gradual underflow (denormalized numbers) reduces precision gradually but avoids sudden flushing to zero.
Q3. How can overflow and underflow be avoided?
Key methods include log-scale computation (log-sum-exp), formula transformations, scaling/normalization, and extended precision arithmetic.
10. References
- Wikipedia, "Integer overflow" (English; see also the floating-point section)
- Wikipedia, "Arithmetic underflow" (English)
- Wikipedia, "Denormal number" (English)
- D. Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," ACM Computing Surveys, vol. 23, no. 1, pp. 5--48, 1991.
- N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002 (§2.7 analyses overflow/underflow; §27 covers safe primitives such as
hypot). - IEEE 754-2019, IEEE Standard for Floating-Point Arithmetic, IEEE, 2019.
Implementation in sangi
The multi-precision floating-point class sangi Float stores the exponent in a 64-bit integer, so it is not constrained by the $\pm 10^{308}$ range of IEEE 754 double precision and can represent magnitudes up to about $10^{10^{18}}$. Most of the "intermediate overflow" cases discussed here are eliminated naturally by carrying the computation through in multi-precision.