Retries feel harmless in isolation. One request fails, you try again. But in aggregate, retries are how small failures turn into big ones.
When a dependency slows down, timeouts fire and clients retry. Queues grow, workers surge, and the dependency slows further. You have created a feedback loop—a “thundering herd” where the herd is made of your own systems.
The Failure Spiral
Retries do not create capacity; they create load. The pattern is predictable and lethal:
- Latency spikes: Callers hit their timeouts.
- Synchronized Retries: Callers try again, usually all at once.
- Load Multiplication: The dependency receives more traffic precisely when it is least able to handle it.
- Queue Saturation: Work cannot complete, so workers pull more tasks, multiplying the pressure.
- Total Collapse: Everything fails, including services that were healthy a minute ago.
What to Bound
If you want retries, you need limits that hold under stress. During an incident, “try again” becomes “try harder.” You must bound these four areas:
1. Bound the Attempts
Have a maximum count. Infinite retries are not resilience; they are denial.
2. Bound the Time
Set a total time budget per operation. If the budget is spent, stop. Otherwise, you turn a slow dependency into a thread pool starvation problem.
3. Bound Concurrency
Limit how many in-flight calls you allow to a dependency. Use breakers, bulkheads, or per-dependency pools to ensure retries don’t overwhelm both sides.
4. Bound the Queue
Queues are stored latency and stored load. Cap your queue depth or age. When you hit the limit, choose an explicit behavior: drop, shed, or degrade.
Back Off Like You Mean It
If retries are synchronized, they amplify spikes. Use exponential backoff with jitter. The goal isn’t just to be “polite”—it’s to break synchronization across your fleet so they don’t all strike at once.
Know What to Drop
Dropping work is uncomfortable because it is explicit, but it is better than a total outage. Do not retry:
- Non-idempotent operations that cannot be safely deduped.
- Stale work that will be irrelevant by the time it completes.
- Low-priority tasks that can be replaced by a fallback like “last-known-good.”
How Breakers Stop the Spiral
Circuit breakers don’t make the dependency healthy; they stop you from piling on. A breaker does two jobs:
- It stops callers from generating load when failure is evident.
- It forces you to pick a fallback behavior instead of “retry until it works.”
The Practical Rule: Retries should be a last resort behind constraints, not a default comfort blanket. If you do not have bounds, you do not have a retry strategy—you have an incident multiplier.