Half-Open: The Critical Recovery Phase
Half-Open is the most delicate part of a circuit breaker’s lifecycle. It is a transition, and transitions are where systems fail in surprising ways.
If you have not read Integration Patterns, start there first.
What Half-Open Is For
When a breaker is Open, Tripswitch is confident the dependency is unhealthy. Traffic is stopped. At some point, that certainty expires. You need to test whether the dependency has recovered.
Half-Open exists to answer one question: Is it safe to start sending traffic again?
It does not answer whether the dependency is healthy, whether you can resume normal traffic, or whether the incident is over. Half-Open is a probing phase, not a recovery declaration.
Why Half-Open Is Risky
The most common failure mode: dependency goes down, breaker trips Open, dependency comes back. Traffic resumes too quickly. Dependency falls over again.
Half-Open exists to prevent premature recovery. But Half-Open itself introduces risk — you are intentionally sending traffic to something that was recently failing, across many instances at once, often during an incident. The goal is controlled exposure.
The Illusion of “Small” Allow Rates
Tripswitch models Half-Open using an allow rate. An allow rate of 0.1 means roughly 10% of requests are permitted through as probes. This sounds conservative, and often is — per instance.
The surprise comes from aggregation.
Allow rates apply per instance, not globally.
If you run 100 instances, each with a 10% allow rate, you are not sending “a few” probes. You are potentially sending 100 concurrent probes.
What Tripswitch Does During Half-Open
Tripswitch’s responsibilities during Half-Open are intentionally limited.
Tripswitch Does
| Responsibility | What it means |
|---|---|
| Evaluate health centrally | The server determines breaker state based on collected samples |
| Choose when to enter Half-Open | Timing is controlled by cooldown and backoff settings |
| Specify an allow rate | The server tells SDKs what percentage of traffic to allow |
| Observe probe outcomes | Sample data from probes informs the Close/re-Open decision |
Tripswitch Does Not
| Responsibility | Why |
|---|---|
| Coordinate probes across instances | Would require global state and cross-instance interference |
| Enforce a global probe budget | Each instance operates independently by design |
Why Tripswitch Does Not Coordinate Probes Globally
Global probe coordination sounds attractive, but it changes the system’s behavior in fundamental ways. If probe limits were shared, one instance can consume recovery capacity that another never gets, faster or closer instances win more probes, and recovery becomes dependent on fleet dynamics rather than local intent.
Each instance applies the same rule, makes its own decision, and cannot prevent another instance from probing. This keeps failure modes understandable and local, even if aggregate traffic is higher.
What You Must Reason About
During Half-Open, the SDK is doing exactly what it says it does. Most problems come from assumptions outside its scope.
Fleet Size
The more instances you run, the more probes you generate in aggregate.
Probe Cost
Not all requests are equal. Some probes are cheap health checks. Others are expensive database operations or third-party API calls.
Timeouts and Outcome Classification
Slow probes are worse than failed probes — they tie up resources and delay recovery signals. Probe requests should have tight timeouts. And if your code treats “bad data” as success, Tripswitch will too. Use WithErrorEvaluator if HTTP 200 with invalid payload should count as failure.
What Success Looks Like
A healthy Half-Open phase looks boring: a small number of probes succeed, failures are visible but contained, the breaker transitions cleanly to Closed, traffic ramps up without a second collapse.
If Half-Open feels noisy, unpredictable, or flappy, it usually means too many probes, probes that are too expensive, or outcomes that don’t reflect reality.
What Half-Open Is Not
| Misconception | Reality |
|---|---|
| A retry mechanism | The SDK doesn’t retry. Rejected probes fail immediately. |
| A traffic ramp | Allow rates are probabilistic, not graduated. |
| A fairness system | No coordination between instances. |
| A guarantee of recovery | Tests can fail. That’s the point. |
The job of Half-Open is not to make recovery painless. It is to make recovery survivable.
Summary
The SDK handles probe gating locally and reports outcomes. You handle fleet sizing, probe cost, timeout configuration, and outcome classification.
If recovery feels fragile, look at what you control, not what Tripswitch controls.
Next Steps
- See Common Mistakes for concrete failure patterns
- See Tuning & Operations for guidance on allow rates and thresholds
- Revisit Integration Patterns if Half-Open behavior is surprising