Retry-as-Recovery
A Derived Failure Pattern of Temporal Assumptions and Hidden Side Effects
Summary
Retry-as-Recovery is a Failure Pattern in which retry becomes the de facto only strategy as a means of recovery from failures, and assumptions regarding time, state, and side effects collapse.
What this Pattern addresses is not the appropriateness of the retry technology itself. In an environment where partial failures are normalized, it addresses the structure in which decisions to "re-execute for now" rationally accumulate, resulting in unrecoverable states.
Context
In distributed systems and external API integration, temporary failures are unavoidable.
Network errors, timeouts, conflicts, etc. often resolve if time passes, and retry is commonly adopted as an effective countermeasure.
Forces
The main dynamics that generate this Pattern are as follows:
-
Generalization of temporary failures
Because many failures are transient, expectations that retrying will succeed are easily formed. -
Deferral of recovery design
Rather than designing explicit recovery strategies, introducing retry looks cheaper in the short term. -
Invisibility of side effects
Impact from re-execution does not surface immediately, and problems appear later. -
Unclear boundaries The range that can be safely re-executed is not shared.
Failure Mode
By using retry as a substitute for recovery, assumptions regarding time and state are no longer maintained.
As a result, the following forms of breaking proceed simultaneously:
-
The same processing is executed multiple times
Processing not guaranteed to be idempotent is re-executed, and data duplication or inconsistency occurs. -
Delays expand side effects
Processing is delayed by retry, and other state changes intervene during that time. -
Failure causes become unclear
The original failure and the outcome after retry mix, making problem isolation difficult.
Consequences
-
Double execution and duplicate processing occur frequently
(Part I: What Breaks — Time / State) -
Non-reproducible failures increase
(Part I: What Breaks — Time / Operation) -
Recovery time becomes unpredictable
(Part II: Why It Breaks — Measurement Gap) -
States of "succeeded but not sure if it was correct" increase
(Part II: Why It Breaks — Context Erosion)
Countermeasures
The following are not a list of solutions, but counter-patterns for changing dynamics with minimal intervention against Failure Mode.
- Make explicit the range that can be re-executed and the boundaries that must not be re-executed
- Position retry as part of recovery strategy, and separate success conditions and failure conditions
- Treat the fact that retry occurred as a learnable event
Resulting Context
Retry continues to be used, but it becomes a strategy in limited situations.
Recovery does not depend only on re-execution, and is treated as a design decision considering state and side effects.
As a result, failures become controllable, and understanding of time and state is recovered.
See also
-
Boundary-Blind Integration
The foundational pattern in which, because failures on boundaries are not localized, re-execution tends to be excessively used as a recovery means. -
Test-Passing Illusion
A derived pattern in which results that succeeded once after retry tend to create false confidence that recovery was correct.
Appendix: Conceptual References
- Information Hiding & Boundaries
Background of structures in which, because boundaries of re-executability are not made explicit, side effects leak across time axes. - Systems Thinking & Constraints
Background of dynamics in which local success (re-execution) impedes overall recovery. - Feedback, Measurement & Learning
Background of structures in which retry is not learned and failure causes are not accumulated.
Appendix: References
- David L. Parnas, On the Criteria To Be Used in Decomposing Systems into Modules, 1972.
- Fred Brooks, No Silver Bullet—Essence and Accidents of Software Engineering, 1987.
- Donella H. Meadows, Thinking in Systems: A Primer, 2008.
- W. Edwards Deming, Out of the Crisis, 1982.