Retry-as-Recovery

A Derived Failure Pattern of Temporal Assumptions and Hidden Side Effects

Summary

Retry-as-Recovery is a Failure Pattern in which retry becomes the de facto only strategy as a means of recovery from failures, and assumptions regarding time, state, and side effects collapse.

What this Pattern addresses is not the appropriateness of the retry technology itself. In an environment where partial failures are normalized, it addresses the structure in which decisions to "re-execute for now" rationally accumulate, resulting in unrecoverable states.

Context

In distributed systems and external API integration, temporary failures are unavoidable.

Network errors, timeouts, conflicts, etc. often resolve if time passes, and retry is commonly adopted as an effective countermeasure.

Forces

The main dynamics that generate this Pattern are as follows:

Generalization of temporary failures
Because many failures are transient, expectations that retrying will succeed are easily formed.
Deferral of recovery design
Rather than designing explicit recovery strategies, introducing retry looks cheaper in the short term.
Invisibility of side effects
Impact from re-execution does not surface immediately, and problems appear later.
Unclear boundaries The range that can be safely re-executed is not shared.

Failure Mode

By using retry as a substitute for recovery, assumptions regarding time and state are no longer maintained.

As a result, the following forms of breaking proceed simultaneously:

The same processing is executed multiple times
Processing not guaranteed to be idempotent is re-executed, and data duplication or inconsistency occurs.
Delays expand side effects
Processing is delayed by retry, and other state changes intervene during that time.
Failure causes become unclear
The original failure and the outcome after retry mix, making problem isolation difficult.

Consequences

Double execution and duplicate processing occur frequently
(Part I: What Breaks — Time / State)
Non-reproducible failures increase
(Part I: What Breaks — Time / Operation)
Recovery time becomes unpredictable
(Part II: Why It Breaks — Measurement Gap)
States of "succeeded but not sure if it was correct" increase
(Part II: Why It Breaks — Context Erosion)

Countermeasures

The following are not a list of solutions, but counter-patterns for changing dynamics with minimal intervention against Failure Mode.

Make explicit the range that can be re-executed and the boundaries that must not be re-executed
Position retry as part of recovery strategy, and separate success conditions and failure conditions
Treat the fact that retry occurred as a learnable event

Resulting Context

Retry continues to be used, but it becomes a strategy in limited situations.

Recovery does not depend only on re-execution, and is treated as a design decision considering state and side effects.

As a result, failures become controllable, and understanding of time and state is recovered.

Appendix: Conceptual References

Information Hiding & Boundaries
Background of structures in which, because boundaries of re-executability are not made explicit, side effects leak across time axes.
Systems Thinking & Constraints
Background of dynamics in which local success (re-execution) impedes overall recovery.
Feedback, Measurement & Learning
Background of structures in which retry is not learned and failure causes are not accumulated.

Appendix: References

David L. Parnas, On the Criteria To Be Used in Decomposing Systems into Modules, 1972.
Fred Brooks, No Silver Bullet—Essence and Accidents of Software Engineering, 1987.
Donella H. Meadows, Thinking in Systems: A Primer, 2008.
W. Edwards Deming, Out of the Crisis, 1982.

Summary​

Context​

Forces​

Failure Mode​

Consequences​

Countermeasures​

Resulting Context​

See also​

Appendix: Conceptual References​

Appendix: References​