Skip to main content

Retry-as-Recovery

A Derived Failure Pattern of Temporal Assumptions and Hidden Side Effects

Summary

Retry-as-Recovery is a Failure Pattern in which retry becomes the de facto only strategy as a means of recovery from failures, and assumptions regarding time, state, and side effects collapse.

What this Pattern addresses is not the appropriateness of the retry technology itself. In an environment where partial failures are normalized, it addresses the structure in which decisions to "re-execute for now" rationally accumulate, resulting in unrecoverable states.


Context

In distributed systems and external API integration, temporary failures are unavoidable.

Network errors, timeouts, conflicts, etc. often resolve if time passes, and retry is commonly adopted as an effective countermeasure.

Forces

The main dynamics that generate this Pattern are as follows:

  • Generalization of temporary failures
    Because many failures are transient, expectations that retrying will succeed are easily formed.

  • Deferral of recovery design
    Rather than designing explicit recovery strategies, introducing retry looks cheaper in the short term.

  • Invisibility of side effects
    Impact from re-execution does not surface immediately, and problems appear later.

  • Unclear boundaries The range that can be safely re-executed is not shared.

Failure Mode

By using retry as a substitute for recovery, assumptions regarding time and state are no longer maintained.

As a result, the following forms of breaking proceed simultaneously:

  • The same processing is executed multiple times
    Processing not guaranteed to be idempotent is re-executed, and data duplication or inconsistency occurs.

  • Delays expand side effects
    Processing is delayed by retry, and other state changes intervene during that time.

  • Failure causes become unclear
    The original failure and the outcome after retry mix, making problem isolation difficult.

Consequences

  • Double execution and duplicate processing occur frequently
    (Part I: What Breaks — Time / State)

  • Non-reproducible failures increase
    (Part I: What Breaks — Time / Operation)

  • Recovery time becomes unpredictable
    (Part II: Why It Breaks — Measurement Gap)

  • States of "succeeded but not sure if it was correct" increase
    (Part II: Why It Breaks — Context Erosion)

Countermeasures

The following are not a list of solutions, but counter-patterns for changing dynamics with minimal intervention against Failure Mode.

  • Make explicit the range that can be re-executed and the boundaries that must not be re-executed
  • Position retry as part of recovery strategy, and separate success conditions and failure conditions
  • Treat the fact that retry occurred as a learnable event

Resulting Context

Retry continues to be used, but it becomes a strategy in limited situations.

Recovery does not depend only on re-execution, and is treated as a design decision considering state and side effects.

As a result, failures become controllable, and understanding of time and state is recovered.

See also

  • Boundary-Blind Integration
    The foundational pattern in which, because failures on boundaries are not localized, re-execution tends to be excessively used as a recovery means.

  • Test-Passing Illusion
    A derived pattern in which results that succeeded once after retry tend to create false confidence that recovery was correct.


Appendix: Conceptual References

Appendix: References

  • David L. Parnas, On the Criteria To Be Used in Decomposing Systems into Modules, 1972.
  • Fred Brooks, No Silver Bullet—Essence and Accidents of Software Engineering, 1987.
  • Donella H. Meadows, Thinking in Systems: A Primer, 2008.
  • W. Edwards Deming, Out of the Crisis, 1982.