Error handling

Temporal automatically retries failed Activities and recovers from infrastructure failures through Durable Execution. But not all failures should be retried. This page covers how to categorize failures, when to mark errors as non-retryable, and how to handle failures that retries cannot resolve.

For background on how Temporal represents and propagates failures, see Application failures.

Categorize failures

When an operation fails, the appropriate response depends on the nature of the failure. Failures fall into three categories based on whether retrying can resolve them.

Transient failures

A transient failure is a one-off event that resolves on its own without intervention. For example, a Worker happens to make a network request at the exact moment an administrator replaces a network cable. The cause is unlikely to affect future requests.

Transient failures are resolved by retrying the operation shortly after the failure. Temporal's default Retry Policy handles transient failures automatically.

Intermittent failures

An intermittent failure is one that recurs but resolves over time. For example, a service that uses rate limiting will reject requests once the threshold is reached, but will accept requests again after the rate limiter resets.

Intermittent failures require retries spaced out over a longer period. Configure your Retry Policy with an appropriate backoffCoefficient and maximumInterval to avoid overwhelming the failing service.

Permanent failures

A permanent failure is one that will recur indefinitely until the cause is fixed. For example, a request that fails due to an invalid email address will continue to fail no matter how many times the operation retries. The only resolution is to correct the email address.

Permanent failures cannot be resolved through retries. They require different input data, a code fix, or some external intervention. Mark these errors as non-retryable to fail fast instead of consuming resources on retries that will not succeed.

Mark permanent errors as non-retryable

When your code detects a permanent failure, mark the error as non-retryable to prevent unnecessary retry attempts. For background on what Application Failures are and how the non_retryable flag works, see Application Failure.

Use non-retryable errors for situations like:

Invalid input data: A malformed email address, a negative payment amount, or a missing required field.
Business rule violations: A customer outside the service area, an order exceeding credit limits, or an expired promotion code.
Authorization failures: The caller does not have permission to perform the operation.
Data validation errors: A referenced record does not exist, or data fails integrity checks.

There are two ways to mark errors as non-retryable:

In the Activity (implementer decides): Set the non_retryable flag when throwing an Application Failure. This enforces the constraint for all callers. Use this when the Activity implementer knows that the error can never be resolved through retries.

In the Retry Policy (caller decides): Add the error type to the Retry Policy's list of non-retryable error types. This lets different Workflows make different decisions about the same Activity. Use this when the decision depends on the caller's business logic.

Preserve retryability when wrapping errors

When an Activity returns an error, the SDK checks the outermost error type to determine retryability. If you catch a non-retryable Application Failure and re-throw it wrapped in a generic language error, the non_retryable flag is lost and the Activity will be retried.

To add context to an error while preserving its retry behavior, wrap it in another Application Failure with the same non_retryable flag. Do not wrap Application Failures in generic language errors.

For a detailed explanation of how the SDK-to-server chain works, see The outermost error type determines retryability.

Use non-retryable errors sparingly

In most cases, let the Retry Policy handle retry limits through timeouts and maximum attempts. Reserve non_retryable for cases where retrying is guaranteed to be futile.

For SDK-specific syntax and code examples, see the error handling guide for your language:

Design Activities for idempotence

Activities may execute more than once due to retries, so design them to be idempotent: producing the same result whether executed once or multiple times.

This is especially important because of an edge case in distributed systems. A Worker can execute an Activity, complete it, and then crash before reporting the result to the Temporal Service. The Activity is retried even though it completed, because the Service has no record of the completion.

Use idempotency keys to prevent duplicate operations. Combine the Workflow Run ID and Activity ID for a value that is consistent across retries but unique across Workflow Executions.

Implement compensation with the Saga pattern

When a multi-step process fails partway through, previous steps may need to be undone. The Saga pattern coordinates a sequence of operations where each step has a compensating action that reverses its effects. If any step fails, the compensating actions for previously completed steps execute in reverse order.

For SDK-specific implementations with working code examples, see:

Python Saga pattern

Categorize failures​

Transient failures​

Intermittent failures​

Permanent failures​

Mark permanent errors as non-retryable​

Preserve retryability when wrapping errors​

Use non-retryable errors sparingly​

Design Activities for idempotence​

Implement compensation with the Saga pattern​