Skip to main content

How replication works

Keeping services available is crucial in distributed applications, especially during failures or incidents. Temporal’s replication and automated failover features ensure high availability. By enabling High Availability replication, you allow Temporal to copy Namespace metadata and Workflow Executions to a replica. This redundancy, combined with failover capability, enhances availability during outages.

High availability

High availability ensures that a system remains operational with minimal downtime. It achieves this with redundancy and failover mechanisms that handle failures, so end-users remain unaware of incidents. Temporal Cloud guarantees this high availability with its Service Level Agreements (SLA).

By default, Temporal Cloud provides high availability to all customers, even if you do not opt into additional High Availability features. Each Namespace is automatically distributed across three availability zones, offering the 99.9% uptime outlined in our SLA.

For added resilience, you can enable a Namespace replica, which also delivers automated failovers to your deployment. Your replica can be configured within the same region ("same-region") or across separate regions ("multi-region"). The failover process automatically configures the replica to assume the primary role during an outage. It ensures that even if parts of the Temporal Service fail or become unreachable, your application continues to function.

Replication

Replication is the process of copying and synchronizing data or services across Temporal Server deployments. This ensures availability and consistency in the event of a failure. Temporal uses replication to support high availability, ensuring that Workflows and data remain available even if parts of the system fail or become unreachable.

In Temporal, replication operates at the Namespace level. Each Namespace is replicated across isolation domains or separate regions. If one Namespace becomes unavailable, a replica can take over, ensuring that Workflows continue without interruption.

Temporal Cloud replicates both Workflow Execution details and metadata, including configurations such as retention periods, Search Attributes, and other settings. All parts of the system will eventually synchronize to a consistent view of the Namespace metadata, even if the primary and its replica temporarily lose communication.

Dive deeper — Workflow replication restrictions[+]

Temporal Cloud restricts certain Workflow operations to the primary:

  • You may only update Workflows in the primary.
  • You may only dispatch Workflow Tasks and Activity Tasks from the primary. Because of this, forward progress in a Workflow Execution can only be made in the primary.

These limits mean that certain requests, such as Start Workflow and Signal Workflow, are processed by and limited to the primary. Replicas may receive API requests from Clients and Workers. They automatically forward these requests to the primary for execution.

Namespaces with High Availability features provide an “all-active” experience for Temporal users. This helps limit or eliminate downtime during Namespace failover. There's a short time window from when a replica becomes active to when Clients and Workers receive a DNS update. During this time requests forward from the now passive (formerly active) primary to the newly active (formerly passive) replica.

As Workflow Executions progress and are operated on, replication Tasks created in the primary are dispatched to the replica. Processing these replication Tasks ensures that the replica undergoes the same state transitions as the active primary. This enables replicated tasks to synchronize and achieve the same state as the original tasks.

Replicas do not distribute Workflow or Activity Tasks. Instead, they perform verification tasks to confirm that intended operations are executed so Workflows reach the desired state. This mechanism ensures consistency and reliability in the replication process.


Deployment options

Temporal deploys your primary and its replica across separate isolation domains. These domains can be in the same region or in different regions, depending on your setup. Because each region has multiple isolation domains, you can place your replica in the same region as your primary (Same-region Replication) or in another region (Multi-region Replication).

  • Same-region Replication: Temporal replicates Namespaces across isolation domains within one region. This option is a good fit when your application is built for one region and you prefer to failover within that region. This provides a reliable failover mechanism while maintaining deployment simplicity.

  • Multi-region Replication: Temporal replicates Namespaces across regions, making sure Workflows and data are available even if a region fails. Asynchronous replication means changes aren’t immediately reflected in other regions but will sync over time, ensuring data integrity. This setup allows failovers between replicas without needing immediate consistency across regions. Replication across different regions enhances resilience and reliability.

caution

Even as you adopt Temporal's High Availability features, don't forget to consider the reliability of your own infrastructure and dependencies. Issues like network outages, hardware failures, or misconfigurations in your own systems can affect your application performance.

For the highest level of reliability, distribute your dependencies across regions, and choose our Multi-region replication feature. Using physically separated regions provides fuller tolerance for your application.

DeploymentDescription
Same‑region ReplicationIsolation domains are co-located within the same region.
Multi‑region ReplicationIsolation domains are located in separate regions.

Failovers

Occasionally, a Namespace may become temporarily unavailable due to an unexpected incident. Temporal Cloud detects these issues using regular health checks.

Health checks

Temporal Cloud monitors error rates, latencies, and infrastructure problems, such as request timeouts. If it finds unhealthy conditions where indicators exceed the allowed thresholds, Temporal automatically switches the primary to the replica. In most cases, the replica is unaffected by the issue. This process is known as failover.

Automatic failovers

Failovers prevent data loss and application interruptions. Existing Workflows continue, and new Workflows start as the incident is addressed. Once the incident is resolved, Temporal Cloud performs a "failback," shifting Workflow Execution processing back to the original Namespace.

Temporal Cloud handles failovers automatically, ensuring continuity without manual intervention.

On failover, the replica becomes active and the Namespace endpoint directs access to it.

On failover, the replica becomes active and the Namespace endpoint directs access to it.

For more control over the failover process, you can disable automated failovers.

tip

You can test the failover of Namespace with High Availability features by manually triggering a failover using the UI or the 'tcld' CLI utility. In most scenarios, we recommend you let Temporal handle failovers for you.

After failover, be aware of the following points:

  • When working with Multi-region Namespaces, your CNAME may change. For example, it may switch from aws-us-west-1.region.tmprl.com to aws-us-east-1.region.tmprl.com. This change doesn't affect same-region Namespaces.

  • Your Namespace endpoint will not change. If it is my_namespace.my_account.tmprl.cloud:7233 before failover, it will be my_namespace.my_account.tmprl.cloud:7233 after failover.

The failover process

Temporal's automated failover process works as follows:

  • During normal operation, the primary asynchronously copies operations and metadata to its replica, keeping them in sync.
  • If the primary becomes unavailable, Temporal detects the issue through health checks. It automatically switches to the replica, using one of its available failover scenarios.
  • The replica takes over the active role and becomes the primary. Operations continue with minimal disruption.
  • When the original primary recovers, the roles can either switch back (failback, by default) or remain as they are, based on your Namespace settings. Automatic role switching with failover and failback minimizes downtime for consistent availability.
info

A Namespace failover, which updates the "active region" field in the Namespace record, is a metadata update. This update is replicated through the Namespace metadata mechanism.

Failover scenarios

The Temporal Cloud failover mechanism supports several modes for executing Namespace failovers. These modes include graceful failover ("handover"), forced failover, and a hybrid mode. The hybrid mode is Temporal Cloud’s default Namespace behavior.

Graceful failover (handover)

In this mode, Temporal Cloud fully processes and drains replication Tasks. Temporal Cloud pauses traffic to the Namespace before the failover. Graceful failover prevents the loss of progress and avoids data conflicts.

The Namespace experiences a short period of unavailability, defaulting to 10 seconds. During this period:

  • Existing Workflows stop progress.
  • Temporal Cloud returns a "Service unavailable error". This error is retryable by the Temporal SDKs.
  • State transitions will not happen and tasks are not dispatched.
  • User requests like start/signal Workflow are rejected.
  • Operations are paused during handover.

This mode favors consistency over availability.

Forced failover

In this mode, a Namespace immediately activates in the replica. Events not replicated due to replication lag undergo conflict resolution upon reaching the new active Namespace.

This mode prioritizes availability over consistency.

Hybrid failover mode

While graceful failovers are preferred for consistency, they aren’t always practical. Temporal Cloud’s hybrid failover mode (the default mode) limits the initial graceful failover attempt to 10 seconds or less.

During this period:

  • Existing Workflows stop progress.
  • Temporal Cloud returns a "Service unavailable error", which is retried by SDKs.

If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover.

This strategy balances consistency and availability requirements.

Scenario summary

Failover ScenarioCharacteristics
Graceful failover (handover)Favors consistency over availability.
Forced failoverPrioritizes availability over consistency.
Hybrid failover modeBalances consistency and availability requirements.

Network partitions

At any time only the primary or the replica is active. The only exception occurs in the event of a network partition, when a Network splits into separate subnetworks. Should this occur, you can promote a replica to active status. Caution: This temporarily makes both regions active. After the network partition is resolved and communication between the isolation domains/regions is restored, a conflict resolution algorithm determines whether the primary or replica remains active.

tip

In traditional active/active replication, multiple nodes serve requests and accept writes simultaneously, ensuring strong synchronous data consistency. In contrast, with a Temporal Cloud Namespace with High Availability Features, only the primary accepts requests and writes at any given time. Workflow History Events are written to the primary first and then asynchronously replicated to the replica, ensuring that the replica remains in sync.

Conflict resolution

Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be reflected in the replica due to replication lag, particularly during failovers. In the event of a non-graceful failover, replication lag may cause a temporary setback in Workflow progress.

Namespaces that aren't replicated can be configured to provide at-most-once semantics for Activities execution when a retry policy's maximum attempts is set to 0. High Availability Namespaces provide at-least-once semantics for execution of Activities. Completed Activities may be re-dispatched in a newly active Namespace, leading to repeated executions.

When a Workflow Execution is updated in a newly active replica following a failover, events from the previously active Namespace that arrive after the failover can't be directly applied. At this point, Temporal Cloud has forked the Workflow History.

After failover, Temporal Cloud creates a new branch history for execution, and begins its conflict resolution process. The Temporal Service ensures that Workflow Histories remain valid and are replayable by SDKs post-failover or after conflict resolution. This capability is crucial for ensuring Workflow Executions continue forward without losing progress, and for maintaining consistency across replication, even during incidents that cause disruptions in replication.