Triggering manual failovers
Temporal Cloud automatically initiates failovers when an incident or outage affects a Namespace with High Availability features. Namespace replicas duplicate data and prevent data loss during failover.
Perform a manual failover
For some users, Temporal's automated health checks and failovers don't provide sufficient nuance and control. For this reason, you can manually trigger failovers based on your own custom alerts and for testing purposes. This section explains how and what to expect afterward.
Always check the replication lag before initiating a failover. A forced failover when there is a significant replication lag has a higher likelihood of rolling back Workflow progress.
Trigger the failover
You can trigger a failover manually using the Temporal Cloud
Web UI or the tcld CLI, depending on your preference and setup.
The following table outlines the steps for each method:
Method | Instructions |
---|---|
Temporal Cloud Web UI | - Visit the Namespace page on the Temporal Cloud Web UI. - Navigate to your Namespace details page and select the Trigger a failover option from the menu. - After confirmation, Temporal initiates the failover. |
Temporal tcld CLI | To manually trigger a failover, run the following command in your terminal: tcld namespace failover \ --namespace <namespace_id>.<account_id> \ --region <target_region> If using API key authentication with the --api-key flag, you must add it directly after the tcld command and before namespace failover . |
Temporal fails over the primary to the replica. When you're ready to fail back, follow these failover instructions to move the primary back to the original.
Post-failover event information
After any failover, whether triggered by you or by Temporal, an event appears in both the Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs.
The audit log entry for Failover uses the "operation": "FailoverNamespace"
event.
After failover, the replica becomes active, taking over in the isolation domain or region.
You don't need to monitor Temporal Cloud's failover response in real time. Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email.
Returning to the primary with failbacks
After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region or isolation domain that was active before the incident once the incident is resolved. This is called a "failback".
To failback a manually-initiated failover, follow the Manual Failover directions to failover back to the original primary.
Disabling Temporal-initiated failovers
When you add a replica to a Namespace, in the event of an incident or an outage Temporal Cloud automatically fails over the Namespace to its replica. This is the recommended and default option.
However if you prefer to disable Temporal-initiated failovers and handle your own failovers, you can do so by following these instructions:
Method | Instructions |
---|---|
Temporal Cloud Web UI | - Navigate to the Namespace detail page in Temporal Cloud. - Choose the "Disable Temporal-initiated failovers" option. |
Temporal tcld CLI | To disable Temporal-initiated failovers, run the following command in your terminal: tcld namespace update-high-availability \ --namespace <namespace_id>.<account_id> \ --disable-auto-failover=true If using API key authentication with the --api-key flag, you must add it directly after the tcld command and before namespace update-high-availability |
Temporal Cloud disables its health-check initiated failovers.
To restore the default behavior, unselect the option in the WebUI or change true
to false
in the CLI command.
Best practices: Workers and failovers
Enabling High Availability for Namespaces doesn't require specific Worker configuration. The process is invisible to the Workers. When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption.
When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region.
- If your application can’t tolerate this latency, deploy a second set of Workers in the replica's region or opt for a replica in the same region:
- In the case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region.
Temporal Cloud enforces a maximum connection lifetime of 5 minutes. This offers your Workers an opportunity to re-resolve the DNS.
Best practices: scheduled failover testing
Microservices and external dependencies will fail at some point. Testing failovers ensures your app can handle these failures effectively. Temporal recommends regular and periodic failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your app continues to function, even when parts of the infrastructure fail.