The Temporal Global Namespace feature provides clients with the capability to continue their Workflow execution from another cluster in the event of a datacenter failover. Although you can configure a Global Namespace to be replicated to any number of clusters, it is only considered active in a single cluster.
Temporal has introduced a new top level entity, Global Namespaces, which provides support for replication of Workflow execution across clusters. Client applications need to run workers polling on Activity/Decision tasks on all clusters. Temporal will only dispatch tasks on the current active cluster; workers on the standby cluster will sit idle until the Global Namespace is failed over.
Because Temporal is a service that provides highly consistent semantics, we only allow external events like StartWorkflowExecution, SignalWorkflowExecution, etc. on an active cluster. Global Namespaces relies on light-weight transactions (paxos) on the local cluster (Local_Quorum) to update the Workflow execution state and create replication tasks which are applied asynchronously to replicate state across clusters. If an application makes these API calls on a cluster where Global Namespace is in standby mode, Temporal will reject those calls with NamespaceNotActiveError, which contains the name of the current active cluster. It is the responsibility of the application to forward the external event to the cluster that is currently active.
This config is used to distinguish namespaces local to the cluster from the global namespace. It controls the creation of replication tasks on updates allowing the state to be replicated across clusters. This is a read-only setting that can only be set when the namespace is provisioned.
A list of clusters where the namespace can fail over to, including the current active cluster. This is also a read-only setting that can only be set when the namespace is provisioned. A re-replication feature on the roadmap will allow updating this config to add/remove clusters in the future.
Name of the current active cluster for the Global Namespace. This config is updated each time the Global Namespace is failed over to another cluster.
Unique failover version which also represents the current active cluster for Global Namespace. Temporal allows failover to be triggered from any cluster, so failover version is designed in a way to not allow conflicts if failover is mistakenly triggered simultaneously on two clusters.
Unlike local namespaces which provide at-most-once semantics for Activity execution, Global Namespaces can only support at-least-once semantics. Temporal XDC relies on asynchronous replication of events across clusters, so in the event of a failover it is possible that Activity gets dispatched again on the new active cluster due to a replication task lag. This also means that whenever Workflow execution is updated after a failover by the new cluster, any previous replication tasks for that execution cannot be applied. This results in loss of some progress made by the Workflow execution in the previous active cluster. During such conflict resolution, Temporal re-injects any external events like Signals to the new history before discarding replication tasks. Even though some progress could rollback during failovers, Temporal provides the guarantee that Workflows won’t get stuck and will continue to make forward progress.
All Visibility APIs are allowed on both active and standby clusters. This enables Temporal Web to work seamlessly for Global Namespaces as all visibility records for Workflow executions can be queried from any cluster the namespace is replicated to. Applications making API calls directly to the Temporal Visibility API will continue to work even if a Global Namespace is in standby mode. However, they might see a lag due to replication delay when querying the Workflow execution state from a standby cluster.
The Temporal CLI can also be used to query the namespace config or perform failovers. Here are some useful commands.
The following command can be used to describe Global Namespace metadata:
The following command can be used to failover Global Namespace my-namespace-global to the dc2 cluster:
Temporal does not forward Activity completions across clusters. Any outstanding Activity will eventually timeout based on the configuration. Your application should have retry logic in place so that the Activity gets retried and dispatched again to a worker after the failover to the new DC. Handling this is pretty much the same as Activity timeout caused by a worker restart even without Global Namespaces.
Temporal will reject the call and return NamespaceNotActiveError. It is the responsibility of the application to forward the failed call to active cluster based on information provided in the error.
The recommendation at this point is to publish events to a Kafka topic if they can be generated in any DC. Then, have a consumer that consumes from the aggregated Kafka topic in the same DC and sends them to Temporal. Both the Kafka consumer and Global Namespace need to be failed over together.