Skip to main content

Monitoring health

Support, stability, and dependency info

Same-region Replication is in Public Preview for Temporal Cloud.

This section provides how-to instructions for monitoring replica and Service health when using Namespaces with High Availability features:

Replica health

You can monitor your replica status with the Temporal Cloud UI. If the replica is unhealthy, Temporal Cloud diables the “Trigger a failover” option to prevent failing over to an unhealthy replica. An unhealthy replica might be due to:

  • Data synchronization issues: The replica fails to remain in sync with the primary due to network or performance problems.
  • Replication lag: The replica falls behind the primary, causing it to be out of sync.
  • Network issues: Loss of communication between the replica and the primary causes problems.
  • Failed health checks: If the replica fails health checks, it’s marked as unhealthy.

These issues prevent the replica from being used during a failover, ensuring system stability and consistency.

Service health and latency metrics

Temporal Cloud’s High Availability features use asynchronous replication between the primary and the replica. Workflow updates in the primary, along with associated History Events, are transmitted to the replica. Replication lag refers to the transmission delay of Workflow updates and history events from the primary to the replica.

tip

Temporal Cloud strives to maintain a P95 replication lag of less than 1 minute. In this context, P95 means 95% of updates are processed faster than this limit.

A forced failover, when there is significant replication lag, increases the likelihood of rolling back Workflow progress. Always check the replication lag metrics before initiating a failover.

Temporal Cloud emits three replication lag-specific metrics. The following samples demonstrate how you can use these metrics to monitor and explore replication lag:

P99 replication lag histogram:

histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le))

Average replication lag:

sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace)
/
sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace)

Monitoring and observability

You can view and set alerts for key Cloud metrics using the Web UI, the 'tcld' CLI utility, or Temporal Cloud APIs. For instance, when adding a region to a Namespace, you can track how Workflow replication progresses. If any errors occur, they will be displayed in the Namespace Web UI.

tip

If a Namespace is using a replica, you may notice that the Action count in temporal_cloud_v0_total_action_count is doubled (2x). This happens because operations are replicated; they occur on both the primary and the replica.

Checking failovers

Temporal Cloud offers several ways to monitor failovers:

  • When Temporal triggers failovers, the audit log will update with details. Look for "operation": "FailoverNamespace" in the logs.
  • You can set up alerts for Temporal-initiated failover events.
  • After a failover, verify that the Namespace is active in the new region using the Temporal Cloud Web UI.