Worker health - Temporal Cloud feature guide
This page is a guide to monitoring a Temporal Worker fleet and covers the following scenarios:
- Configuring minimal observations
- How to detect a backlog of Tasks
- How to detect greedy Worker resources
- How to detect misconfigured Workers
- How to configure Sticky cache
Minimal Observations
These alerts should be configured and understood first to gain intelligence into your application health and behaviors.
- Create monitors and alerts for Schedule-To-Start latency SDK metrics (both Workflow Executions and Activity Executions). See Detect Task backlog section to explore sample queries and appropriate responses that accompany these values.
- Alert at >200ms for your p99 value
- Plot >100ms for your p95 value
- Create a Grafana panel called Sync Match Rate. See the Sync Match Rate section to explore example queries and appropriate responses that accompany these values.
- Alert at <95% for your p99 value
- Plot <99% for your p95 value
- Create a Grafana panel called Poll Success Rate. See the Detect greedy Workers section for example queries and appropriate responses that accompany these values.
- Alert at <90% for your p99 value
- Plot <95% for your p95 value
The following alerts build on the above to dive deeper into specific potential causes for Worker related issues you might be experiencing.
- Create monitors and alerts for the temporal_worker_task_slots_available SDK metric. See the Detect misconfigured Workers section for appropriate responses based on the value.
- Alert at 0 for your p99 value
- Create monitors for the temporal_sticky_cache_size SDK metric. See the Configure Sticky Cache section for more details on this configuration.
- Plot at {value} > {WorkflowCacheSize.Value}
- Create monitors for the temporal_sticky_cache_total_forced_eviction SDK metric. This metric is available in the Go SDK, and the Java SDK only. See the Configure Sticky Cache section for more details and appropriate responses.
- Alert at >{predetermined_high_number}
Detect Task Backlog
How to detect a backlog of Tasks.
Metrics to monitor:
- SDK metric: workflow_task_schedule_to_start_latency
- SDK metric: activity_schedule_to_start_latency
- Temporal Cloud metric: temporal_cloud_v0_poll_success_count
- Temporal Cloud metric: temporal_cloud_v0_poll_success_sync_count
Schedule To Start latency
The Schedule-To-Start metric represents how long Tasks are staying unprocessed, in the Task Queues. But differently, it is the time between when a Task is enqueued and when it is picked up by a Worker. This time being long (likely) means that your Workers can't keep up — either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker.
If your Schedule-To-Start latency alert triggers or is high, check the Sync Match Rate to decide if you need to adjust your Worker or fleet, or contact Temporal Cloud support. If your Sync Match Rate is low, contact Temporal Cloud support. If your Sync Match Rate is low, you can contact Temporal Cloud support.
The schedule_to_start_latency SDK metric for both Workflow Executions and Activity Executions should have alerts.
Prometheus query samples
Workflow Task Latency, 99th percentile
histogram_quantile(0.99, sum(rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m])) by (le, namespace, task_queue))
Workflow Task Latency, average
sum(increase(temporal_workflow_task_schedule_to_start_latency_seconds_sum[5m])) by (namespace, task_queue)
/
sum(increase(temporal_workflow_task_schedule_to_start_latency_seconds_count[5m])) by (namespace, task_queue)
Activity Task Latency, 99th percentile
histogram_quantile(0.99, sum(rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m])) by (le, namespace, task_queue))
Activity Task Latency, average
sum(increase(temporal_activity_schedule_to_start_latency_seconds_sum[5m])) by (namespace, task_queue)
/
sum(increase(temporal_activity_schedule_to_start_latency_seconds_count[5m])) by (namespace, task_queue)
Target
This latency should be very low, close to zero. Any higher value indicates a bottleneck. Anything else indicates bottlenecking.
Sync Match Rate
The Sync Match Rate measures the proportion of Tasks that are delivered directly to Stickied Cached Workflows on Workers, compared to the total number of delivered Tasks.
A sync match is when a task is immediately matched to a Stickied Cached Workflow on a Worker.
An async match is when a Task cannot be matched to the Sticky Queue for a Worker. This can happen when no Worker has cached the Workflow, or if the Task times out during processing. In this case, the Task returns to the general Task Queue.
Calculate Sync Match Rate
temporal_cloud_v0_poll_success_sync_count ÷ temporal_cloud_v0_poll_success_count = N
Prometheus query samples
sync_match_rate query
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
)
)
/
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
)
)
Target
The Sync Match Rate should be at least >95%, but preferably >99%.
Interpretation
Consider if this acceptable for your use case to have low Sync Match Rate. For example, if you have known workloads or you intentionally throttle tasks.
If Schedule-To-Start latencies are high, the Task Queue is experiencing a backlog of Tasks.
It's also important to understand the fill and drain rates of the async tasks are during these windows:
Successful async polls
temporal_cloud_v0_poll_success_count - temporal_cloud_v0_poll_success_sync_count = N
sum(rate(temporal_cloud_v0_poll_success_count{temporal_namespace=~"$temporal_namespace"}[5m])) by (temporal_namespace, task_type)
-
sum(rate(temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$temporal_namespace"}[5m])) by (temporal_namespace, task_type)
Actions
- Verify that your Worker setup is optimized for your instance:
- Check the system CPU usage against
task_slots
and adjustmaxConcurrentWorkflowTaskExecutionSize
andmaxConcurrentActivityExecutionSize
settings as necessary. - Check the system memory usage against
sticky_cache_size
and adjust sticky cache size as necessary. - For a detailed explanation of settings, see the Worker Performance section.
- Check the system CPU usage against
- Increase the Worker config for concurrent pollers for Workflow or Activity
task_slots
, if your Worker resources can accommodate the increased load.- Reference Worker Performance > Poller Count.
- Increase the number of available Workers.
Setting the Schedule-To-Start Timeout in your Activity Options can skew your observations. Avoid setting a Schedule-To-Start Timeout when load testing for latency.
Detect greedy Worker resources
How to detect greedy Worker resources.
You can have too many Workers. If you see the Poll Success Rate showing low numbers, you might have too many resources polling Temporal Cloud.
Metrics to monitor:
- Temporal Cloud metric: temporal_cloud_v0_poll_success_count
- Temporal Cloud metric: temporal_cloud_v0_poll_success_sync_count
- Temporal Cloud metric: temporal_cloud_v0_poll_timeout_count
- SDK metric: temporal_workflow_task_schedule_to_start_latency
- SDK metric: temporal_activity_schedule_to_start_latency
Calculate Poll Success Rate
(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count)
/
(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count + temporal_cloud_v0_poll_timeout_count)
Target
Poll Success Rate should be >90% in most cases of systems with a steady load. For high volume and low latency, try to target >95%.
Interpretation
There may be too many Pollers for the amount of Workers.
If you see the following at the same time then you might have too many Workers:
- ResourceExhausted on poll operations, e.g.
temporal_long_request_failure{operation=~"Poll.*"}
- Low poll success rate
Actions
Consider sizing down your Workers by either:
- Reducing the number of Workers polling the impacted Task Queue, OR
- Reducing the concurrent pollers per Worker, OR
- Both of the above
Prometheus query samples
poll_success_rate query
(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
)
)
+
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
)
)
)
/
(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
)
)
+
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
)
)
)
+
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_timeout_count{temporal_namespace=~"$namespace"}[5m]
)
)
)
)
Detect misconfigured Workers
How to detect misconfigured Workers.
Worker configuration can negatively affect Task processing efficiency.
Metrics to monitor:
- SDK metric: temporal_worker_task_slots_available
- SDK metric: sticky_cache_size
- SDK metric: sticky_cache_total_forced_eviction
Execution Size Configuration
The maxConcurrentWorkflowTaskExecutionSize
and maxConcurrentActivityExecutionSize
define the number of total available slots for the Worker.
If this is set too low, the Worker will not be able to keep up processing Tasks.
Target
The temporal_worker_task_slots_available
metric should always be >0.
Prometheus query samples
Over Time
avg_over_time(temporal_worker_task_slots_available{namespace="$namespace",worker_type="WorkflowWorker"}[10m])
Current Time
temporal_worker_task_slots_available{namespace="default", worker_type="WorkflowWorker", task_queue="$task_queue_name"}
Interpretation
You are likely experiencing a Task backlog if you are seeing inadequate slot counts frequently. The work is not getting processed as fast as it should/can.
Action
Increase the maxConcurrentWorkflowTaskExecutionSize
and maxConcurrentActivityExecutionSize
values and keep an eye on your Worker resource metrics (CPU utilization, etc) to make sure you haven't created a new issue.
Configure Sticky Execution Cache
Sticky Execution means that a Worker caches a Workflow Execution Event History and creates a dedicated Task Queue to listen on. It significantly improves performance because the Temporal Service only sends new events to the Worker instead of entire Event Histories.
How to configure your Workflow Sticky cache.
The WorkflowCacheSize
should always be greater than the sticky_cache_size
metric value.
Additionally, you can watch sticky_cache_total_forced_eviction
for unusually high numbers that are likely an indicator of inefficiency, since Workflows are being evicted from the cache.
Target
The sticky_cache_size
should report less than or equal to your WorkflowCacheSize
value.
Also, sticky_cache_total_forced_eviction should not be reporting high numbers (relative).
Action
If you see a high eviction count, verify there are no other inefficiencies in your Worker configuration or resource provisioning (backlog).
If you see the cache size metric exceed the WorkflowCacheSize
, increase this value if your Worker resources can accommodate it or provision more Workers.
Finally, take time to review this document and see if it addresses other potential cache issues.
Prometheus query samples
Sticky Cache Size
max_over_time(temporal_sticky_cache_size{namespace="$namespace"}[10m])
Sticky Cache Evictions
rate(temporal_sticky_cache_total_forced_eviction_total{namespace="$namespace"}[5m]))