Skip to main content

Temporal OSS Cluster metrics reference

OSS Cluster metrics

The information on this page is relevant to open source Temporal Cluster deployments.

See Cloud metrics for metrics emitted by Temporal Cloud.

See SDK metrics for metrics emitted by the SDKs.

A Temporal Cluster emits a range of metrics to help operators get visibility into the Cluster’s performance and to set up alerts.

All metrics emitted by the Cluster are listed in defs.go.

For details on setting up metrics in your Cluster configuration, see Temporal Cluster.

The dashboards repository contains community-driven Grafana dashboard templates that can be used as a starting point for monitoring the Temporal Cluster and SDK metrics. You can use these templates as references to build your own dashboards. For any metrics that are missing in the dashboards, use defs.go as a reference.

Note that, apart from these metrics emitted by the Cluster, you should also monitor infrastructure-specific metrics like CPU, memory, and network for all hosts that are running Temporal Cluster services.

Common metrics

Temporal emits metrics for each gRPC service request. These metrics are emitted with type, operation, and namespace tags, which provide visibility into Service usage and show the request rates across Services, Namespaces, and Operations.

  • Use the operation tag in your query to get request rates, error rates, or latencies per operation.
  • Use the service_name tag with the service role tag values to get details for the specific service.

All common tags that you can add in your query are defined in the defs.go file.

For example, to see service requests by operation on the Frontend Service, use the following:

sum by (operation) (rate(service_requests{service_name="frontend"}[2m]))

Note: All metrics queries in this topic are Prometheus queries.

The following list describes some metrics you can get started with.

service_requests

Shows service requests received per Task Queue. Example: Service requests by operation sum(rate(service_requests{operation=\"AddWorkflowTask\"}[2m]))

service_latency

Shows latencies for all Client request operations. Usually these are the starting point to investigate which operation is experiencing high-latency issues. Example: P95 service latency by operation for the Frontend Service histogram_quantile(0.95, sum(rate(service_latency_bucket{service_name="frontend"}[5m])) by (operation, le))

service_error_with_type

(Available only in v1.17.0+) Identifies errors encountered by the service. Example: Service errors by type for the Frontend Service sum(rate(service_errors_with_type{service_name="frontend"}[5m])) by (error_type)

client_errors

An indicator for connection issues between different Server roles. Example: Client errors sum(rate(client_errors{service_name="frontend",service_role="history"}[5m]))

In addition to these, you can define some service-specific metrics to get performance details for each service. Start with the following list, and use defs.go to define additional metrics as required.

Matching Service metrics

poll_success

Shows for Tasks that are successfully matched to a poller. Example: sum(rate(poll_success{}[5m]))

poll_timeouts

Shows when no Tasks are available for the poller within the poll timeout. Example: sum(rate(poll_timeouts{}[5m]))

asyncmatch_latency

Measures the time from creation to delivery for async matched Tasks. The larger this latency, the longer Tasks are sitting in the queue waiting for your Workers to pick them up. Example: histogram_quantile(0.95, sum(rate(asyncmatch_latency_bucket{service_name=~"matching"}[5m])) by (operation, le))

no_poller_tasks

Emitted whenever a task is added to a task queue that has no poller, and is a counter metric. This is usually an indicator that either the Worker or the starter programs are using the wrong Task Queue. Use no_poller_tasks_per_tl to get data per Task Queue.

History Service metrics

A History Task is an internal Task in Temporal that is created as part of a transaction to update Workflow state and is processed by the Temporal History service. It is critical to ensure that the History Task processing system is healthy. The following key metrics can be used to monitor the History Service health:

task_requests

Emitted on every Task process request. Example: sum(rate(task_requests{service="$service",operation=~"TransferActive.*"}[1m]))

task_errors

Emitted on every Task process error. Example: sum(rate(task_errors{operation=~"TransferActive.*"}[1m]))

task_attempt

Number of attempts on each Task Execution. A Task is retried forever, and each retry increases the attempt count. Example: histogram_quantile($percentile, sum(rate(task_attempt_bucket{service="$service",operation=~"TransferActive.*"}[1m])) by (operation, le))

task_latency_processing

Shows the processing latency per attempt. Example: histogram_quantile($percentile, sum(rate(task_latency_processing_bucket{operation=~"TransferActive.*",service="$service", service_name="history"}[1m])) by (operation, le))

task_latency

Measures the in-memory latency across multiple attempts.

task_latency_queue

Measures the duration, end-to-end, from when the Task should be executed (from the time it was fired) to when the Task is done.

task_latency_load

(Available only in v1.18.0+) Measures the duration from Task generation to Task loading (Task schedule to start latency for persistence queue).

task_latency_schedule

(Available only in v1.18.0+) Measures the duration from Task submission (to the Task scheduler) to processing (Task schedule to start latency for in-memory queue).

queue_latency_schedule

(Available only in v1.18.0+) Measures the time to schedule 100 Tasks in one Task channel in the host-level Task scheduler. If fewer than 100 Tasks are in the Task channel for 30 seconds, the latency is scaled to 100 Tasks upon emission. Note: This is still an experimental metric and is subject to change.

task_latency_userlatency

Shows the latency introduced because of Workflow logic. For example, if you have one Workflow scheduling many Activities or Child Workflows at the same time, it can cause a per-Workflow lock contention. The wait period for the per-Workflow lock is counted as userlatency.

The operation tag contains details about Task type and Active versus Standby statuses, and can be used to get request rates, error rates, or latencies per operation, which can help identify issues caused by database problems.

Persistence metrics

Temporal Server emits metrics for every persistence database read and write. Some of the most important ones are the following:

persistence_requests

Emitted on every persistence request. Examples:

  • Prometheus query for getting the total number of persistence requests by operation for the History Service: sum by (operation) (rate(persistence_requests{service="$service",service_name="history"}[1m]))
  • Prometheus query for getting the total number of persistence requests by operation for the Matching Service: sum by (operation) (rate(persistence_requests{cluster="$cluster",service_name="matching"}[5m]))

persistence_errors

Shows all persistence errors. This metric is a good indicator for connection issues between Temporal Cluster and the persistence store. Example:

  • Prometheus query for getting all persistence errors by service (history) sum (rate(persistence_errors{service="$service",service_name="history"}[1m]))

persistence_error_with_type

Shows all errors related to the persistence store with type, and contain an error_type tag.

  • Prometheus query for getting persistence errors with type by (history) and by error type: sum(rate(persistence_error_with_type{service="$service",service_name="history"}[1m])) by (error_type)

persistence_latency

Shows the latency on persistence operations. Example:

  • Prometheus query for getting latency by percentile: histogram_quantile($percentile, sum(rate(persistence_latency_bucket{service="$service" service_name="history"}[1m])) by (operation, le))

Schedule metrics

Temporal emits metrics that track the performance and outcomes of these Scheduled Executions.

Below are additional metrics that can help you monitor and optimize your Scheduled Workflow Executions.

schedule_buffer_overruns

Indicates instances where the buffer for holding Scheduled Workflows exceeds its maximum capacity. This scenario typically occurs when schedules with a buffer_all overlap policy have their average run length exceeding the average schedule interval.

Example: To monitor buffer overruns.

sum(rate(schedule_buffer_overruns{namespace="$namespace"}[5m]))

schedule_missed_catchup_window

Tracks occurrences when the system fails to execute a Scheduled Action within the defined catchup window. Missed catchup windows can result from extended outages beyond the configured catchup period.

Example: To identify missed catchup opportunities.

sum(rate(schedule_missed_catchup_window{namespace="$namespace"}[5m]))

schedule_rate_limited

Reflects instances where the creation of Workflows by a Schedule is throttled due to rate limiting policies within a Namespace. This metric is crucial for identifying scheduling patterns that frequently hit rate limits, potentially causing missed catchup windows.

Example: To assess the impact of rate limiting on Scheduled Executions.

sum(rate(schedule_rate_limited{namespace="$namespace"}[5m]))

schedule_action_success

Measures the successful execution of Workflows as per their schedules or through manual triggers. This metric is confirms that Workflows are running as expected without delays or errors.

Example: To track the success rate of Scheduled Workflow Executions.

sum(rate(schedule_action_success{namespace="$namespace"}[5m]))