Skip to main content

Temporal Server self-hosted production deployment

Overview#

While a lot of effort has been made to easily run and test the Temporal Server in a development environment (see the Quick install guide), there is far less of an established framework for deploying Temporal to a live (production) environment. That is because the set up of the Server depends very much on the intended use-case and the hosting infrastructure.

This page is dedicated to providing a "first principles" approach to self-hosting the Temporal Server. As a reminder, experts are accessible via the Community forum and Slack should you have any questions.

Note: if you are interested in a managed service hosting Temporal Server, please register your interest in Temporal Cloud.

Setup principles#

Prerequisites#

The Temporal Server is a Go application which you can import or run as a binary.

The minimum dependency is a database. The Server supports Cassandra, MySQL, or PostgreSQL. Further dependencies are only needed to support optional features. For example, enhanced Workflow search can be achieved using Elasticsearch. And, monitoring and observability are available with Prometheus and Grafana.

See the versions & dependencies page for precise versions we support together with these features.

Configuration#

At minimum, the development.yaml file needs to have the global and persistence parameters defined.

The Server configuration reference has a more complete list of possible parameters.

Make sure to set Workflow and Activity timeouts everywhere.

Scaling and Metrics#

The requirements of your Temporal system will vary widely based on your intended production workload. You will want to run your own proof of concept tests and watch for key metrics to understand the system health and scaling needs.

At a high level, you will want to track these 3 categories of metrics:

  • Service metrics: For each request made by the service handler we emit service_requests, service_errors, and service_latency metrics with type, operation, and namespace tags. This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.
  • Persistence metrics: The Server emits persistence_requests, persistence_errors and persistence_latency metrics for each persistence operation. These metrics include the operation tag such that you can get the request rates, error rates or latencies per operation. These are super useful in identifying issues caused by the database.
  • Workflow Execution stats: The Server also emits counters for when Workflow Executions are complete. These are useful in getting overall stats about Workflow Execution completions. Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of Workflow Execution completion. These include the namespace tag. Additional information is available in this forum post.

Checklist for Scaling Temporal#

Temporal is highly scalable due to its event sourced design. We have load tested up to 200 million concurrent Workflow Executions. Every shard is low contention by design and it is very difficult to oversubscribe to a Task Queue in the same cluster. With that said, here are some guidelines to some common bottlenecks:

  • Database. The vast majority of the time the database will be the bottleneck. We highly recommend setting alerts on ScheduleToStart latency to look out for this. Also check if your database connection getting saturated.
  • Internal services. The next layer will be scaling the 4 internal services of Temporal (Frontend, Matching, History, and Worker). Monitor each accordingly. The Frontend service is more CPU bound, whereas the History and Matching services require more memory. If you need more instances of each service, spin them up separately with different command line arguments. You can learn more cross referencing our Helm chart with our Server Configuration reference.
  • See the Server Limits section below for other limits you will want to keep in mind when doing system design, including event history length.
  • Multi-Cluster Replication is an experimental feature you can explore for heavy reads.

Finally you want to set up alerting and monitoring on Worker metrics. When Workers are able to keep up, ScheduleToStart latency is close to zero. The default is 4 Workers (aka pollers, as the Workers poll Task Queues), which should handle no more than 300 messages per second.

Specifically, the primary scaling metrics are located in the server's dynamic configs:

  • MaxConcurrentActivityTaskPollers and MaxConcurrentWorkflowTaskPollers: Defaults to 4
  • MaxConcurrentActivityExecutionSize and MaxConcurrentWorkflowTaskExecutionSize: Defaults to 200

Scaling will depend on your workload โ€” for example, for a Task Queue with 500 messages per second, you might want to scale up to 10 pollers. Provided you tune the concurrency of your pollers based on your application, it should be possible to scale them based on standard resource utilization metrics (CPU, Memory, etc).

It's possible to have too many workers. Monitor the poll success (poll_success/poll_success_sync) and poll_timeouts metrics:

  • if you see low ScheduleToStart latency / low percentage of poll success / high percentage of timeouts, you might have too many workers/pollers.
  • with 100% poll success and increasing ScheduleToStart latency, you need to scale up.

FAQ: Autoscaling Workers based on Task Queue load#

Temporal does not yet support returning the number of tasks in a task queue. The main technical hurdle is that each task can have its own ScheduleToStart timeout, so just counting how many tasks were added and consumed is not enough.

This is why we recommend tracking ScheduleToStart latency for determining if the task queue has a backlog (aka Workers are under-provisioned for a given Task Queue). We do plan to add features that give more visibility into the task queue state in future.

Server limits#

Running into limits can cause unexpected failures, so be mindful when you design your systems. Here is a comprehensive list of all the hard (error) / soft (warn) server limits relevant to operating Temporal:

Debugging Temporal#

Debugging Temporal Server Configs#

Recommended configuration debugging techniques for production Temporal Server setups:

Debugging Workflows#

We recommend using Temporal Web to debug your Workflow Executions in development and production.

Tracing Workflows#

Temporal Web's tracing capabilities mainly track activity execution within a Temporal context. If you need custom tracing specific for your usecase, you should make use of context propagation to add tracing logic accordingly.

Future content#

Topics this document will cover in future: (for now, please search/ask on the forum)

  • Recommended Environment
    • Staging/Test
    • using Temporal Web
  • More on Monitoring/Prometheus/Logging
    • Give guidance on how to set up alerts on Metrics provided by SDK
  • Setting up alerts for Workflow Task failures
  • Temporal Antipatterns
    • Trying to implement a queue in a workflow (because people hear we replace queues)
    • Serializing massive amounts of state into and out of the workflow.
    • Treating everything as incredibly rigid/linear sequence of steps instead of dynamic logic
    • Implementing a DSL which is actually just a generic schema based language
    • Polling in activities instead of using signals
    • Blocking on incredibly long RPC requests and not using heartbeats
    • Failing/retrying workflows without a very very specific business reason
  • Temporal Best practices
    • Mapping things to entities instead of traditional service design
    • Testing: unit, integration
    • Retries: figuring out right values for timeouts
    • Versioning
    • WF as unit of scalability - break things into workflows to scale, don't stuff everything in one workflow!

Further Reading#

Understanding the Temporal Server architecture can help you debug and troubleshoot production deployment issues.

External Runbooks#

Third party content that may help: