Temporal Server self-hosted production deployment
Overview
While a lot of effort has been made to easily run and test the Temporal Server in a development environment (see the Quick install guide), there is far less of an established framework for deploying Temporal to a live (production) environment. That is because the set up of the Server depends very much on your intended use-case and the hosting infrastructure.
This page is dedicated to providing a "first principles" approach to self-hosting the Temporal Server. As a reminder, experts are accessible via the Community forum and Slack should you have any questions.
info
If you are interested in a fully managed service hosting Temporal Server, please register your interest in Temporal Cloud. We have a waitlist for early Design Partners.
Temporal Server
Temporal Server is a Go application which you can import or run as a binary (we offer builds with every release).
If you are running only the Go binary, Go is not required.
But if you are building Temporal or running it from source, Go v1.16+ is required.
While Temporal can be run as a single Go binary, we recommend that production deployments of Temporal Server should deploy each of the 4 internal services separately (if you are using Kubernetes, one service per pod) so they can be scaled independently in future.
See below for a refresher on the 4 internal services:
Temporal Cluster Architecture
A Temporal Cluster is the group of services, known as the Temporal Server, combined with persistence stores, that together act as a component of the Temporal Platform.
Persistence
A Temporal Cluster's only required dependency for basic operation is a database. Multiple types of databases that are supported.
The database stores the following types of data:
- Tasks: Tasks to be dispatched.
- State of Workflow Executions:
- Execution table: A capture of the mutable state of Workflow Executions.
- History table: An append only log of Workflow Execution History Events.
- Namespace metadata: Metadata of each Namespace in the Cluster.
- Visibility data: Enables operations like "show all running Workflow Executions". For production environments, we recommend using Elasticsearch.
An Elasticsearch database can be added to enable Advanced Visibility.
Versions
Temporal tests compatibility by spanning the minimum and maximum stable non-EOL major versions for each supported database. As of time of writing, these specific versions are used in our test pipelines and actively tested before we release any version of Temporal:
- Cassandra v3.11 and v4.0
- PostgreSQL v10.18 and v13.4
- MySQL v5.7 and v8.0 (specifically 8.0.19+ due to a bug)
We update these support ranges once a year. The release notes of each Temporal Server declare when we plan to drop support for database versions reaching End of Life.
- Because Temporal Server primarily relies on core database functionality, we do not expect compatibility to break often. Temporal has no opinions on database upgrade paths; as long as you can upgrade your database according to each project's specifications, Temporal should work with any version within supported ranges.
- We do not run tests with vendors like Vitess and CockroachDB, so you rely on their compatibility claims if you use them. Feel free to discuss them with fellow users in our forum.
- Temporal is working on official SQLite v3.x persistence, but this is meant only for development and testing, not production usage. Cassandra, MySQL, and PostgreSQL schemas are supported and thus can be used as the Server's database.
Monitoring & observation
Temporal emits metrics by default in a format that is supported by Prometheus. Monitoring and observing those metrics is optional. Any software that can pull metrics that supports the same format could be used, but we ensure it works with Prometheus and Grafana versions only.
- Prometheus >= v2.0
- Grafana >= v2.5
Visibility
Temporal has built-in Visibility features. To enhance this feature, Temporal supports an integration with Elasticsearch.
- Elasticsearch v7.10 is supported from Temporal version 1.7.0 onwards
- Elasticsearch v6.8 is supported in all Temporal versions
- Both versions are explicitly supported with AWS Elasticsearch
In practice, this means you will run each container with a flag specifying each service, e.g.
docker run
# persistence/schema setup flags omitted
-e SERVICES=history \ -- Spinup one or more of: history, matching, worker, frontend
-e LOG_LEVEL=debug,info \ -- Logging level
-e DYNAMIC_CONFIG_FILE_PATH=config/foo.yaml -- Dynamic config file to be watched
temporalio/server:<tag>
See the Docker source file for more details.
Each release also ships a Server with Auto Setup
Docker image that includes an auto-setup.sh
script we recommend using for initial schema setup of each supported database. You should familiarize yourself with what auto-setup does, as you will likely be replacing every part of the script to customize for your own infrastructure and tooling choices.
Though neither are blessed for production use, you can consult our Docker-Compose repo or Helm Charts for more hints on configuration options.
Minimum Requirements
- The minimum Temporal Server dependency is a database. We support Cassandra, MySQL, or PostgreSQL, with SQLite on the way.
- Further dependencies are only needed to support optional features. For example, enhanced Workflow search can be achieved using Elasticsearch.
- Monitoring and observability are available with Prometheus and Grafana.
- Each language SDK also has minimum version requirements. See the versions & dependencies page for precise versions we support together with these features.
Kubernetes is not required for Temporal, but it is a popular deployment platform anyway. We do maintain a Helm chart you can use as a reference, but you are responsible for customizing it to your needs. We also hosted a YouTube discussion on how we think about the Kubernetes ecosystem in relation to Temporal.
Configuration
At minimum, the development.yaml
file needs to have the global
and persistence
parameters defined.
The Server configuration reference has a more complete list of possible parameters.
Before you deploy: Reminder on shard count
A huge part of production deploy is understanding current and future scale - the number of shards can't be changed after the cluster is in use so this decision needs to be upfront. Shard count determines scaling to improve concurrency if you start getting lots of lock contention.
The default numHistoryShards
is 4; deployments at scale can go up to 500-2000 shards.
Please consult our configuration docs and check with us for advice if you are worried about scaling.
Scaling and Metrics
The requirements of your Temporal system will vary widely based on your intended production workload. You will want to run your own proof of concept tests and watch for key metrics to understand the system health and scaling needs.
- Configure your metrics subsystem. Temporal supports three metrics providers out of the box via Uber's Tally interface: StatsD, Prometheus, and M3.
Tally offers extensible custom metrics reporting, which we expose via
temporal.WithCustomMetricsReporter
. OpenTelemetry support is planned in the future. - Set up monitoring. You can use these Grafana dashboards as a starting point.
The single most important metric to track is
schedule_to_start_latency
- if you get a spike in workload and don't have enough workers, your tasks will get backlogged. We strongly recommend setting alerts for this metric. This is usually emitted in client SDKs as bothtemporal_activity_schedule_to_start_latency_*
andtemporal_workflow_task_schedule_to_start_latency_*
variants - see the Prometheus GO SDK example and the Go SDK source and there are plans to add it on the Server.- Set up alerts for Workflow Task failures.
- Also set up monitoring/alerting for all Temporal Workers for standard metrics like CPU/Memory utilization.
- Load testing. You can use the Maru benchmarking tool (author's guide here), see how we ourselves stress test Temporal, or write your own.
All metrics emitted by the server are listed in Temporal's source. There are also equivalent metrics that you can configure from the client side. At a high level, you will want to track these 3 categories of metrics:
- Service metrics: For each request made by the service handler we emit
service_requests
,service_errors
, andservice_latency
metrics withtype
,operation
, andnamespace
tags. This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations. - Persistence metrics: The Server emits
persistence_requests
,persistence_errors
andpersistence_latency
metrics for each persistence operation. These metrics include theoperation
tag such that you can get the request rates, error rates or latencies per operation. These are super useful in identifying issues caused by the database. - Workflow Execution stats: The Server also emits counters for when Workflow Executions are complete.
These are useful in getting overall stats about Workflow Execution completions.
Use
workflow_success
,workflow_failed
,workflow_timeout
,workflow_terminate
andworkflow_cancel
counters for each type of Workflow Execution completion. These include thenamespace
tag.
Please request any additional information in our forum. Key discussions are here:
- https://community.temporal.io/t/metrics-for-monitoring-server-performance/536/3
- https://community.temporal.io/t/guidance-on-creating-and-interpreting-grafana-dashboards/493
Checklist for Scaling Temporal
Temporal is highly scalable due to its event sourced design. We have load tested up to 200 million concurrent Workflow Executions. Every shard is low contention by design and it is very difficult to oversubscribe to a Task Queue in the same cluster. With that said, here are some guidelines to some common bottlenecks:
- Database. The vast majority of the time the database will be the bottleneck. We highly recommend setting alerts on
schedule_to_start_latency
to look out for this. Also check if your database connection is getting saturated. - Internal services. The next layer will be scaling the 4 internal services of Temporal (Frontend, Matching, History, and Worker). Monitor each accordingly. The Frontend service is more CPU bound, whereas the History and Matching services require more memory. If you need more instances of each service, spin them up separately with different command line arguments. You can learn more cross referencing our Helm chart with our Server Configuration reference.
- See the Server Limits section below for other limits you will want to keep in mind when doing system design, including event history length.
Please see the dedicated docs on Tuning and Scaling Workers.
FAQs
FAQ: Autoscaling Workers based on Task Queue load
Temporal does not yet support returning the number of tasks in a task queue.
The main technical hurdle is that each task can have its own ScheduleToStart
timeout, so just counting how many tasks were added and consumed is not enough.
This is why we recommend tracking schedule_to_start_latency
for determining if the task queue has a backlog (aka your Workflow and Activity Workers are under-provisioned for a given Task Queue).
We do plan to add features that give more visibility into the task queue state in future.
FAQ: High Availability cluster configuration
You can set up a high availability deployment by running more than one instance of the server. Temporal also handles membership and routing. You can find more details in the clusterMetadata
section of the Server Configuration reference.
clusterMetadata:
enableGlobalNamespace: false
failoverVersionIncrement: 10
masterClusterName: "active"
currentClusterName: "active"
clusterInformation:
active:
enabled: true
initialFailoverVersion: 0
rpcAddress: "127.0.0.1:7233"
FAQ: Multiple deployments on a single cluster
You may sometimes want to have multiple parallel deployments on the same cluster, eg:
- when you want to split Temporal deployments based on namespaces, e.g. staging/dev/uat, or for different teams who need to share common infrastructure.
- when you need a new deployment to change
numHistoryShards
.
We recommend not doing this if you can avoid it. If you need to do it anyway, double-check the following:
- Have a separate persistence (database) for each deployment
- Cluster membership ports should be different for each deployment (they can be set through environment variables). For example:
- Temporal1 services can have 7233 for frontend, 7234 for history, 7235 for matching
- Temporal2 services can have 8233 for frontend, 8234 for history, 8235 for matching
- There is no need to change gRPC ports.
More details about the reason here.
Server limits
Running into limits can cause unexpected failures, so be mindful when you design your systems. Here is a comprehensive list of all the hard (error) / soft (warn) server limits relevant to operating Temporal:
- gRPC: gRPC has 4MB size limit (per each message received)
- Event Batch Size: The
DefaultTransactionSizeLimit
limit is 4MB. This is the largest transaction size we allow for event histories to be persisted.- This is configurable with
TransactionSizeLimit
, if you know what you are doing.
- This is configurable with
- Blob size limit: For incoming payloads (including Workflow context) - source
- we warn at 512KB:
Blob size exceeds limit.
- we error at 2MB:
ErrBlobSizeExceedsLimit: Blob data size exceeds limit.
- This is configurable with
BlobSizeLimitError
andBlobSizeLimitWarn
, if you know what you are doing.
- we warn at 512KB:
- History total size limit: (leading to a terminated Workflow Execution)
- We warn at 10MB:
history size exceeds warn limit.
- We error at 50mb:
history size exceeds error limit.
- This is configurable with
HistorySizeLimitError
andHistorySizeLimitWarn
, if you know what you are doing.
- We warn at 10MB:
- History total count limit: (leading to a terminated Workflow Execution)
- We warn at 10,000 events:
history size exceeds warn limit.
- We error at 50,000 events:
history size exceeds error limit.
- This is configurable with
HistoryCountLimitError
andHistoryCountLimitWarn
, if you know what you are doing.
- We warn at 10,000 events:
- Search Attributes:
- Number of Search Attributes: max 100
- Single Search Attribute Size: 2KB
- Total Search Attribute Size: 40KB
- This is configurable with
SearchAttributesNumberOfKeysLimit
,SearchAttributesTotalSizeLimit
andSearchAttributesSizeOfValueLimit
, if you know what you are doing.
Securing Temporal
Please see our dedicated docs on Temporal Server Security.
Debugging Temporal
Debugging Temporal Server Configs
Recommended configuration debugging techniques for production Temporal Server setups:
- Containers (to be completed)
- Storage (to be completed)
- Networking
- Temporal cluster unable to establish ring membership, causing an infinite crash loop: Use
tcurl
to audit it
- Temporal cluster unable to establish ring membership, causing an infinite crash loop: Use
Debugging Workflows
We recommend using Temporal Web to debug your Workflow Executions in development and production.
Tracing Workflows
Temporal Web's tracing capabilities mainly track activity execution within a Temporal context. If you need custom tracing specific for your usecase, you should make use of context propagation to add tracing logic accordingly.
Further things to consider
warning
This document is still being written and we would welcome your questions and contributions.
Please search for these topics in our forum or ask on Slack.
Temporal Antipatterns
Please request elaboration on any of these.
- Trying to implement a queue in a workflow (because people hear we replace queues)
- Serializing massive amounts of state into and out of the workflow.
- Treating everything as rigid/linear sequence of steps instead of dynamic logic
- Implementing a DSL which is actually just a generic schema based language
- Polling in activities instead of using signals
- Blocking on incredibly long RPC requests and not using heartbeats
- Failing/retrying workflows without a very very specific business reason
Temporal Best practices
Please request elaboration on any of these.
- Mapping things to entities instead of traditional service design
- Testing: unit, integration
- Retries: figuring out right values for timeouts
- Versioning
- The Workflow is Temporal's fundamental unit of scalability - break things into workflows to scale, don't try to stuff everything in one workflow!
External Runbooks
Third party content that may help:
- Recommended Setup for Running Temporal with Cassandra on Production (Temporal Forums)
- How To Deploy Temporal to Azure Container Instances
- How To Deploy Temporal to Azure Kubernetes Service (AKS)
- AWS ECS runbook (we are seeking external contributions, please let us know if you'd like to work on this)
- AWS EKS runbook (we are seeking external contributions, please let us know if you'd like to work on this)