Skip to main content

Temporal Server self-hosted production deployment


While a lot of effort has been made to easily run and test the Temporal Server in a development environment (see the Quick install guide), there is far less of an established framework for deploying Temporal to a live (production) environment. That is because the set up of the Server depends very much on your intended use-case and the hosting infrastructure.

This page is dedicated to providing a "first principles" approach to self-hosting the Temporal Server. As a reminder, experts are accessible via the Community forum and Slack should you have any questions.


If you are interested in a fully managed service hosting Temporal Server, please register your interest in Temporal Cloud. We have a waitlist for early Design Partners.

Temporal Server

Temporal Server is a Go application which you can import or run as a binary (we offer builds with every release). While Temporal can be run as a single Go binary, we recommend that production deployments of Temporal Server should deploy each of the 4 internal services separately (if you are using Kubernetes, one service per pod) so they can be scaled independently in future.

See below for a refresher on the 4 internal services:

Temporal Cluster Architecture

A Temporal Cluster is the Temporal Server paired with persistence.

The Temporal Server consists of four independently scalable services:

  • Frontend gateway: for rate limiting, routing, authorizing
  • History subsystem: maintains data (mutable state, queues, and timers)
  • Matching subsystem: hosts Task Queues for dispatching
  • Worker service: for internal background workflows

A Temporal Cluster (Server + persistence)

A Temporal Cluster (Server + persistence)

For example, a real life production deployment can have 5 Frontend, 15 History, 17 Matching, and 3 Worker services per cluster.

The Temporal Server services can run independently or be grouped together into shared processes on one or more physical or virtual machines. For live (production) environments we recommend that each service runs independently, as each one has different scaling requirements, and troubleshooting becomes easier. The History, Matching, and Worker services can scale horizontally within a Cluster. The Frontend Service scales differently than the others, because it has no sharding/partitioning, it is just stateless.

Each service is aware of the others, including scaled instances, through a membership protocol via Ringpop.

Frontend Service

The Frontend Service is a stateless gateway service that exposes a strongly typed Proto API. The Frontend Service is responsible for rate limiting, authorizing, validating, and routing all in-bound calls.

Frontend Service

Frontend Service

Types of inbound calls include the following:

  • Domain CRUD
  • External events
  • Worker polls
  • Visibility requests
  • Admin operations via the CLI
  • Multi-cluster Replication related calls from a remote Cluster

Every inbound request related to a Workflow Execution must have a Workflow Id, which becomes hashed for routing purposes. The Frontend Service has access to the hash rings that maintain service membership information, including how many nodes (instances of each service) are in the Cluster.

Inbound call rate limiting is applied per host and per namespace.

The Frontend service talks to the Matching service, History service, Worker service, the database, and Elasticsearch (if in use).

  • It uses the grpcPort 7233 to host the service handler.
  • It uses port 6933 for membership related communication.

History service

The History Service tracks the state of Workflow Executions.

History Service

History Service

The History Service scales horizontally via individual shards, configured during the Cluster's creation. The number of shards remains static for the life of the Cluster (so you should plan to scale and over-provision).

Each shard maintains data (routing Ids, mutable state) and queues. There are three types of queues that a History shard maintains:

  • Transfer queue: This is used to transfer internal tasks to the Matching Service. Whenever a new Workflow Task needs to be scheduled, the History Service transactionally dispatches it to the Matching Service.
  • Timer queues: This is used to durably persist Timers.
  • Replicator queue: This is used only for the experimental Multi-Cluster feature

The History service talks to the Matching Service and the Database.

  • It uses grpcPort 7234 to host the service handler.
  • It uses port 6934 for membership related communication.

Matching service

The Matching Service is responsible for hosting Task Queues for Task dispatching.

Matching Service

Matching Service

It is responsible for matching Workers to Tasks and routing new tasks to the appropriate queue. This service can scale internally by having multiple instances.

It talks to the Frontend service, History service, and the database.

  • It uses grpcPort 7235 to host the service handler.
  • It uses port 6935 for membership related communication.

Worker service

The Worker Service runs background processing for the replication queue, system Workflows, and in versions older than 1.5.0, the Kafka visibility processor.

Worker Service

Worker Service

It talks to the Frontend service.

  • It uses grpcPort 7239 to host the service handler.
  • It uses port 6939 for membership related communication.


The database provides storage for the system.



Cassandra, MySQL, and PostgreSQL schemas are supported and thus can be used as the Server's database.

The database stores the following types of data:

  • Tasks: Tasks to be dispatched.
  • State of Workflow Executions:
    • Execution table: A capture of the mutable state of Workflow Executions.
    • History table: An append only log of Workflow Execution History Events.
  • Namespace metadata: Metadata of each Namespace in the Cluster.
  • Visibility data: Enables operations like "show all running Workflow Executions". For production environments, we recommend using ElasticSearch.

In practice, this means you will run each container with a flag specifying each service, e.g.

docker run
# persistence/schema setup flags omitted
-e SERVICES=history \ -- Spinup one or more of: history, matching, worker, frontend
-e LOG_LEVEL=debug,info \ -- Logging level
-e DYNAMIC_CONFIG_FILE_PATH=config/foo.yaml -- Dynamic config file to be watched

See the Docker source file for more details.

Each release also ships a Server with Auto Setup Docker image that includes an script we recommend using for initial schema setup of each supported database. You should familiarize yourself with what auto-setup does, as you will likely be replacing every part of the script to customize for your own infrastructure and tooling choices.

Though neither are blessed for production use, you can consult our Docker-Compose repo or Helm Charts for more hints on configuration options.

Minimum Requirements

Kubernetes is not required for Temporal, but it is a popular deployment platform anyway. We do maintain a Helm chart you can use as a reference, but you are responsible for customizing it to your needs. We also hosted a YouTube discussion on how we think about the Kubernetes ecosystem in relation to Temporal.


At minimum, the development.yaml file needs to have the global and persistence parameters defined.

The Server configuration reference has a more complete list of possible parameters.

Before you deploy: Reminder on shard count

A huge part of production deploy is understanding current and future scale - the number of shards can't be changed after the cluster is in use so this decision needs to be upfront. Shard count determines scaling to improve concurrency if you start getting lots of lock contention. The default numHistoryShards is 4; deployments at scale can go up to 500-2000 shards. Please consult our configuration docs and check with us for advice if you are worried about scaling.

Scaling and Metrics

The requirements of your Temporal system will vary widely based on your intended production workload. You will want to run your own proof of concept tests and watch for key metrics to understand the system health and scaling needs.

All metrics emitted by the server are listed in Temporal's source. There are also equivalent metrics that you can configure from the client side. At a high level, you will want to track these 3 categories of metrics:

  • Service metrics: For each request made by the service handler we emit service_requests, service_errors, and service_latency metrics with type, operation, and namespace tags. This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.
  • Persistence metrics: The Server emits persistence_requests, persistence_errors and persistence_latency metrics for each persistence operation. These metrics include the operation tag such that you can get the request rates, error rates or latencies per operation. These are super useful in identifying issues caused by the database.
  • Workflow Execution stats: The Server also emits counters for when Workflow Executions are complete. These are useful in getting overall stats about Workflow Execution completions. Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of Workflow Execution completion. These include the namespace tag.

Please request any additional information in our forum. Key discussions are here:

Checklist for Scaling Temporal

Temporal is highly scalable due to its event sourced design. We have load tested up to 200 million concurrent Workflow Executions. Every shard is low contention by design and it is very difficult to oversubscribe to a Task Queue in the same cluster. With that said, here are some guidelines to some common bottlenecks:

  • Database. The vast majority of the time the database will be the bottleneck. We highly recommend setting alerts on schedule_to_start_latency to look out for this. Also check if your database connection is getting saturated.
  • Internal services. The next layer will be scaling the 4 internal services of Temporal (Frontend, Matching, History, and Worker). Monitor each accordingly. The Frontend service is more CPU bound, whereas the History and Matching services require more memory. If you need more instances of each service, spin them up separately with different command line arguments. You can learn more cross referencing our Helm chart with our Server Configuration reference.
  • See the Server Limits section below for other limits you will want to keep in mind when doing system design, including event history length.

Scaling Workflow and Activity Workers

Finally you want to set up alerting and monitoring on Worker metrics. When Workers are able to keep up, schedule_to_start_latency is close to zero. The default is 4 Workers (each of which can have 2-4 pollers of Task Queues), which should handle no more than 300 messages per second.

Specifically, the primary scaling metrics are located in the server's dynamic configs:

  • MaxConcurrentActivityTaskPollers and MaxConcurrentWorkflowTaskPollers: Defaults to 4
  • MaxConcurrentActivityExecutionSize and MaxConcurrentWorkflowTaskExecutionSize: Defaults to 200

Scaling will depend on your workload — for example, for a Task Queue with 500 messages per second, you might want to scale up to 10 pollers. Provided you tune the concurrency of your pollers based on your application, it should be possible to scale them based on standard resource utilization metrics (CPU, Memory, etc).

It's possible to have too many workers. Monitor the poll success (poll_success/poll_success_sync) and poll_timeouts metrics:

  • poll_success_sync indicates a "sync match", i.e. a poller waiting for a task to appear, isntead of a task waiting for a poller to appear.
  • Poll Success should be >90% in most cases - for high volume and low latency, try to target >95%
  • if you see low schedule_to_start_latency / low percentage of poll success / high percentage of timeouts, you might have too many workers/pollers.
  • with 100% poll success and increasing schedule_to_start_latency, you need to scale up.


FAQ: Autoscaling Workers based on Task Queue load

Temporal does not yet support returning the number of tasks in a task queue. The main technical hurdle is that each task can have its own ScheduleToStart timeout, so just counting how many tasks were added and consumed is not enough.

This is why we recommend tracking schedule_to_start_latency for determining if the task queue has a backlog (aka your Workflow and Activity Workers are under-provisioned for a given Task Queue). We do plan to add features that give more visibility into the task queue state in future.

FAQ: High Availability cluster configuration

You can set up a high availability deployment by running more than one instance of the server. Temporal also handles membership and routing. You can find more details in the clusterMetadata section of the Server Configuration reference.

enableGlobalNamespace: false
failoverVersionIncrement: 10
masterClusterName: "active"
currentClusterName: "active"
enabled: true
initialFailoverVersion: 0
rpcAddress: ""

FAQ: Multiple deployments on a single cluster

You may sometimes want to have multiple parallel deployments on the same cluster, eg:

  • when you want to split Temporal deployments based on namespaces, e.g. staging/dev/uat, or for different teams who need to share common infrastructure.
  • when you need a new deployment to change numHistoryShards.

We recommend not doing this if you can avoid it. If you need to do it anyway, double-check the following:

  • Have a separate persistence (database) for each deployment
  • Cluster membership ports should be different for each deployment (they can be set through environment variables). For example:
    • Temporal1 services can have 7233 for frontend, 7234 for history, 7235 for matching
    • Temporal2 services can have 8233 for frontend, 8234 for history, 8235 for matching
  • There is no need to change gRPC ports.

More details about the reason here.

Server limits

Running into limits can cause unexpected failures, so be mindful when you design your systems. Here is a comprehensive list of all the hard (error) / soft (warn) server limits relevant to operating Temporal:

Securing Temporal

Please see our dedicated docs on Temporal Server Security.

Debugging Temporal

Debugging Temporal Server Configs

Recommended configuration debugging techniques for production Temporal Server setups:

Debugging Workflows

We recommend using Temporal Web to debug your Workflow Executions in development and production.

Tracing Workflows

Temporal Web's tracing capabilities mainly track activity execution within a Temporal context. If you need custom tracing specific for your usecase, you should make use of context propagation to add tracing logic accordingly.

Further things to consider


This document is still being written and we would welcome your questions and contributions.

Please search for these topics in our forum or ask on Slack.

Temporal Antipatterns

Please request elaboration on any of these.

  • Trying to implement a queue in a workflow (because people hear we replace queues)
  • Serializing massive amounts of state into and out of the workflow.
  • Treating everything as rigid/linear sequence of steps instead of dynamic logic
  • Implementing a DSL which is actually just a generic schema based language
  • Polling in activities instead of using signals
  • Blocking on incredibly long RPC requests and not using heartbeats
  • Failing/retrying workflows without a very very specific business reason

Temporal Best practices

Please request elaboration on any of these.

  • Mapping things to entities instead of traditional service design
  • Testing: unit, integration
  • Retries: figuring out right values for timeouts
  • Versioning
  • The Workflow is Temporal's fundamental unit of scalability - break things into workflows to scale, don't try to stuff everything in one workflow!

External Runbooks

Third party content that may help:

Get notified of updates