While a lot of effort has been made to easily run and test the Temporal Server in a development environment (see the Quick install guide), there is far less of an established framework for deploying Temporal to a live (production) environment. That is because the set up of the Server depends very much on the intended use-case and the hosting infrastructure.
This page is dedicated to providing a "first principles" approach to self-hosting the Temporal Server. As a reminder, experts are accessible via the Community forum and Slack should you have any questions.
The Temporal Server is a Go application which you can import or run as a binary.
The minimum dependency is a database. The Server supports Cassandra, MySQL, or PostgreSQL. Further dependencies are only needed to support optional features. For example, enhanced Workflow search can be achieved using ElasticSearch. And, monitoring and observability are available with Prometheus and Grafana.
See the versions & dependencies page for precise versions we support together with these features.
The Server configuration reference has a more complete list of possible parameters.
The requirements of your Temporal system will vary widely based on your intended production workload. You will want to run your own proof of concept tests and watch for key metrics to understand the system health and scaling needs.
- Configure your metrics subsystem. Temporal supports three metrics providers out of the box: StatsD, Prometheus, and M3.
- Set up monitoring. You can use these Grafana dashboards as a starting point.
- Load testing. You can use Maru (author's guide here) or write your own.
At a high level, you will want to track these 3 categories of metrics:
- Service metrics: For each request made by the service handler we emit
namespacetags. This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.
- Persistence metrics: The Server emits
persistence_latencymetrics for each persistence operation. These metrics include the
operationtag such that you can get the request rates, error rates or latencies per operation. These are super useful in identifying issues caused by the database.
- Workflow stats: The Server also emits counters on Workflows complete.
These are useful in getting overall stats about Workflow completions.
workflow_cancelcounters for each type of Workflow completion. They are also include the
namespacetag. Additional information is available in this forum post.
⚠️ This is a basic guide to troubleshooting/debugging Temporal applications. It is work-in-progress and we encourage reading about our Architecture for more detail. The better you understand how Temporal works, the better you will be at debugging your workflows.
If you have the time, we recommend watching our 19 minute video guide on YouTube which demonstrates the debugging explained below.
The primary mechanism we recommend for debugging is Temporal Web, which is run in a separate process:
- Workflows are identified by their Workflow ID, which you provide when creating the workflow. They also have a Name which is directly taken from your code.
- Workflow Status is usually in one of a few states: Running, Completed, or Terminated, with Start Time and End Time shown accordingly.
- Workflow ID's are are distinct from Run ID's, which uniquely identify one of potentially many Runs of Workflows with the same Workflow ID.
Tip: Don't confuse Runs with Workflow Executions - they are similar, but a long-running Workflow Execution can have multiple Runs. A Run is the atomic unit.
The full state of every Run is inspectable in Temporal Web:
- If your workflows seem like they aren't receiving the right data, check the Input arguments given.
- If your workflows seem "stuck", check the Task Queue assigned to a given workflow to see that there are active workers polling.
- If you see inspect the Pending Activities and see an activity with a lot of retry
attempts, you can check the
lastFailurefield for a clue as to what happened.
- If you need to go back in time from the current state, check the History Events where you can see the full Workflow Execution History logs (this is what makes Temporal so resilient)
Reading execution histories is one of the more reliable ways of debugging:
Here, you can see the exact sequence of events that has happened so far, which includes the relevant state for each event and details about what went wrong or what is preventing the next correct event. There are about 40 system events in total. See our Temporal Server Event Types reference for detailed descriptions.
Temporal also stores the stack trace of where a given activity is currently blocked:
This is often a good way to get a deep understanding of whether your workflow is executing as expected.
Here we will discuss how to proceed once you have identified and fixed the code for an erroring activity.
If your activity code is deterministic, you might be able to simply restart the worker to pick up the changes. Execution will continue from where it last succeeded. In other words, we get "hotfixing for free" due to Temporal's execution model.
However, if your activity is more complex, you will have to explicitly version your workflows or even manually terminate and restart the workflows.
This section is still being written - if you have specific questions you'd like us to answer, please search or ask on the Temporal Forum.
Topics this document will cover in future: (for now, please search/ask on the forum)
- Recommended Environment
- using Temporal Web
- More on Monitoring/Prometheus/Logging
- Give guidance on how to set up alerts on Metrics provided by SDK
- Setting up alerts for Workflow Task failures
- Best practices for writing Workflow Code:
- Testing: unit, integration
- Retries: figuring out right values for timeouts
Understanding the Temporal Server architecture can help you debug and troubleshoot production deployment issues.
Third party content that may help:
- Recommended Setup for Running Temporal with Cassandra on Production (Temporal Forums)
- How To Deploy Temporal to Azure Container Instances
- How To Deploy Temporal to Azure Kubernetes Service (AKS)
- ECS runbook (to be completed)
- EKS runbook (to be completed)