There is a tight coupling between Temporal Task Queues and Worker Processes.
What is a Worker?
In day-to-day conversations, the term Worker is used to denote either a Worker Program, a Worker Process, or a Worker Entity. Temporal documentation aims to be explicit and differentiate between them.
What is a Task?
There are two types of Tasks:
What is a Worker Program?
A Worker Program is the static code that defines the constraints of the Worker Process, developed using the APIs of a Temporal SDK.
What is a Worker Entity?
A Worker Entity is the individual Worker within a Worker Process that listens to a specific Task Queue.
A Worker Entity listens and polls on a single Task Queue. A Worker Entity contains a Workflow Worker and/or an Activity Worker, which makes progress on Workflow Executions and Activity Executions, respectively.
Can a Worker handle more Workflow Executions than its cache size or number of supported threads?
Yes it can. However, the trade off is added latency.
Workers are stateless, so any Workflow Execution in a blocked state can be safely removed from a Worker. Later on, it can be resurrected on the same or different Worker when the need arises (in the form of an external event). Therefore, a single Worker can handle millions of open Workflow Executions, assuming it can handle the update rate and that a slightly higher latency is not a concern.
What is a Worker Process?
Component diagram of a Worker Process and the Temporal Server
More formally, a Worker Process is any process that implements the Task Queue Protocol and the Task Execution Protocol.
- A Worker Process is a Workflow Worker Process if the process implements the Workflow Task Queue Protocol and executes the Workflow Task Execution Protocol to make progress on a Workflow Execution. A Workflow Worker Process can listen on an arbitrary number of Workflow Task Queues and can execute an arbitrary number of Workflow Tasks.
- A Worker Process is an Activity Worker Process if the process implements the Activity Task Queue Protocol and executes the Activity Task Processing Protocol to make progress on an Activity Execution. An Activity Worker Process can listen on an arbitrary number of Activity Task Queues and can execute an arbitrary number of Activity Tasks.
Worker Processes are external to a Temporal Cluster. Temporal Application developers are responsible for developing Worker Programs and operating Worker Processes. Said another way, the Temporal Cluster (including the Temporal Cloud) doesn't execute any of your code (Workflow and Activity Definitions) on Temporal Cluster machines. The Cluster is solely responsible for orchestrating State Transitions and providing Tasks to the next available Worker Entity.
While data transferred in Event Histories is secured by mTLS, by default, it is still readable at rest in the Temporal Cluster.
To solve this, Temporal SDKs offer a Data Converter API that you can use to customize the serialization of data going out of and coming back in to a Worker Entity, with the net effect of guaranteeing that the Temporal Cluster cannot read sensitive business data.
In many of our tutorials, we show you how to run both a Temporal Cluster and one Worker on the same machine for local development. However, a production-grade Temporal Application typically has a fleet of Worker Processes, all running on hosts external to the Temporal Cluster. A Temporal Application can have as many Worker Processes as needed.
A Worker Process can be both a Workflow Worker Process and an Activity Worker Process. Many SDKs support the ability to have multiple Worker Entities in a single Worker Process. (Worker Entity creation and management differ between SDKs.) A single Worker Entity can listen to only a single Task Queue. But if a Worker Process has multiple Worker Entities, the Worker Process could be listening to multiple Task Queues.
Worker Processes executing Activity Tasks must have access to any resources needed to execute the actions that are defined in Activity Definitions, such as the following:
- Network access for external API calls.
- Credentials for infrastructure provisioning.
- Specialized GPUs for machine learning utilities.
The Temporal Cluster itself has internal workers for system Workflow Executions. However, these internal workers are not visible to the developer.
What is a Workflow Task?
A Workflow Task is a Task that contains the context needed to make progress with a Workflow Execution.
- Every time a new external event that might affect a Workflow state is recorded, a Workflow Task that contains the event is added to a Task Queue and then picked up by a Workflow Worker.
- After the new event is handled, the Workflow Task is completed with a list of Commands.
- Handling of a Workflow Task is usually very fast and is not related to the duration of operations that the Workflow invokes.
What is a Workflow Task Execution?
What is an Activity Task?
An Activity Task contains the context needed to proceed with an Activity Task Execution. Activity Tasks largely represent the Activity Task Scheduled Event, which contains the data needed to execute an Activity Function.
If Heartbeat data is being passed, an Activity Task will also contain the latest Heartbeat details.
What is an Activity Task Execution?
The ActivityTaskScheduled Event corresponds to when the Temporal Cluster puts the Activity Task into the Task Queue.
The ActivityTaskStarted Event corresponds to when the Worker picks up the Activity Task from the Task Queue.
Either ActivityTaskCompleted or one of the other Closed Activity Task Events corresponds to when the Worker has yielded back to the Temporal Cluster.
The API to schedule an Activity Execution provides an "effectively once" experience, even though there may be several Activity Task Executions that take place to successfully complete an Activity.
Once an Activity Task finishes execution, the Worker responds to the Cluster with a specific Event:
What is a Task Queue?
There are two types of Task Queues, Activity Task Queues and Workflow Task Queues.
Task Queue component
Task Queues are very lightweight components. Task Queues do not require explicit registration but instead are created on demand when a Workflow Execution or Activity spawns or when a Worker Process subscribes to it. When a Task Queue is created, both a Workflow Task Queue and an Activity Task Queue are created under the same name. There is no limit to the number of Task Queues a Temporal Application can use or a Temporal Cluster can maintain.
Workers poll for Tasks in Task Queues via synchronous RPC. This implementation offers several benefits:
- A Worker Process polls for a message only when it has spare capacity, avoiding overloading itself.
- In effect, Task Queues enable load balancing across many Worker Processes.
- Task Queues enable what we call Task Routing, which is the routing of specific Tasks to specific Worker Processes or even a specific process.
- Task Queues support server-side throttling, which enables you to limit the Task dispatching rate to the pool of Worker Processes while still supporting Task dispatching at higher rates when spikes happen.
- When all Worker Processes are down, messages simply persist in a Task Queue, waiting for the Worker Processes to recover.
- Worker Processes do not need to advertise themselves through DNS or any other network discovery mechanism.
- Worker Processes do not need to have any open ports, which is more secure.
All Workers listening to a given Task Queue must have identical registrations of Activities and/or Workflows. The one exception is during a Server upgrade, where it is okay to have registration temporarily misaligned while the binary rolls out.
Where to set Task Queues
There are four places where the name of the Task Queue can be set by the developer.
- A Task Queue must be set when spawning a Workflow Execution:
- How to start a Workflow Execution using tctl
- How to start a Workflow Execution using the Go SDK
- How to start a Workflow Execution using the Java SDK
- How to start a Workflow Execution using the PHP SDK
- How to start a Workflow Execution using the Python SDK
- How to start a Workflow Execution using the TypeScript SDK
- A Task Queue name must be set when creating a Worker Entity and when running a Worker Process:
Note that all Worker Entities listening to the same Task Queue name must be registered to handle the exact same Workflows Types and Activity Types.
If a Worker Entity polls a Task for a Workflow Type or Activity Type it does not know about, it will fail that Task. However, the failure of the Task will not cause the associated Workflow Execution to fail.
- A Task Queue name can be provided when spawning an Activity Execution:
This is optional. An Activity Execution inherits the Task Queue name from its Workflow Execution if one is not provided.
- How to start an Activity Execution using the Go SDK
- How to start an Activity Execution using the Java SDK
- How to start an Activity Execution using the PHP SDK
- How to start an Activity Execution using the Python SDK
- How to start an Activity Execution using the TypeScript SDK
- A Task Queue name can be provided when spawning a Child Workflow Execution:
This is optional. A Child Workflow Execution inherits the Task Queue name from its Parent Workflow Execution if one is not provided.
- How to start a Child Workflow Execution using the Go SDK
- How to start a Child Workflow Execution using the Java SDK
- How to start a Child Workflow Execution using the PHP SDK
- How to start a Child Workflow Execution using the Python SDK
- How to start a Child Workflow Execution using the TypeScript SDK
Task Queues can be scaled by adding partitions. The default number of partitions is 4.
Task Queues with multiple partitions do not have any ordering guarantees. Once there is a backlog of Tasks that have been written to disk, Tasks that can be dispatched immediately (“sync matches”) are delivered before tasks from the backlog (“async matches”). This approach optimizes throughput.
Task Queues with a single partition are almost always first-in, first-out, with rare edge case exceptions. However, using a single partition limits you to low- and medium-throughput use cases.
This section is on the ordering of individual Tasks, and does not apply to the ordering of Workflow Executions, Activity Executions, or Events in a single Workflow Execution. The order of Events in a Workflow Execution is guaranteed to remain constant once they have been written to that Workflow Execution's History.
What is a Sticky Execution?
A Sticky Execution is when a Worker Entity caches the Workflow in memory and creates a dedicated Task Queue to listen on.
A Sticky Execution occurs after a Worker Entity completes the first Workflow Task in the chain of Workflow Tasks for the Workflow Execution.
The Worker Entity caches the Workflow in memory and begins polling the dedicated Task Queue for Workflow Tasks that contain updates, rather than the entire Event History.
If the Worker Entity does not pick up a Workflow Task from the dedicated Task Queue in an appropriate amount of time, the Cluster will resume Scheduling Workflow Tasks on the original Task Queue. Another Worker Entity can then resume the Workflow Execution, and can set up its own Sticky Execution for future Workflow Tasks.
Sticky Executions are the default behavior of the Temporal Platform.
What is Task Routing?
Task Routing is simply when a Task Queue is paired with one or more Workers, primarily for Activity Task Executions.
This could also mean employing multiple Task Queues, each one paired with a Worker Process.
Task Routing has many applicable use cases.
Some SDKs provide a Session API that provides a straightforward way to ensure that Activity Tasks are executed with the same Worker without requiring you to manually specify Task Queue names. It also includes features like concurrent session limitations and worker failure detection.
A Worker that consumes from a Task Queue asks for an Activity Task only when it has available capacity, so it is never overloaded by request spikes. If Activity Tasks get created faster than Workers can process them, they are backlogged in the Task Queue.
The rate at which each Activity Worker polls for and processes Activity Tasks is configurable per Worker. Workers do not exceed this rate even if it has spare capacity. There is also support for global Task Queue rate limiting. This limit works across all Workers for the given Task Queue. It is frequently used to limit load on a downstream service that an Activity calls into.
In some cases, you might need to execute Activities in a dedicated environment. To send Activity Tasks to this environment, use a dedicated Task Queue.
Route Activity Tasks to a specific host
In some use cases, such as file processing or machine learning model training, an Activity Task must be routed to a specific Worker Process or Worker Entity.
For example, suppose that you have a Workflow with the following three separate Activities:
- Download a file.
- Process the file in some way.
- Upload a file to another location.
The first Activity, to download the file, could occur on any Worker on any host. However, the second and third Activities must be executed by a Worker on the same host where the first Activity downloaded the file.
In a real-life scenario, you might have many Worker Processes scaled over many hosts. You would need to develop your Temporal Application to route Tasks to specific Worker Processes when needed.
Route Activity Tasks to a specific process
Some Activities load large datasets and cache them in the process. The Activities that rely on those datasets should be routed to the same process.
In this case, a unique Task Queue would exist for each Worker Process involved.
Workers with different capabilities
Some Workers might exist on GPU boxes versus non-GPU boxes. In this case, each type of box would have its own Task Queue and a Workflow can pick one to send Activity Tasks.
If your use case involves more than one priority, you can create one Task Queue per priority, with a Worker pool per priority.
Task Routing is the simplest way to version your code.
If you have a new backward-incompatible Activity Definition, start by using a different Task Queue.
What is a Worker Session?
A Worker Session is a feature provided by some SDKs that provides a straightforward API for Task Routing to ensure that Activity Tasks are executed with the same Worker without requiring you to manually specify Task Queue names. It also includes features like concurrent session limitations and Worker failure detection.
What is Worker Versioning?
Worker Versioning simplifies the process of deploying changes to Workflow Definitions. It does this by letting you define sets of versions that are compatible with each other, and then assigning a Build ID to the code that defines a Worker. The Temporal Server uses the Build ID to determine which versions of a Workflow Definition a Worker can process.
We recommend that you read about Workflow Definitions before proceeding, because Workflow Versioning is largely concerned with helping to manage nondeterministic changes to those definitions.
Worker Versioning helps manage nondeterministic changes by providing a convenient way to ensure that Workers with different Workflow and Activity Definitions operating on the same Task Queue don't attempt to process Workflow Tasks and Activity Tasks that they can't successfully process, according to sets of versions associated with that Task Queue that you've defined.
Accomplish this goal by assigning a Build ID (a free-form string) to the code that defines a Worker, and specifying which Build IDs are compatible with each other by updating the version sets associated with the Task Queue, stored by the Temporal Server.
When and why you should use Worker Versioning
The main reason to use this feature is to deploy incompatible changes to short-lived Workflows. On Task Queues using this feature, the Workflow starter doesn't have to know about the introduction of new versions.
The new code in the newly deployed Workers executes new Workflow Executions, while only Workers with an appropriate version process old Workflow Executions.
Decommission old Workers
You can decommission old Workers after you archive all open Workflows using their version. If you have no need to query closed Workflows, you can decommission them when no open Workflows remain at that version.
For example, if you have a Workflow that completes within a day, a good strategy is to assign a new Build ID to every new Worker build and add it as the new overall default in the version sets.
Because your Workflow completes in a day, you know that you won't need to keep older Workers running for more than a day after you deploy the new version (assuming availability).
Removing old versions from the sets isn't necessary. Leaving them doesn't cause any harm.
You can apply this technique to longer-lived Workflows too; however, you might need to run multiple Worker versions simultaneously while open Workflows complete.
Deploy code changes to Workers
The feature also lets you implement compatible changes to or prevent a buggy code path from executing on currently open Workflows. You can achieve this by adding a new version to an existing set and defining it as compatible with an existing version, which shouldn't execute any future Workflow Tasks. Because the new version processes existing Event Histories, it must adhere to the usual deterministic constraints, and you might need to use one of the versioning APIs.
Moreover, this feature lets you make incompatible changes to Activity Definitions in conjunction with incompatible changes to Workflow Definitions that use those Activities. This functionality works because any Activity that a Workflow schedules on the same Task Queue gets dispatched by default only to Workers compatible with the Workflow that scheduled it. If you want to change an Activity Definition's type signature while creating a new incompatible Build ID for a Worker, you can do so without worrying about the Activity failing to execute on some other Worker with an incompatible definition. The same principle applies to Child Workflows. For both Activities and Child Workflows, you can override the default behavior and run the Activity or Child Workflow on latest default version.
Public-facing Workflows on a versioned Task Queue shouldn't change their signatures because doing so contradicts the purpose of Workflow-launching Clients remaining unaware of changes in the Workflow Definition. If you need to change a Workflow's signature, use a different Workflow Type or a completely new Task Queue.
If you schedule an Activity or a Child Workflow on a different Task Queue from the one the Workflow runs on, the system doesn't assign a specific version. This means if the target queue is versioned, they run on the latest default, and if it's unversioned, they operate as they would have without this feature.
Continue-As-New and Worker Versioning
By default, a versioned Task Queue's Continue-as-New function starts the continued Workflow on the same compatible set as the original Workflow.
If you continue-as-new onto a different Task Queue, the system doesn't assign any particular version. You also have the option to specify that the continued Workflow should start using the Task Queue's latest default version.
How to use Worker Versioning
To use Worker Versioning, follow these steps:
- Define Worker build-identifier version sets for the Task Queue.
You can use either the
temporalCLI or your choice of SDK.
- Enable the feature on your Worker by specifying a Build ID.
Defining the version sets
Whether you use Temporal CLI or an SDK, updating the version sets feels the same. You specify the Task Queue that you're targeting, the Build ID that you're adding (or promoting), whether it becomes the new default version, and any existing versions it should be considered compatible with.
The rest of this section uses updates to one Task Queue's version sets as examples.
By default, both Task Queues and Workers are in an unversioned state. Unversioned Worker can poll unversioned Task Queues and receive tasks. To use this feature, both the Task Queue and the Worker must be associated with Build IDs.
If you run a Worker using versioning against a Task Queue that has not been set up to use versioning (or is missing that Worker's Build ID), it won't get any tasks. Likewise, a unversioned Worker polling a Task Queue with versioning won't work either.
The versions in the following examples look like semver versions for clarity, but they don't need to be. Versions can be any arbitrary string.
First, add a version
1.0 to the Task Queue as the new default.
Your version sets now look like this:
|set 1 (default)|
All new Workflows started on the Task Queue have their first tasks assigned to version
Workers with their Build ID set to
1.0 receive these Tasks.
If Workflows that don't have an assigned version are still running on the Task Queue, Workers without a version take those tasks. So ensure that such Workers are still operational if any Workflows were open when you added the first version. If you deployed any Workers with a different version, those Workers receive no Tasks.
Now, imagine you need change the Workflow for some reason.
2.0 to the sets as the new default:
|set 1||set 2 (default)|
|1.0 (default)||2.0 (default)|
All new Workflows started on the Task Queue have their first tasks assigned to version
1.0 Workflows keep generating tasks targeting
Each deployment of Workers receives their respective Tasks.
This same concept carries forward for each new incompatible version.
Maybe you have a bug in
2.0, and you want to make sure all open
2.0 Workflows switch to some new code as fast as possible.
So, you add
2.1 to the sets, marking it as compatible with
Now your sets look like this:
|set 1||set 2 (default)|
All new Workflow Tasks that are generated for Workflows whose last Workflow Task completion was on version
2.0 are now assigned to version
Because you specified that
2.1 is compatible with
2.0, Temporal Server assumes that Workers with this version can process the existing Event Histories successfully.
Continue with your normal development cycle, adding a
Nothing new here:
|set 1||set 2||set 3 (default)|
|1.0 (default)||2.0||3.0 (default)|
Now imagine that version
3.0 doesn't have an explicit bug, but something about the business logic
is less than ideal.
You are okay with existing
3.0 Workflows running to completion, but you want new Workflows to use the old
This operation is supported by performing an update targeting
2.0) and setting its set as the current default, which results in these sets:
|set 1||set 3||set 2 (default)|
|1.0 (default)||3.0 (default)||2.0|
Now new Workflows start on
Permitted and forbidden operations on version sets
A request to change the sets can do one of the following:
- Add a version to the sets as the new default version in a new overall-default compatible set.
- Add a version to an existing set that's compatible with an existing version.
- Optionally making it the default for that set.
- Optionally making that set the overall-default set.
- Promote a version within an existing set to become the default for that set.
- Promote a set to become the overall-default set.
You can't explicitly delete versions.This helps you avoid the situation in which Workflows accidentally become stuck with no means of making progress because the version they're associated with no longer exists.
However, sometimes you might want to do this intentionally.
If you want to make sure that all Workflows currently being processed by, say,
2.0 stop (even if you don't yet have a new version ready), you can add a new version
2.1 to the sets marked as compatible with
New tasks will target
2.1, but because you haven't deployed any
2.1 Workers, they won't make any progress.
The sets have a maximum size limit, which defaults to 100 build IDs across all sets.
This limit is configurable on Temporal Server via the
limit.versionBuildIdLimitPerQueue dynamic config property.
Operations to add new Build IDs to the sets fail if the limit would be exceeded.
There is also a limit on the number of sets, which defaults to 10.
This limit is configurable via the
limit.versionCompatibleSetLimitPerQueue dynamic config property.
In practice, these limits should rarely be a concern because a version is no longer needed after no open Workflows are using that version, and a background process will delete IDs and sets that are no longer needed.
There is also a limit on the size of each Build ID or version string, which defaults to 255 characters.
This limit is configurable on the server via the
limit.workerBuildIdSize dynamic config property.
Build ID reachability
Eventually, you'll want to know whether you can retire the old Worker versions. Temporal provides functionality to help you determine whether a version is still in use by open or closed Workflows. You can use the Temporal CLI to do this with the following command:
temporal task-queue get-build-id-reachability
The command determines, for each Task Queue, whether the Build ID in question is unreachable, only reachable by closed Workflows, or reachable by open and new Workflows. For example, this "2.0" Build ID is shown here by the CLI to be reachable by both new Workflows and some existing Workflows:
temporal task-queue get-build-id-reachability --build-id "2.0"
BuildId TaskQueue Reachability
2.0 build-id-versioning-dc0068f6-0426-428f-b0b2-703a7e409a97 [NewWorkflows
For more information, see the CLI documentation or help output.
You can also use this API
GetWorkerTaskReachability directly from within language SDKs.
Unversioned Workers refer to Workers that have not opted into the Worker Versioning feature in their configuration. They receive tasks only from Task Queues that do not have any version sets defined on them, or that have open workflows that began executing before versions were added to the queue.
To migrate from an unversioned Task Queue, add a new default Build ID to the Task Queue. From there, deploy Workers with the same Build ID. Unversioned Workers will continue processing open Workflows, while Workers with the new Build ID will process new Workflow Executions.