Develop code that durably executes - Python SDK dev guide

When it comes to the Temporal Platform's ability to durably execute code, the SDK's ability to Replay a Workflow Execution is a major aspect of that. This chapter introduces the development patterns which enable that.

Develop for a Durable Execution

This chapter of the Temporal Python SDK developer's guide introduces best practices to developing deterministic Workflows that can be Replayed, enabling a Durable Execution.

By the end of this section you will know basic best practices for Workflow Definition development.

Learning objectives:

Identify SDK API calls that map to Events
Recognize non-deterministic Workflow code
Explain how Workflow code execution progresses

The information in this chapter is also available in the Temporal 102 course.

This chapter builds on the Construct a new Temporal Application project chapter and relies on the Background Check use case and sample applications as a means to contextualize the information.

This chapter walks through the following sequence:

Retrieve a Workflow Execution's Event History
Add a Replay test to your application
Intrinsic non-deterministic logic
Non-deterministic code changes

Retrieve a Workflow Execution's Event History

There are a few ways to view and download a Workflow Execution's Event History. We recommend starting off by using either the Temporal CLI or the Web UI to access it.

Using the Temporal CLI

Use the Temporal CLI's temporal workflow show command to save your Workflow Execution's Event History to a local file. Run the command from the /tests directory so that the file saves alongside the other testing files.

.
├── backgroundcheck.py
├── main.py
├── ssntraceactivity.py
└── tests
    ├── __init__.py
    ├── backgroundcheck_workflow_history.json
    ├── conftest.py
    └── replay_dacx_test.py

Local dev server

If you have been following along with the earlier chapters of this guide, your Workflow Id might be something like backgroundcheck_workflow.

temporal workflow show \
 --workflow-id backgroundcheck_workflow \
 --namespace backgroundcheck_namespace \
 --output json > tests/backgroundcheck_workflow_history.json

Workflow Id returns the most recent Workflow Execution

The most recent Event History for that Workflow Id is returned when you only use the Workflow Id to identify the Workflow Execution. Use the --run-id option as well to get the Event History of an earlier Workflow Execution by the same Workflow Id.

Temporal Cloud

For Temporal Cloud, remember to either provide the paths to your certificate and private keys as command options, or set those paths as environment variables:

temporal workflow show \
 --workflow-id backgroundcheck_workflow \
 --namespace backgroundcheck_namespace \
 --tls-cert-path /path/to/ca.pem \
 --tls-key-path /path/to/ca.key \
 --output json  > tests/backgroundcheck_workflow_history.json

Self-hosted Temporal Cluster

For self-hosted environments, you might be using the Temporal CLI command alias:

temporal_docker workflow show \
 --workflow-id backgroundcheck_workflow \
 --namespace backgroundcheck_namespace \
 --output json  > tests/backgroundcheck_workflow_history.json

Via the UI

A Workflow Execution's Event History is also available in the Web UI.

Navigate to the Workflows page in the UI and select the Workflow Execution.

Select a Workflow Execution from the Workflows page

From the Workflow details page you can copy the Event History from the JSON tab and paste it into the backgroundcheck_workflow_history.json file.

Copy Event History JSON object from the Web UI

Add a Replay test

Add the Replay test to the set of application tests. The Replayer is available from the Replayer class in the SDK. Register the Workflow Definition and then specify an existing Event History to compare to.

Run the tests in the test directory (pytest). If the Workflow Definition and the Event History are incompatible, then the test fails.

View the source code

in the context of the rest of the application code.

@pytest.mark.asyncio
async def test_replay_workflow_history_from_file():
    async with await WorkflowEnvironment.start_time_skipping():
        with open("tests/backgroundcheck_workflow_history.json", "r") as f:
            history_json = json.load(f)

WorkflowEnvironment is a class in the Temporal Python SDK that provides a testing suite for running Workflows and Activity code. start_time_skipping() is a method that allows you to skip time in a Workflow Execution. By skipping time, you can quickly test how Workflows behave over extended periods of time without needing to wait in real-time.

Why add a Replay test?

The Replay test is important because it verifies whether the current Workflow code (Workflow Definition) remains compatible with the Event Histories of earlier Workflow Executions.

A failed Replay test typically indicates that the Workflow code exhibits non-deterministic behavior. In other words, for a specific input, the Workflow code can follow different code paths during each execution, resulting in distinct sequences of Events. The Temporal Platform's ability to ensure durable execution depends on the SDK's capability to re-execute code and return to the most recent state of the Workflow Function execution.

The Replay test executes the same steps as the SDK and verifies compatibility.

Workflow code becomes non-deterministic primarily through two main avenues:

Intrinsic non-deterministic logic: This occurs when Workflow state or branching logic within the Workflow gets determined by factors beyond the SDK's control.
Non-deterministic code changes: When you change your Workflow code and deploy those changes while there are still active Workflow Executions relying on older code versions.

Intrinsic non-deterministic logic

"Intrinsic non-determinism" refers to types of Workflow code that can disrupt the completion of a Workflow by diverging from the expected code path based on the Event History. For instance, using a random number to decide which Activities to execute is a classic example of intrinsic non-deterministic code.

Luckily, for Python developers, the Python SDK employs a sort of “Sandbox” environment that either wraps many of the typical non-deterministic calls, making them safe to use, or prevents you from running the code in the first place.

Calls that are disallowed will cause a Workflow Task to fail with a "Restricted Workflow Access" error, necessitating code modification for the Workflow to proceed.

Calls such as random.randint() are actually caught by the SDK, so that the resulting number persists and doesn’t cause deterministic issues.

However the sandbox is not foolproof and non-deterministic issues can still occur.

Developers are encouraged to use the SDK’s APIs when possible and avoid potentially intrinsically non-deterministic code:

Random Number Generation:
- Replace random.randint() with workflow.random().randint().
Time Management:
- Use workflow.now() instead of datetime.now() or workflow.time() instead time.time() for current time.
- Leverage the custom asyncio event loop in Workflows; use asyncio.sleep() as needed.

Read more about How the Python Sandbox works for details.

Other common ways to introduce non-deterministic issues into a Workflow:

External System Interaction:
- Avoid direct external API calls, file I/O operations, or interactions with other services.
- Utilize Activities for these operations.
Data Structure Iteration:
- Use Python dictionaries as they are deterministically ordered.
Run Id Usage:
- Be cautious with storing or evaluating the run Id.

Non-deterministic code changes

The most important thing to take away from the section is to make sure you have an application versioning plan whenever you are developing and maintaining a Temporal Application that will eventually deploy to a production environment.

Versioning APIs and versioning strategies are covered in other parts of the developer's guide, this chapter sets the stage to understand why and how to approach those strategies.

The Event History

Inspect the Event History of a recent Background Check Workflow using the temporal workflow show command:

temporal workflow show \
 --workflow-id backgroundcheck_workflow \
 --namespace backgroundcheck_namespace

You should see output similar to this:

Progress:
  ID          Time                     Type
   1  2023-10-25T20:28:03Z  WorkflowExecutionStarted
   2  2023-10-25T20:28:03Z  WorkflowTaskScheduled
   3  2023-10-25T20:28:03Z  WorkflowTaskStarted
   4  2023-10-25T20:28:03Z  WorkflowTaskCompleted
   5  2023-10-25T20:28:03Z  ActivityTaskScheduled
   6  2023-10-25T20:28:03Z  ActivityTaskStarted
   7  2023-10-25T20:28:03Z  ActivityTaskCompleted
   8  2023-10-25T20:28:03Z  WorkflowTaskScheduled
   9  2023-10-25T20:28:03Z  WorkflowTaskStarted
  10  2023-10-25T20:28:03Z  WorkflowTaskCompleted
  11  2023-10-25T20:28:03Z  WorkflowExecutionCompleted

Result:
  Status: COMPLETED
  Output: ["pass"]

The preceding output shows eleven Events in the Event History ordered in a particular sequence. All Events are created by the Temporal Server in response to either a request coming from a Temporal Client, or a Command coming from the Worker.

Let's take a closer look:

WorkflowExecutionStarted: This Event is created in response to the request to start the Workflow Execution.
WorkflowTaskScheduled: This Event indicates a Workflow Task is in the Task Queue.
WorkflowTaskStarted: This Event indicates that a Worker successfully polled the Task and started evaluating Workflow code.
WorkflowTaskCompleted: This Event indicates that the Worker suspended execution and made as much progress that it could.
ActivityTaskScheduled: This Event indicates that the ExecuteActivity API was called and the Worker sent the ScheduleActivityTask Command to the Server.
ActivityTaskStarted: This Event indicates that the Worker successfully polled the Activity Task and started evaluating Activity code.
ActivityTaskCompleted: This Event indicates that the Worker completed evaluation of the Activity code and returned any results to the Server. In response, the Server schedules another Workflow Task to finish evaluating the Workflow code resulting in the remaining Events, WorkflowTaskScheduled.WorkflowTaskStarted, WorkflowTaskCompleted, WorkflowExecutionCompleted.

Event reference

The Event reference serves as a source of truth for all possible Events in the Workflow Execution's Event History and the data that is stored in them.

Add a call to sleep

In the following sample, we add a couple of logging statements and a Timer to the Workflow code to see how this affects the Event History.

Use the asyncio.sleep() API to cause the Workflow to sleep for a minute before the call to execute the Activity. The Temporal Python SDK offers deterministic implementations to the following API calls:

Use the workflow.logger API to log from Workflows to avoid seeing repeated logs from the Replay of the Workflow code.

View the source code

in the context of the rest of the application code.

import asyncio
from datetime import timedelta

from temporalio import workflow

with workflow.unsafe.imports_passed_through():
    from ssntraceactivity import ssn_trace_activity

@workflow.defn()
class BackgroundCheck:
    @workflow.run
    async def run(self, ssn: str) -> str:
        random_number = workflow.random().randint(1, 100)
        if random_number < 50:
            await asyncio.sleep(60)
            workflow.logger.info("Sleeping for 60 seconds")
        return await workflow.execute_activity(
            ssn_trace_activity,
            ssn,
            schedule_to_close_timeout=timedelta(seconds=5),
        )

Inspect the new Event History

After updating your Workflow code to include the logging and Timer, run your tests again. You should expect to see the TestReplayWorkflowHistoryFromFile test fail. This is because the code we added creates new Events and alters the Event History sequence.

To get this test to pass, we must get an updated Event History JSON file. Start a new Workflow and after it is complete download the Event History as a JSON object.

Double check Task Queue names

Reminder that this guide jumps between several sample applications using multiple Task Queues. Make sure you are starting Workflows on the same Task Queue that the Worker is listening to. And, always make sure that all Workers listening to the same Task Queue are registered with the same Workflows and Activities.

If you inspect the new Event History, you will see two new Events in response to the asyncio.sleep() API call which send the StartTimer Command to the Server:

TimerStarted
TimerFired

However, it is also important to note that you don't see any Events related to logging. And if you were to remove the Sleep call from the code, there wouldn't be a compatibility issue with the previous code. This is to highlight that only certain code changes within Workflow code is non-deterministic. The basic thing to remember is that if the API call causes a Command to create Events in the Workflow History that takes a new path from the existing Event History then it is a non-deterministic change.

This becomes a critical aspect of Workflow development when there are running Workflows that have not yet completed and rely on earlier versions of the code.

Practically, that means non-deterministic changes include but are not limited to the following:

Adding, removing, reordering an Activity call inside a Workflow Execution
Switching the Activity Type used in a call to ExecuteActivity
Adding or removing a Timer
Altering the execution order of Activities or Timers relative to one another

The following are a few examples of changes that do not lead to non-deterministic errors:

Modifying non-Command generating statements in a Workflow Definition, such as logging statements
Changing attributes in the ActivityOptions
Modifying code inside of an Activity Definition

Retrieve a Workflow Execution's Event History​

Using the Temporal CLI​

Via the UI​

Add a Replay test​

Why add a Replay test?​

Intrinsic non-deterministic logic​

Non-deterministic code changes​

The Event History​

Add a call to sleep​

Inspect the new Event History​

Retrieve a Workflow Execution's Event History

Using the Temporal CLI

Via the UI

Add a Replay test

Why add a Replay test?

Intrinsic non-deterministic logic

Non-deterministic code changes

The Event History

Add a call to sleep

Inspect the new Event History