Temporal: A Central ‘Brain’ for Box

Over the last decade, Box has emerged as the leading provider of a secure, scalable, and user-friendly platform for managing and collaborating on content in the cloud. And as the continue to lead, they continuously search for ways to improve their product architecture that will enable new features and improve the performance of existing functionality.

Managing massive, dynamic, diverse content

The Box Content Cloud isn’t just a repository for vast and diverse file types; it’s a dynamic hub where customers can seamlessly collaborate, share, and organize their intricate content. With folders housing millions of files, each with unique permissions and metadata, managing updates is crucial. Any failure to propagate changes across the folder hierarchy can lead to significant costs and potential data integrity issues, making seamless updates essential.

A simple example would be if we accidentally copy a folder from Point A to Point B twice, then you’re going to end up with two copies of the folder in the destination folder. Considering that any file in a given tree could be important, it’s critical that the system which handles propagation be strongly consistent and completely reliable.

Attempt 1: Homegrown orchestration with queue and databases Box had originally solved this problem with a homegrown orchestration system powered by queues and events. For each file operation, tasks were added to a queue which workers would then pull from, re-queueing new tasks and state as needed. Maintaining consistency with this ad-hoc system was not simple.

What we had previously was this system where we’d have this worker get a message off of a queue, perform some chunk of work and then go put a message back onto the queue with a pagination marker, to perform some other chunk of work.

Every queue demanded tailored logic and state management to address failures and temporary issues. Consequently, numerous databases proliferated, each hosting bespoke state machines for the intricate internal processes. Lacking a cohesive, overarching perspective on active workflows, monitoring an operation’s status and advancement, let alone pausing or restarting ongoing tasks, posed considerable challenges.

Recognizing the limitations of the homegrown approach, the Box team initiated a project to transform this architecture into something less complex and more manageable. This effort wasn’t just about addressing the immediate problems, but also finding a scalable long term solution for the business.

We needed a central brain where we can store state.

The first exploration was to build a “template” for a generic file operation that employed an API that was inspired by AWS Simple Workflow, but did not use the service under the hood. Alternatives such as Netflix Conductor were also considered, but were quickly dismissed as they do not support writing logic as code, and relied upon lesser-known storage and queueing systems.

Speed and Simplicity with Temporal

As the exploration continued, Temporal emerged later in the project’s lifecycle. It was recognized for sharing a similar conceptual framework with Simple Workflow, requiring only a minimal adjustment in both mental and programmatic approach to enable prototyping with Temporal as the orchestration engine.

Temporal was a relatively easy path forward for the team as they could use an “off-the-shelf” framework, and to get the scale they needed. They just needed to figure out how to map their problem to durable workflows.

The value of Temporal was so convincing that Box decided to not only port orchestration logic during the migration, but also the service level business logic. Box employed a methodical and scientific transition plan and two major criteria were chosen to evaluate success:

Is the orchestration doing what we think it’s doing or at least touching all of the same things that we expect it to touch? Is the logic that the orchestrator is orchestrating actually sound (considering that it’s also been reimplemented with Temporal)?

Armed with this framework, the team started with an in-memory simulation to validate the API and architecture. Once baseline compatibility was proven, a highly controlled rollout began - tenant by tenant.

Why Temporal for Box

Ultimately, the concept of workflow as code was a major attraction. It’s far more convenient for developers to work with and just felt right for the Box team. Attempting to define workflows using JSON and similar methods can be cumbersome. With workflow as code, it’s straightforward. As an engineer tasked with implementing a SAGA said, “I find it very intuitive to look at this and think, Okay, this is how I would approach it.” The API provides the illusion of single threaded execution without having to get into the details of trying to orchestrate multiple things.

If you describe your workflows using JSON structure, not only are they difficult to understand, but they present a significant testing challenge as well. With Temporal, everything is code, so you can use a single test suite to validate it’s function end-to-end.\

What’s next

While the use case for Temporal started out narrow and focused, additional teams within Box have begun adopting the platform. The data platform team is using it within their systems architecture. A file systems team is considering it’s use for traversals of file structure to support file and folder governance. And there are more teams using it.

Temporal: A Central 'Brain' for Box

Temporal: A Central ‘Brain’ for Box

Managing massive, dynamic, diverse content

Speed and Simplicity with Temporal

Why Temporal for Box

What’s next

Build invincible apps