Ordering, Concurrency, and Interleavings

Intro

I thought I would kick off a small devlog containing reflections on some of the approaches we’ve taken in the development of software at Marymoor Studios including both our games and the MSC (Marymoor Studios Core libraries). In this first post I’ll talk a little about ordering, scheduling, and safety concerns that led to the development of the Promise library.

Activities and Turns

One of the first challenges I always encounter when designing any highly concurrent and multithreaded software (including games!) is managing order. By order I mean the order of operations that execute to form the logic of the program. I.e. the program does A followed by B followed by C in that arrangement. Testing the program for correctness amounts to determining whether that order is the proper one to achieve the desired outcome. In a single threaded program with no concurrency and no parallelism, the order of operations is strictly defined by the order of the code (i.e. the sequence of text written in a program language – modulo loops, function calls, and the like). A straight-line program is much easier to understand than a concurrent one. It has a high degree of determinism – meaning it will always execute predictably in exactly the same way. If there is a bug in the logic then that bug will be exhibited every time the program is run, making it significantly easier to find and fix.

A concurrent program executes multiple larger operations (I’ll call them Activities) at the same time, but not necessarily simultaneously. A single CPU core may “take turns” working on two larger activities by breaking each up into smaller chunks (I’ll just call the chunks Turns for simplicity’s sake). Each larger activity can be comprised of any number of turns from one to infinity (e.g. infinite loops). Different activities can be made up of different numbers of turns. New activities can be started at any time, spawned either as a child or a sibling computation of an already executing activity. For example, a game might have separate activities to handle the movement of the player’s character and NPC characters that occupy the screen at the same time. Each moves seemingly independently and concurrently, though a single CPU core might be switching rapidly between them to achieve this illusion.

Concurrency and Interleavings

There are many reasons to leverage concurrent activities but the two that most often drive me are:

Separation of Concerns: each activity is a separate logical thread of execution.
IO: Input/Output between an activity and the outside world.

By leveraging activities, I find it much easier to separate the logic of different parts of the program into separate threads of control (mini programs) each of which can be reasoned about (and tested!) independently. The logic for controlling the player character is largely independent of the logic for controlling an NPC character and being able to write and execute them as separate activities makes it easier to write and maintain them. IO happens when an activity interacts with the outside world, typically through one of the physical devices mediated by the operating system like the network, the disk drive, or another process. CPUs have gotten so fast compared to these IO devices that if an activity simply blocks waiting for an IO to complete then much of the CPU core’s processing power would be wasted waiting instead of computing. If the program can execute another activity while one activity is waiting on IO, then the program can continue to make progress overall. When multiple activities are executing concurrently a turn from one may execute followed by a turn from another, say turn A1 from activity A followed by turn B1 from activity B. At any given time, there can be many activities running concurrently each with an available turn. If there is more than one turn available, then something must decide which turn to run next. Deciding this is called scheduling, and the thing making the decision is called a Scheduler. The actual chosen sequence of turns executed by a single CPU core is called an interleaving. For example, a possibly interleaving of the activities A and B mentioned above in a single run might have been: [A1 A2 B1 A3 B2 B3 A4]. A different interleaving of these same activities might have been [A1 B1 A2 B2 B3 A3 A4]. Both are valid. If the two activities are independent, both will produce the exact same outcome.

However, if the activities are NOT independent, then the the two interleavings MAY produce different outcomes! For example, imagine that activity A adds 1 charge to the player’s power meter per turn (background charging), while activity B fires the player’s weapon consuming 1 charge (but only if the meter is not empty). Therefore these are NOT independent activities because they share a variable – the meter’s current charge level. The first interleaving will result in this sequence of charge level values [1 2 1 2 1 0 1] delivering 3 shots and thus defeating the enemy. The second interleaving, however, will result in [1 0 1 0 # 1 2] delivering only 2 shots (and one misfire). The player will lose. The activities that executed in both of these runs was exactly the same. Each activity, from its own perspective, executed exactly the same logic. But the outcome of the combined activities was still different. The difference in outcome was determined by the chosen interleaving of the activities. A differing outcome not due to a change in the inputs is called nondeterminism. The thing creating the nondeterminism is called a source of the nondeterminism. The source of nondeterminism in the example above is the choice of interleaving. We’ll call this specific type of nondeterminism: interleaving nondeterminism.

Scheduling and Nondeterminism

If you are trying to track down a bug (say, there is a bug in the power recharge rate mentioned above) then having nondeterminism in the choice of interleavings might make it more difficult to find that bug. Some playthroughs would yield one outcome (no bug exhibited) while others would have a different outcome (the bug shows up). Which outcome happens is not dependent on the inputs (i.e. the actions of the player) and so finding that bug may prove difficult. It may not show up in any interleaving during testing and only appear in the field after the game has shipped. On the other hand, if the choice of interleaving were deterministic (say, the first interleaving above is ALWAYS chosen) then only one outcome is possible. If that outcome exhibits the bug, then that bug will ALWAYS happen on all playthroughs with the same inputs and is therefore more easily fixed before shipping. If the bug doesn’t occur in that outcome, then it will NEVER occur on ANY playthrough (with those inputs).

There are many different scheduling algorithms for choosing an interleaving. Each has its own pros and cons. For instance, imagine that the scheduler runs activity A until it is finished and then runs activity B until it is finished. Technically this is correct but it might be undesirable. For one thing, it gives a poor simulation of concurrency. Second, it may suffer from starvation if, for example, activity A is infinite. Or lead to a hang if activity A requires a state created by activity B before it terminates. Consider another scheduler that uses a random number to pick between all of the currently runnable turns. This is reasonable and achieves something like “fair” progress for all concurrent activities (i.e. they are all guaranteed to make some progress). But – this will introduce a source of interleaving nondeterminism such as that demonstrated above.

Lastly, consider the scheduler that strictly runs one turn from each activity in round-robin fashion. This scheduling algorithm would produce a deterministic interleaving (i.e. ALWAYS the same one). For example, the interleaving of the above activities A and B would ALWAYS be [A1 B1 A2 B2 A3 B3 A4] which would result in the “3 shots” successful outcome. So, the choice of scheduler (and therefore the choice of scheduling algorithm) can impact how much interleaving nondeterminism a program exhibits which in turn affects how hard it will be to write and test the program.

Parallelism and Nondeterminism

Most computing devices these days (PCs, phones, tablets, consoles, and even smart watches) contain multiple CPU cores. This enables the hardware to physically execute multiple instructions simultaneously. This is called Parallelism. While concurrency allowed for multiple high-level activities to be executing at the same time (i.e. overlap), concurrency did not guarantee that multiple turns actually ran simultaneously (i.e. true parallelism). As we discussed above, a common strategy for concurrent activities is for the scheduler to take turns executing a little of one activity and then a little of another activity. No actual parallelism is needed when the scheduler takes turns over a single CPU core. However, when the hardware has multiple CPU cores, then the scheduler can choose to execute multiple turns simultaneously, each on a different physical CPU core. Utilizing multiple CPU cores through parallelism increases a program’s computing power and can either allow it to do more in the same amount of time or reduce the latency of the work it is already doing.

Because CPU cores execute asynchronously from each other (and have other nondeterministic elements in their communication, synchronization, and timing), scheduling a set of activities across multiple CPU cores inherently results in interleaving nondeterminism. Even the round-robin scheduler discussed above would still produce a nondeterministic interleaving when scheduled over multiple CPU cores. If it sends turn A1 to CPU1 and then B1 to CPU2 there is no way to know whether A1 or B1 will complete first, and A2 would be scheduled on the first CPU to finish. If A1 and then A2 are both issued and both complete before B1 completes this would result in an interleaving with [A1 A2 B1...] even though the scheduler used a strict round-robin approach to dispatching turns.

In addition to interleaving nondeterminism, parallelism also introduces a variety of thread safety and data race issues which typically must be addressed through the careful use of synchronization primitives (i.e. mutexes, locks, critical sections, semaphores, interlocks, etc.) - which I will collectively call locks. Locks can frequently be the source of bugs and performance issues. Locks are pessimistic by nature and must be placed in the proper parts of the code even when most interleavings never need them (i.e. there is no contention on the lock). Often there are only a very small number of interleavings which are very rarely seen but when chosen, lead to data corruption or memory safety issues unless properly protected by locks. In a nondeterministically scheduled environment, there can be no guarantees that any amount of testing as ever observed these rare interleavings. This makes it very hard to test whether the use of locks in the code is actually correct. The locking code could be wrong, but if no unsafe interleaving is every selected by the scheduler then the latent bug will never be observed. Such bugs might only be encountered in the field after they lead to program malfunction or failure. For example, such bugs may lead to game misbehavior, crashes, save game corruption, or other similar undesirable outcomes.

Tasks

There are many ways to achieve concurrency (Tasks, multiple threads, goroutines, multiple processes, etc.) depending on your programming environment. In C# the Task Framework is built into both to the Language Runtime and the language syntax itself (with keywords like async and await). This makes Tasks a very convenient mechanism for introducing concurrency into your Unity or Godot based games. Both game engines support C# and the Task Framework is built into C#. The Task Framework and its scheduler is designed to achieve the maximum amount of parallelism the hardware can support. To achieve parallelism, the Task Framework breaks activities up into Tasks and then schedules those tasks freely across a pool of threads. The number of threads in the pool is automatically tuned to the number of CPU cores available in the hardware. While at first this seems like an ideal design, in many programs this choice may NOT actually achieve the best performance and may lead to harder to write, harder to test, and harder to maintain software.

When using the Task Framework, Tasks are scheduled freely across all available CPU cores. This implies a high degree of interleaving nondeterminism. It also implies that thread safety and data race considerations MUST be carefully addressed through either memory isolation or locking. It is rare to use the Task Framework without also protecting most state changes with explicit locking. Even subsequent turns of the same activity may be scheduled on different CPU cores and therefore be subject to data races. This can be a very demanding programming environment for creatives and less experienced game developers (or even experienced developers!). They are forced to deal with some of the most challenging issues in computer science even when developing the simplest components. Being able to write lock-free code in a deterministically scheduled environment is so much less error-prone, but using Tasks requires abandoning that happy path.

Additionally, games are a common (but not unique) example of where maximizing parallelism may NOT be advantageous. Game engines, like Unity and Godot, offer abstractions centered around a single “game loop” that allows a collection of “game objects” to be manipulated between executions of a graphics rendering engine that realize the game’s visual output for the player. Primarily for reasons involving performance and synchronization with the GPU, the renderer and the “game objects” are usually NOT thread-safe (i.e. they don’t use locks internally). It is, therefore, generally unsafe to interact with these “game objects” at pretty much any point in time other than a designated spot during the game loop between frames where the game engine calls into the game developer’s code. If a game developer pushes some of their game logic into a Task-based activity, since that Task’s activity is freely scheduled on the thread-pool, it will likely run in parallel with the main thread executing the game loop. If the activity directly accesses “game objects” from a background thread, then memory corruption and/or a game crash is likely to result. To avoid these errors a developer MUST either constrain their game logic to the main thread (by not using Task-based concurrency) or by implementing careful thread synchronization (that might be hard to verify and so inevitably leading to bugs).

Promises

Many of the above considerations led directly to the development of the Promise library within the larger MSC (Marymoor Studios Core). The Promise library provides a drop-in replacement for the Task Framework including full support for the async and await keywords. Unlike the Task Framework, the Promise library does NOT use the thread-pool for scheduling. Instead, it implements its own deterministic single-threaded scheduler. All activities within the same SIP (Software Isolated Process) are executed by a single CPU core in a deterministic, cooperative, turn-based manner. (Multiple SIPs can be created to take advantage of multiple hardware CPU cores. I’ll discuss SIPs and multicore programming more in a future post.) In both Unity or Godot, the Promise scheduler can be configured to execute activities ONLY during the period between frame renderings in the main game loop. (There’s also support for using a Promise scheduler in a stand-alone C# program such as a game server, a utility, or anything else.) This means that async Promise-based activities can freely close over “game objects” and interact with both game objects and other game state in ANY turn without ANY special considerations for locking or game loop synchronization. All Marymoor Studios game logic uses no locks whatsoever.

Promise-based activities have a very low degree of interleaving nondeterminism, making the resulting game more predictable and easier to test. In a single-threaded, cooperatively-scheduled environment (such as the Promise scheduler) a turn is NEVER preempted. In a Task-based thread-pool environment, however, all of the threads in the thread pool are preemptively scheduled and contend for the same CPU cores. A preemptively scheduled method can be interrupted mid-method at virtual any point (even literally in the middle of a line of code). In contrast, a cooperatively scheduled computation can ONLY be interrupted at the explicit points where it yields or awaits.
This ensures that, once invariants have been established, they are never externally invalided unintentionally in the middle of a method. This makes it much easier to reason about and prevent TOCTOU issues.

Conclusion

Use of the Promise library significantly eases development of our game software. It can be easily integrated with game engines like Unity and Godot. It is a drop-in replacement for the Task Framework. It fully supports all C# language syntax like async and await. It allows async code to be seamlessly written that interacts directly with game objects and state without fear of cross-thread issues or lock considerations. And it significantly reduces the game’s overall interleaving nondeterminism, making all components of the game more deterministic and more predictable. That makes them easier to write and easier to test!

Read the next post in this series.

Feedback

Write us with feedback.