Orphaned Computation

Intro

One of the most powerful tools in modern programming is the ease with which we can perform “background computation”, i.e. the ability to execute additional logic outside the primary flow of control. Way back in our Ordering post we introduced the concept of an activity and throughout this blog we have highlighted many mechanisms for creating additional concurrent activities through constructs like threads, coroutines, goroutines, tasks, or even child processes. Almost all modern frameworks provide some mechanism for creating concurrent activities, and many modern programming languages even include built-in syntax to do so simply and easily. This gives rise to A LOT of concurrent activities!

With all these activities running around, we can’t be surprised by the emergence of one of the most common and most pesky programming hazards: Orphaned Computation. Orphaned computation occurs when:

Orphaned Computation:

A background activity that continues to execute after its result or side-effects are no longer required.

There could be lots of reasons why an activity’s result is no longer needed. Perhaps the player cancelled the operation, the player left the scene for another, the object being affected was destroyed, or the operation itself got stuck and was restarted. Whatever the reason, the background activity should have been terminated, but wasn’t (almost certainly a bug). We can categorize orphaned computations into two broad buckets:

Benign: activities that continue to consume resources, but don’t otherwise negatively impact correctness.
Hazardous: activities whose side-effects impact correctness.

In this post we’ll take a look at some of the impacts of both kinds of orphaned computations, and then in our next post we’ll suggest some techniques for avoiding them. In the interests of simplicity, I’ll focus these post on activities created using async tasks in C#, but the same principals will almost certainly apply, regardless of your language or framework of choice.

Active Abstractions vs. Passive Abstractions

Activities can certainly arise ad hoc during the execution of some logic (say, in the middle of and for the duration of some method). However, they often emerge during the implementation of some abstraction where their existence is (rightfully) hidden behind the abstraction boundary. The consumer of such an abstraction cannot know whether or not background activities are executed internally - unless the specification says so. Abstractions that contain background activities are what we call Active Abstractions, i.e. those whose state can change independently of any direct interaction through their public-facing interface.

Contrast these with Passive Abstractions whose state is always completely determined by the sequence of their public interactions. Passive abstractions are more predicable and deterministic, making them easier to write and test. Furthermore, because of their need to create background activities, active abstractions create implicit (or explicit, depending on the framework) dependencies on the mechanisms used to create their background activities. This may impose constraints on the concurrency or threading model they can execute within or introduce dependencies on libraries or components that lead to compile-time and/or runtime dependency conflicts. Even built-in language constructs, like C#’s async, await and Task, can impose these kinds of implicit constraints. For example, in C# awaiting within an activity empowers the runtime (unless explicitly configured otherwise) to potentially resume the activity on another thread owned by the threadpool leading to potentially true parallelism and all the threading hazards that go with it (requiring proper locking to address).

For these reasons, I always prefer to design passive abstractions whenever possible and only resort to active abstractions when absolutely necessary. Nevertheless, there are many powerful use cases for active abstractions, and they are very common.

Let’s look at an example of a simple active abstraction:

WARNING:

The code below contains bugs!

internal sealed class RedirectStdErr : IDisposable
{
  private const int s_bufferSize = 10240;
  private readonly StringBuilder m_content;
  private readonly SafeFileHandle m_read;
  private readonly SafeFileHandle m_write;
  private readonly SafeFileHandle m_stderr;
  private readonly Task m_reader;

  public RedirectStdErr(StringBuilder content)
  {
    m_content = content;
    Win32NativeMethods.CreatePipe(out m_read, out m_write, default, s_bufferSize);
    m_stderr = Win32NativeMethods.GetStdHandle(Win32NativeMethods.StdErrorHandle);
    Win32NativeMethods.SetStdHandle(Win32NativeMethods.StdErrorHandle, m_write);
    m_reader = Read();
  }

  private async Task Read()
  {
    await using FileStream stm = new(m_read, FileAccess.Read);
    using TextReader reader = new StreamReader(stm);
    char[] buffer = new char[s_bufferSize];
    int n = await reader.ReadBlockAsync(buffer);
    while (n != 0)
    {
      Span<char> text = buffer.AsSpan(0, n);
      m_content.Append(text);
      n = await reader.ReadBlockAsync(buffer);
    }
  }

  /// <inheritdoc/>
  public void Dispose()
  {
    // Put the original stderr back, and release our copy of the handle.
    Win32NativeMethods.SetStdHandle(Win32NativeMethods.StdErrorHandle, m_stderr);
    m_stderr.SetHandleAsInvalid();
  }
}

This abstraction creates a simple object that redirects into a StringBuilder any text written to STDERR for the duration of its lifetime. (I use a version of this class in my Steamworks library because, for reasons not entirely clear to me, some Steamworks APIs provide their failure diagnostics only as text written to STDERR instead of, say, returning a string or error code, go figure!). You might use this abstraction like this:

void SomeOperation()
{
  StringBuilder sb = m_pool.Get();
  try {
    using RedirectStdErr _ = new(sb);
    if (!m_steam.ApiThatWritesToStderr()) {
      throw new SteamException(sb.ToString());
    }
  }
  finally {
    m_pool.Return(sb);
  }
}

This abstraction is an active one because the amount of diagnostic string data written to STDERR is unknown a priori. There is no way to guarantee that the buffer allocated to the Pipe will be sufficient to hold it. If the pipe buffer is not big enough then writes to STDERR (say, through Console.Error.WriteLine()) will block until the pipe is read to free up space. Liveness requires that the pipe be read and written to concurrently to avoid a stall. The background activity within the abstraction concurrently reads the pipe in a loop while the main operation executes (and possibly writes to the pipe). If the main operation fails, then the diagnostic data is collected and propagated through the SteamException.

As written above, this abstraction orphans the internal background activity tracked by the field m_reader. Each time SomeOperation is executed a new activity will be created and then leaked.

Wasted Resources

While benign activities won’t corrupt your game logic they can still be a drag on resources. They may lead to performance issues, slow-downs, resource contention, and negatively affect gameplay, particularly for long running playthroughs or server-side components that don’t see frequent restarts. There are four kinds of leaks related to orphaned computation that I particularly watch out for. An unaccounted rise in any of these could indicate the presence of orphaned computation.

Memory Leaks

The most obvious is memory. Inherently background activities consume some amount of memory. At the very least they allocate a resolver for their result (i.e. the task in m_reader), but any additional state that is closed over by the activity and whose lifetime is scoped to the activity’s lifetime will also be leaked when the activity becomes orphaned. In the above code a 10K buffer (see s_bufferSize) is allocated for the pipe. This additional memory will also be kept alive for the lifetime of the activity. If the activity is leaked, so is the buffer.

CPU Leaks

Another common impact of orphaned computation is wasted CPU. A background activity that performs work that is never used steals cycles from the CPU whenever it runs. The above example doesn’t waste CPU (because it performs an efficient wait on the pipe reader), but what if that abstraction also showed the user a progress indicator of the pending operation. Then the background activity might instead look something like this:

  private async Task ReadWithProgress()
  {
    await using FileStream stm = new(m_read, FileAccess.Read);
    stm.ReadTimeout = 1000;  // once a second.
    using TextReader reader = new StreamReader(stm);
    char[] buffer = new char[s_bufferSize];
    int n;
    do
    {
      try {
        n = await reader.ReadBlockAsync(buffer)
        if (n != 0) {
          Span<char> text = buffer.AsSpan(0, n);
          m_content.Append(text);
        }
      } catch (IOException ex) {
        UpdateProgress();
      }
    } while (n != 0);
  }

This activity wakes once a second to call UpdateProgress. If UpdateProgress is cheap then this might never show up on a CPU profile, but each orphaned activity silently steals away a little more CPU.

Constrained Resource Leaks

Constrained Resources are those that are inherently limited, usually due to physical scarcity. Examples include things like file handles, process handles, network ports, and (particularly in games) GPU memory. Leaking these resources can quickly lead to a stall, a crash, or otherwise odd failures. The significance of constrained resources is that subsequent failures more often than not occur in some other part of the code when a legitimate use attempts to allocate a constrained resources and instead encounters a failure. Because of the lack of failure locality with the root cause, and the potential for a random distribution of subsequent failures, these issues can be hard to track down and fix.

In the RedirectStdErr example above, the Pipe read file handle is leaked. The background activity owns the FileStream object which appears in a using-statement at its top. The stream will only be disposed (releasing the pipe read file handle) when the activity terminates. Because the activity is orphaned and therefore never terminates, the handle is never released. The handle can’t be garbage collected (finalized) because the pending IO from ReadBlockAsync will keep it alive.

If SomeOperation is called often enough then the process may run out of file handles. After that, any attempt to open a new file handle (e.g. to save a game) will fail with an unusual and unexpected error. This can lead to very hard to debug situations with random components suddenly failing in seemingly unrelated ways.

Stack Leaks

The last resource I’d like to highlight as being uniquely leaked by orphaned computations is: stacks. The RedirectStdErr uses an awaitable method to perform its IO, and so does not park a stack, but not all IO devices support async (overlapped) IO. If Pipe only supported blocking IO then this activity might look something like:

  private void Read()
  {
    using FileStream stm = new(m_read, FileAccess.Read);
    using TextReader reader = new StreamReader(stm);
    char[] buffer = new char[s_bufferSize];
    int n = reader.ReadBlock(buffer);
    while (n != 0)
    {
      Span<char> text = buffer.AsSpan(0, n);
      m_content.Append(text);
      n = reader.ReadBlock(buffer);
    }
  }

Each running instance of this activity would park its stack while performing blocking IO (i.e. ReadBlock). The cost of a stack can vary greatly depending on the underlying mechanism (for example, goroutines might be cheaper than threads), but they all have some cost in both memory and scheduling overhead. In some frameworks stacks are also a constrained resources, e.g. a threadpool with a maximum thread count. Exceeding the maximum stack limit can lead other unrelated components to experience random stalls as they block waiting for a stack (or thread) to become available in the threadpool. The more stacks that are leaked the smaller the supply of available stacks the pool has to work with. If more stacks are leaked than the application’s minimum working-set stack requirements then the application may become deadlocked, unable to make forward progress even though nothing appears to be waiting for anything else.

Safety Hazards

While benign activities are a nuisance, not all orphaned computation is harmless. When an orphaned computation closes over state that is still in use after the orphan’s result is no longer needed then there is a high potential for unintended mutation or side-effects that can lead to unexpected behaviors, malfunctions, crashes, or corruption. Let’s look at a few examples:

Unexpected Mutation

There is nothing more nerve-wracking in the debugger than observing a variable in the watch-window suddenly change its value between steppings when none of the code you just step through changed it. Where is this new value coming from? “It MUST be memory corruption!”, you think, as your stomach drops through the floor, because you know that a memory corruption bug will take forever to track down. Fortunately for you (relatively speaking), when you place a breakpoint on every line that sets that variable you discover that there is an executed line that changes that variable, it just isn’t in a function that should ever have been called at this point in the program.

Mutation caused by orphaned computation can be very surprising because it does not fall within the mental model the developer has of the program’s operation. It is hard to reason about something that shouldn’t be happening. Closing over state shared with other computation can be efficient and a powerful mechanism for communication, but when access extends beyond its intended lifetime it leads to unexpected results.

In SomeOperation above, we allocate a StringBuilder from a pool to avoid memory allocations on this commonly used object. This is fine. We waited until the operation is complete. The RedirectStdErr was disposed. The sb value is no longer needed and can be returned to the pool. Right? The problem is that the Read activity was orphaned. It is still running. And it has closed over the sb object. It can append content to sb at any time. Its possible that ApiThatWritesToStderr failed, wrote some data to STDERR and then exited. At which time, because of interleaving nondeterminism (see Ordering), SomeOperation resumed and completed before the Read activity read this data. Suppose sb was returned to the pool, and then was immediately rented again by another operation to format a string to display an enemy character’s name. Now instead of showing “Pirate Captain 2” above his head it shows “PirateAuthentication failed. Captain Check credentials2”. Well, that is not right.

Interleaving nondeterminism is effectively a timing issue. So this invalid behavior might never be observed during testing and only emerge later in the field after the game has been released. Though this example produces only a visual anomaly, it could just as easily be a mutation that has less obvious impact (e.g. erroneously reducing the player’s health), or lead to permanent corruption (e.g. mutate a list’s size in the middle of writing a savegame file corrupting the entire file).

Stale Results

Orphaned computations don’t only cause unexpected mutations, but they can also produce stale results and extraneous signaling. For example, consider ReadWithProgress above where it periodically updates progress while an operation is running. Suppose progress is updated in the UI by raising a signal or an event (a very common decoupled communication mechanism). The signal handler sets the Value property of a ProgressBar control.

Now suppose that we cancelled the first operation after it got stuck at 20% and restarted it. But ReadWithProgress was orphaned and the first activity is still running, now concurrently with the second restarted activity. Both instances of ReadWithProgress compete to update the progress bar. Due to interleaving nondeterminism the user sees the progress bar bounce back and forth between the actual progress and the stalled 20% value. Perhaps as [10%, 20%, 50%, 20%, 70%, 20%, 90%, 20%]. Well, that is not right. Where is that coming from?

Unexpected Errors

Often the interactions that orphaned computation attempt with other part of the application lead to error conditions, violate invariants, or otherwise produce faults. Remember, these interactions are unintended and not part of the architectural model of the program.

What if the signal handler for progress updates validated monotonicity? The superfluous 20% signals now throw an ArgumentException on the UI thread. How are these handled? Is a nuisance dialog shown to the user? Is the log file filled with noise? Are telemetry statistics corrupted? Or does the game just crash?

Side Note: monotonicity is often a great invariant to assert, as it can cheaply catch many unintended mutations.

Conclusion

In this is post, we reviewed some of the dangers associated with orphaned computations. They can be the source of resource drains, constrained resource exhaustion, unintended mutations and side-effects, extraneous signaling, and even faults. When creating background activities, particularly when designing active abstractions that may hide the existence of their background activities, we MUST take extreme care to ensure that all activities terminate with proper lifetime bounds. What I call clean shutdown. In the next post, I’ll review some common techniques for implementing clean shutdown that will hopefully help make our code more composable and more reliable. Until next time, be diligent to know the state of your background activities, and code on!

Read the previous post in this series.

Read the next post in this series.

Feedback

Write us with feedback.