Orphaned Computation: Clean Shutdown

Intro

In the previous post we introduce Orphaned Computations and demonstrated how they can leak resources. We also looked at how, due to unintended side-effects, they can cause a variety of safety hazards that may negatively impact the correctness of our program.

In this post we’ll take a look at some techniques for ensuring that all our activities eventually terminate, thereby avoiding orphaned computations. This is called clean shutdown. Clean shutdown is a design principle that applies hierarchically to each level of our software architecture from the simple class-based abstraction level all the way up to the multi-machine distributed systems level. Modules that exhibit clean shutdown are much easier to compose into larger modules which also exhibit clean shutdown. Modules lacking clean shutdown can taint all higher-level compositions making it difficult or impossible to deliver higher-level guarantees.

The Halting Problem

In the previous post we implemented an Active Abstraction (whose buggy code appears again below, for convenience). This abstraction included a leaked background activity implemented by the Read method. In this post we’ll try to develop some techinques for ensuring that background activities don’t leak. Before delving into some suggestions and examples, I’d like to acknowledge that guaranteeing that all background activities terminate at the appropriate time can be challenging. This is essentially the problem of proving if a program (i.e. the activity body) terminates given the input conditions that should have signalled its termination (i.e. its end of lifetime conditions). This sounds suspiciously like the Halting Problem which Alan Turing famously proved is undecidable (at least in the general case).

This makes it difficult to prove through testing alone that our background activities will terminate properly. Fortunately, we don’t have to solve this problem for the general case. We don’t have to accept just any program, but instead we have direct control over how the program under test is implemented. That means we can both (a) change the program to make it easier to determine the correctness of our termination logic, and (b) write tests with foreknowledge of that program’s logic to verify its specific termination conditions. While I’m a strong believer in units tests in general, particular care should always be given to the unit tests of active abstractions to ensure proper verification of clean shutdown.

There are diverse ways that background activities may become busy that prevents them in turn from processing their termination conditions. Keeping some of these in mind as we write background activities will help us plan for their termination. For example, even straight-line code can become blocked on IO, trying to acquire locks or mutexes, waiting on event signaling, or awaiting the resolution of a Promise, Tasks, or other future. Background activities commonly contain loops, and so the loop condition and the loop invariant almost certainly play an important roll in termination. The termination conditions should either directly affect the loop condition, or be combined with it.

Let’s reexamine our buggy active abstraction and then see if we can fix it:

WARNING:

The code below contains bugs!

internal sealed class RedirectStdErr : IDisposable
{
  private const int s_bufferSize = 10240;
  private readonly StringBuilder m_content;
  private readonly SafeFileHandle m_read;
  private readonly SafeFileHandle m_write;
  private readonly SafeFileHandle m_stderr;
  private readonly Task m_reader;

  public RedirectStdErr(StringBuilder content)
  {
    m_content = content;
    Win32NativeMethods.CreatePipe(out m_read, out m_write, default, s_bufferSize);
    m_stderr = Win32NativeMethods.GetStdHandle(Win32NativeMethods.StdErrorHandle);
    Win32NativeMethods.SetStdHandle(Win32NativeMethods.StdErrorHandle, m_write);
    m_reader = Read();
  }

  private async Task Read()
  {
    await using FileStream stm = new(m_read, FileAccess.Read);
    using TextReader reader = new StreamReader(stm);
    char[] buffer = new char[s_bufferSize];
    int n = await reader.ReadBlockAsync(buffer);
    while (n != 0)
    {
      Span<char> text = buffer.AsSpan(0, n);
      m_content.Append(text);
      n = await reader.ReadBlockAsync(buffer);
    }
  }

  /// <inheritdoc/>
  public void Dispose()
  {
    // Put the original stderr back, and release our copy of the handle.
    Win32NativeMethods.SetStdHandle(Win32NativeMethods.StdErrorHandle, m_stderr);
    m_stderr.SetHandleAsInvalid();
  }
}

The Flag or State Pattern

One of the simplest and most straightforward patterns to indicate that a background activity should terminate is through the use of a boolean flag. In C# this flag is often called m_done or m_disposed (if the abstraction is IDisposable). If the abstraction implements a finite state machine such that termination is tied to a terminal state of that machine, then the state member variable (e.g. m_state or m_current) can also be used in this pattern.

If we were use m_disposed as a termination flag in our RedirectStdErr abstraction it might look like:

internal sealed class RedirectStdErr : IDisposable
{
  ...
  private int m_disposed;

  private async Task Read()
  {
    await using FileStream stm = new(m_read, FileAccess.Read);
    using TextReader reader = new StreamReader(stm);
    char[] buffer = new char[s_bufferSize];
    int n = await reader.ReadBlockAsync(buffer);
    while (n != 0 && Volatile.Read(ref m_disposed) == 0)
    {
      Span<char> text = buffer.AsSpan(0, n);
      m_content.Append(text);
      n = await reader.ReadBlockAsync(buffer);
    }
  }

  /// <inheritdoc/>
  public void Dispose()
  {
    if (Interlocked.CompareExchange(ref m_disposed, 1, 0) == 0)
    {
      ... // Dispose stuff.
    }
  }
}

In this pattern, when its time to shutdown (as indicated by the call to Dispose here), we set the flag (m_disposed) to a value indicating that the background activity should terminate (e.g. 1 in this case, because there is no interlocked operations are bool variables in C#). We then logically combine the loop condition with the termination condition to guarantee that the loop will exit once Dispose has been called.

You’ll notice that we are using Interlocked.CompareExchange and Volatile.Read here instead of a simple boolean variable. The memory-model in C# provides weak guarantees for non-volatile reads, so to guarantee that the loop performs a true load on each iteration we must mark that explicitly as a voltile read (or the loop may never see the write by Dispose even though it has already happened due, say, to enregistering or optimization). Depending on the programming language you are working with, you may need be aware of its memory model and necessary memory barrier APIs.

This flag pattern is simple, efficient, and often effective. This code guarantees that the loop will eventually terminate after Dispose has been called, and everything else in the Read method is simple straight-line code. So the leak is fixed, the background activity will terminate, right? Well, almost.

Sequence Termination

A bug still remains in the code above. What if the code hasn’t even entered the loop yet, but is blocked waiting for the first read? The line of code right before the loop looks like this:

    int n = await reader.ReadBlockAsync(buffer);

And that line contains an await which blocks on IO waiting for the first read from the pipe. This structure roughly corresponds to an entire category of patterns for background computation termination. It is very common for a background activity to read a potentially unbounded, usually ordered, sequence of inputs through some communication medium (where inputs are produced either outside the abstraction or by the parent activity itself). The purpose of the background computation is to process each input in turn until the end of the sequence is reached, and then to terminate.

In our example, the sequence is defined by the pipe which receives a series of strings written to the pipe via STDERR. If we had told the background activity that the sequence was complete, then the background activity would naturally know when to exit. We should have done this by closing the pipe’s write file handle. So an alternative fix to this code might have been:

internal sealed class RedirectStdErr : IDisposable
{
  ... 

  /// <inheritdoc/>
  public void Dispose()
  {
    // Put the original stderr back, and release our copy of the handle.
    Win32NativeMethods.SetStdHandle(Win32NativeMethods.StdErrorHandle, m_stderr);
    // Close the pipe to indicate EOS.
    m_write.Dispose();
    m_stderr.SetHandleAsInvalid();
  }
}

Once we have restored the original STDERR handle, no further writes will be made to the pipe. By closing the pipe’s write handle (m_write) we indicate that the input sequence is complete. A call to ReadBlockAsync on the pipe after it has been closed will return the value 0 (assuming all previous writes have already been read). The loop will eventually terminate when n == 0. So the leak is fixed, the background activity will terminate, right? Well, almost.

Internal Cancellation

It is true that most implementations of common sequence mediums (e.g. pipes, files, channels) WILL immediately abort any pending async (overlapped) reads that are waiting at the moment the write handle is closed to indicate EOS (End of Sequence). However, this is not universally true. Some sequence mediums DON’T provide strong guarantees for termination signalling to pending reads (even if subsequent reads ARE guaranteed to complete immediately). For this reason, it is sometimes recommended to always include conservative timeouts on pending reads. If a read times out, the loop can explicitly check the termination condition (e.g. the m_disposed flag in the section above) and break out of the loop if termination has been requested, otherwise it can simply reissue the pending read to keep waiting. Though they certainly have their uses, I try to avoid timeouts wherever possible as they can respond poorly to varying latency conditions and complicate testing. I prefer prompt and explicit signalling.

CancellationToken to the rescue. In C# CancellationToken (like tokio::sync::CancellationToken in Rust or context.Context in golang) provides for non-blocking, lock-free, cooperative, multithread-safe, asynchronous, broadcast cancellation notification that composes well with the core libraries including the API ReadBlockAsync. We can use a CancellationToken (in place of m_disposed) in the internal implementation of an abstraction to signal the need for termination.

A better fix to our abstraction might be:

internal sealed class RedirectStdErr : IDisposable
{
  ...
  private CancellationTokenSource m_disposed;
  private CancellationToken m_cancel;

  public RedirectStdErr(StringBuilder content)
  {
    ...
    m_disposed = new CancellationTokenSource();
    m_cancel = m_disposed.Token;
    m_reader = Read(m_cancel);
  }

  private async Task Read(CancellationToken cancel)
  {
    await using FileStream stm = new(m_read, FileAccess.Read);
    using TextReader reader = new StreamReader(stm);
    char[] buffer = new char[s_bufferSize];
    int n;
    do
    {
      try {
        n = await reader.ReadBlockAsync(buffer, cancel)
        if (n != 0) {
          Span<char> text = buffer.AsSpan(0, n);
          m_content.Append(text);
        }
      } catch (OperationCanceledException ex) {
        break;
      }
    } while (n != 0 && !cancel.IsCancellationRequested);
  }

  /// <inheritdoc/>
  public void Dispose()
  {
    if (!m_cancel.IsCancellationRequested)
    {
      // Signal cancellation (and dispose the source).
      m_disposed.Cancel();
      m_disposed.Dispose();
      // Put the original stderr back, and release our copy of the handle.
      Win32NativeMethods.SetStdHandle(Win32NativeMethods.StdErrorHandle, m_stderr);
      // Close the pipe to indicate EOS.
      m_write.Dispose();
      m_stderr.SetHandleAsInvalid();
    }
  }
}

The CancellationTokenSource created in the constructor will be used to signal our background activity to terminate via the source’s CancellationToken (passed to the activity as an argument). Any pending calls to ReadBlockAsync are guaranteed to either complete or throw OperationCanceledException. Regardless of whether the EOS signal from the closure of the pipe’s write handle, or the cancellation notification is seen first by the background activity, it WILL exit the loop and terminate. The check against IsCancellationRequested in Dispose even protects this abstraction against double dispose. So the leak is fixed, the background activity will terminate, right? Well, almost.

Side Note: It is considered a best practice to provide an idempotent implementation of Dispose whenever it is not prohibitively expensive to do so.

Async Disposables

Cross-thread communication (including flag sets, sequence completion, and cancellation notification) is both asyncronous and subject to interleaving nondeterminism. We have argued that the background activity will eventually terminate, but have not proved any strong guarantees about when it will terminate. If it takes a long time before it terminates then it will continue to create the same potential hazards as any orphaned computation until it does actually terminate.

Let revisit our motivating usecase example:

void SomeOperation()
{
  StringBuilder sb = m_pool.Get();
  try {
    using RedirectStdErr _ = new(sb);
    if (!m_steam.ApiThatWritesToStderr()) {
      throw new SteamException(sb.ToString());
    }
  }
  finally {
    m_pool.Return(sb);
  }
}

We saw in the previous post in the section Unexpected Mutation that if the background activity continues to execute after RedirectStdErr.Dispose() returns then the string value of sb could be unexpectedly mutated by a stale background activity that had closed over sb. For SomeOperation to be correct, there MUST be a strong guarantee that any background activities created by RedirectStdErr have terminated before Dispose completes.

To satisfy this requirement Dispose MUST wait until the background activity has terminated before returning:

  /// <inheritdoc/>
  public void Dispose()
  {
    if (!m_cancel.IsCancellationRequested)
    {
      ...
    }

    // Wait for the background activity to terminate.
    m_reader.Wait();
  }

Now, we know that the background activity not only terminates, but has terminated before Dispose returns. Our termination guarantees are strong. Note that the m_reader.Wait() logic appears outside the idempotency check. If it did not then we would not get consistent idempotent behavior. Instead the first call would perform clean shutdown and block until the background activity has terminated, but subsequent calls to Dispose would return immediately WITHOUT the provided guarantee that all background activity had terminated. This would be incorrect.

Only one potential problem remains. The call to m_reader.Wait() will block until the background activity has terminated. But the background activity may be asynchronously busy waiting on IO and may not terminate until either the IO completes or the cancellation notification arrives which could take an arbitrary amount of time due to interleaving nondeterminism. The thread calling Dispose will be parked until that happens. To avoid this C# introduced IAsyncDisposable (an async analog to IDisposable):

internal sealed class RedirectStdErr : IAsyncDisposable
{
  ...

  /// <inheritdoc/>
  public ValueTask DisposeAsync()
  {
    if (!m_cancel.IsCancellationRequested)
    {
      ...
    }

    // Wait for the background activity to terminate.
    return new ValueTask(m_reader);
  }
}

If the background activity doesn’t terminate promptly, then calling DisposeAsync instead of Dispose allows the caller to yield their stack until the background activity has terminated. When implementing IAsyncDisposable the calling code is changed ever so slightly to use await using instead of using:

void SomeOperation()
{
  StringBuilder sb = m_pool.Get();
  try {
    await using RedirectStdErr _ = new(sb);
    if (!m_steam.ApiThatWritesToStderr()) {
      throw new SteamException(sb.ToString());
    }
  }
  finally {
    m_pool.Return(sb);
  }
}

Now, we are guaranteed that the background activity terminates. And we are guaranteed that it terminates before Dispose returns (or AsyncDispose resolves). So the leak is fixed, the background activity will terminate, right? YES!

Conclusion

In this is post, we reviewed some common techniques for implementing clean shutdown and showed some gotchas to watch out for along the way. Not all gotchas apply to all situations, and each of the above patterns have their uses. Use your judgement. Clean shutdown and clear lifetime semantics make code more composable and more reliable. Without discipline, we have nothing. Practicing clean shutdown discipline enables you to code on!

Read the previous post in this series.

Read the next post in this series.

Feedback

Write us with feedback.