SIP: Software Isolated Process

Intro

In several previous posts I’ve mentioned SIPs in passing without ever really getting into what they are. In this article I’d like to provide a little more detail about what I mean by the term SIP and why I like them (and why you should too).

SIP is an acronym for Software Isolated Process. I usually use the term SIP instead of Process because the word “Process” has become very overloaded in computing and can mean different things to different people. To understand how these terms have become so muddy, we’ll have to take a trip down memory lane (bad pun intended).

A Brief History of the Process

In the early days of computing, programs (aka jobs, or tasks) executed one at a time, sequentially on a single CPU. The CPU executed the instructions of the program in the order they were written. The program had complete control over all of the hardware including all memory and all IO devices.

By the 1960’s CPUs had already grown fast enough in comparison to IO devices that keeping the very expensive CPUs busy all of the time through the execution of a single computing job became difficult. To avoid wasting those costly resources, operating systems began experimenting with multiprogramming, that is, the execution of more than one program at the same time. When one program was waiting on IO, the CPU could switch to and execute another program instead of letting the CPU sit idle. This made good economic sense when a single computer in 1961, like the Burroughs B5000, could cost as much as $1M ($4-$8M today when adjusted for inflation)! (Incidentally, for you movie fans, the Burroughs ran an operating system named Master Control Program - the namesake for the evil MCP in the original Tron movie.)

Running multiple concurrent programs created a new set of issues. One of the most significant problems quickly became obvious on these early multiprogramming systems. Two programs running concurrently can write all over each other’s memory. If two programs try to use the same piece of memory at the same time, they are likely to corrupt each other’s operation. For two programs to safely run concurrently they must agree ahead of time to use different, non-overlapping, segments of memory. This made software design more difficult and negatively impacted composability.

By the late 1960’s CPU manufactures introduced a new piece of hardware called the memory management unit (MMU) that created a 2-level memory addressing scheme through the use of tables loaded into hardware that map the addresses used directly by programs (virtual memory) into other addresses (physical memory) used by the hardware. This allowed two programs to both use the same memory address without fear of corrupting each other’s memory as long as they loaded their own address tables into the MMU before they accessed memory and their tables mapped that virtual address to different physical addresses. This meant software could be written once again as if it were the only program running on a machine without having to worry about adding a (segment) offset to any memory addresses it accessed (to prevent memory collisions with other programs).

By the early 1970’s true multitasking operating systems (notably Unix) took advantage of MMUs to introduce our modern notion of the OS Process. An OS Process is a program designed to run in a multiprogramming environment whose lifetime is mediated by the OS and whose access to a strict subset of physical memory is enforced by MMU hardware under the exclusive control of the OS (rather than the program). Programs no longer needed to worry about loading their MMU tables. Programs had no visibility at all into memory not allocated to them by the OS.

Once again, each program was given the appearance of executing strictly sequentially on a single CPU with exclusive access to all of its addressable memory. A program could safely execute concurrently in this environment alongside any number of other programs without being concerned with corruption or interference. This made it considerably easier for different groups of people to write independent programs that ultimately ran on the same computer without having to coordinate with each other. This also empowered computers to execute more programs concurrently (keeping those expensive CPUs busy) without fear of operational failures.

A Brief History of Multiprocessing

In parallel with (see what I did there?) the evolution of hardware-assisted memory isolation, computer designers also explored the benefits of connecting multiple CPUs to the same system (multiprocessing). Even early systems in the 1960’s sported two or more CPUs. By the 1970s many commercial systems supported multiple-CPU configurations arranged either as symmetric multiprocessing (SMP) systems where all CPUs share access to all the memory, or as message-passing systems where programs running on different CPUs send messages to each other to coordinate access to CPU-attached memory.

One of the main benefits of multiprocessing systems in the 1980s was that the OS could not only run multiple programs concurrently, but could actually run multiple programs in parallel. Multiple processes, each running independently under hardware-assisted memory isolation, were none the wiser of the switch between concurrent and parallel execution. Hardware-assisted memory isolation is a powerful abstraction for the separation of concerns between processes.

However, some programs, particularly data intensive ones, found they could benefit from using multiple CPUs directly over the same shared data. By the 1990’s this led directly to the development of a new programming methodology called multithreading. OSes that supported multithreading introduced mechanisms (e.g. clone(), CreateThread()) for creating additional logical threads of execution within the same OS process. Each of these threads were like separate programs in that they executed their own independent program logic concurrently (or even in parallel) with each other.

However, unlike separate OS processes, all of the threads in a single program share the same hardware memory address space. The MMU no longer prohibits them from reading and writing each other’s memory. This means they can all operate at the same time over the same potentially large data structures, such as a database’s btree, without having to first make copies of it. It means all of the threads can contribute to a shared computed solution for a large computing problem, such as a Beam Search for the next game move by a computer player. Multithreading makes practical the use of such large data structures and large computing problems.

With great power, however, also comes great responsibility. The thread abstraction strips away the benefits of hardware-assisted memory isolation that programs previously enjoyed. It reintroduces concurrency hazards like memory corruption, race conditions, and deadlocks that memory isolation and single-threaded execution had avoided. The multithreaded programming model included new complexities like synchronization primitives (locks) to try to address these issues. Despite being intensely popular (and perhaps seen as a badge of prowess) the proper use of locks can be challenging and fraught with subtle issues (as we discussed in our earlier post on Ordering).

Enter SIP, Stage Left

A SIP isn’t really a particular technology or mechanism, but more of a programming methodology and a discipline. A way of looking at and thinking about the logic executing in a program (particularly one with multiple threads). A simple definition of a SIP might be:

A single-threaded, sequential, logical thread of execution with exclusive access to its own isolated memory.

The key idea here is to try to restore the benefits we, as software engineers, enjoyed from the single-threaded, hardware-assisted memory isolation environment that existed before multithreading while still allowing the specific pieces of our software that benefit the most from multithreaded parallelism to leverage it without permitting that power to taint the whole application programming model.

A SIP is self-contained both in code execution and in state. Because it is single-threaded, its execution behavior is sequential and totally ordered making it easy to write, easy to reason about, easy to debug, and easy to maintain. It is free from concurrency hazards, even when running in a highly concurrent environment. It has no race conditions. All of its state is stored in (and only in) a segment memory that only it has access to guaranteeing that state is also self-contained. It cannot be corrupted by other computations in the same process. It can only be mutated by the totally ordered sequential logic the SIP itself executes, making mutation easy to follow (and easy to document through logging or debugging). The logic and state can be independently reasoned about making it easy to test independently of the larger application that uses it. Multiple instances of the same SIP can be instantiated at the same time, each with independent identity and an independent copy of its state. This makes a SIP reusable with high composability.

In many ways SIPs harken back to the original design philosophy of Unix. The Unix philosophy was about simplicity, modularity, and composability. Instead of building giant, all-in-one programs, Unix encouraged small, purpose-built processes that could be chained together to solve complex problems - an idea that has influenced everything from scripting to modern microservices. The SIP methodology applies all of these same ideas but to the multiple threads that make up a complex modern software application (especially and including games).

SIP DYI

OK, I’m convinced. How do I get one? There are many adjacent technologies that can be used to achieve the functionality of a SIP including threads, tasks, coroutines, goroutines, single-threaded apartments, and even child processes. Let’s look at some of the requirements a SIP needs to achieve its desired properties.

First things first. You need a mechanism to start a separate logical thread of execution. One that can run single-threaded program logic concurrently (or even in parallel). Once started, however, the logic MUST remain independent of any other SIP (including the one that created it). Obviously a child process could achieve this, but other mechanisms like a thread or a task will work equally well (and at a lower cost) assuming we can meet the other requirements.

Next, we need to guarantee that the new SIP has exclusive access to its own isolated memory. This is trivial if we chose a child process in the step above, since hardware-assisted memory isolation guarantees exactly this property at the hardware level. To achieve the same guarantees within a single OS process, however, we will need to employ some programming discipline. In particular:

Clean Launch Context:
We must be careful when launching our new thread that its closure does NOT include any references to shared objects (i.e. pointers or by-reference objects). Exposing any references creates an undesirable back-channel for hidden communication between the new SIP and its parent, breaking the illusion of memory isolation.
No Mutable Statics:
We must avoid using process-wide mutable statics. This includes mutable global variables, class statics, and function statics. Most modern programming languages allow the definition of both mutable and immutable statics. Immutable statics are invaluable as constants and compile-time defined data structures. However, in a multithreaded environment, mutable statics could be accessed by any thread at any time and MUST be protected by locks. Furthermore, mutable statics have weakly defined initialization ordering which can lead to unexpected results when multiple SIPs are created at the same time. (Incidentally, avoiding mutable statics has other benefits in composability, testing, separation of concerns, and maintenance. So I tend to recommend this precept anyway, independently of SIPs.)
Communicate Only Through Explicit Channels:
When launching a new SIP it will obviously need some mechanism to communicate with other SIPs. We talked a lot about Transports and Sessions in our previous post. What will work best for you depends on many factors such as what kind of information you need to communicate, whether communication is one-directional or bi-directional, or whether you are crossing process or network boundaries. A communication channel could as simple as a thread-safe queue or as sophisticated as an RPC session. The key principle is that it only allows explicit, structured, message exchange. It MUST not allow either party unstructured access to the other party’s memory.
Communicate Only By-Value or Linear Transfer:
When a SIP sends a message to another SIP, the message content itself MUST not introduce new references to shared objects. The easiest way to guarantee this is to communicate only through memory copying (i.e. by value). A good serialization library can help efficiently flatten complex structures into an easily copiable sequence of bytes. Then the channel can perform a blit (aka byte block transfer) from the sending SIP’s memory to the receiving SIP’s memory. Alternatively, for very large or complex object trees, a Linear Transfer could be employed. Warning: for a linear transfer to be applicable the entire object tree to be transferred must itself be self-contained with no escaping references (otherwise those references would become shared object references and break the illusion of memory isolation).
Memory Safety Must Be Obeyed At All Times: Obviously, direct unsafe memory accesses has the potential to violate the memory isolation of SIPs.

Almost any modern programming language can be used to implement a SIP, but some will demand greater developer discipline than others. Some will provide better guarantees through compile-time or run-time enforcement. How easy these requirements are to implement will greatly depend on the programming language you use, your personal style, and the operating system you have to work with.

Memory-managed languages like C#, Rust, Go, or Java can make it significantly easier to achieve the memory isolation properties describe above. Their compilers can enforce (1) and (5) at compile-time through their type systems. There are many existing libraries that provide suitable abstractions for use as channels in (3). And (4) can be guaranteed by-construction through serialization and materialization. Only (2) relies on developer discipline. (In some environments even (2) can be enforced at compile-time through additional tooling, such as C#’s roslyn-based analyzers which can detect and flag the unintentional use of mutable statics.)

Even in environments as inhospitable to memory safety as C++ is, I have had great success in utilizing the SIP paradigm through the judicious use of modern C++ primitives and some developer care. IMHO the benefits have always vastly outweighed any cost to expressability.

Failure Domains and Recovery

There is one final property of SIPs that I would like to emphasize. A SIP forms a natural Failure Domain. A failure domain is a physical or logical subset of a hardware or software system that can fail independently of the rest of the system. If a component within the domain fails it may affect other things within the domain, but everything outside the domain continues to function normally. Architects can use failure domains in their designs to define the scope of impact when something goes wrong, and to limit the extent of required recovery operations.

The concept of failure domains is a powerful tool in building reliable software systems, even those that only run on a single computer (like a game). I like to use failure domains to encapsulate parts of my software. For instance, when writing a game, I like to run the game session in a separate failure domain from the main UI. If there is a bug in the game logic that causes the game session to crash, the UI remains responsive. The user can see an error message, they might be able to save the game, and they can gracefully exit to the main screen to restart the game or to quit.

A good SIP implementation makes a great mechanism to easily employ failure domains like this. Launching a separate SIP containing the game session and communicating with it using a Promise RPC proxy is both simple and powerful. If the game ends, I simply destroy the SIP. If the user leaves the game early, I simply destroy the SIP. If the game session fails or crashes then I simply destroy the SIP. Because of the SIP’s self-contained nature, failures cannot escape and impact the larger system. Destroying the SIP always guarantees proper cleanup of all resources contained within. No leaks are possible, by construction. And SIP teardown is definitive. When the SIP is destroyed it is guaranteed to terminate (i.e. no longer be running - an important liveness property). There can be no orphaned background computation that might corrupt recovery efforts or a restart, by construction. Recovery, if needed, is therefore also simple. Once the SIP is destroyed, create a new one and reload the persisted state.

Conclusion

In this is post, we talked about the Software Isolated Process or SIP as a programming methodology that promotes dissecting complex multithreaded programs into independent segments of logic and state which can be executed sequentially, each with their own exclusive access to a block of isolated memory.

Because of their clean and well-defined properties, much of Distributed Computing theory is written in terms of abstract processes that are in fact SIPs. While it is by no means required to use SIPs to leverage the learnings from these theories, it can certainly make it much easier to reconcile theory with practice.

In summary, the benefits of SIPs include:

Freedom from the complexities and subtleties inherent in the use of locks.
Freedom from concurrency hazards including corruption, data races, and deadlocks.
Simpler and well-defined reasoning about the order of program execution.
Explicit and well-defined cross-thread interactions through message passing.
Well-defined ordering in logging and diagnostics (due to single-threaded execution).
Greater composability due to a strong separation of concerns.
Easier cleanup and recovery due to well-defined failure domains.

I hope you find good uses for your SIPs. Until next time, code on!

Read the previous post in this series.

Read the next post in this series.

Feedback

Write us with feedback.