RPC Part 2: Interface Ordering

Intro

In the previous post we considered what properties to look for in a Message Passing system and introduced the Promise RPC library which attempts to meet some of these needs. In this post we’ll look at some new challenges around ordering in message passing systems.

Message Ordering

So far, in our discussion of message passing systems, we have implied that communication between components is made up of sequences of messages, but we haven’t said anything about how those messages are ordered. The simplest and most intuitive expectation would be to assume that messages “happen” in the order they are sent. We might call this ordering Sequential (or even Serializable in the database or distributed computing sense). And, while this sequential assumption is intuitively satisfying, it is often wrong. To understand why it can be wrong we have to first take a look at the 5 phases of an RPC call’s lifecycle:

Issue:
The caller makes (or issues) a message by calling an RPC method (or sending a request, or sending a message). This act creates a logical intent (represented by the physical message) that demands a resolution (either a success or a failure). The sequence of calls themselves define a call order from the perspective of the caller (and the developer). The call order is implicitly tied to the developer’s intuitive expectations around causal ordering, which in turn affects the developer’s intuition about the correctness of their program.
Dispatch:
Messages take some time to travel (i.e. latency). When the callee receives the message, a new activity will be dispatched to compute the response. Remember from the previous post that our message passing systems give the callee some control over when new messages are dispatched (in case the callee is already busy doing something else at the time the message was issued). Once dispatched, the response may be computed promptly (i.e. synchronously and without preemption), or asynchronously. If asynchronously, as soon as the message’s activity yields for the first time (i.e. awaits), another message might also be dispatched. This leads to multiple message activities executing concurrently. Those activities’ subsequent turns are interleaved according to the policy of the scheduler. Therefore, only the first turn of each message activity is well ordered with respect to the message passing system. We call the ordering of these first turns the dispatch order.
Retirement:
Eventually the callee’s activity completes (either successfully or as a failure) which results in a response message sent back to the caller. The order in which the responses are sent is called retirement order. The time between the dispatch and the retirement represents the period over which the callee (sometimes called the server) has committed its resources to handling the call. For instance, the scalability of a server is often related to its retirement rate (i.e. the number of retirements per second for a given dispatch rate).
Reception:
Latency, of course, applies in both directions. Eventually (assuming a lossless communication medium) the caller receives the response message. At the point of reception, the message passing system has concluded its handling of this particular message and any resources related it can be released. The response is now just data at the caller. But the caller may not be ready to consume the response immediately. If the call was issued asynchronously (e.g. it produced a Promise for the result) then the caller may be busy issuing additional concurrent calls before awaiting on the result. Or the caller’s scheduler may have cooperatively scheduled another activity to execute while waiting for the response.
Completion:
Eventually the caller’s activity awaits, resumes, and then consumes the response. The order in which the caller sees and consumes a set of responses defines the completion order.

Out of Order Retirement

There are several orderings mentioned here that might be of interest: call order, dispatch order, retirement order, and completion order. In the synchronous local call model (i.e. regular functions) all four of these orders are identical. The call is issued and dispatched in the same instant. The call then blocks the stack, preventing the caller from making any other calls, until the call both retires and completes (which also happen at the same instant). Our intuition from straightline synchronous code is that all these orderings are the same.

But, we have already seen examples in the asynchronous call model, including IO, where it is undesireable to tie completion order to call order. If I call a method that takes a long time, and then concurrently call a shorter method, it would be undesireable if the shorter method were not allowed to complete until after the long method was done. For instance, consider an example where the long call loads a save game at the game server, and the short call returns a percentage of completion for that load activity (to show a progress bar to the user). It would be useless if the progress bar couldln’t be updated until after the load operation itself was complete. Obviously we don’t want completion order to be tied to call order. We name this condition Out of Order Retirement, and we will allow both retirement order and completion order to be independent of either call order or dispatch order.

Out of order retirement means you can make any number of calls asynchronously and then handle each response as soon as it is available, without having to wait for any of the other responses to complete. That seems desirable.

Pipelining

But there is something intrinsically causal in the relationship between the call order and the dispatch order. In the load game and progress call example above, it seems wrong (or at least unintuitive) if the progress call (which was issued strictly after the load call) were to race ahead of the load call at the callee and be dispatched first. How should the callee interpret a request for progress on a load that hasn’t even happened yet? If the progress call were to fail, how should the caller interpet such a failure? Did the load fail? The caller isn’t really interested in any interleaving where the progress call happens before the load call.

The caller could attempt to defer issuing the progress call until after the load has begun, but without some extra complexity, the caller has no mechanism to know when the load call has been dispatch. Therefore the caller has no way to reliable know when to issue the progress call so that it will always work. Deferring the progress call, adds extra latency. If the load turns out to be very quick (say much faster than the network latency) then deferring the progress call could delay the entire operation. You could split the load into two calls, a StartLoadGame() and a GetLoadGameResult() and then wait for the first to complete before issuing the progress call. This also introduces latency since, at minimum, a full network round-trip is required for StartLoadGame() to complete before the first progress call could be made, and a second full network round-trip is required for GetLoadGameResult(). What if the caller never calls GetLoadGameResult()? How long should the callee hold on to the result before discarding it? Retirement rate is now bounded by both the network, the caller’s program logic, and the callee’s own implementation (an undesirable optimization problem).

Ideally, the caller wants to issue both calls immediately but express a constraint on the dispatch order to the callee. Let’s call this feature Pipelining. And the constraint to be expressed is: (within a given scope) the dispatch order must match the call order. So, if (a) the message passing system implements pipelining, (b) the load call and the progress call are in the same scope, and (c) they are issued in the desired semantic ordering, then (d) the dispatch order will match the call order. The game server will NEVER see a request for progress on a load that has never started. The caller can pipeline both the load and the progress immediately (and even reissue the progress call periodically to update the UI) without any unexpected states occurring. If the load turns out to be quick, both the load and the progress calls can be retired promptly. Both retire time and network latency are minimized. Retire time depends only on the server implementation.

Interface Order

What is the right scope over which to apply the ordering constraint described above? Obviously the scope must be bounded by the target component that receives the messages. It wouldln’t make sense to require strong ordering for messages to two different components. However, even a single component might accept messages for many different unrelated functions. Typically we group methods that are semantically related to each other into a single specification. Such a specification is in a position to describe any interactions between methods while still promoting local reasoning. Let’s call such a specification over a set of methods an Interface. As we saw in the previous post, Promise RPC defines an Interface through the definition of a C# interface with an attribute. For now, let’s propose that we apply the ordering constraint above across all methods in the same Interface, but not (necessarily) to methods across Interfaces. We’ll name this Interface Order. (In a later post we’ll narrow this proposal even further to just a specific instance of an Interface, but we’ll need Capabilities to define what instance really means in an RPC.)

Interface Order is an intuitively satisfying scope. Most object oriented programming languages support interfaces. And developers have an natural expectation of ordering when making calls on such an interface. For example, if I were to write:

Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("This text is green!");

I expect the text to come out green. If this same code were expressed in a distributed system:

// Pipeline two calls.
Promise p1 = console.SetForegroundColor(ConsoleColor.Green);
Promise p2 = console.WriteLine("This text is green!");

// Await the completion.
await p1;
await p2;

The developer would still expect the text to come out green, and they would be very surprised if the WriteLine call were to race ahead of the SetForegroundColor call and write some other color text instead.

Can You Recover Order If You Lose It?

Although interface order seems to meet natural developer expectations, very few systems actually try to deliver strong guarantees on dispatch ordering. Let’s take a look at why it is challenging to implement interface ordering in a messaging system by looking at how HTTP deals with ordering.

HTTP-based RPC interfaces will typically make calls through an HTTP client library (e.g. HttpClient in C#). Though there are many different library implementations to choose from, what most of them have in common is that, for a variety of performance reasons, they all use pools of outgoing connections. When a request is sent, an available connection is chosen from the pool, the request is issued, and when complete the connection is returned to the pool for later reuse. Two requests issued by the same caller will likely use two different connections and so execute truly in parallel. This minimizes end-to-end latency, and reuse makes this fairy efficient. But… the semantic ordering of the two requests is already lost at the time of issue. Two packets on two separate TCP sockets are totally unordered from each other, and the second may arrive (even at the same server) before the first. Even if the server executes the requests promptly at the time of dispatch, reordering of the requests in flight to the server will lead to out of order dispatch.

Even if we ignore client-side pooling, and assume that both requests were pipelined on the same socket, almost all HTTP servers (e.g. Kestrel) utilize the thread-pool to dispatch incoming requests. Even if the requests come off the wire in call order, the scheduled execution of the first line of code in the callee is free-threaded and therefore may occur in any order. Leading, once again, to out of order dispatch.

Complicating this story is the heavy use of proxies, load balancers and reverse proxies in front of HTTP servers. These tools parse the sequence of incoming HTTP requests, apply policy, and then reissue those requests to one or more other downstream HTTP servers. These tools have weak ordering guarantees, and may even forward two pipelined requests from the same client-side socket onto two different servers for processing, leading to both loss of ordering and loss of session identity.

All of these ordering transformations stem from HTTP’s basic assumption that each request is semantically independent (and so unordered with respect to each other). Request independence helps make HTTP highly scalable and fault tolerant at the cost of session identity and dispatch ordering. For HTTP, this is generally a very good tradeoff given its original intent. A web browser largely doesn’t care which physical server delivers its page elements or what order they arrive in.

Typically, HTTP-based systems that care about ordering must recover any required ordering through explicit states that are persisted in external storage systems (e.g. databases). A dispatched RPC must load the externalized state, obtain ordering control (e.g. by opening a transaction, or mutating the persisted state to acquire a lock), perform the operation, write back any changes to the externalized state, and finally release ordering control (e.g. by committing the transaction or mutating the state again to release the lock).

This may work fine for large scale stateless systems or stateful systems with a significant data center presence (which includes externalized storage). However, it is a very challenging regime and is generally a poor fit for games which involve large numbers of otherwise small but highly mutable stateful systems. Many games are even peer-to-peer with little or no data center presence. Giving up ordering only to try to recover it again is a waste of resources. And recoving ordering incorrectly leads to very subtle bugs, race conditions, and nondeterminism.

Promise RPC and Ordering

Promise RPC provides ordering guarantees at the proxy level. Ordering on proxy instances enables powerful composition with the Capability model (which we will talk about more in a future post). All method calls made on the same proxy are guaranteed to be dispatched in the same order they were issued. Because the Promise scheduler at the callee is single-thread and cooperative this ordering ensures that the callee both observes the dispatch order and has the opportunity to impose as much or as little downstream ordering on retirement as the interface’s semantics require. For example, a callee can impose a total order (e.g. serializability) by deferring additional dispatches until the pending activity is complete. Alternatively, a callee can allow concurrent execution while maintaining ephemeral state that records the original dispatch order (e.g. log sequence numbers or LSNs) and it can use that information to enforce some kind of partial order. And implementations with only prompt methods (a very common sub-case) achieve a total order on internal state mutations without any additional information.

These strong ordering guarantees are available out of the box, and implemented directly by the messaging system, requiring no extra effort by the developer. Simply define an interface with one or more methods, make calls on those methods in some order, and the callee will dispatch those methods in the same order. Guaranteed. Regardless of which transport is in use (e.g. same-SIP, cross-thread, cross-process, TCP networking, Steam networking).

Dispatch ordering provides significant power to the callee in implementing any desired serializability semantics, but doesn’t impose any restrictions on the caller. All Promise RPC methods are asynchronous and return Promise-based results. The caller can issue calls in any order. Can choose to await calls immediately, or overlap their execution. Can even await their completion out of order. The caller’s choices never impact the callee’s semantics or the callee’s implementation requirements.

Example

As an example of how interface ordering makes writing games easier: Consider a Game Lobby screen for a multiplayer game that allows all players to collaboratively configure the game’s settings while engaging in chat and then launch the game. The lobby implements an interface called IGameLauncher which provides methods for changing various settings, sending chat messages to the group, and launching the game. Some settings have data dependencies (e.g. you can’t set the map style on round 3 if the game only has 2 rounds). To keep the interface responsive, all changes to the game settings are pipelined. When a player changes the settings a method call is issued to the launcher (running on the host or game server) where it is replicated asynchonrously to all joined players after imposing an LSN based on dispatch order at the launcher. A player can both increase the round count and then configure the new round’s settings without having to wait for all other players’ interfaces to catch up with the changes. But all players still see the same updates in the same order (eventually) in their own UI. The LSN-checks at the launcher ensure a total order to all changes and resolve conflicts that arise if two players attempt to make mutually exclusive changes simultaneously. Similarly, chat messages are totally ordered, so all players see the same chat messages in the same order. Replies never get reordered from the messages they replied to (even though the ordering of the messages is not explicitly tracked). The game can be launched by any player at any time, but ONLY IF they have seen all committed settings and chat messages. This is enforced simply by including the LSN that the player’s UI most recently saw before the Launch button was pressed. If the LSN doesn’t match the latest LSN at the launcher then the launch will fail. The launch intent can never be reordered in front of a settings change made by the same UI. So, I can’t click a setting and then click launch and have the game launch without that setting because the launch RPC raced in front of the settings change. None of this ordering correctness requires any explicit code from the developer.

Conclusion

In this post we talked about RPC method lifecycles, message ordering, and using interfaces to scope ordering constraints. We discussed what semantics the Promise RPC library implements for message ordering, and gave some examples about how that might be useful.

This is Part 2 of our look at the MSC RPC system. In Part 3 and beyond we’ll look at remote object identity through capability exchange, streaming with sequences and bytes, and finally we’ll see how session lifetime relates to aborts and cancellation. Until next time, code on!

Read the previous post in this series.

Read the next post in this series.

Feedback

Write us with feedback.