BEAMitude: What if times out?

Problem statement

I had the pleasure to dive a bit into the gen_server OTP pattern (doc, exdoc). That experience left me with a few questions about its failure modes.

First of which I’m exploring in this post:

What happens when synchronous calls to GenServer have timeouts that are often too short?

Wouldn’t that bloat the client’s mailbox?

Introduction to GenServer (gen_server)

Much ink has been spilled about GenServer1. For a good reason. It’s a well thought out, battle-tested, and rather universal abstraction. It is mainly useful in the Actor model world (because other paradigms use shared memory, mutexes, etc).

It provides reasonable building block for stateful servers. In the following diagram are two processes, client and server and their gen_server interaction (on a high level):


There are three main use-cases:

For the rest of this article, only call will be interesting. Behind the scenes it works in the following fashion (elixir syntax):

  1. Client sends Server a message: {:"$gen_call", {self(), tag}, request}
  2. Client enters into receive block (waiting for a message), with an optional timeout
  3. Server’s receive loop (supplied by GenServer) invokes appropriate handle_call(request, _from, state) callback which typically returns {:reply, response, newstate}
  4. Server’s receive loop sends back the response to the stated from address (with an unique tag), and loops with newstate
  5. Client picks up the message from the receive block… and: profit.


In order to verify the behavior under timing out, I whipped up a simple module to verify the behavior:

# a.exs

defmodule MyApp.SlowServer do
  @name __MODULE__
  use GenServer

  # Client interface

  def start_link(_) do
    GenServer.start_link(__MODULE__, %{}, name: @name)

  def slow_call(val) do @name, {:slow, val}, 10

  # Server callbacks
  @impl true
  def handle_call({:slow, val}, _from, state) do
    {:reply, val, state}

  @impl true
  def init(state) do
    {:ok, state}

defmodule Main do
  def slow_call(t) do
    try do
      IO.puts "Slow calling: #{t}"
      IO.puts "reply: #{inspect MyApp.SlowServer.slow_call(t)}"
      a, b -> IO.puts "error: #{inspect a}, #{inspect b}"
    IO.puts "Mailbox: #{inspect, :messages)}"
    IO.puts ""

IO.puts "Ex version: #{System.version}"
IO.puts "OTP release: #{System.otp_release}"
IO.puts "Client pid: #{inspect self()}"
IO.puts ""
{:ok, pid} = MyApp.SlowServer.start_link(:foo)
:sys.trace(pid, true)

# Expect: success, empty mailbox

# Expect: fail, mailbox with 1 message

# Expect: fail, mailbox with 2 messages

And sure enough, when running on an old enough OTP, I’m getting the behavior I thought I would:

$ elixir a.exs
Ex version: 1.7.4
OTP release: 21
Client pid: #PID<0.91.0>

Slow calling: 5
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,5} from <0.91.0>
*DBG* 'Elixir.MyApp.SlowServer' sent 5 to <0.91.0>, new state #{}
reply: 5
Mailbox: {:messages, []}

Slow calling: 20
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,20} from <0.91.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 20}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 20 to <0.91.0>, new state #{}
Mailbox: {:messages, [{#Reference<0.3808279768.984612871.134230>, 20}]}

Slow calling: 19
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,19} from <0.91.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 19}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 19 to <0.91.0>, new state #{}
Mailbox: {:messages, [{#Reference<0.3808279768.984612871.134230>, 20},
                      {#Reference<0.3808279768.984612871.134239>, 19}]}

In other words, if I trap the exit resulting from the call timing out, the Client mailbox accumulates a stray message.

Elixir documentation even mentions that:

If the caller catches the failure and continues running, and the server is just late with the reply, it may arrive at any time later into the caller’s message queue. The caller must in this case be prepared for this and discard any such garbage messages that are two-element tuples with a reference as the first element.

So I shouldn’t have been surprised at all.

But this is all actually a lie3.

You see, when you run this on a more recent OTP distro the problem isn’t there:

$ elixir a.exs 
Ex version: 1.13.1
OTP release: 24
Client pid: #PID<0.94.0>

Slow calling: 5
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,5} from <0.94.0>
*DBG* 'Elixir.MyApp.SlowServer' sent 5 to <0.94.0>, new state #{}
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 5}, 10]}}
Mailbox: {:messages, []}

Slow calling: 20
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,20} from <0.94.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 20}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 20 to <0.94.0>, new state #{}
Mailbox: {:messages, []}

Slow calling: 19
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,19} from <0.94.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 19}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 19 to <0.94.0>, new state #{}
Mailbox: {:messages, []}

And I actually spent a few hours figuring out what the hell is going on. Because my toy example from above refused to work as expected.

Only to find an interesting comment in gen.erl:

Using alias to prevent responses after ‘noconnection’ and timeouts.

Turns out, OTP 24 implemented EEP-53: Process aliases which allows gen_server (and others) to temporarily alias the process id (to an active monitor). So after demonitoring, the alias is similar in behavior to a dead process – the message gets dropped.

That’s neat. And I all can say is: cool. GenServer got even better. :)

Closing words

One question down, many more to go. When I get a few more hours to burn, I’ll post a new installment.

  1. e.g. Clients and Servers from LYSE, OTP in Elixir: Learn GenServer by Building Your Own URL Shortener, Learning Elixir’s GenServer with a real-world example.

  2. Not strictly true, since handle_info is for a grab bag of things (all non-matching messages).

  3. Because I started writing this on Ex 1.13.1 with OTP 24 (Alpine Linux :: edge) and then spent a few hours digging through OTP and Elixir codebase and debugging the hell out of the little module… to figure out “Y U NO WORK”. So the timeline on this blogpost is a lie. Because I think it reads better that way.