BEAMitude: What if GenServer.call times out?
Problem statement
I had the pleasure to dive a bit into the
gen_server
OTP pattern
(doc,
exdoc).
That experience left me with a few questions about its failure modes.
First of which I’m exploring in this post:
What happens when synchronous calls to GenServer have timeouts that are often too short?
Wouldn’t that bloat the client’s mailbox?
Introduction to GenServer (gen_server)
Much ink has been spilled about GenServer1. For a good reason. It’s a well thought out, battle-tested, and rather universal abstraction. It is mainly useful in the Actor model world (because other paradigms use shared memory, mutexes, etc).
It provides reasonable building block for stateful servers. In the following
diagram are two processes, client
and server
and their gen_server
interaction (on a high level):
There are three main use-cases:
- synchronous call (returning a reply)
- asynchronous cast (fire and forget RPC)
- internal housekeeping (info)2
For the rest of this article, only call
will be interesting. Behind the
scenes it works in the following fashion (elixir syntax):
- Client sends Server a message:
{:"$gen_call", {self(), tag}, request}
- Client enters into
receive
block (waiting for a message), with an optional timeout - Server’s
receive
loop (supplied byGenServer
) invokes appropriatehandle_call(request, _from, state)
callback which typically returns{:reply, response, newstate}
- Server’s
receive
loop sends back theresponse
to the statedfrom
address (with an uniquetag
), and loops withnewstate
- Client picks up the message from the
receive
block… and: profit.
Answer
In order to verify the behavior under timing out, I whipped up a simple module to verify the behavior:
# a.exs
defmodule MyApp.SlowServer do
@name __MODULE__
use GenServer
# Client interface
def start_link(_) do
GenServer.start_link(__MODULE__, %{}, name: @name)
end
def slow_call(val) do
GenServer.call @name, {:slow, val}, 10
end
# Server callbacks
@impl true
def handle_call({:slow, val}, _from, state) do
:timer.sleep(val)
{:reply, val, state}
end
@impl true
def init(state) do
{:ok, state}
end
end
defmodule Main do
def slow_call(t) do
try do
IO.puts "Slow calling: #{t}"
IO.puts "reply: #{inspect MyApp.SlowServer.slow_call(t)}"
catch
a, b -> IO.puts "error: #{inspect a}, #{inspect b}"
end
:timer.sleep(20)
IO.puts "Mailbox: #{inspect Process.info(self(), :messages)}"
IO.puts ""
end
end
IO.puts "Ex version: #{System.version}"
IO.puts "OTP release: #{System.otp_release}"
IO.puts "Client pid: #{inspect self()}"
IO.puts ""
{:ok, pid} = MyApp.SlowServer.start_link(:foo)
:sys.trace(pid, true)
# Expect: success, empty mailbox
Main.slow_call(5)
# Expect: fail, mailbox with 1 message
Main.slow_call(20)
# Expect: fail, mailbox with 2 messages
Main.slow_call(19)
And sure enough, when running on an old enough OTP, I’m getting the behavior I thought I would:
$ elixir a.exs
Ex version: 1.7.4
OTP release: 21
Client pid: #PID<0.91.0>
Slow calling: 5
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,5} from <0.91.0>
*DBG* 'Elixir.MyApp.SlowServer' sent 5 to <0.91.0>, new state #{}
reply: 5
Mailbox: {:messages, []}
Slow calling: 20
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,20} from <0.91.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 20}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 20 to <0.91.0>, new state #{}
Mailbox: {:messages, [{#Reference<0.3808279768.984612871.134230>, 20}]}
Slow calling: 19
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,19} from <0.91.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 19}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 19 to <0.91.0>, new state #{}
Mailbox: {:messages, [{#Reference<0.3808279768.984612871.134230>, 20},
{#Reference<0.3808279768.984612871.134239>, 19}]}
In other words, if I trap the exit
resulting from the call
timing out, the
Client mailbox accumulates a stray message.
Elixir documentation even mentions that:
If the caller catches the failure and continues running, and the server is just late with the reply, it may arrive at any time later into the caller’s message queue. The caller must in this case be prepared for this and discard any such garbage messages that are two-element tuples with a reference as the first element.
So I shouldn’t have been surprised at all.
But this is all actually a lie3.
You see, when you run this on a more recent OTP distro the problem isn’t there:
$ elixir a.exs
Ex version: 1.13.1
OTP release: 24
Client pid: #PID<0.94.0>
Slow calling: 5
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,5} from <0.94.0>
*DBG* 'Elixir.MyApp.SlowServer' sent 5 to <0.94.0>, new state #{}
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 5}, 10]}}
Mailbox: {:messages, []}
Slow calling: 20
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,20} from <0.94.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 20}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 20 to <0.94.0>, new state #{}
Mailbox: {:messages, []}
Slow calling: 19
*DBG* 'Elixir.MyApp.SlowServer' got call {slow,19} from <0.94.0>
error: :exit, {:timeout, {GenServer, :call, [MyApp.SlowServer, {:slow, 19}, 10]}}
*DBG* 'Elixir.MyApp.SlowServer' sent 19 to <0.94.0>, new state #{}
Mailbox: {:messages, []}
And I actually spent a few hours figuring out what the hell is going on. Because my toy example from above refused to work as expected.
Only to find an interesting comment in gen.erl
:
Using alias to prevent responses after ‘noconnection’ and timeouts.
Turns out, OTP 24 implemented
EEP-53: Process aliases
which allows gen_server
(and others) to temporarily alias the process id (to
an active monitor). So after demonitoring, the alias is similar in behavior to
a dead process – the message gets dropped.
That’s neat. And I all can say is: cool. GenServer got even better. :)
Closing words
One question down, many more to go. When I get a few more hours to burn, I’ll post a new installment.
-
e.g. Clients and Servers from LYSE, OTP in Elixir: Learn GenServer by Building Your Own URL Shortener, Learning Elixir’s GenServer with a real-world example. ↩
-
Not strictly true, since
handle_info
is for a grab bag of things (all non-matching messages). ↩ -
Because I started writing this on Ex 1.13.1 with OTP 24 (Alpine Linux :: edge) and then spent a few hours digging through OTP and Elixir codebase and debugging the hell out of the little module… to figure out “Y U NO WORK”. So the timeline on this blogpost is a lie. Because I think it reads better that way. ↩