Debugging Production Elixir with Observer and Recon

It's 3 AM. Your phone won't stop buzzing. The production dashboard shows request latency climbing from 50ms to 12 seconds; memory usage is spiking and users are churning. You've got two options: restart the node and hope, or connect to the running system and actually understand what's happening.

Most platforms force you into the first option. The BEAM doesn't.

The BEAM virtual machine gives us something rare in production systems — the ability to inspect, trace, and modify running processes without stopping them.beam-introspection This isn't a parlor trick. It's a fundamental architectural decision that the Erlang ecosystem made decades ago, and it remains one of Elixir's most underused capabilities.

The Remote Shell: Your Entry Point

IEx provides remote shell access through the --remsh flag; it establishes a connection to a named Erlang node running somewhere else on the network. Your production node needs to be started with a name. In a typical release, this happens via your rel/env.sh.eex or through environment variables:

# Named node with full hostname
elixir --name myapp@prod-server-01.example.com -S mix phx.server

# Named node with short name (single host or local development)
elixir --sname myapp -S mix phx.server

To connect from another machine, you need three things: the node name, network access, and a matching Erlang cookie.erlang-cookie The cookie is a shared secret that authorizes node-to-node communication.

# Connect to the remote node
iex --name debug@your-machine.example.com \
    --cookie YOUR_PRODUCTION_COOKIE \
    --remsh myapp@prod-server-01.example.com

Once connected, you're inside the production runtime. Every command you type executes in that node's process space. This is power. This is also danger.

A few rules for remote shell sessions:

Never paste untested code. Compile errors in a remote shell can behave unpredictably.
Avoid blocking the shell process on long-running operations.
Keep your session short; idle connections consume resources.
Use read-only operations first. Mutate only when you understand the system state.

For production nodes behind firewalls, establish an SSH tunnel first:

ssh -L 4369:localhost:4369 -L 9001:localhost:9001 prod-server-01.example.com

Port 4369 is EPMD — the Erlang Port Mapper Daemon — which maps node names to distribution ports.epmd Port 9001 is an example distribution port; your node's actual port will differ.

Observer: The Visual Debugger

Observer is Erlang's built-in GUI for system inspection. Most developers know it from local development but never point it at production nodes. That's a missed opportunity.

To run Observer locally against a remote node:

# First, ensure your local node is started with distribution
iex --name observer@localhost --cookie YOUR_PRODUCTION_COOKIE

# Then connect to the remote node
Node.connect(:"myapp@prod-server-01.example.com")

# Start Observer
:observer.start()

Once Observer launches, go to Nodes in the menu bar and select your production node. You've now got a live view of processes, memory allocation, and system statistics.

The Applications tab shows your supervision tree. The Processes tab lets you sort by message queue length, memory usage, or reductions — the BEAM's unit of work accounting. The System tab covers schedulers, memory, and I/O.

One caveat: Observer adds overhead.observer-overhead The GUI polls the remote node for data, which burns CPU and network bandwidth. Don't leave it connected for extended periods on heavily loaded systems. Get your data and disconnect.

Process Inspection: Finding the Culprit

When latency spikes or memory climbs, a process is usually responsible. Your job is to find it.

The :recon library belongs in every production deployment. Add it to your dependencies:

# mix.exs
{:recon, "~> 2.5"}

Finding Heavy Processes

# Top 10 processes by memory usage
:recon.proc_count(:memory, 10)

# Top 10 processes by message queue length
:recon.proc_count(:message_queue_len, 10)

# Top 10 processes by heap size
:recon.proc_count(:heap_size, 10)

# Top 10 processes by reductions (CPU work) over a 1-second window
:recon.proc_window(:reductions, 10, 1000)

The difference between proc_count and proc_window matters. proc_count gives you a snapshot — who has the most right now. proc_window gives you a delta — who did the most work during the measurement period. Use proc_window for CPU-bound issues; use proc_count for accumulation problems like message queues.

Deep Process Inspection

Once you've got a suspicious process, dig deeper:

pid = pid(0, 1234, 0)  # Construct a PID from the tuple notation

# Basic info
Process.info(pid, [:registered_name, :current_function, :message_queue_len])

# Full info dump
Process.info(pid)

# Erlang-level details
:erlang.process_info(pid, :current_stacktrace)

The current_stacktrace key is gold. It tells you exactly what that process is doing right now; if you see a process stuck in a blocking call to an external service, you've found your bottleneck.

For GenServers, check the state directly:

:sys.get_state(pid)

If the state is enormous — say, a list with 500,000 items — you've found your memory problem.sys-format-status

Memory Leak Hunting

Elixir processes have independent heaps, which simplifies garbage collection but complicates memory debugging. A leak might be a single process hoarding data; it might be binary references held across process boundaries. The two look very different from the outside.

Binary Leaks

Binary data larger than 64 bytes lives in a shared heap and is reference-counted.refc-binary-threshold If a process holds a reference to a sub-binary — a slice of a larger binary — the entire original binary stays in memory until all references are released.

This pattern causes leaks:

def parse_header(<<header::binary-size(4), _rest::binary>>) do
  # header is a sub-binary referencing the entire original binary
  {:ok, header}
end

The :recon library can find processes holding onto binary references:

# Find processes with the most binary memory
:recon.bin_leak(10)

This returns processes sorted by the difference between their binary virtual heap size and their actual binary memory usage — a heuristic for leak potential. It's not perfect; some processes legitimately hold large binaries. But during a memory investigation, it narrows the field fast.

Allocator Analysis

For system-wide memory issues, :recon_alloc gives you visibility into the BEAM's memory allocators:

# Memory usage by allocator type
:recon_alloc.memory(:usage)

# Fragmentation statistics
:recon_alloc.memory(:allocated)

# Cache hit rates for allocators
:recon_alloc.cache_hit_rates()

High fragmentation — allocated far exceeding used — indicates the allocators are holding onto memory they can't efficiently reuse. This often happens with highly variable workload patterns; the allocators reserve blocks for peak demand and don't return them to the OS. Sometimes the answer is restarting the node. Sometimes it's tuning allocator flags. Neither is glamorous.

Tracing in Production

Tracing lets you observe function calls, arguments, and return values in real time. It's also the fastest way to crash a production node if you do it wrong.

Never use :dbg or :erlang.trace directly in production.recon-trace-safety They have no rate limiting. A function called 50,000 times per second will generate 50,000 trace messages per second; your shell drowns, the node's memory spikes, and you've turned a debugging session into a second incident.

Use :recon_trace instead. It has built-in rate limiting:

# Trace calls to MyApp.Worker.process/2, max 100 traces total
:recon_trace.calls({MyApp.Worker, :process, 2}, 100)

# Trace with return values
:recon_trace.calls({MyApp.Worker, :process, 2}, 100, [{:return_trace}])

# Trace all functions in a module, 50 traces per second, max 1000 total
:recon_trace.calls({MyApp.Worker, :_, :_}, {1000, 50}, [])

The rate limiting format {max_traces, traces_per_second} is your safety valve. Start conservative. You can always increase the limit; you can't un-crash a node.

To stop all tracing:

:recon_trace.clear()

Match Specs for Surgical Tracing

Sometimes you only want to trace calls with specific arguments:

# Trace only when the first argument is :error
:recon_trace.calls(
  {MyApp.Handler, :handle_event, [{[:error, :_], [], [{:return_trace}]}]},
  100
)

Match specs use Erlang's pattern matching syntax.match-specs The format is [{pattern, guards, actions}]. They're dense and the documentation isn't great, but learning them is the difference between useful traces and noise.

Tracing by Process

You can restrict tracing to specific processes:

# Trace calls only from a specific process
pid = Process.whereis(MyApp.ProblematicWorker)
:recon_trace.calls({MyApp.Database, :query, :_}, 100, [{:scope, [pid]}])

This cuts the noise dramatically when you already know which process is misbehaving but want to understand what it's calling downstream.

GenServer Debugging with :sys

The :sys module provides introspection hooks for any process built on OTP behaviors.sys-module Every GenServer, GenStage, and GenStateMachine supports these functions — no dependencies required.

pid = Process.whereis(MyApp.Worker)

# Get the current state
:sys.get_state(pid)

# Get formatted status (includes state, message queue, etc.)
:sys.get_status(pid)

# Enable trace messages for this process
:sys.trace(pid, true)

# Disable tracing
:sys.trace(pid, false)

The trace/2 function outputs debug messages to the shell for every event the process handles. Unlike :recon_trace, this is process-specific and relatively low overhead.

For stuck processes, :sys.get_status/1 reveals whether the process is waiting for a message, handling a call, or blocked in a function.

Suspending and Resuming Processes

When you need to freeze a process to inspect it without it processing more messages:

# Suspend the process — it stops handling messages
:sys.suspend(pid)

# Inspect state while frozen
state = :sys.get_state(pid)

# Resume normal operation
:sys.resume(pid)

This matters most when debugging race conditions. Suspend the suspect process, examine its state, then resume; messages queue up during suspension and process normally after. The process doesn't know it was paused.

Replacing State in Running Processes

In emergencies, you can modify a GenServer's state without restarting it:

:sys.replace_state(pid, fn state ->
  # Return the new state
  %{state | stuck_flag: false}
end)

Use this sparingly. It violates the normal message-passing model and can introduce inconsistencies. But when you need to unstick a process at 3 AM without a deployment, it's there.

War Stories

The Binary That Wouldn't Die

A Phoenix application leaked 2GB of memory over 48 hours. Standard metrics showed nothing — no single process with outsized memory. :recon.bin_leak(20) told a different story: a GenServer that processed file uploads was storing parsed headers in its state.

The headers were extracted as sub-binaries from the original upload. Tiny slices, a few bytes each. But the reference-counting system kept the entire multi-megabyte upload binary alive because those sub-binaries still pointed into it.

The fix was almost insulting in its simplicity:

def parse_header(<<header::binary-size(4), _rest::binary>>) do
  {:ok, :binary.copy(header)}  # Breaks the reference to the parent binary
end

:binary.copy/1 creates a new, independent binary.binary-copy Thirty seconds of inspection. Three lines of code. Two gigabytes reclaimed.

The Queue That Ate the Node

A messaging application experienced cascading latency. :recon.proc_count(:message_queue_len, 5) showed a single process with 340,000 pending messages.

The process was a GenServer that wrote events to an external database. The database had slowed down due to an unrelated index problem; the GenServer's mailbox filled faster than it could drain. Classic unbounded queue.

The immediate fix: kill the process. Supervision restarted it with an empty queue, and backpressure from the caller side kicked in.

The permanent fix took longer — implementing a bounded mailbox pattern using GenStage with explicit backpressure signals. The architectural lesson was more interesting than the debugging itself: every GenServer that talks to an external service is an unbounded queue waiting to happen, unless you've explicitly designed it not to be.

The Runaway Recursion

CPU usage spiked to 100% on two of eight schedulers. :recon.proc_window(:reductions, 5, 1000) identified two processes burning through reductions at 10x the rate of anything else.

:erlang.process_info(pid, :current_stacktrace) showed both processes deep in a recursive function — no termination condition for a specific edge case in the input data. The fix required a code deployment, but the diagnosis took under two minutes. Without BEAM introspection, we'd have been reading logs and guessing for hours.

The Silent Deadlock

Two GenServers called each other synchronously. Under normal load, the calls completed fast enough that the default 5-second timeout never triggered.genserver-timeout Under heavy load, both processes occasionally called each other simultaneously; each waiting for the other to respond. Classic deadlock.

The symptom was intermittent timeouts with no obvious pattern. :sys.get_status/1 on both processes showed them in a waiting state; the stack traces revealed they were both blocked in GenServer.call/3, waiting for each other.

The fix: convert one of the calls to cast and handle the response asynchronously. What started as a debugging session revealed an architectural flaw that had been latent for months. That's the thing about production debugging on the BEAM — you often find problems you didn't know you had.

The Discipline of It

These tools won't save you if you use them carelessly. I've watched engineers connect a remote shell and start typing exploratory code without a hypothesis; twenty minutes later they've learned a lot about the system but nothing about the problem.

Hypothesize first, then instrument. Start with :recon.proc_count and :recon.proc_window — they add almost no overhead. Escalate to tracing only when you need to see call patterns. Escalate to Observer when you need the visual map. Each level adds cost; don't skip to the expensive tool because it's more impressive.

Document what you find. The patterns that cause problems in your system will recur. Build runbooks. And practice in staging — the first time you connect a remote shell should not be during an incident at 3 AM with your phone still buzzing.

The BEAM gives you the ability to understand running systems while they run. Most platforms make you fly blind, piecing together behavior from logs and metrics after the fact. Elixir lets you open the hood while the engine is running. I keep coming back to that; it's the reason I reach for this stack when reliability actually matters.