27m read
Tags: elixir, testing, streamdata, property-testing

Your test suite is lying to you.

Not maliciously. It's doing exactly what you asked; you wrote tests for the cases you thought of, and those cases pass. The problem is everything else---the edge cases hiding in the combinatorial explosion of possible inputs that no human could enumerate by hand.combinatorial-math

I've watched production systems fail on inputs that seemed obvious in hindsight. A Unicode character in a name field. A negative timestamp. An empty list where the code assumed at least one element. Each failure had a test suite that passed with flying colors; the tests verified what the developers imagined, not what users would actually do.

Property-based testing inverts the whole approach. Instead of testing specific examples, you define properties that should hold for all inputs, then let the computer generate thousands of test cases trying to break them. Checking a few points on a curve versus verifying the equation that defines it.


The Limits of Example-Based Testing

Traditional unit tests are example-based---you pick an input, call your function, assert the output:

test "adds two numbers" do
  assert Calculator.add(2, 3) == 5
  assert Calculator.add(0, 0) == 0
  assert Calculator.add(-1, 1) == 0
end

Three cases. Your function handles integers from negative infinity to positive infinity. Three divided by infinity is not great coverage.

You could add more examples---maybe ten, maybe a hundred. But you're still guessing which inputs matter; you're almost certainly missing the weird ones, the inputs that expose integer overflow or rounding errors or the assumption buried three functions deep that a list is never empty.

Property-based testing asks a different question: what should be true for any valid input? For addition, several properties come to mind immediately---commutativity, identity, associativity:algebraic-properties

  • Commutativity: add(a, b) should equal add(b, a)
  • Identity: add(a, 0) should equal a
  • Associativity: add(add(a, b), c) should equal add(a, add(b, c))

These properties don't depend on specific values; they should hold whether you're adding 2 and 3 or 999,999 and -42. Generate random integers, verify the properties across thousands of combinations, and you get far more confidence than three hand-picked examples ever provide.


StreamData: Elixir's Property Testing Library

StreamData is Elixir's property-based testing library, maintained by the core team.streamdata-lineage Two things: generators for creating random data, and the check all macro for running property tests.

Add it to your mix.exs:

defp deps do
  [
    {:stream_data, "~> 1.0", only: [:dev, :test]}
  ]
end

Then import it in your test files:

defmodule MyApp.CalculatorTest do
  use ExUnit.Case
  use ExUnitProperties

  property "addition is commutative" do
    check all a <- integer(),
              b <- integer() do
      assert Calculator.add(a, b) == Calculator.add(b, a)
    end
  end
end

The check all macro generates random integers for a and b, then runs the assertion. By default it runs 100 cases per property.default-iterations If any case fails, StreamData reports the failing input and shrinks it to the minimal reproducer.


Writing Properties That Actually Test Something

The hard part isn't the syntax---it's figuring out what properties to test. Commutativity works for math; most business logic doesn't have clean mathematical properties. Four patterns show up again and again across different domains.

Round-trip properties: encode then decode; you should get the original value back.roundtrip-naming

property "JSON encoding round-trips" do
  check all map <- map_of(string(:alphanumeric), integer()) do
    assert map == map |> Jason.encode!() |> Jason.decode!()
  end
end

Invariant properties: certain conditions should always hold after an operation. No exceptions.

property "sorting produces ordered output" do
  check all list <- list_of(integer()) do
    sorted = Enum.sort(list)
    pairs = Enum.zip(sorted, Enum.drop(sorted, 1))
    assert Enum.all?(pairs, fn {a, b} -> a <= b end)
  end
end

Oracle properties: compare your implementation against a known-correct reference---slow but trusted.

property "my_sort matches Enum.sort" do
  check all list <- list_of(integer()) do
    assert MySorter.sort(list) == Enum.sort(list)
  end
end

Idempotence: applying an operation twice produces the same result as applying it once. This one catches more bugs than you'd expect; normalization functions are notorious for breaking on the second pass.

property "normalizing email is idempotent" do
  check all email <- email_generator() do
    once = Email.normalize(email)
    twice = Email.normalize(once)
    assert once == twice
  end
end

Generators: Built-In and Custom

StreamData ships with generators for the common types:

# Primitives
integer()                    # Any integer
positive_integer()           # 1, 2, 3, ...
float()                      # Any float
boolean()                    # true or false
binary()                     # Random binary data
string(:alphanumeric)        # Letters and numbers
atom(:alphanumeric)          # Atoms from alphanumeric strings

# Collections
list_of(integer())           # [1, -3, 42, ...]
map_of(atom(:alphanumeric), string(:alphanumeric))
tuple({integer(), string(:alphanumeric)})

# Choosing from options
member_of([:pending, :active, :cancelled])
one_of([integer(), float()])

Real applications need custom generators. You build them by composing primitives with gen all---same shape as check all but it returns a generator instead of running assertions:gen-all-lazytree

def user_generator do
  gen all name <- string(:alphanumeric, min_length: 1, max_length: 100),
          email <- email_generator(),
          age <- integer(18..120) do
    %User{name: name, email: email, age: age}
  end
end

def email_generator do
  gen all local <- string(:alphanumeric, min_length: 1, max_length: 64),
          domain <- string(:alphanumeric, min_length: 1, max_length: 63) do
    "#{local}@#{domain}.com"
  end
end

A word of caution on that email generator---it only produces .com addresses with alphanumeric local parts.email-generator-limits Good enough for testing most validation logic, but it won't catch bugs triggered by plus-addressing, internationalized domains, or the truly bizarre edge cases that RFC 5321 allows.

property "users can be serialized" do
  check all user <- user_generator() do
    assert {:ok, _} = User.to_json(user)
  end
end

Shrinking: Finding the Minimal Failure

When a property fails, the random input that triggered it is usually large and noisy. Shrinking strips it down to the smallest input that still fails.integrated-shrinking

Consider this buggy function:

defmodule Buggy do
  def process(list) when length(list) > 5 do
    raise "Can't handle more than 5 elements"
  end
  def process(list), do: list
end

A property test might generate [42, -7, 999, 0, 8, -3, 100] as the failing input. Seven elements, arbitrary values, noise everywhere. StreamData shrinks this to [0, 0, 0, 0, 0, 0]---six zeros. The minimal list that triggers the bug; every element reduced to its simplest form.shrinking-zero

Shrinking works by trying progressively simpler values. For integers, it moves toward zero. For lists, it removes elements and shrinks what remains. For custom generators built with gen all, shrinking composes automatically from the component generators---you don't have to define shrink behavior yourself.

The output looks like this:

1) property users can be serialized (MyApp.UserTest)
   test/my_app/user_test.exs:10
   Failed with generated values (after 3 successful runs):

       * Clause:    user <- user_generator()
         Generated: %User{name: "", email: "@.com", age: 18}

That shrunk result tells you exactly where to look. Empty name, malformed email. The original randomly generated user might have had a 50-character name obscuring what actually matters.


Testing Ecto Schemas and Changesets

Property testing is at its most useful when validating Ecto changesets. Instead of checking a handful of invalid inputs by hand, generate thousands:ecto-sandbox

defmodule MyApp.AccountTest do
  use ExUnit.Case
  use ExUnitProperties
  alias MyApp.Accounts.User

  property "valid users pass changeset validation" do
    check all attrs <- valid_user_attrs() do
      changeset = User.changeset(%User{}, attrs)
      assert changeset.valid?, "Expected valid changeset for: #{inspect(attrs)}"
    end
  end

  property "empty email fails validation" do
    check all attrs <- valid_user_attrs() do
      bad_attrs = Map.put(attrs, :email, "")
      changeset = User.changeset(%User{}, bad_attrs)
      refute changeset.valid?
      assert {:email, _} = hd(changeset.errors)
    end
  end

  property "age under 18 fails validation" do
    check all attrs <- valid_user_attrs(),
              bad_age <- integer(-1000..17) do
      bad_attrs = Map.put(attrs, :age, bad_age)
      changeset = User.changeset(%User{}, bad_attrs)
      refute changeset.valid?
    end
  end

  defp valid_user_attrs do
    gen all name <- string(:alphanumeric, min_length: 1, max_length: 100),
            email <- email_generator(),
            age <- integer(18..120) do
      %{name: name, email: email, age: age}
    end
  end

  defp email_generator do
    gen all local <- string(:alphanumeric, min_length: 1, max_length: 64),
            domain <- string(:alphanumeric, min_length: 1, max_length: 63) do
      "#{local}@#{domain}.com"
    end
  end
end

This finds edge cases that manual tests miss. Maybe your email regex rejects single-character local parts. Maybe age validation allows nil when it shouldn't. Maybe there's a length constraint you forgot about that only surfaces when StreamData generates a 100-character name. Property tests surface these by throwing variety at your code; they're doing the tedious work of exploring the input space so you don't have to.


Testing Business Logic

Business logic is where property-based testing earns its keep. Financial calculations, state machines, pricing engines---anywhere the logic is complex enough that enumerating cases by hand is a losing proposition.

A shopping cart:

defmodule MyApp.CartTest do
  use ExUnit.Case
  use ExUnitProperties
  alias MyApp.Commerce.Cart

  property "cart total equals sum of line items" do
    check all items <- list_of(cart_item_generator(), min_length: 1) do
      cart = Cart.new(items)
      expected_total = items |> Enum.map(&(&1.price * &1.quantity)) |> Enum.sum()
      assert Cart.total(cart) == expected_total
    end
  end

  property "removing an item decreases total" do
    check all items <- list_of(cart_item_generator(), min_length: 2),
              index <- integer(0..(length(items) - 1)) do
      cart = Cart.new(items)
      item_to_remove = Enum.at(items, index)

      original_total = Cart.total(cart)
      updated_cart = Cart.remove_item(cart, item_to_remove.sku)
      new_total = Cart.total(updated_cart)

      assert new_total < original_total or item_to_remove.quantity == 0
    end
  end

  property "applying discount never increases total" do
    check all items <- list_of(cart_item_generator(), min_length: 1),
              discount_percent <- integer(0..100) do
      cart = Cart.new(items)
      original_total = Cart.total(cart)

      discounted = Cart.apply_discount(cart, discount_percent)
      discounted_total = Cart.total(discounted)

      assert discounted_total <= original_total
    end
  end

  defp cart_item_generator do
    gen all sku <- string(:alphanumeric, length: 8),
            price <- positive_integer(),
            quantity <- integer(0..100) do
      %{sku: sku, price: price, quantity: quantity}
    end
  end
end

Those properties capture business rules without hard-coding prices or quantities. If someone changes the discount calculation and accidentally makes it increase prices for certain inputs---say, a rounding error on quantities above 50---the property test catches it. That's the kind of bug that slips through three hand-picked examples and shows up six months later in a customer invoice.


Practical Considerations

Property-based tests run slower than example-based tests; they're generating and evaluating hundreds of cases instead of three. You'll want to tune iteration counts---the default is 100, which works for most things. Lower it for expensive operations, raise it for critical paths:

property "critical calculation is correct", max_runs: 1000 do
  # ...
end

When a property fails, StreamData prints the seed that generated the failing case. Pin it with initial_seed to reproduce the failure deterministically:seed-persistence

property "my property", initial_seed: 12345 do
  # ...
end

One thing I've learned the hard way: start with simple generators. The instinct is to build generators that produce "realistic" data---proper names, well-formed emails, sensible prices. Resist it, at least initially. A generator that spits out empty strings and zeros finds more bugs than one producing polished test fixtures. The realistic generator feels more satisfying; the dumb one catches more issues.

Mix property tests and example tests freely. Properties verify general behavior across the input space; example tests document specific scenarios and edge cases you've already encountered in production. They're not competing approaches. Use both.example-complement


What Changes When You Think in Properties

The first few property tests take longer to write than example tests. You have to think differently---not "what's the right answer for this input?" but "what should always be true, regardless of input?" That shift is uncomfortable. It forces you to articulate invariants you've been relying on implicitly; assumptions buried so deep in the code that nobody thought to write them down.

But once you have generators for your core domain types, new properties come fast. And they catch things example tests never would---the Unicode character that breaks your parser, the boundary condition at integer limits, the empty collection that shouldn't have been empty but was.

I've seen property tests catch bugs that survived months in production, hiding behind edge cases nobody had thought to check.quickcheck-industry That's the gap between testing what you imagine and testing what's actually possible. The cases you think of are the easy ones. The hard ones---the ones that wake you up at 3 AM---are exactly the cases you'd never write by hand.


What do you think of what I said?

Share with me your thoughts. You can tweet me at @allanmacgregor.