Testland
Browse all skills & agents

jepsen-patterns

Reference for Jepsen-style distributed-systems testing - consistency models hierarchy (linearizability vs sequential vs causal vs monotonic-reads vs eventual), nemesis primitives (network partitions, clock skew, kill nodes), workload generators, Knossos + Elle linearizability checkers. Reference-only because Jepsen tests are typically Clojure-bespoke per system; use this skill to evaluate vendor claims and structure your own test.

jepsen-patterns

Per the Jepsen consistency docs, "A consistency model is a safety property which declares what a system can do." Jepsen tests distributed databases by injecting faults (the nemesis) and checking the operation history against a consistency model.

This skill is reference-only - Jepsen itself is a Clojure library

  • DSL; production tests are bespoke per database. Use this skill to: read vendor "we passed Jepsen" claims with the right framing, scope a custom Jepsen-style test, or evaluate a competing data-store choice.

When to use

  • Evaluating a distributed database vendor's consistency claims.
  • Designing in-house consistency tests for a custom store (CRDT-based KV, custom replication).
  • Onboarding to a system already tested by Jepsen - read the report intelligently.

Step 1 - Map the consistency model your system claims

Per the Jepsen consistency docs, models are organized by their guarantees + the phenomena they prohibit:

ModelAllowed phenomenaForbidden phenomena
LinearizabilityNone - operations totally ordered respecting real timeStale read, lost update, write skew
Sequential consistencyPer-process order respected; cross-process not real-timeReal-time-ordering violations
Causal consistencyCause-before-effect respectedCausally unrelated operations may appear out of order
Monotonic readsOnce a read sees value v, no later read sees an older valueCross-client divergence allowed
Eventual consistencyConvergence eventuallyStale reads, inconsistent windows

Your system claims one of these (or a hybrid: snapshot isolation, read-your-writes, etc.). The test must match the claim.

Step 2 - Pick a nemesis

Nemesis primitives Jepsen ships:

NemesisWhat it does
PartitionSplits the cluster into N groups; intra-group communication blocked
CrashHard-kills a process
PauseSIGSTOPs a process (hangs without disconnecting)
Clock skewjiggles gettimeofday() per-node
Slow diskadds I/O latency
Bitflipcorrupts disk contents

Combine nemeses (partition + crash + clock skew) to find compound bugs.

Step 3 - Generator: construct the workload

A Jepsen workload is per-client operations: invoke read / write / cas / append, observe outcome (ok / fail / info).

Pseudocode shape (Jepsen DSL is Clojure):

(generator/mix
  [{:f :read,  :value nil}
   {:f :write, :value (rand-int 100)}
   {:f :cas,   :value [old new]}])

Concurrent N clients hit the system; outcomes recorded as a history (an ordered list of invocations + completions).

Step 4 - Check the history with Knossos / Elle

CheckerUse
KnossosLinearizability checker for register-style ops (read/write/cas)
ElleTransactional anomaly checker (G0/G1a/G1b/G1c, G-nonadjacent, G-single, G2-item, G2) - finds dirty/non-monotonic/non-repeatable read violations

Both surface counterexamples (specific operation sequences) that violate the claimed model. Counterexamples are the value: vendor claim says "linearizable"; checker says "here's an op sequence that isn't" → you have evidence.

Step 5 - Workload patterns

Common workload shapes per consistency claim:

WorkloadTests
Register (single-key R/W/CAS)Linearizability of single-key
Append (per-key list, append + read)Per-key history monotonicity
Set (insert + read all)No lost insert; eventual visibility window
Bank transfer (txn-level read + write)Transactional invariants (sum stays constant)

Pick the workload closest to your system's user-facing invariants.

Step 6 - Reading vendor Jepsen reports

Check for these red flags:

  • "Tested at default isolation level" → vendor weakened isolation for the test.
  • "With clock skew off" → clock skew is the typical-failure-mode for many distributed systems.
  • "Without disk-fsync nemesis" → disk-flush bugs are a major class.
  • Limited workload range → only read/write, no cas or transactions.

Per the Jepsen consistency docs, Jepsen's value is that "consistency models and phenomena are often defined in terms of dependencies" - gaps in the test = gaps in confidence.

Step 7 - In-house test scoping

For your own system (custom KV / custom replication):

  1. Decide claim: what consistency level do you want to guarantee?
  2. Compose nemesis: at minimum partition + crash; add clock skew if timestamps used.
  3. Write workload: register-style for KV; bank-transfer-style for transactional.
  4. Run with Knossos (register) or Elle (transactional).
  5. Counterexamples → fix. Re-run. Add to CI suite.

Out-of-the-box Jepsen test rigs exist for many systems (jepsen-io/jepsen GitHub); fork rather than start from scratch.

Anti-patterns

Anti-patternWhy it failsFix
Test under stable network onlyReal production has partitions; bugs hideAlways include partition nemesis (Step 2)
Trust "we did our own consistency tests" without checkerManual reasoning misses subtle violationsUse Knossos / Elle (Step 4)
Single-client workloadConcurrency bugs need concurrencyMulti-client generator (Step 3)
Skip clock skew if using NTPNTP can step backward; bugs triggerInclude clock skew (Step 2)
Run for 60sBugs may take hours to surfaceRun hours; bisect to specific operation in history

Limitations

  • Jepsen is Clojure-first; Python / Go ports exist but lag in features.
  • Test runs are infra-heavy: real cluster, real network, real disk. Cloud-friendly via Docker but expensive.
  • Not all bugs reproduce 100% - expect probabilistic findings.
  • This skill is a reference; actually running Jepsen requires Clojure familiarity + significant per-system engineering.

References