jepsen-patterns
Reference for Jepsen-style distributed-systems testing - consistency models hierarchy (linearizability vs sequential vs causal vs monotonic-reads vs eventual), nemesis primitives (network partitions, clock skew, kill nodes), workload generators, Knossos + Elle linearizability checkers. Reference-only because Jepsen tests are typically Clojure-bespoke per system; use this skill to evaluate vendor claims and structure your own test.
jepsen-patterns
Per the Jepsen consistency docs, "A consistency model is a safety property which declares what a system can do." Jepsen tests distributed databases by injecting faults (the nemesis) and checking the operation history against a consistency model.
This skill is reference-only - Jepsen itself is a Clojure library
When to use
Step 1 - Map the consistency model your system claims
Per the Jepsen consistency docs, models are organized by their guarantees + the phenomena they prohibit:
| Model | Allowed phenomena | Forbidden phenomena |
|---|---|---|
| Linearizability | None - operations totally ordered respecting real time | Stale read, lost update, write skew |
| Sequential consistency | Per-process order respected; cross-process not real-time | Real-time-ordering violations |
| Causal consistency | Cause-before-effect respected | Causally unrelated operations may appear out of order |
| Monotonic reads | Once a read sees value v, no later read sees an older value | Cross-client divergence allowed |
| Eventual consistency | Convergence eventually | Stale reads, inconsistent windows |
Your system claims one of these (or a hybrid: snapshot isolation, read-your-writes, etc.). The test must match the claim.
Step 2 - Pick a nemesis
Nemesis primitives Jepsen ships:
| Nemesis | What it does |
|---|---|
| Partition | Splits the cluster into N groups; intra-group communication blocked |
| Crash | Hard-kills a process |
| Pause | SIGSTOPs a process (hangs without disconnecting) |
| Clock skew | jiggles gettimeofday() per-node |
| Slow disk | adds I/O latency |
| Bitflip | corrupts disk contents |
Combine nemeses (partition + crash + clock skew) to find compound bugs.
Step 3 - Generator: construct the workload
A Jepsen workload is per-client operations: invoke read / write / cas / append, observe outcome (ok / fail / info).
Pseudocode shape (Jepsen DSL is Clojure):
(generator/mix
[{:f :read, :value nil}
{:f :write, :value (rand-int 100)}
{:f :cas, :value [old new]}])Concurrent N clients hit the system; outcomes recorded as a history (an ordered list of invocations + completions).
Step 4 - Check the history with Knossos / Elle
| Checker | Use |
|---|---|
| Knossos | Linearizability checker for register-style ops (read/write/cas) |
| Elle | Transactional anomaly checker (G0/G1a/G1b/G1c, G-nonadjacent, G-single, G2-item, G2) - finds dirty/non-monotonic/non-repeatable read violations |
Both surface counterexamples (specific operation sequences) that violate the claimed model. Counterexamples are the value: vendor claim says "linearizable"; checker says "here's an op sequence that isn't" → you have evidence.
Step 5 - Workload patterns
Common workload shapes per consistency claim:
| Workload | Tests |
|---|---|
| Register (single-key R/W/CAS) | Linearizability of single-key |
| Append (per-key list, append + read) | Per-key history monotonicity |
| Set (insert + read all) | No lost insert; eventual visibility window |
| Bank transfer (txn-level read + write) | Transactional invariants (sum stays constant) |
Pick the workload closest to your system's user-facing invariants.
Step 6 - Reading vendor Jepsen reports
Check for these red flags:
Per the Jepsen consistency docs, Jepsen's value is that "consistency models and phenomena are often defined in terms of dependencies" - gaps in the test = gaps in confidence.
Step 7 - In-house test scoping
For your own system (custom KV / custom replication):
Out-of-the-box Jepsen test rigs exist for many systems (jepsen-io/jepsen GitHub); fork rather than start from scratch.
Anti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Test under stable network only | Real production has partitions; bugs hide | Always include partition nemesis (Step 2) |
| Trust "we did our own consistency tests" without checker | Manual reasoning misses subtle violations | Use Knossos / Elle (Step 4) |
| Single-client workload | Concurrency bugs need concurrency | Multi-client generator (Step 3) |
| Skip clock skew if using NTP | NTP can step backward; bugs trigger | Include clock skew (Step 2) |
| Run for 60s | Bugs may take hours to surface | Run hours; bisect to specific operation in history |