Session-Based Test Management Is the Audit Trail Exploratory Testing Needed

TestlandMay 28, 2026

Session-based test management turns exploratory testing into evidence: charters, time-boxed sessions, and PROOF debriefs a stakeholder can audit.

Sample charter PROOF report TBS split: 60% test design and execution, 30% bug investigation, 10% session setup. The flow ends at one debrief that turns the session into an audit trail.

Microsoft's Windows 10 Creators Update Bug Bash produced 108,900 Quests completed and 115,100 feedback items in a single themed event (Windows Insider Blog, Feb 24 2017). Exploratory testing at hyperscaler scale, inside the company that ships the world's most regulated desktop OS. So why does the practice still get dismissed as cowboy clicking in most organizations?

The credibility problem is not what testers do. Managers do not distrust testers. They distrust undocumented work. An engineering lead audits a unit-test report in 30s and a CI dashboard in 60s. A tester who spent 90min poking at the new permissions screen comes back with a story, and stories do not survive contact with a sprint review.

Session-Based Test Management (SBTM) is a structured approach to exploratory testing built on three artifacts: a charter that sets the mission, a time-boxed session of 60 to 120min, and a PROOF debrief that turns the tester's notes into auditable evidence. It does not change what a good tester does. It changes what they leave behind.

Why scripted coverage misses the bugs that hurt

Scripted tests defend behaviour someone already imagined. They cannot defend the behaviour nobody thought to write down. Michael Bolton and James Bach's "Testing vs Checking" (current revision Aug 10 2024) draws the line: checking confirms expectations, testing investigates the system to find out what the expectations should have been. The distinction matters most when manual testers and automation engineers share it rather than treat it as a turf line.

The bugs that hurt production live on the testing side of that line. Permissions that interact in a way the spec did not predict. A retry path that double-charges under a network blip. An export script that rounds correctly until the locale changes. None of these are scripted-coverage failures. They are imagination failures, and scripted suites do not fix imagination.

SBTM keeps the scripted half intact and defends the testing half: the work where a human forms and discards hypotheses in real time. A checklist is the wrong tool for an investigation, and an investigation logbook is the wrong tool for a regression gate organized by change impact. You need both, and they need to look different on purpose.

What Jon Bach and James Bach actually proposed in 2000

SBTM was originated by Jon Bach with his brother James Bach at Hewlett-Packard in 2000. The original Satisfice paper (published Nov 1 2000, last updated Jul 7 2021) is short, blunt, and still readable. The citable source. Everything else is community gloss.

The proposal had three parts. First, a session: an uninterrupted, time-boxed unit of exploratory work. Wikipedia's session-based testing entry pins the canonical length at 1 to 2 hours. The 60min, 90min, 120min three-tier convention on QA blogs is community folklore, not verbatim Bach. Useful in practice, but a team agreement, not a citation.

Second, a charter: a short statement of mission. The charter makes the session bounded. Without it, exploratory testing is just clicking. With it, the tester has a target, a budget, and a reason to stop.

Third, a debrief: a short conversation between the tester and a peer or lead, after the session, against a structured report. The debrief turns a tester's notebook into team knowledge. A session that ends in a debriefed report is auditable. A session that ends in a Slack message is not.

A charter in three lines, using Hendrickson's template

The most quoted charter template in the QA practitioner community is "Explore (target) / With (resources) / To discover (information)." The phrasing is widely attributed to Bach, but does not appear verbatim in the 2000 paper. The citable canonical source is Elisabeth Hendrickson's Explore It! Reduce Risk and Increase Confidence with Exploratory Testing (Pragmatic Bookshelf, 2013, ISBN 9781937785024), which uses that exact wording.

Three concrete charters for a typical web product:

CHARTER 1: Checkout flow
Explore  checkout payment-method selection
With     three test cards (Visa accepted, Mastercard declined,
         AmEx unsupported) and a network-throttled connection
To discover  state mismatches between the cart total and the
             order confirmation

CHARTER 2: Permissions screen
Explore  the team-permissions edit modal
With     a viewer, an editor, and an owner test account
To discover  any combination where a permission change does not
             match the audit-log entry

CHARTER 3: Bug-fix follow-up
Explore  the receipts page after the BUG-4488 fix
With     the original failing PDF, the retry path, and the
         export-to-CSV path
To discover  whether the fix introduced rounding errors at
             currency boundaries

Each charter scopes the surface area, enumerates the resources, and names the kind of information that would make the session worth having. The third line is the hardest to write, and the one that makes the debrief useful.

How TBS splits a session into three buckets

TBS is a single acronym covering three task categories: T for test design and execution, B for bug investigation and reporting, S for session setup. The Satisfice paper adds it as a time-accounting layer on top of the charter. The tester records, roughly, what share of the session went to each bucket. Wikipedia's session-based testing entry covers the same split.

Bucket	What it covers	Healthy share of session time
Test design and execution	Designing fresh tests on the fly + running them	50-70%
Bug investigation and reporting	Reproducing, isolating, and writing up defects	20-40%
Session setup	Configuring data, environments, and tooling before testing	<20%

The percentages are diagnostic, not precision. A session where 70% went to setup is a tooling-debt signal, not a failure. A session where 80% went to bug investigation is a signal the area is in worse shape than the spec admits. Summarize five sessions in a row, find setup north of 30%, and you need a test-data fixture, not a new charter template. TBS pays off most when read across sessions, not inside one.

The PROOF debrief, in the right order

PROOF stands for Past, Results, Obstacles, Outlook, Feelings, the five fields of the canonical SBTM debrief, in that order per Wikipedia: session-based testing. Treat the order as load-bearing: it walks the debriefer from what happened to what is left to what the tester suspects, in the sequence that lets a lead spot risk.

Past is what the tester actually did, not what the charter said they would. Results is what they found: bugs, surprises, confirmations. Obstacles is what got in the way and would slow the next session. Outlook is what still needs testing. Feelings is calibrated tester judgment, not sentiment: suspicion, confidence, fatigue. "I am suspicious of how this modal handles the owner case" is risk reporting, not venting.

CHARTER 2: Permissions screen
TBS: 60% test design and execution / 30% bug investigation / 10% setup

PAST
Ran 4 permission-change flows across viewer / editor / owner.
Diffed UI state against the audit log after each.

RESULTS
BUG-4488: owner -> editor downgrade lands in UI but the audit-log
          entry shows the previous role. Reproducible.
BUG-4489: viewer cannot see the modal at all but the help text
          implies they can. Cosmetic, low priority.

OBSTACLES
Audit-log entries take up to 90s to appear. Slowed every check.

OUTLOOK
Owner -> owner transfer not yet covered. SSO-provisioned users
not yet covered. Bulk-edit modal not yet touched.

FEELINGS
Suspicious of the audit-log timing. The 90s lag suggests an
async write path that may swallow errors. Worth one more session
focused only on audit-log integrity.

A debriefer reading that report decides in 60s whether to schedule the follow-up and whether BUG-4488 blocks the release. That is the audit trail managers ask for and rarely get.

How hyperscalers run exploratory testing under different names

Microsoft does this at scale and calls it a Bug Bash. Feature engineers author Quests: charters in everything but name, with a scoped target, resources, and an outcome to look for. Events are themed and time-boxed. The 2017 Creators Update bash hit 108,900 Quests completed and 115,100 feedback items (Windows Insider Blog). Same shape as SBTM, scaled to a population.

Google's Chat team described its own adaptation on the Google Testing Blog in a May 20 2008 post by Joel Hynoski. The team used what Hynoski called "a broader definition of a session" tied to a "daily milestone." The post is dated, but durable evidence that SBTM vocabulary crossed into a hyperscaler and survived a shipping product team.

Microsoft surfaces the practice again in a June 8 2023 Azure DevOps blog by Ravi Kumar: "testing becomes a team sport with exploratory testing whereby multiple stakeholders can collaborate on the end product to discover and solve issues faster." The same post defends manual testing as "less costly to implement and provides quicker feedback." Netflix uses "bug bash" on its tech blog (Mar 21 2019) for pre-bounty internal hunts.

The silence elsewhere is also data. Spotify, GitHub, Stripe, Airbnb, Uber, and Atlassian engineering blogs have zero first-party posts using "session-based test management" or "test charter" as current QA practice. The vocabulary lives in the practitioner community, and that gap is why engineering leads find it hard to defend the practice upward: the language they trust does not contain the words.

Three objections charter-based testing has to answer

"This is just unstructured clicking"

The answer is the artifact set. A charter scopes the work before the session starts. TBS records where the time went. The PROOF report names what was found, missed, and suspected. A peer or lead debriefs the report in 15min. Four artifacts per session, all reviewable: more documentation than most unit-test runs produce.

"How do we know what was actually covered?"

Coverage in exploration is multi-axis: product areas, risk classes, configurations, user roles, recent changes. Any single percentage is fiction. The honest dial is "areas not touched in N weeks." A weekly rollup showing the receipts page has not had a session in three weeks tells leadership more than a 90% scripted-coverage number unchanged for three months. Resist the urge to fabricate a charter-coverage percent: it will damage trust the first time a missed area produces an incident.

"This does not scale"

The Windows 10 Creators Update Bug Bash is the empirical refutation: 108,900 Quests in a single themed event. SBTM scales the way agile estimation scales, by adding sessions and debriefers, not by changing the unit. The bottleneck is debrief throughput, and you solve it by training more debriefers.

When charters complement scripted checks and when they replace them

Charters do not replace automated checks: the two practices defend different things and pair cleanly when each stays on its own side of the line. Push one into the other's territory and you end up with brittle E2E suites and undocumented exploratory work at once.

Use scripted or automated checks for	Use charter-based sessions for
Regression of known behaviour	Investigation of recently changed areas
Pass/fail with an unambiguous oracle	Outcomes requiring tester judgment
Critical-path coverage at every deploy	Edge configurations and rare combinations
Contract checks against an API spec	Bug-fix follow-up and incident investigation
Compliance evidence trails	Areas added during a feature freeze

The ISTQB Foundation Level syllabus v4.0 (effective May 9 2023) covers exploratory testing in §4.4.2 as an experience-based technique. It does not explicitly endorse SBTM or charters as a named practice. Worth knowing if your organization treats ISTQB as canon: the syllabus gives you cover to do exploratory testing, but the management framework around it (charters, TBS, PROOF, debriefs) you bring from the practitioner literature.

Reporting charter coverage upward without inventing fake metrics

A weekly rollup an engineering lead can read in 60s does more for SBTM adoption than any whitepaper. Three columns. Resist the urge to add a "percent complete."

Area	Activity this week	Risk read
Checkout	3 sessions, 2 charters, 4 bugs (1 P1)	Fix BUG-4488 before next release
Permissions	1 session, 1 charter, 2 bugs	Audit-log mismatch needs design review
Receipts	0 sessions in 3 weeks	Untouched; schedule a charter
Settings	2 sessions, 0 bugs	Stable; reduce cadence

Three things make the report work. The activity column is countable, so nobody argues about the number. The risk read is a sentence a lead can act on this week, not a status colour. And the Receipts row makes the gap visible: an area untouched for three weeks is more actionable than 84% scripted coverage unchanged for three months.

Frame this with stakeholders as a risk dashboard, not a coverage dashboard. SBTM produces investigation evidence and risk hypotheses. Different from a regression-pass percentage, and they pair with one cleanly.

Where SBTM is heading next

Test management platforms are starting to ship native PROOF and TBS fields. Most still force charters into a Test Case shape that loses the structure, and teams paper over it with prose. Whichever platform ships the right primitives first will collect the practitioner community.

LLM transcript assistants are pulling the writing cost of a session report toward zero. A tester narrates while exploring, then asks an assistant to draft a PROOF report in the canonical order. The tester still owns the Feelings field, but the friction barrier on documenting exploratory work is collapsing.

The Microsoft 2023 Azure DevOps post is an early signal that SBTM vocabulary is migrating from the practitioner community into engineering blogs. As test observability matures, expect more.

Five questions QA leads ask about charters

What does a bad charter look like, and how do you spot it before the session starts?

A bad charter is unbounded ("Test the new release") or has a vague "To discover" line ("any issues"). Spot it in pre-session review: if you cannot picture a finding that would make the session worth having, the third line is wrong. Send it back.

What is the simplest way to track TBS percentages without buying a test management product?

Three columns in a shared spreadsheet: charter ID, session date, T/B/S split as rough integers. Testers fill it in during the debrief, not the session. After ten rows, sort by S share descending. The top row is your next tooling-debt fix.

How do you introduce charters when the team already runs heavy automation?

Pick the area where the last three production incidents slipped past the suite. Charter one 90min session against that area's recent changes. Show the engineering lead the PROOF report next to the green CI run that missed the same bug. Pitch SBTM as the layer above the suite, not a replacement.

How many parallel sessions can one lead realistically review per week?

About fifteen well-formed PROOF reports per debriefer per week, at 15min per debrief and assuming the lead is not also writing charters. The bottleneck is reading time, not session time. Past twenty per week, train a second debriefer rather than skimming. Skimmed debriefs are how the audit trail rots.

Can an LLM run a charter unattended, and if so what is the role of the human tester?

No, and the framing is wrong. An LLM can draft the PROOF report from a narrated session, suggest follow-up charters from prior Outlook fields, and flag missing TBS data. It cannot form a Feeling, which is calibrated tester suspicion. The human owns hypothesis generation and risk judgment.

Getting started

You do not need a tool, a process change, or executive buy-in to run a first session this week.

Pick one product area scoped to about 90min of focused work this Friday.
Write one charter using Hendrickson's template: Explore (target) / With (resources) / To discover (information).
Run the session. Take notes. Track TBS share roughly: how much was testing, how much was bug write-up, how much was setup.
Spend 10min writing a PROOF report (Past, Results, Obstacles, Outlook, Feelings). Debrief with one teammate for 15min.
After four or five sessions and reports, share Bach's original SBTM paper with your engineering lead and propose making the cadence standing.

The cadence is the point. One session in isolation is a curiosity. Five sessions debriefed against a consistent report shape is an audit trail, and an audit trail is what gets exploratory testing out of the credibility hole.