qa-search-relevance

Search relevance testing: 6 skills (elasticsearch-relevance-tests, hybrid-search-eval-author, judgment-list-author, opensearch-relevance-tests, solr-relevance-tests, vector-search-precision-tests) and 1 agent (relevance-regression-reviewer). IR-metrics-driven NDCG / MRR / Recall@k regression detection.

Install this plugin

/plugin install qa-search-relevance@testland-qa

Part of role bundle: qa-role-ai

qa-search-relevance

IR-metrics-driven search relevance testing - judgment lists, NDCG / MRR / Recall@k, vector recall@k vs latency Pareto curve. Three skills + one reviewer agent that synthesizes per-query regression analysis across term-based and vector search.

Components

Type	Name	Description
Skill	elasticsearch-relevance-tests	`_rank_eval` API; judgment lists; metrics (Precision@K, Recall@K, MRR, DCG/NDCG, ERR); per-query regression detection; Quepid + Splainer integration
Skill	opensearch-relevance-tests	Search Relevance Workbench; reuse ES judgment format; neural query DSL; hybrid (BM25 + neural) ranking + pipeline weighting; ES → OS migration parity
Skill	vector-search-precision-tests	Brute-force ground truth; recall@k vs latency Pareto; HNSW M / ef_construct / ef sweep; embedding-model-upgrade drift; ANN-Benchmarks framework
Agent	relevance-regression-reviewer	Adversarial reviewer; per-query regression detection; refuses when head queries drop > 0.05 OR when judgments are stale (> 50% unrated)
Skill	solr-relevance-tests	Apache Solr relevance testing: LTR, debugQuery score explain, edismax tuning, nDCG checks.
Skill	judgment-list-author	Bootstrap human-judgment ground-truth lists: query sampling, grading scales, kappa, Quepid, pooling.
Skill	hybrid-search-eval-author	Evaluate hybrid retrieval (BM25 + vector + reranker) with RRF fusion and nDCG/MRR.

Install

/plugin marketplace add testland/qa
/plugin install qa-search-relevance@testland-qa

Skills

elasticsearch-relevance-tests

Author Elasticsearch relevance regression tests using the Ranking Evaluation API (`POST <index>/_rank_eval`) - judgment lists (query + expected docs at ranks), per-query metrics (Precision@K, Recall@K, MRR, DCG, ERR), reproducible test corpora; pair with Quepid + Splainer for interactive judgment authoring.

hybrid-search-eval-author

Evaluates hybrid retrieval pipelines (BM25 + vector + reranker) end-to-end: authors ground-truth judgment sets, computes nDCG@k and MRR over fused results, measures the lift from Reciprocal Rank Fusion vs weighted fusion vs single-stage retrieval, and quantifies reranker (cross-encoder/Cohere/bge) impact. Use when a production system combines lexical and semantic retrieval and you need a numeric relevance baseline, fusion-strategy comparison, or evidence that a reranker is earning its latency cost.

judgment-list-author

Bootstraps human-relevance judgment lists (query sets, grading scales, rater guidelines, inter-rater agreement, Quepid tooling, TREC-style pooling, and refresh cadence) that serve as ground truth for all three search-relevance skills and the relevance-regression-reviewer agent. Use when a team needs to create or refresh the judgment corpus before running NDCG / MRR / Recall@k evaluations.

opensearch-relevance-tests

Author OpenSearch relevance tests with Search Relevance Workbench (judgment lists, query sets, experiments), `_rank_eval` API (Elasticsearch-fork-compatible), and hybrid BM25 + neural ranking eval. Reuse Elasticsearch judgment list format; document the differences (neural search query DSL, hybrid weighting via `neural_query_enricher`).

solr-relevance-tests

Tests Apache Solr search relevance by querying a test core, asserting ranking and score expectations, uploading LTR feature stores and models via the `/schema/feature-store` and `/schema/model-store` REST APIs, using `debugQuery` for per-document score explain, tuning eDisMax parameters (`qf`, `pf`, `mm`, `bq`), and computing judgment-driven nDCG checks against pinned corpora. Use when the search stack runs Apache Solr (enterprise, SolrCloud, or embedded) and you need a pre-deploy relevance gate or LTR model verification.

vector-search-precision-tests

Vector search benchmarking - recall@k vs latency tradeoffs, ground-truth construction via brute-force, HNSW tuning (M / ef_construct / ef per Qdrant docs), embedding-model-upgrade drift detection. Use ANN-Benchmarks framework for cross-engine comparison; per-engine clients (Qdrant, Weaviate, pgvector, Pinecone, Elasticsearch k-NN, Milvus) for in-product tests.

Agents

relevance-regression-reviewer

Adversarial reviewer of search relevance changes (algorithm tuning, schema changes, embedding model upgrade). Runs the team's judgment list against before+after; computes per-metric delta (NDCG / MRR / Recall@k); flags regressions per-query; suggests new judgments needed when too many docs go unrated. Refuses to ✅ when net relevance drops or when judgment coverage falls below threshold.