qa-search-relevance
Search relevance testing: 6 skills (elasticsearch-relevance-tests, hybrid-search-eval-author, judgment-list-author, opensearch-relevance-tests, solr-relevance-tests, vector-search-precision-tests) and 1 agent (relevance-regression-reviewer). IR-metrics-driven NDCG / MRR / Recall@k regression detection.
Install this plugin
/plugin install qa-search-relevance@testland-qaPart of role bundle: qa-role-ai
qa-search-relevance
IR-metrics-driven search relevance testing - judgment lists, NDCG / MRR / Recall@k, vector recall@k vs latency Pareto curve. Three skills + one reviewer agent that synthesizes per-query regression analysis across term-based and vector search.
Components
| Type | Name | Description |
|---|---|---|
| Skill | elasticsearch-relevance-tests | _rank_eval API; judgment lists; metrics (Precision@K, Recall@K, MRR, DCG/NDCG, ERR); per-query regression detection; Quepid + Splainer integration |
| Skill | opensearch-relevance-tests | Search Relevance Workbench; reuse ES judgment format; neural query DSL; hybrid (BM25 + neural) ranking + pipeline weighting; ES → OS migration parity |
| Skill | vector-search-precision-tests | Brute-force ground truth; recall@k vs latency Pareto; HNSW M / ef_construct / ef sweep; embedding-model-upgrade drift; ANN-Benchmarks framework |
| Agent | relevance-regression-reviewer | Adversarial reviewer; per-query regression detection; refuses when head queries drop > 0.05 OR when judgments are stale (> 50% unrated) |
| Skill | solr-relevance-tests | Apache Solr relevance testing: LTR, debugQuery score explain, edismax tuning, nDCG checks. |
| Skill | judgment-list-author | Bootstrap human-judgment ground-truth lists: query sampling, grading scales, kappa, Quepid, pooling. |
| Skill | hybrid-search-eval-author | Evaluate hybrid retrieval (BM25 + vector + reranker) with RRF fusion and nDCG/MRR. |
Install
/plugin marketplace add testland/qa
/plugin install qa-search-relevance@testland-qaSkills
elasticsearch-relevance-tests
Author Elasticsearch relevance regression tests using the Ranking Evaluation API (`POST <index>/_rank_eval`) - judgment lists (query + expected docs at ranks), per-query metrics (Precision@K, Recall@K, MRR, DCG, ERR), reproducible test corpora; pair with Quepid + Splainer for interactive judgment authoring.
hybrid-search-eval-author
Evaluates hybrid retrieval pipelines (BM25 + vector + reranker) end-to-end: authors ground-truth judgment sets, computes nDCG@k and MRR over fused results, measures the lift from Reciprocal Rank Fusion vs weighted fusion vs single-stage retrieval, and quantifies reranker (cross-encoder/Cohere/bge) impact. Use when a production system combines lexical and semantic retrieval and you need a numeric relevance baseline, fusion-strategy comparison, or evidence that a reranker is earning its latency cost.
judgment-list-author
Bootstraps human-relevance judgment lists (query sets, grading scales, rater guidelines, inter-rater agreement, Quepid tooling, TREC-style pooling, and refresh cadence) that serve as ground truth for all three search-relevance skills and the relevance-regression-reviewer agent. Use when a team needs to create or refresh the judgment corpus before running NDCG / MRR / Recall@k evaluations.
opensearch-relevance-tests
Author OpenSearch relevance tests with Search Relevance Workbench (judgment lists, query sets, experiments), `_rank_eval` API (Elasticsearch-fork-compatible), and hybrid BM25 + neural ranking eval. Reuse Elasticsearch judgment list format; document the differences (neural search query DSL, hybrid weighting via `neural_query_enricher`).
solr-relevance-tests
Tests Apache Solr search relevance by querying a test core, asserting ranking and score expectations, uploading LTR feature stores and models via the `/schema/feature-store` and `/schema/model-store` REST APIs, using `debugQuery` for per-document score explain, tuning eDisMax parameters (`qf`, `pf`, `mm`, `bq`), and computing judgment-driven nDCG checks against pinned corpora. Use when the search stack runs Apache Solr (enterprise, SolrCloud, or embedded) and you need a pre-deploy relevance gate or LTR model verification.
vector-search-precision-tests
Vector search benchmarking - recall@k vs latency tradeoffs, ground-truth construction via brute-force, HNSW tuning (M / ef_construct / ef per Qdrant docs), embedding-model-upgrade drift detection. Use ANN-Benchmarks framework for cross-engine comparison; per-engine clients (Qdrant, Weaviate, pgvector, Pinecone, Elasticsearch k-NN, Milvus) for in-product tests.