solr-relevance-tests
Tests Apache Solr search relevance by querying a test core, asserting ranking and score expectations, uploading LTR feature stores and models via the `/schema/feature-store` and `/schema/model-store` REST APIs, using `debugQuery` for per-document score explain, tuning eDisMax parameters (`qf`, `pf`, `mm`, `bq`), and computing judgment-driven nDCG checks against pinned corpora. Use when the search stack runs Apache Solr (enterprise, SolrCloud, or embedded) and you need a pre-deploy relevance gate or LTR model verification.
solr-relevance-tests
Apache Solr is the primary Elasticsearch/OpenSearch alternative in enterprise search. Unlike the ES/OS _rank_eval endpoint, Solr has no single built-in IR-metrics endpoint: relevance testing is assembled from debugQuery score explain, the Learning To Rank (LTR) contrib module, eDisMax tuning, and a test harness that computes nDCG externally. This skill covers that assembly.
Nearest neighbors and differentiation:
| Skill | Engine | Differentiation axis |
|---|---|---|
elasticsearch-relevance-tests | Elasticsearch | Built-in _rank_eval endpoint; no LTR store API |
opensearch-relevance-tests | OpenSearch | ES-fork + neural search; different LTR surface |
solr-relevance-tests | Apache Solr | debugQuery explain, LTR feature/model store REST, eDisMax qf/pf/mm tuning |
When to use
Step 1 - Start a test core
Per the Solr CLI reference:
bin/solr start -p 8983
bin/solr create -c test_products -d _defaultIndex a snapshot of your production corpus (or a representative subset). Freeze the index before running any judgment-driven tests - new documents shift relevance baselines silently.
Step 2 - Build the judgment list
Same 4-point scale as elasticsearch-relevance-tests: 0 = irrelevant, 1 = somewhat, 2 = relevant, 3 = highly relevant.
query,doc_id,rating
"running shoes",SKU-1234,3
"running shoes",SKU-5678,2
"running shoes",SKU-9999,0
"red dress",SKU-2222,3Collect judgments via query logs + click data, domain SME review, or Quepid (open source judgment UI with Solr support).
Step 3 - Query and collect ranked results
Solr has no _rank_eval equivalent. Call the query endpoint and collect ranked doc IDs per query:
import requests
SOLR = "http://localhost:8983/solr/test_products"
def ranked_ids(query: str, rows: int = 10) -> list[str]:
r = requests.get(f"{SOLR}/select", params={
"q": query,
"defType": "edismax",
"qf": "title^5.0 description^1.0",
"rows": rows,
"fl": "id,score",
})
return [doc["id"] for doc in r.json()["response"]["docs"]]Step 4 - Compute nDCG externally
import csv, math
def load_judgments(path: str) -> dict[str, dict[str, int]]:
j: dict[str, dict[str, int]] = {}
with open(path) as f:
for row in csv.DictReader(f):
j.setdefault(row["query"], {})[row["doc_id"]] = int(row["rating"])
return j
def dcg(ratings: list[int]) -> float:
return sum(r / math.log2(i + 2) for i, r in enumerate(ratings))
def ndcg_at_k(query: str, doc_ids: list[str],
judgments: dict[str, int], k: int = 10) -> float:
ranked = [judgments.get(d, 0) for d in doc_ids[:k]]
ideal = sorted(judgments.values(), reverse=True)[:k]
idcg = dcg(ideal)
return dcg(ranked) / idcg if idcg > 0 else 0.0
def test_ndcg_baseline():
judgments = load_judgments("tests/judgments.csv")
scores = {}
for query, rels in judgments.items():
ids = ranked_ids(query)
scores[query] = ndcg_at_k(query, ids, rels)
mean = sum(scores.values()) / len(scores)
assert mean >= 0.75, f"Mean nDCG@10 regressed: {mean:.3f}"Step 5 - Per-query regression guard
import json
from pathlib import Path
def test_no_query_drops_more_than_10_percent():
baseline = json.loads(Path("tests/solr_baseline.json").read_text())
judgments = load_judgments("tests/judgments.csv")
for query, rels in judgments.items():
current = ndcg_at_k(query, ranked_ids(query), rels)
b = baseline[query]
delta = current - b
assert delta >= -0.10, (
f"Query '{query}' dropped {delta:.3f} "
f"(was {b:.3f}, now {current:.3f})"
)Save a new baseline after any intentional improvement:
python3 tests/capture_baseline.py > tests/solr_baseline.jsonStep 6 - debugQuery for score explain
Per the Solr debugQuery reference, append debug=results&debug.explain.structured=true to receive a nested score breakdown per document:
def explain_top(query: str, rows: int = 5) -> dict:
r = requests.get(f"{SOLR}/select", params={
"q": query,
"defType": "edismax",
"qf": "title^5.0 description^1.0",
"rows": rows,
"debug": "results",
"debug.explain.structured": "true",
})
return r.json()["debug"]["explain"]
def test_top_doc_score_above_threshold():
explain = explain_top("running shoes")
top_id = next(iter(explain))
score = explain[top_id]["value"]
assert score >= 5.0, f"Top document score {score:.2f} below expected floor"Use explainOther (per the Solr debugQuery reference) to compare scoring of an expected document against the actual top results:
?q=running+shoes&explainOther=id:SKU-1234&debug=resultsThis surfaces why SKU-1234 ranked lower than expected.
Step 7 - eDisMax tuning verification
Per the Solr eDisMax reference, the key parameters affecting relevance are:
| Parameter | Effect |
|---|---|
qf | Field weights: title^5.0 description^1.0 |
pf | Phrase proximity boost when all terms appear together |
mm | Minimum-should-match: 75% requires 3 of 4 terms |
bq | Additive boost query: bq=category:shoes^2.0 |
bf | Additive function boost: bf=recip(rord(price),1,1000,1000) |
tie | Tie-breaker across qf fields (default 0.0) |
ps | Phrase slop: ps=3 allows 3 intervening words |
Pin the eDisMax config in tests so a config file change is caught before deploy:
EDISMAX_PARAMS = {
"defType": "edismax",
"qf": "title^5.0 description^1.0 brand^3.0",
"pf": "title^10.0",
"mm": "75%",
"tie": "0.1",
}
def test_edismax_params_unchanged():
# Fails if the live handler returns different defaults
r = requests.get(f"{SOLR}/config/requestHandler",
params={"componentName": "/select"})
handler = r.json()["config"]["requestHandler"]["/select"]
defaults = handler.get("defaults", {})
for key, expected in EDISMAX_PARAMS.items():
assert defaults.get(key) == expected, (
f"eDisMax param '{key}' changed: expected {expected!r}, "
f"got {defaults.get(key)!r}"
)Step 8 - LTR feature store upload and verification
Per the Solr LTR reference, the feature store REST API:
# Upload features
curl -XPUT 'http://localhost:8983/solr/test_products/schema/feature-store' \
--data-binary "@tests/ltr/features.json" \
-H 'Content-type:application/json'
# Verify store contents
curl 'http://localhost:8983/solr/test_products/schema/feature-store/_DEFAULT_'Minimal features.json with a field-value feature and a recency function:
[
{
"name": "titleMatch",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q": "title:(${query})" }
},
{
"name": "recency",
"class": "org.apache.solr.ltr.feature.FieldValueFeature",
"params": { "field": "published_date" }
}
]def test_feature_store_uploaded():
r = requests.get(
f"{SOLR}/schema/feature-store/_DEFAULT_"
)
names = {f["name"] for f in r.json()["features"]}
assert "titleMatch" in names
assert "recency" in namesStep 9 - LTR model upload and re-ranking test
Per the Solr LTR reference, upload a MultipleAdditiveTreesModel (LambdaMART):
curl -XPUT 'http://localhost:8983/solr/test_products/schema/model-store' \
--data-binary "@tests/ltr/lambdamart_v1.json" \
-H 'Content-type:application/json'Then assert the LTR re-ranked list improves nDCG vs the baseline BM25 list. Per the Solr LTR reference, the rq parameter with reRankDocs controls how many top BM25 candidates are re-scored:
def ranked_ids_ltr(query: str, rows: int = 10,
rerank_docs: int = 100) -> list[str]:
r = requests.get(f"{SOLR}/select", params={
"q": query,
"defType": "edismax",
"qf": "title^5.0 description^1.0",
"rq": "{!ltr model=lambdamart_v1 reRankDocs=" + str(rerank_docs) + "}",
"rows": rows,
"fl": "id,score,[features]",
})
return [doc["id"] for doc in r.json()["response"]["docs"]]
def test_ltr_improves_ndcg():
judgments = load_judgments("tests/judgments.csv")
bm25_scores, ltr_scores = {}, {}
for query, rels in judgments.items():
bm25_scores[query] = ndcg_at_k(query, ranked_ids(query), rels)
ltr_scores[query] = ndcg_at_k(query, ranked_ids_ltr(query), rels)
bm25_mean = sum(bm25_scores.values()) / len(bm25_scores)
ltr_mean = sum(ltr_scores.values()) / len(ltr_scores)
assert ltr_mean >= bm25_mean, (
f"LTR model did not improve nDCG: BM25={bm25_mean:.3f}, LTR={ltr_mean:.3f}"
)Step 10 - CI integration
# .github/workflows/solr-relevance.yml
jobs:
relevance:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start Solr
run: |
bin/solr start -p 8983
bin/solr create -c test_products -d _default
- name: Index test corpus
run: python3 tests/index_corpus.py
- name: Upload LTR feature store
run: |
curl -XPUT 'http://localhost:8983/solr/test_products/schema/feature-store' \
--data-binary "@tests/ltr/features.json" \
-H 'Content-type:application/json'
- name: Upload LTR model
run: |
curl -XPUT 'http://localhost:8983/solr/test_products/schema/model-store' \
--data-binary "@tests/ltr/lambdamart_v1.json" \
-H 'Content-type:application/json'
- name: Run relevance tests
run: pytest tests/ -v --tb=shortAnti-patterns
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Mutable test core | Index changes shift baselines between runs | Snapshot + restore before each CI run |
| Only asserting aggregate nDCG | Per-query regressions hide in the mean | Per-query guard (Step 5) |
| LTR model tested without BM25 baseline | Improvement is unmeasurable | Capture BM25 nDCG first, then compare (Step 9) |
Fetching debugQuery output without debug.explain.structured=true | String parse is fragile across Solr versions | Always use structured explain |
| Uploading model before feature store | Model references features that don't exist yet | Features first, model second (Steps 8-9) |
Hard-coded reRankDocs=10 | Candidate pool too small; LTR can't reorder enough docs | Set reRankDocs to at least 3x the result page size |