Testland
Browse all skills & agents

solr-relevance-tests

Tests Apache Solr search relevance by querying a test core, asserting ranking and score expectations, uploading LTR feature stores and models via the `/schema/feature-store` and `/schema/model-store` REST APIs, using `debugQuery` for per-document score explain, tuning eDisMax parameters (`qf`, `pf`, `mm`, `bq`), and computing judgment-driven nDCG checks against pinned corpora. Use when the search stack runs Apache Solr (enterprise, SolrCloud, or embedded) and you need a pre-deploy relevance gate or LTR model verification.

solr-relevance-tests

Apache Solr is the primary Elasticsearch/OpenSearch alternative in enterprise search. Unlike the ES/OS _rank_eval endpoint, Solr has no single built-in IR-metrics endpoint: relevance testing is assembled from debugQuery score explain, the Learning To Rank (LTR) contrib module, eDisMax tuning, and a test harness that computes nDCG externally. This skill covers that assembly.

Nearest neighbors and differentiation:

SkillEngineDifferentiation axis
elasticsearch-relevance-testsElasticsearchBuilt-in _rank_eval endpoint; no LTR store API
opensearch-relevance-testsOpenSearchES-fork + neural search; different LTR surface
solr-relevance-testsApache SolrdebugQuery explain, LTR feature/model store REST, eDisMax qf/pf/mm tuning

When to use

  • Production stack runs Apache Solr (standalone or SolrCloud) and you need a pre-deploy gate before changing query config, schema, or analyzers.
  • A trained LTR model (LambdaMART, LinearModel, NeuralNetwork) must be verified to improve or preserve nDCG before promotion.
  • eDisMax field weights (qf) or phrase boosts (pf) were edited and you need to confirm no per-query regression.
  • Score explain output shows an unexpected ranking and you need to reproduce
    • assert it in a test.

Step 1 - Start a test core

Per the Solr CLI reference:

bin/solr start -p 8983
bin/solr create -c test_products -d _default

Index a snapshot of your production corpus (or a representative subset). Freeze the index before running any judgment-driven tests - new documents shift relevance baselines silently.

Step 2 - Build the judgment list

Same 4-point scale as elasticsearch-relevance-tests: 0 = irrelevant, 1 = somewhat, 2 = relevant, 3 = highly relevant.

query,doc_id,rating
"running shoes",SKU-1234,3
"running shoes",SKU-5678,2
"running shoes",SKU-9999,0
"red dress",SKU-2222,3

Collect judgments via query logs + click data, domain SME review, or Quepid (open source judgment UI with Solr support).

Step 3 - Query and collect ranked results

Solr has no _rank_eval equivalent. Call the query endpoint and collect ranked doc IDs per query:

import requests

SOLR = "http://localhost:8983/solr/test_products"

def ranked_ids(query: str, rows: int = 10) -> list[str]:
    r = requests.get(f"{SOLR}/select", params={
        "q": query,
        "defType": "edismax",
        "qf": "title^5.0 description^1.0",
        "rows": rows,
        "fl": "id,score",
    })
    return [doc["id"] for doc in r.json()["response"]["docs"]]

Step 4 - Compute nDCG externally

import csv, math

def load_judgments(path: str) -> dict[str, dict[str, int]]:
    j: dict[str, dict[str, int]] = {}
    with open(path) as f:
        for row in csv.DictReader(f):
            j.setdefault(row["query"], {})[row["doc_id"]] = int(row["rating"])
    return j

def dcg(ratings: list[int]) -> float:
    return sum(r / math.log2(i + 2) for i, r in enumerate(ratings))

def ndcg_at_k(query: str, doc_ids: list[str],
              judgments: dict[str, int], k: int = 10) -> float:
    ranked = [judgments.get(d, 0) for d in doc_ids[:k]]
    ideal = sorted(judgments.values(), reverse=True)[:k]
    idcg = dcg(ideal)
    return dcg(ranked) / idcg if idcg > 0 else 0.0

def test_ndcg_baseline():
    judgments = load_judgments("tests/judgments.csv")
    scores = {}
    for query, rels in judgments.items():
        ids = ranked_ids(query)
        scores[query] = ndcg_at_k(query, ids, rels)
    mean = sum(scores.values()) / len(scores)
    assert mean >= 0.75, f"Mean nDCG@10 regressed: {mean:.3f}"

Step 5 - Per-query regression guard

import json
from pathlib import Path

def test_no_query_drops_more_than_10_percent():
    baseline = json.loads(Path("tests/solr_baseline.json").read_text())
    judgments = load_judgments("tests/judgments.csv")
    for query, rels in judgments.items():
        current = ndcg_at_k(query, ranked_ids(query), rels)
        b = baseline[query]
        delta = current - b
        assert delta >= -0.10, (
            f"Query '{query}' dropped {delta:.3f} "
            f"(was {b:.3f}, now {current:.3f})"
        )

Save a new baseline after any intentional improvement:

python3 tests/capture_baseline.py > tests/solr_baseline.json

Step 6 - debugQuery for score explain

Per the Solr debugQuery reference, append debug=results&debug.explain.structured=true to receive a nested score breakdown per document:

def explain_top(query: str, rows: int = 5) -> dict:
    r = requests.get(f"{SOLR}/select", params={
        "q": query,
        "defType": "edismax",
        "qf": "title^5.0 description^1.0",
        "rows": rows,
        "debug": "results",
        "debug.explain.structured": "true",
    })
    return r.json()["debug"]["explain"]

def test_top_doc_score_above_threshold():
    explain = explain_top("running shoes")
    top_id = next(iter(explain))
    score = explain[top_id]["value"]
    assert score >= 5.0, f"Top document score {score:.2f} below expected floor"

Use explainOther (per the Solr debugQuery reference) to compare scoring of an expected document against the actual top results:

?q=running+shoes&explainOther=id:SKU-1234&debug=results

This surfaces why SKU-1234 ranked lower than expected.

Step 7 - eDisMax tuning verification

Per the Solr eDisMax reference, the key parameters affecting relevance are:

ParameterEffect
qfField weights: title^5.0 description^1.0
pfPhrase proximity boost when all terms appear together
mmMinimum-should-match: 75% requires 3 of 4 terms
bqAdditive boost query: bq=category:shoes^2.0
bfAdditive function boost: bf=recip(rord(price),1,1000,1000)
tieTie-breaker across qf fields (default 0.0)
psPhrase slop: ps=3 allows 3 intervening words

Pin the eDisMax config in tests so a config file change is caught before deploy:

EDISMAX_PARAMS = {
    "defType": "edismax",
    "qf": "title^5.0 description^1.0 brand^3.0",
    "pf": "title^10.0",
    "mm": "75%",
    "tie": "0.1",
}

def test_edismax_params_unchanged():
    # Fails if the live handler returns different defaults
    r = requests.get(f"{SOLR}/config/requestHandler",
                     params={"componentName": "/select"})
    handler = r.json()["config"]["requestHandler"]["/select"]
    defaults = handler.get("defaults", {})
    for key, expected in EDISMAX_PARAMS.items():
        assert defaults.get(key) == expected, (
            f"eDisMax param '{key}' changed: expected {expected!r}, "
            f"got {defaults.get(key)!r}"
        )

Step 8 - LTR feature store upload and verification

Per the Solr LTR reference, the feature store REST API:

# Upload features
curl -XPUT 'http://localhost:8983/solr/test_products/schema/feature-store' \
  --data-binary "@tests/ltr/features.json" \
  -H 'Content-type:application/json'

# Verify store contents
curl 'http://localhost:8983/solr/test_products/schema/feature-store/_DEFAULT_'

Minimal features.json with a field-value feature and a recency function:

[
  {
    "name": "titleMatch",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": { "q": "title:(${query})" }
  },
  {
    "name": "recency",
    "class": "org.apache.solr.ltr.feature.FieldValueFeature",
    "params": { "field": "published_date" }
  }
]
def test_feature_store_uploaded():
    r = requests.get(
        f"{SOLR}/schema/feature-store/_DEFAULT_"
    )
    names = {f["name"] for f in r.json()["features"]}
    assert "titleMatch" in names
    assert "recency" in names

Step 9 - LTR model upload and re-ranking test

Per the Solr LTR reference, upload a MultipleAdditiveTreesModel (LambdaMART):

curl -XPUT 'http://localhost:8983/solr/test_products/schema/model-store' \
  --data-binary "@tests/ltr/lambdamart_v1.json" \
  -H 'Content-type:application/json'

Then assert the LTR re-ranked list improves nDCG vs the baseline BM25 list. Per the Solr LTR reference, the rq parameter with reRankDocs controls how many top BM25 candidates are re-scored:

def ranked_ids_ltr(query: str, rows: int = 10,
                   rerank_docs: int = 100) -> list[str]:
    r = requests.get(f"{SOLR}/select", params={
        "q": query,
        "defType": "edismax",
        "qf": "title^5.0 description^1.0",
        "rq": "{!ltr model=lambdamart_v1 reRankDocs=" + str(rerank_docs) + "}",
        "rows": rows,
        "fl": "id,score,[features]",
    })
    return [doc["id"] for doc in r.json()["response"]["docs"]]

def test_ltr_improves_ndcg():
    judgments = load_judgments("tests/judgments.csv")
    bm25_scores, ltr_scores = {}, {}
    for query, rels in judgments.items():
        bm25_scores[query] = ndcg_at_k(query, ranked_ids(query), rels)
        ltr_scores[query] = ndcg_at_k(query, ranked_ids_ltr(query), rels)
    bm25_mean = sum(bm25_scores.values()) / len(bm25_scores)
    ltr_mean = sum(ltr_scores.values()) / len(ltr_scores)
    assert ltr_mean >= bm25_mean, (
        f"LTR model did not improve nDCG: BM25={bm25_mean:.3f}, LTR={ltr_mean:.3f}"
    )

Step 10 - CI integration

# .github/workflows/solr-relevance.yml
jobs:
  relevance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Start Solr
        run: |
          bin/solr start -p 8983
          bin/solr create -c test_products -d _default
      - name: Index test corpus
        run: python3 tests/index_corpus.py
      - name: Upload LTR feature store
        run: |
          curl -XPUT 'http://localhost:8983/solr/test_products/schema/feature-store' \
            --data-binary "@tests/ltr/features.json" \
            -H 'Content-type:application/json'
      - name: Upload LTR model
        run: |
          curl -XPUT 'http://localhost:8983/solr/test_products/schema/model-store' \
            --data-binary "@tests/ltr/lambdamart_v1.json" \
            -H 'Content-type:application/json'
      - name: Run relevance tests
        run: pytest tests/ -v --tb=short

Anti-patterns

Anti-patternWhy it failsFix
Mutable test coreIndex changes shift baselines between runsSnapshot + restore before each CI run
Only asserting aggregate nDCGPer-query regressions hide in the meanPer-query guard (Step 5)
LTR model tested without BM25 baselineImprovement is unmeasurableCapture BM25 nDCG first, then compare (Step 9)
Fetching debugQuery output without debug.explain.structured=trueString parse is fragile across Solr versionsAlways use structured explain
Uploading model before feature storeModel references features that don't exist yetFeatures first, model second (Steps 8-9)
Hard-coded reRankDocs=10Candidate pool too small; LTR can't reorder enough docsSet reRankDocs to at least 3x the result page size

Limitations

  • Solr has no native IR-metrics API (_rank_eval equivalent); nDCG must be computed in the test harness. ES/OS teams may prefer the built-in API.
  • LTR requires the ltr contrib module enabled in solrconfig.xml and the featureVectorCache configured. Missing config silently disables re-ranking.
  • Large LTR models (deep tree ensembles) may exceed ZooKeeper's buffer limits in SolrCloud; use DefaultWrapperModel with an external resource reference (per the Solr LTR reference).
  • Click-derived judgments carry position bias; correct using click models before using them as ground truth.

References