This page describes what is evaluated in the repository, how it is measured,
and where the limits are. There is no leaderboard here and no cross-provider
comparison: the committed evaluation surface is an offline fixture plus
runtime evaluators, not a live benchmark of any specific model.
For the deterministic path that walks a single supported question end to
end, see Reproduce E20.
evaluation/test_cases.json ships 12 questions across six categories: error_codes, reset_password, warranty, installation, billing, general. Each case carries question, expected_keywords, expected_answer, and category fields.
Offline evaluator
evaluation/ragas_eval.py implements a RAGEvaluator with four keyword-based metrics without the ragas package: faithfulness, answer_relevancy, context_precision, context_recall. The fallback is deterministic and runs without a paid embedding API.
Benchmark runner
evaluation/benchmark_runner.py loads the fixture, scores each case against its expected_answer through RAGEvaluator, and writes aggregate plus per-question results to data/evaluation/benchmark_results.json. —use-embeddings swaps the keyword-based answer_relevancy for the embedding-based variant.
Online evaluators
evaluation/online_evaluators.py scores live trace state at request time: citation coverage, answer-length anomaly via z-score, retrieval hit-rate over relevance_score, tool-use efficiency, refusal-pattern match, and PII detection driven by config/evaluator_patterns.yml.
No cross-provider live-graph numbers. The repo runs the offline path,
not a multi-model bake-off, and the docs do not claim provider-comparison
results.
No leaderboard. There is no committed “model A beats model B on these
twelve questions” output, and the fixture is too small to make that claim
even if such output existed.
No paid-API baseline. RAGEvaluator runs without ragas or an embedding
service; the embedding-based variant of answer_relevancy is opt-in via
--use-embeddings in the benchmark runner.
The seeded demo KB ships three documents (demo/seed_docs.py): warranty,
returns, and errors_e10_e30.md. The 12-case fixture spans six categories,
so three of them (installation, billing, reset_password) do not have
direct support documents in the demo KB. Those cases still execute through
the graph, but the retrieval step is expected to surface low-relevance
chunks, and the evaluator scores reflect that — by design, not by accident.
If you want the deterministic trust path on the supported scenario, the
Reproduce E20 page walks ingest,
ask, and trace inspection end to end against the seeded source.
There are two runners. The default one runs against empty context (so only
answer_relevancy is meaningful); the second one feeds each case the seeded
demo doc whose category matches, which exercises all four metrics.
Both runners write data/evaluation/benchmark_results.json. The committed
file is the output of benchmark_offline_with_docs.py so the numbers below
are reproducible from a clean checkout.
The numbers below come from evaluation/benchmark_offline_with_docs.py
against the seeded demo/docs/ KB. Three categories have a supporting doc;
three do not — that gap is the dominant signal in the aggregate.
Metric
Aggregate (n=12)
faithfulness
0.042
answer_relevancy
0.549
context_precision
0.190
context_recall
0.250
Per-category mean scores:
Category
n
Has doc?
faithfulness
answer_relevancy
context_precision
context_recall
error_codes
3
yes
0.167
0.528
0.317
0.333
warranty
2
yes
0.000
0.167
0.500
0.833
billing
2
yes (returns_policy.md, partial overlap)
0.000
0.833
0.167
0.167
reset_password
2
no
0.000
0.667
0.000
0.000
installation
2
no
0.000
0.500
0.000
0.000
general
1
no
0.000
0.667
0.000
0.000
faithfulness is low because the answers in the fixture are written as
prose that does not always overlap word-for-word with the demo doc text;
context_precision/recall are zero for unsupported categories by design,
not by accident. These are honest numbers, not a leaderboard.
benchmark_runner.py (no docs) writes the same shape but the four
aggregate fields collapse to keyword-only answer_relevancy ≈ 0.549 and
zeros elsewhere — that’s the baseline against which the demo-doc run
is an honest improvement.