Evaluation

This page describes what is evaluated in the repository, how it is measured, and where the limits are. There is no leaderboard here and no cross-provider comparison: the committed evaluation surface is an offline fixture plus runtime evaluators, not a live benchmark of any specific model.

For the deterministic path that walks a single supported question end to end, see Reproduce E20.

What is committed

12-case fixture

evaluation/test_cases.json ships 12 questions across six categories: error_codes, reset_password, warranty, installation, billing, general. Each case carries question, expected_keywords, expected_answer, and category fields.

Offline evaluator

evaluation/ragas_eval.py implements a RAGEvaluator with four keyword-based metrics without the ragas package: faithfulness, answer_relevancy, context_precision, context_recall. The fallback is deterministic and runs without a paid embedding API.

Benchmark runner

evaluation/benchmark_runner.py loads the fixture, scores each case against its expected_answer through RAGEvaluator, and writes aggregate plus per-question results to data/evaluation/benchmark_results.json. —use-embeddings swaps the keyword-based answer_relevancy for the embedding-based variant.

Online evaluators

evaluation/online_evaluators.py scores live trace state at request time: citation coverage, answer-length anomaly via z-score, retrieval hit-rate over relevance_score, tool-use efficiency, refusal-pattern match, and PII detection driven by config/evaluator_patterns.yml.

What is not committed

No cross-provider live-graph numbers. The repo runs the offline path, not a multi-model bake-off, and the docs do not claim provider-comparison results.
No leaderboard. There is no committed “model A beats model B on these twelve questions” output, and the fixture is too small to make that claim even if such output existed.
No paid-API baseline. RAGEvaluator runs without ragas or an embedding service; the embedding-based variant of answer_relevancy is opt-in via --use-embeddings in the benchmark runner.

Why the fixture is limited

The seeded demo KB ships three documents (demo/seed_docs.py): warranty, returns, and errors_e10_e30.md. The 12-case fixture spans six categories, so three of them (installation, billing, reset_password) do not have direct support documents in the demo KB. Those cases still execute through the graph, but the retrieval step is expected to surface low-relevance chunks, and the evaluator scores reflect that — by design, not by accident.

If you want the deterministic trust path on the supported scenario, the Reproduce E20 page walks ingest, ask, and trace inspection end to end against the seeded source.

How to run the offline benchmark

There are two runners. The default one runs against empty context (so only answer_relevancy is meaningful); the second one feeds each case the seeded demo doc whose category matches, which exercises all four metrics.

cd D:\RAG_Support_Assistant
python -m demo.seed_docs                              # writes demo/docs/*.md

# Baseline: empty context, scores answer_relevancy only.
python evaluation/benchmark_runner.py

# With docs: category->doc mapping for the three supported categories.
python evaluation/benchmark_offline_with_docs.py

# Or write to a custom output path:
python evaluation/benchmark_runner.py --output data/evaluation/my_run.json

Both runners write data/evaluation/benchmark_results.json. The committed file is the output of benchmark_offline_with_docs.py so the numbers below are reproducible from a clean checkout.

Measured baseline

The numbers below come from evaluation/benchmark_offline_with_docs.py against the seeded demo/docs/ KB. Three categories have a supporting doc; three do not — that gap is the dominant signal in the aggregate.

Metric	Aggregate (n=12)
`faithfulness`	0.042
`answer_relevancy`	0.549
`context_precision`	0.190
`context_recall`	0.250

Per-category mean scores:

Category	n	Has doc?	faithfulness	answer_relevancy	context_precision	context_recall
`error_codes`	3	yes	0.167	0.528	0.317	0.333
`warranty`	2	yes	0.000	0.167	0.500	0.833
`billing`	2	yes (returns_policy.md, partial overlap)	0.000	0.833	0.167	0.167
`reset_password`	2	no	0.000	0.667	0.000	0.000
`installation`	2	no	0.000	0.500	0.000	0.000
`general`	1	no	0.000	0.667	0.000	0.000

faithfulness is low because the answers in the fixture are written as prose that does not always overlap word-for-word with the demo doc text; context_precision/recall are zero for unsupported categories by design, not by accident. These are honest numbers, not a leaderboard.

Output shape

The structure of data/evaluation/benchmark_results.json is what the offline-with-docs runner writes after a run — that file is committed:

{
  "aggregate": {
    "faithfulness": 0.0417,
    "answer_relevancy": 0.5486,
    "context_precision": 0.1905,
    "context_recall": 0.25
  },
  "per_question": [
    {
      "question": "Что означает ошибка E401?",
      "category": "error_codes",
      "scores": {
        "faithfulness": 0.0,
        "answer_relevancy": 0.75,
        "context_precision": 0.1429,
        "context_recall": 0.0
      }
    }
  ],
  "num_cases": 12,
  "context_source": "demo/docs (category->doc mapping)",
  "category_coverage": { "error_codes": 3, "warranty": 2, "billing": 2,
                          "reset_password": 2, "installation": 2, "general": 1 },
  "supported_categories": ["billing", "error_codes", "warranty"],
  "timestamp": "…"
}

benchmark_runner.py (no docs) writes the same shape but the four aggregate fields collapse to keyword-only answer_relevancy ≈ 0.549 and zeros elsewhere — that’s the baseline against which the demo-doc run is an honest improvement.