What the assistant does

The assistant answers customer-support questions in Russian on top of a tenant’s knowledge base. The goal is to keep the model from emitting unsupported answers about a tenant’s error codes, warranty terms, or billing flow — so answerable claims are grounded in retrieved KB passages, and up to ten extracted claims are re-checked against them when fact verification is enabled and retrieved context is non-empty, before the answer leaves the graph.

Product surface

The first screenshot is the assistant’s chat UI rendered against a live E20 answer — the same case walked end to end under “A real question, end to end” below. It is the static UI shell from static/chat.html driven by the exact JSON a real /api/ask graph run returned (sources + citations + quality/route badges), captured against an isolated demo tenant so the owner’s corpus is untouched; the layout, citation pill, and “Источники” disclosure are all production DOM populated by that genuine response. The other three screenshots are from the docs hub itself: the landing entry, the measured-baseline section, and the auto-generated LangGraph flowchart.

RAG Support Assistant chat UI: user question 'Как исправить E20?' followed by the assistant's grounded three-step answer with [1] citation pills, Качество: 95 and Маршрут: auto badges, and an expanded Источники block citing errors_e10_e30.md. — Chat UI — E20 walkthrough rendered through the production `addMessage` path against a live `/api/ask` response (external-mistral, captured 2026-06-16; citation, badges, sources disclosure all real DOM).

Landing page of the RAG Support Assistant docs hub: title, intro paragraph, See the example / Reproduce E20 / GitHub buttons, and the start of the Proof points cards. — Landing entry — title, three navigation buttons, Proof points.

Evaluation page — Measured baseline section with aggregate metric table (faithfulness 0.042, answer_relevancy 0.549, context_precision 0.190, context_recall 0.250) and per-category breakdown. — Evaluation page — measured aggregate + per-category numbers from `benchmark_offline_with_docs.py`.

LangGraph state machine flowchart rendered inline: classify_complexity, transform_query, retrieve, grade_docs, generate, verify_facts, evaluate, route_or_retry with branches to suggest_questions / rewrite_query / handle_error. — Architecture / LangGraph — Mermaid flowchart auto-generated from `agent/graph.py`.

A real question, end to end

The repo ships a 12-case fixture set under evaluation/test_cases.json covering six categories: error codes, password reset, warranty, installation, billing, general. The bundled demo seeder (demo/seed_docs.py) writes source files for later ingestion via the normal KB-upload flow; the seeder itself does not build a vector index. Below is one of those cases walked through the full LangGraph pipeline against that bundled corpus once ingested.

Demo corpus: warranty, returns, and E10-E30 errors (errors_e10_e30.md).

Input

Как исправить E20?

What happens inside the graph

The pipeline is a LangGraph state machine. It lives in agent/graph.py and the full flowchart — with every node and conditional edge — is rendered automatically on the LangGraph state machine page. The walkthrough below names each node in the order the graph actually executes.

Retrieve 1-3: classify_complexity — classifies the question as simple, complex, or unknown. The label drives model routing (fast model for simple, strong model for complex), not the path through the graph: every question still passes through transform_query next.
Retrieve 1-3: transform_query — rewrites the user question into a single search_query. When chat history is present, follow-ups are first rewritten into a standalone query. When HyDE is enabled (hypothetical- document embedding — the LLM drafts a plausible answer, which is then embedded and used as the retrieval probe), the generated hypothetical answer is stored as hyde_query, and retrieve uses it in preference to search_query when HyDE succeeds.
Retrieve 1-3: retrieve — hybrid search against a per-tenant ChromaDB collection. Tenant isolation is enforced by collection separation: the tenant id is resolved from the JWT at API entry and used to select a dedicated Chroma collection ({prefix}_{sanitize(tenant_id)} — see vectordb/manager.py). The sanitizer strips non-[A-Za-z0-9._-] characters to _ and truncates to fit Chroma’s 63-char collection-name limit, so isolation holds as long as raw tenant ids remain distinct after sanitization. Postgres-side queries filter by tenant_id at the ORM layer.
Generate 4-5: grade_docs (Corrective RAG — the LLM re-grades each retrieved chunk YES/NO for actual relevance rather than trusting the vector similarity ranking) — NO chunks are filtered out before generation, with the top-ranked vector hit preserved as a fallback when other chunks survive. This shapes what generate sees but does not by itself trigger a retry; the retry decision is made later in route_or_retry from the self-eval scores.
Generate 4-5: generate — the QA prompt receives only the graded chunks plus the question, with each chunk presented as [Документ N | source=<filename>]. The model is chosen by the active routing profile (default gracekelly-primary, with Ollama fallback on provider failure — see Provider routing).
Verify & Decide 6-8: verify_facts — when fact verification is enabled and retrieved context is non-empty, up to ten claims are extracted from the draft answer and each one is re-checked against the retrieved chunks. Each checked claim is labelled supported / unsupported in the trace, and factuality_score is written as the percentage of supported claims; skipped or no-claim paths write 100. This labelling is independent of the next step’s self-eval.
Verify & Decide 6-8: evaluate — runs an LLM self-eval against the retrieved context and produces quality_score (1–100) and relevance_score (derived as quality_score / 100, rounded to three decimals — normally 0.01–1.0 because quality_score is clamped to 1–100). These are the scores route_or_retry reads next.
Verify & Decide 6-8: route_or_retry — sets a verdict from the two scores and the remaining iteration budget. Both gates must pass for auto:
- auto (quality ≥ min_quality and relevance ≥ min_relevance, defaults 80 and 0.8) → suggest_questions → log
- retry (either gate below threshold, iterations left) → rewrite_query → retrieve → … (Self-RAG — the loop that rewrites the query and re-retrieves when the self-eval is too low, capped by max_iterations)
- human (either gate below threshold, iterations exhausted) → log directly, no suggestions
- error (state carries an error from an earlier node) → handle_error → END
Finalize 9-10: suggest_questions — three follow-ups are generated from the answer context to seed the next turn. Only runs on the auto verdict.
Finalize 9-10: log — the trace store records node order, state snapshots, retrieved-doc references and citations, provider/model, token usage, and estimated cost. Per-node latency and LLM span timing are carried by Langfuse and OpenTelemetry when those exporters are configured.

Output

The QA prompt labels each retrieved chunk with a source=<filename> tag internally (see agent/prompts.py) and asks the model to emit a numbered [N] citation whenever it leans on a specific fact. The filenames the citations point at come back separately in the sources / citations fields of the /api/ask JSON response. A grounded answer for the bundled E20 case looks roughly like:

Ошибка E20 связана с проблемой слива воды. Возможные причины:
засорённый сливной фильтр, перегиб сливного шланга или неисправность
сливного насоса [1].

Возможно, вас также интересует:
  • Что делать, если фильтр чистый, а ошибка остаётся?
  • Где найти инструкцию по разборке сливного узла?
  • Какие коды ошибок ещё связаны с водой?

{
  "sources": [
    { "source": "errors_e10_e30.md", "page_content": "E20 — проблема со сливом воды …" }
  ],
  "citations": [
    { "index": 1, "doc_id": "errors_e10_e30.md", "title": "errors_e10_e30.md", "excerpt": "клапан слива / фильтр …" }
  ]
}

The exact field values vary by run, but the shape is fixed: sources is a list of {source, page_content} and citations is a list of {index, doc_id, title, excerpt} — see the AskResponse, SourceInfo, and Citation Pydantic models in api/routers/conversation.py. The body of the answer is constrained to what the errors_e10_e30.md entry for E20 actually says (clogged filter, kinked drain hose, faulty drain pump).

The case’s expected_keywords (ошибка, проверьте, устройство) are not checked against the answer text — benchmark_runner feeds them to context_recall, which scores whether the keywords surface in the retrieved context. The expected_answer field is consumed separately as the answer fixture for offline answer-quality metrics.

What changes across question types

Every question traverses the same backbone: classify → transform → retrieve → grade → generate → verify → evaluate → route_or_retry. What differs across categories is which scores evaluate tends to produce, and therefore which verdict route_or_retry picks. The table below names what tends to drive a retry per category; it is a qualitative read on the bundled fixture set, not a measured number.

Category	Example	What can drive a retry
`error_codes`	Что означает ошибка E401?	Question mentions a code not present in the indexed KB (E401 is in the fixture set but not in `errors_e10_e30.md`, which only covers E10/E20/E25/E30) — retrieval returns off-topic chunks, `evaluate` drops, `route_or_retry` rewrites. The walked-through case above (E20) is in the KB and grades cleanly.
`reset_password`	Как сбросить пароль?	Short factual; the demo KB has no password-reset document, so retrieval returns empty/off-topic and `evaluate` scores would drop.
`warranty`	Сколько длится гарантия?	Short factual; the demo KB has the warranty document, so retrieval should ground cleanly.
`installation`	Как установить приложение?	No installation document in the demo KB — retrieval returns empty or off-topic, `evaluate` scores drop, `route_or_retry` triggers `retry` or `human`.
`billing`	Как отменить подписку?	No billing document in the demo KB — same path as `installation`.
`general`	Как связаться с поддержкой?	Short factual; if the answer can be inferred from the bundled docs, `evaluate` passes; otherwise the same empty-retrieval path applies.

What this actually demonstrates

Grounded answers, not unsupported generation

Up to ten claims are extracted from the final answer and each is cross-checked against the retrieved chunks in verify_facts, which labels them supported / unsupported and writes a factuality_score as the supported-claim percentage. evaluate then runs an independent self-eval against the retrieved context, and route_or_retry uses that score (not the verify_facts labels) to decide retry vs ship.

Self-correcting retrieval

Corrective RAG (grade_docs) drops off-topic chunks before generation. Self-RAG retry (route_or_retry → rewrite_query → retrieve → …) is triggered by the evaluate scores, not by chunk count, and is capped by max_iterations.

Multi-tenant by construction

The tenant id is resolved from the JWT at API entry and routes every vector query to a dedicated per-tenant ChromaDB collection ({prefix}_{sanitize(tenant_id)}). Postgres-side queries filter by tenant_id at the ORM layer. Tenants land in separate collections as long as their raw ids stay distinct after the sanitization and 63-char truncation applied to fit Chroma’s collection-name rules.

Observable end to end

The trace store records node order, state snapshots, retrieved-doc references, provider/model, token usage, and cost — so any answer in production can be reconstructed offline. Langfuse and OpenTelemetry carry LLM timings and span data when configured.

docker compose up -d                # postgres, redis, chroma, api
python -m demo.seed_docs            # writes demo KB files to demo/docs/
                                    # (then ingest them via your usual
                                    #  KB upload flow — seed_docs only
                                    #  creates the source files)
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JWT" \
  -d '{"question": "Как исправить E20?"}'

# Evaluate the 12 expected-answer fixtures with the offline evaluator
# (this scores fixtures with context_recall / answer fields; it does not
#  run the live graph end-to-end).
python -m evaluation.benchmark_runner \
  --cases evaluation/test_cases.json \
  --output runs/local.json

Where to go next

LangGraph state machine — the auto-generated flowchart of the nodes referenced above.
Architecture overview — request lifecycle, modules, datastores, cross-cutting concerns.
Provider routing — which models answer which profile, with costs.
API routes catalog — the surface a client integrates against.