The assistant answers customer-support questions in Russian on top of a tenant’s
knowledge base. The goal is to keep the model from emitting unsupported answers
about a tenant’s error codes, warranty terms, or billing flow — so answerable
claims are grounded in retrieved KB passages, and up to ten extracted claims
are re-checked against them when fact verification is enabled and retrieved
context is non-empty, before the answer leaves the graph.
The first screenshot is the assistant’s chat UI rendered against a live
E20 answer — the same case walked end to end under “A real question, end
to end” below. It is the static UI shell from static/chat.html driven by
the exact JSON a real /api/ask graph run returned (sources + citations +
quality/route badges), captured against an isolated demo tenant so the
owner’s corpus is untouched; the layout, citation pill, and “Источники”
disclosure are all production DOM populated by that genuine response. The
other three screenshots are from the docs hub itself: the landing entry,
the measured-baseline section, and the auto-generated LangGraph flowchart.
Chat UI — E20 walkthrough rendered through the production addMessage path against a live /api/ask response (external-mistral, captured 2026-06-16; citation, badges, sources disclosure all real DOM).Landing entry — title, three navigation buttons, Proof points.Evaluation page — measured aggregate + per-category numbers from benchmark_offline_with_docs.py.Architecture / LangGraph — Mermaid flowchart auto-generated from agent/graph.py.
The repo ships a 12-case fixture set under evaluation/test_cases.json
covering six categories: error codes, password reset, warranty, installation,
billing, general. The bundled demo seeder (demo/seed_docs.py) writes source
files for later ingestion via the normal KB-upload flow; the seeder itself
does not build a vector index. Below is one of those cases walked through the
full LangGraph pipeline against that bundled corpus once ingested.
Demo corpus: warranty, returns, and E10-E30 errors (errors_e10_e30.md).
The pipeline is a LangGraph state machine. It lives in
agent/graph.py
and the full flowchart — with every node and conditional edge — is rendered
automatically on the
LangGraph state machine page.
The walkthrough below names each node in the order the graph actually executes.
Retrieve 1-3: classify_complexity — classifies the question as simple, complex,
or unknown. The label drives model routing (fast model for simple, strong
model for complex), not the path through the graph: every question still
passes through transform_query next.
Retrieve 1-3: transform_query — rewrites the user question into a single
search_query. When chat history is present, follow-ups are first
rewritten into a standalone query. When HyDE is enabled (hypothetical-
document embedding — the LLM drafts a plausible answer, which is then
embedded and used as the retrieval probe), the generated hypothetical
answer is stored as hyde_query, and retrieve uses it in preference
to search_query when HyDE succeeds.
Retrieve 1-3: retrieve — hybrid search against a per-tenant ChromaDB collection.
Tenant isolation is enforced by collection separation: the tenant id is
resolved from the JWT at API entry and used to select a dedicated Chroma
collection ({prefix}_{sanitize(tenant_id)} — see vectordb/manager.py).
The sanitizer strips non-[A-Za-z0-9._-] characters to _ and truncates
to fit Chroma’s 63-char collection-name limit, so isolation holds as long
as raw tenant ids remain distinct after sanitization. Postgres-side
queries filter by tenant_id at the ORM layer.
Generate 4-5: grade_docs (Corrective RAG — the LLM re-grades each retrieved
chunk YES/NO for actual relevance rather than trusting the vector
similarity ranking) — NO chunks are filtered out before generation,
with the top-ranked vector hit preserved as a fallback when other
chunks survive. This shapes what generate sees but does not by itself
trigger a retry; the retry decision is made later in route_or_retry
from the self-eval scores.
Generate 4-5: generate — the QA prompt receives only the graded chunks plus the
question, with each chunk presented as [Документ N | source=<filename>].
The model is chosen by the active routing profile (default
gracekelly-primary, with Ollama fallback on provider failure — see
Provider routing).
Verify & Decide 6-8: verify_facts — when fact verification is enabled and retrieved
context is non-empty, up to ten claims are extracted from the draft
answer and each one is re-checked against the retrieved chunks. Each
checked claim is labelled supported / unsupported in the trace,
and factuality_score is written as the percentage of supported
claims; skipped or no-claim paths write 100. This labelling is
independent of the next step’s self-eval.
Verify & Decide 6-8: evaluate — runs an LLM self-eval against the retrieved context
and produces quality_score (1–100) and relevance_score (derived
as quality_score / 100, rounded to three decimals — normally 0.01–1.0
because quality_score is clamped to 1–100). These are the scores
route_or_retry reads next.
Verify & Decide 6-8: route_or_retry — sets a verdict from the two scores and the
remaining iteration budget. Both gates must pass for auto:
auto (quality ≥ min_qualityand relevance ≥ min_relevance,
defaults 80 and 0.8) → suggest_questions → log
retry (either gate below threshold, iterations left) →
rewrite_query → retrieve → … (Self-RAG — the loop that rewrites
the query and re-retrieves when the self-eval is too low, capped by
max_iterations)
human (either gate below threshold, iterations exhausted) → log
directly, no suggestions
error (state carries an error from an earlier node) → handle_error
→ END
Finalize 9-10: suggest_questions — three follow-ups are generated from the answer
context to seed the next turn. Only runs on the auto verdict.
Finalize 9-10: log — the trace store records node order, state snapshots,
retrieved-doc references and citations, provider/model, token usage, and
estimated cost. Per-node latency and LLM span timing are carried by
Langfuse and OpenTelemetry when those exporters are configured.
The QA prompt labels each retrieved chunk with a source=<filename> tag
internally (see agent/prompts.py) and asks the model to emit a numbered
[N] citation whenever it leans on a specific fact. The filenames the
citations point at come back separately in the sources / citations
fields of the /api/ask JSON response. A grounded answer for the bundled
E20 case looks roughly like:
Ошибка E20 связана с проблемой слива воды. Возможные причины:
засорённый сливной фильтр, перегиб сливного шланга или неисправность
сливного насоса [1].
Возможно, вас также интересует:
• Что делать, если фильтр чистый, а ошибка остаётся?
• Где найти инструкцию по разборке сливного узла?
• Какие коды ошибок ещё связаны с водой?
{
"sources": [
{ "source": "errors_e10_e30.md", "page_content": "E20 — проблема со сливом воды …" }
The exact field values vary by run, but the shape is fixed: sources is a
list of {source, page_content} and citations is a list of
{index, doc_id, title, excerpt} — see the AskResponse, SourceInfo,
and Citation Pydantic models in api/routers/conversation.py. The body
of the answer is constrained to what the errors_e10_e30.md entry for E20
actually says (clogged filter, kinked drain hose, faulty drain pump).
The case’s expected_keywords (ошибка, проверьте, устройство) are not
checked against the answer text — benchmark_runner feeds them to
context_recall, which scores whether the keywords surface in the
retrieved context. The expected_answer field is consumed separately
as the answer fixture for offline answer-quality metrics.
Every question traverses the same backbone:
classify → transform → retrieve → grade → generate → verify → evaluate → route_or_retry. What differs across categories is which scores evaluate
tends to produce, and therefore which verdict route_or_retry picks. The
table below names what tends to drive a retry per category; it is a
qualitative read on the bundled fixture set, not a measured number.
Category
Example
What can drive a retry
error_codes
Что означает ошибка E401?
Question mentions a code not present in the indexed KB (E401 is in the fixture set but not in errors_e10_e30.md, which only covers E10/E20/E25/E30) — retrieval returns off-topic chunks, evaluate drops, route_or_retry rewrites. The walked-through case above (E20) is in the KB and grades cleanly.
reset_password
Как сбросить пароль?
Short factual; the demo KB has no password-reset document, so retrieval returns empty/off-topic and evaluate scores would drop.
warranty
Сколько длится гарантия?
Short factual; the demo KB has the warranty document, so retrieval should ground cleanly.
installation
Как установить приложение?
No installation document in the demo KB — retrieval returns empty or off-topic, evaluate scores drop, route_or_retry triggers retry or human.
billing
Как отменить подписку?
No billing document in the demo KB — same path as installation.
general
Как связаться с поддержкой?
Short factual; if the answer can be inferred from the bundled docs, evaluate passes; otherwise the same empty-retrieval path applies.
Up to ten claims are extracted from the final answer and each is
cross-checked against the retrieved chunks in verify_facts, which
labels them supported / unsupported and writes a factuality_score
as the supported-claim percentage. evaluate then runs an independent
self-eval against the retrieved context, and route_or_retry uses that
score (not the verify_facts labels) to decide retry vs ship.
Self-correcting retrieval
Corrective RAG (grade_docs) drops off-topic chunks before generation.
Self-RAG retry (route_or_retry → rewrite_query → retrieve → …) is
triggered by the evaluate scores, not by chunk count, and is capped
by max_iterations.
Multi-tenant by construction
The tenant id is resolved from the JWT at API entry and routes every
vector query to a dedicated per-tenant ChromaDB collection
({prefix}_{sanitize(tenant_id)}). Postgres-side queries filter by
tenant_id at the ORM layer. Tenants land in separate collections as
long as their raw ids stay distinct after the sanitization and 63-char
truncation applied to fit Chroma’s collection-name rules.
Observable end to end
The trace store records node order, state snapshots, retrieved-doc
references, provider/model, token usage, and cost — so any answer in
production can be reconstructed offline. Langfuse and OpenTelemetry
carry LLM timings and span data when configured.