Operations & evaluation — RAG Support Assistant
Это содержимое пока не доступно на вашем языке.
Operations & evaluation — RAG Support Assistant
Section titled “Operations & evaluation — RAG Support Assistant”Moved out of the top-level README to keep it scannable; this is the full operations and evaluation reference.
Experiments
Section titled “Experiments”Prompt, model, and retrieval changes can be tracked as YAML experiments in
evaluation/experiments/.
Create a new draft from the current runtime snapshot:
python scripts/experiment_new.py --name "concise-answers" --from current --description "Shorter support replies"Stage an experiment without changing committed defaults:
python scripts/experiment_apply.py 2026-04-21-concise-answers --mode stageEXPERIMENT_ID=2026-04-21-concise-answers python -c "from config.settings import get_settings; print(get_settings().retrieval_top_k)"Recommended workflow:
- Create a draft with
experiment_new.py. - Stage it with
experiment_apply.py --mode stage. - Run nightly or regression evaluation against the staged
EXPERIMENT_ID. - Deploy with
experiment_apply.py --mode deployonce metrics look acceptable.
In stage mode, EXPERIMENT_ID plus config/experiment_override.yaml applies to
the runtime pipeline on the next request without any git edits or deploy-mode
rewrite of agent/prompts.py.
Admin endpoints:
GET /api/admin/experimentsGET /api/admin/experiments/{id}POST /api/admin/experiments/{id}/archivePOST /api/admin/experiments/{id}/regression-run?baseline=currentGET /api/admin/regression-runsGET /api/admin/regression-runs/{run_id}
Regression eval
Section titled “Regression eval”Curated regression runs compare a baseline against current or an experiment
without invoking the heavy nightly RAGAS pipeline.
python scripts/regression_eval.py \ --baseline current \ --candidate 2026-04-21-concise-answers \ --dataset evaluation/curated_cases.jsonl \ --tenant all \ --max-cases 100 \ --seed 42- The script writes
reports/regression/<timestamp>-<baseline>-vs-<candidate>.mdand a JSON sidecar next to it. - Exit code
0means the candidate satisfied the regression gate,1means gate failure, and2is reserved for infrastructure/runtime errors. - Each completed run is persisted into
eval_resultswithkind='regression',run_id, baseline/candidate experiment ids, and the report path. temperature=0is forced for Ollama-backed regression runs to keep comparisons reproducible.- GitHub Actions exposes an informational
regression-evaljob on pull requests that touchagent/prompts.py,config/settings.py, orevaluation/experiments/*.yaml.
Provider benchmarking
Section titled “Provider benchmarking”The same regression runner also supports provider/model benchmarks by passing registry aliases instead of experiment ids.
python scripts/regression_eval.py \ --baseline ollama-small \ --candidate mistral-small-latest \ --dataset evaluation/curated_cases.jsonl \ --tenant all \ --max-cases 50 \ --seed 42- Alias resolution comes from
config/providers.yml, soollama-small,gk-fast,gk-strong, andmistral-small-latestcan be used directly. - Use
mistral-small-latestfor direct Mistral benchmarks; baremistral-smallbelongs to the GraceKelly profile. - Default mode is
mock-provider-benchmark: answers are derived from the curated dataset and pricing/latency/refusal metrics are simulated so CI does not call live external providers accidentally. - Live calls require explicit opt-in via
--allow-paid-apisorLLM_BENCHMARK_ALLOW_PAID_APIS=true. - Reports compare pass rate, latency, total cost, and refusal rate for the baseline and candidate provider targets.
- Direct-provider profiles are blocked when
DAILY_COST_LIMIT_USDis already exhausted for the current UTC day.
Online evaluators
Section titled “Online evaluators”Online evaluators are synchronous, low-cost checks that run on the final trace
state after the main graph finishes. They are enabled by
ONLINE_EVALUATORS_ENABLED=true and persist one row per evaluator into
trace_evaluations.
citation_coveragemeasures the share of answer sentences carrying[N]citations.answer_length_anomalyflags answers whose word-count z-score falls outside the baseline.retrieval_hit_ratemeasures the share of retrieved documents whose rerankrelevance_scoreis above0.5.tool_use_efficiencycompares final-answer tokens against tool-call token spend.refusal_detectedflags refusal-style phrases fromconfig/evaluator_patterns.yml.pii_leak_suspicionflags phone/email/card-like patterns and stores only the matched pattern names, never raw values.language_mismatchcompares detected query and answer languages.
Operational surfaces:
GET /api/admin/evaluations/trends?evaluator=<name>&days=30GET /api/admin/evaluations/worst?evaluator=<name>&limit=20python scripts/eval_daily_snapshot.py --date 2026-04-20deploy/helm/templates/cronjob-eval-snapshot.yamlruns the daily snapshot at02:00 UTC
Monitoring
Section titled “Monitoring”The project exposes three observability surfaces.
1. GET /api/metrics - JSON snapshot from SQLite
Section titled “1. GET /api/metrics - JSON snapshot from SQLite”{ "latency": {"p50_sec": 2.1, "p95_sec": 8.4, "p99_sec": 14.2, "window": "24h"}, "escalation": {"total_traces": 120, "escalated": 18, "rate_pct": 15.0, "window": "24h"}, "quality": {"scored_traces": 840, "avg_quality": 78.3, "low_quality_share_pct": 12.5, "window": "7d"}, "errors": {"total_started": 120, "likely_failed": 2, "likely_failure_rate_pct": 1.7, "window": "24h"}, "feedback": {"total": 95, "thumbs_down": 11, "thumbs_down_rate_pct": 11.6, "window": "7d"}}/static/metrics.html renders this snapshot with auto-refresh and status
coloring for operators and admins.
2. GET /metrics - Prometheus
Section titled “2. GET /metrics - Prometheus”monitoring/prometheus.py currently initializes Prometheus collectors
(Counter, Gauge, Histogram, or Summary):
- HTTP and latency:
rag_requests_total{route},rag_request_duration_seconds,rag_http_requests_total{method,endpoint,status},rag_http_request_duration_seconds{method,endpoint} - Quality and feedback:
rag_quality_score,rag_factuality_score,rag_escalation_total,rag_feedback_total{rating},rag_model_routing_total{complexity},rag_eval_drift{metric_name},regression_runs_total{result},regression_runs_duration_seconds,regression_last_pass_rate{baseline,candidate},online_evaluator_score{evaluator},online_evaluator_runs_total{evaluator,verdict},online_evaluator_errors_total{evaluator} - Resilience and protection:
rag_circuit_breaker_state{name},rag_circuit_breaker_transitions_total{name,to_state},rag_ollama_retry_events_total{event},rag_request_timeouts_total{endpoint},rag_inflight_pipelines,rag_pipeline_rejections_total{reason},rag_rate_limit_rejections_total{endpoint},rag_body_size_rejections_total{reason} - Platform health and data:
rag_component_up{component},rag_db_pool_size,rag_db_pool_checked_out,rag_db_pool_overflow,rag_active_sessions,rag_vector_store_documents,llm_cost_usd_total{provider,model,tenant},rag_stale_important_docs_count,llm_cache_hits_total{tenant},llm_cache_misses_total{tenant},rag_traces_purged_total{table},rag_audit_purged_total,rag_auth_failures_total{reason},review_queue_pending_total{reason},review_queue_confirmed_total{verdict},review_queue_oldest_pending_seconds
3. Alert rules and scheduled checks
Section titled “3. Alert rules and scheduled checks”monitoring/alert_rules.ymldefines Prometheus alert groups for resilience, health, quality, latency, nightly eval drift, and stale docs.scripts/check_alerts.pyis a lightweight SQLite-based checker that can run every five minutes and push alerts throughALERT_WEBHOOK_URL.scripts/nightly_eval.pyrecords evaluation drift, andscripts/weekly_report.pyproduces tenant-specific weekly reports.scripts/build_review_queue.pyis intended for hourly automation and feeds the continuous-learning review backlog.scripts/generate_improvement_backlog.pyaggregates the weekly actionable backlog and writesreports/improvement_backlog/<YYYY-Www>.md.
python scripts/check_alerts.py --dry-runpython scripts/build_review_queue.py --days 1 --tenant allpython scripts/nightly_eval.pypython scripts/weekly_report.py --tenant TEST --dry-runpython scripts/generate_improvement_backlog.py --tenant all --week 2026-W17 --out reports/improvement_backlog/2026-W17.mdReview queue
Section titled “Review queue”The review queue keeps weak or high-risk traces from being lost between tracing, feedback, and escalation flows.
python scripts/build_review_queue.py --days 7 --tenant all- The builder inserts
pendingcases forthumbs_down,low_quality,escalated,fact_fail, andslow_tracesignals. - The admin UI exposes a Review Queue tab with filters, status counters,
Confirm good/Confirm bad/Dismissactions, and links to/admin/traces/{trace_id}. - For single-user offline review, export a JSONL batch, annotate
review.*fields in an editor, then import the verdicts back through the CLI. - For Kubernetes, use
deploy/helm/templates/cronjob-review-queue.yamlto run the builder hourly with--days 1.
Improvement backlog
Section titled “Improvement backlog”The improvement backlog turns one week of review, KB, freshness, latency, and evaluation signals into a ranked list of changes worth making this week.
python scripts/generate_improvement_backlog.py --tenant all --week 2026-W17 --out reports/improvement_backlog/2026-W17.md- Ranking uses
impact * frequency * recency, where recency decays exponentially from the latest occurrence. - The generator keeps at most
BACKLOG_MAX_ITEMSitems and renders markdown sections for critical, high, and medium priorities. GET /api/admin/improvement-backlog/currentreturns the latest backlog JSON for the current admin tenant.GET /api/admin/improvement-backlog/archive?year=2026lists stored markdown backlog weeks underreports/improvement_backlog/.- For Kubernetes, use
deploy/helm/templates/cronjob-improvement-backlog.yamlto generate the backlog every Monday at06:00 UTC.
Curated dataset
Section titled “Curated dataset”Confirmed review cases can be promoted into a reusable JSONL dataset for regression checks and provider benchmarks.
python scripts/build_curated_dataset.py --tenant all --since 2026-04-01 --out evaluation/curated_cases.jsonl --include-bad- Each line in
evaluation/curated_cases.jsonlis a standalone case withinput.query,input.context_hint,input.channel,expected.*,human_verdict,reviewer_notes,source_trace_id, andcreated_at. confirmed_goodrows always participate; add--include-badto also exportconfirmed_badrows.- Re-running the builder is idempotent by
case_id, so the file can be refreshed safely during iterative review. GET /api/admin/curated-dataset/statsreturns dataset counts split by verdict, tenant, and channel.POST /api/admin/curated-dataset/rebuildqueues an async rebuild and stores progress in Redis under acurated-dataset-job:<job_id>tracker key.
Offline review workflow
Section titled “Offline review workflow”Use the export/import pair when reviewing pending cases locally instead of clicking through the admin UI one by one.
python scripts/review_export.py --status pending --tenant all --limit 5python scripts/review_import.py .review_local/review_batch_<timestamp>.jsonl --dry-runpython scripts/review_import.py .review_local/review_batch_<timestamp>.jsonl --confirmscripts/review_export.pywrites a comment header plus one JSON object per review case withquery,answer,retrieved_docs,tool_calls,citations, and an emptyreviewobject for manual annotation.- By default export files are written to
.review_local/, and both.review_local/andreview_batch_*.jsonlare ignored by git. - Edit only the nested
reviewobject per line:verdict = good | bad | dismiss, optionalnotes, optionalfix_hint, optionaltags. scripts/review_import.pyskips comments, ignores rows withreview.verdict = null, and refuses to overwrite items that are no longerpending.- Set
REVIEWER_EMAILbefore import. In the current schema it must match an existingusers.usernameso the import can persistreviewed_by. - For large batches, either pass
--confirmup front or answer the interactive confirmation prompt when more than 10 verdicts would be applied.
Threshold tuning
Section titled “Threshold tuning”Threshold recommendations are generated from recent traces plus
review_queue verdicts. The analyzer writes a markdown report and exposes a
JSON view for admin tooling.
python scripts/analyze_thresholds.py --tenant all --days 30 --out reports/threshold_recommendations.mdscripts/analyze_thresholds.pyevaluatesQUALITY_THRESHOLD,FACT_VERIFICATION_MIN_SCORE,ESCALATION_THRESHOLD, andSLOW_TRACE_THRESHOLD_MSagainst labeled bad/good traces and suggests the best cutoff by F1.GET /api/admin/thresholds/analysis?days=30returns the latest cached JSON analysis for the current tenant.POST /api/admin/thresholds/refresh?days=30forces a refresh and rewritesreports/threshold_recommendations.md.deploy/helm/templates/cronjob-threshold-analysis.yamlruns the analyzer weekly.- If fewer than
THRESHOLD_ANALYSIS_MIN_LABELSlabeled traces are available, the report keeps the metric section but marks it as insufficient data.