Configuration — RAG Support Assistant

Moved out of the top-level README to keep it scannable; this is the full runtime-configuration reference.

Environment Variables

Copy .env.example to .env, then adjust only what your deployment needs.

LLM and models

Variable	Default	Description
`OLLAMA_BASE_URL`	`http://localhost:11434`	Base URL for explicit `local-first` Ollama mode or GraceKelly fallback
`OLLAMA_MODEL_NAME`	`qwen2.5:7b`	Primary Ollama model when `LLM_PROVIDER_PROFILE=local-first`
`OLLAMA_FAST_MODEL_NAME`	`llama3.2:3b`	Faster Ollama model for explicit local helper/tool flows
`MODEL_ROUTING_ENABLED`	`false`	Enable simple/complex/global model routing
`OLLAMA_REQUEST_TIMEOUT_SEC`	`60`	Timeout for a single Ollama HTTP request
`REQUIRE_OLLAMA`	`false`	Fail fast at startup if explicit Ollama mode/fallback validation requires Ollama
`LANGFUSE_PUBLIC_KEY`	`-`	Optional Langfuse public key
`LANGFUSE_SECRET_KEY`	`-`	Optional Langfuse secret key
`LANGFUSE_HOST`	`https://cloud.langfuse.com`	Langfuse host for LLM observability

Provider profiles and live external APIs

Variable	Default	Description
`PROVIDER_REGISTRY_PATH`	`config/providers.yml`	YAML registry with providers, pricing, capabilities, and routing profiles
`LLM_PROVIDER_PROFILE`	`gracekelly-primary`	Active routing profile; defaults to the local GraceKelly orchestrator
`LLM_BENCHMARK_ALLOW_PAID_APIS`	`false`	Backward-compatible flag that allows live external-provider calls in provider benchmarks
`DAILY_COST_LIMIT_USD`	`5.0`	Fail fast when tracked direct-provider spend for the current UTC day reaches this limit
`MISTRAL_API_KEY`	`changeme`	Direct Mistral API key; placeholder values are treated as missing
`GRACEKELLY_BASE_URL`	`http://127.0.0.1:8011`	Base URL for the local GraceKelly orchestrator
`GRACEKELLY_API_KEY`	`-`	Optional GraceKelly bearer token for non-public endpoints
`GRACEKELLY_API_KEY_ENV`	`GRACEKELLY_API_KEY`	Env var name used by the runtime to look up the optional GraceKelly API key
`GRACEKELLY_HEALTH_CHECK_TIMEOUT_SEC`	`2.0`	Readiness-probe timeout before GraceKelly is considered unavailable
`GRACEKELLY_REQUEST_TIMEOUT_SEC`	`30.0`	Timeout for a single GraceKelly `/api/v1/smart` call
`FAILOVER_CHAIN_ENABLED`	`true`	Enable GraceKelly -> Ollama automatic failover for profiles that declare a local fallback
`FAILOVER_FALLBACK_CACHE_SECONDS`	`300`	Cache a successful local fallback decision for this many seconds

RAG pipeline

Variable	Default	Description
`RAG_EMBEDDING_MODEL`	`BAAI/bge-m3`	Embedding model used for documents and queries (local backend)
`RAG_EMBEDDING_BACKEND`	`local`	`local` = SentenceTransformer on `RAG_DEVICE`; `remote` = OpenAI/Mistral-compatible embeddings API. Remote frees ingest/search from loading the heavy local model (e.g. unblocks Windows under the 1-GiB/process rule). Remote vectors are L2-normalized to match the local path
`RAG_EMBEDDING_REMOTE_URL`	`https://api.mistral.ai/v1/embeddings`	Remote embeddings endpoint (OpenAI-compatible `{model, input:[...]}`)
`RAG_EMBEDDING_REMOTE_MODEL`	`mistral-embed`	Remote embedding model name
`RAG_EMBEDDING_REMOTE_API_KEY_ENV`	`MISTRAL_API_KEY`	Name of the env var holding the remote API key (the key itself is never stored in settings/logs)
`RAG_EMBEDDING_REMOTE_BATCH`	`32`	Inputs per remote embeddings request
`RAG_EMBEDDING_REMOTE_TIMEOUT_SEC`	`60`	Timeout for a single remote embeddings request
`RAG_RERANKER_MODEL`	`BAAI/bge-reranker-v2-m3`	Multilingual cross-encoder reranker (pairs with BGE-M3)
`RAG_HYBRID_SEARCH`	`true`	Combine BM25 with vector retrieval
`RAG_RETRIEVAL_STRATEGY`	`hybrid`	Retrieval strategy: `vector`, `hybrid`, `graph`, or `factcard`; `graph` and `factcard` fall back to `hybrid` when their store is absent. `factcard` (opt-in) serves whole fact-cards for enumeration queries (fields/documents/conditions) — closes the `customs-clearance-fields` recall gap; build the collection with `scripts/build_factcards.py`. Auto-routing into `factcard` is intentionally NOT default (NO-SHIP pending Phase-5 offline-delta — see `docs/operations/2026-06-14-adaptive-retrieval-closure.md`)
`RAG_RETRIEVAL_TOP_K`	`20`	Candidate documents fetched before reranking
`RAG_RERANK_TOP_K`	`5`	Final document count after reranking
`RRF_K`	`60`	Reciprocal Rank Fusion smoothing constant
`RRF_DOC_KEY_CHARS`	`200`	Prefix length used to deduplicate RRF document keys
`QUALITY_THRESHOLD`	`80`	Default quality threshold used by routing/evaluation logic
`CHUNK_SIZE`	`800`	Default chunk size for ingestion
`CHUNK_OVERLAP`	`200`	Default chunk overlap for ingestion
`API_DEFAULT_PAGE_SIZE`	`50`	Default page size for list-style admin endpoints
`RAG_SEMANTIC_CHUNKING`	`true`	Enable semantic chunking
`RAG_CONTEXTUAL_HEADERS`	`true`	Prepend contextual headers during ingestion. Cheap by default (`build_vector_store` derives headers from chunk metadata — no LLM/network). The LLM-generated variant runs only when `INGESTION_BATCH_ENABLED=true`, and then per document, not per chunk
`INGESTION_CONTEXTUAL_CONCURRENCY`	`4`	Bounded concurrency for the LLM contextual-header fallback (providers without a native batch API). `1` = strictly serial. Caps in-flight requests so a full-corpus ingest cannot fan out unbounded provider calls. Progress is logged as `[contextual_headers] i/N`
`RAG_AGENTIC_MODE`	`false`	Enable the tool-calling agent graph
`RAG_HYDE`	`false`	Enable Hypothetical Document Embeddings
`RAG_PARENT_CHILD`	`false`	Enable parent-child chunking
`RAG_STRUCTURAL_CHUNKING`	`true`	Split markdown by headers (sections), cap to `CHUNK_SIZE`
`RAG_PARENT_EXPANSION`	`true`	Post-rerank: supplement final chunks with neighbouring sections of their source
`RAG_PARENT_EXPANSION_WINDOW`	`2`	Sections taken from each side of a selected chunk
`RAG_PARENT_EXPANSION_MAX_CHARS`	`3600`	Cap on expanded chunk text (core + neighbours)
`RAG_GRAPH_RETRIEVAL`	`off`	Graph-lane activation gate: `off`/`on`/`auto`; condition evaluated & logged at ingestion (lane itself = Phase 2, not built)
`RAG_GRAPH_MIN_CHUNKS`	`20000`	`auto`: minimal chunk count to consider the graph lane
`RAG_GRAPH_MIN_CROSSDOC_SHARE`	`0.15`	`auto`: minimal cross-doc entity share (connectivity gate)
`RAG_GRAPH_CROSSDOC_SHARE`	unset	Measured probe value (`scripts/graph_probe.py`; 2026-06-06 corpus: 0.296, gate passed); unset = probe not run, `auto` stays off
`RAG_ASK_BUDGET_SEC`	`0`	Optional wall-clock budget for a single `ConversationSession.ask()` outside the HTTP path (which already has `request_timeout_sec`). `0` = off (blocking). When >0 and exceeded, `ask()` returns a graceful degraded result (`route="timeout"`) instead of hanging on a flapping provider; the background run is not cancellable
`RAG_SELF_RAG_MAX_ITER`	`2`	Maximum Self-RAG iterations
`RAG_SELF_RAG_MIN_QUALITY`	`70`	Minimum quality score to avoid retry/escalation
`STREAMING_QUALITY_EVAL`	`true`	Streaming `/api/ask/stream` runs one cheap Self-RAG self-eval so streamed answers are quality-routed on par with non-streaming; set `false` to roll back to the legacy synthetic-score streaming path
`FACT_VERIFICATION_ENABLED`	`true`	Run fact verification after generation
`FACT_VERIFICATION_MIN_SCORE`	`70`	Minimum factuality score threshold
`FACT_VERIFY_CONTEXT_MAX_DOCS`	`5`	Max retrieved docs used as evidence when verifying answer facts
`FACT_VERIFY_CONTEXT_CHARS_PER_DOC`	`3600`	Chars per doc used as fact-verification evidence; aligned with `RAG_PARENT_EXPANSION_MAX_CHARS` so verification sees full parent-expanded chunks
`SLOW_TRACE_THRESHOLD_MS`	`10000`	Trace-duration threshold for review queue collection
`THRESHOLD_ANALYSIS_MIN_LABELS`	`20`	Minimum labeled traces required before suggesting a new threshold
`REVIEW_QUEUE_ENABLED`	`true`	Enable review queue builder and admin endpoints
`ONLINE_EVALUATORS_ENABLED`	`true`	Enable lightweight per-trace online evaluators, persistence, and admin views. When persistence fails (e.g. Postgres unreachable in a standalone graph run), the first failure logs at WARNING and identical repeats drop to DEBUG — one signal per process, not one per request; answers are unaffected
`ONLINE_EVALUATORS_TIMEOUT_SEC`	`1.0`	Per-trace online-evaluator wall-clock budget; runs that exceed it are dropped and counted in `rag_online_evaluators_dropped_total{reason}`
`REGRESSION_GATE_MAX_REGRESSIONS`	`2`	Maximum allowed curated regressions before the gate fails
`REGRESSION_GATE_MIN_PASS_RATE`	`0.85`	Minimum candidate pass rate required by the regression gate
`RAG_VECTOR_BACKEND`	`chroma`	Vector store backend
`VECTORDB_COLLECTION_PREFIX`	`rag_docs`	Chroma collection prefix; full name is `{prefix}_{tenant_id}`
`CATEGORIES_CONFIG_PATH`	`config/categories.yml`	Taxonomy file for upload auto-categorization

Resilience and capacity

Resilience layers apply in this order: timeout -> retry -> circuit breaker -> bounded concurrency -> request wall-time

Variable	Default	Description
`OLLAMA_RETRY_MAX_ATTEMPTS`	`3`	Retry attempts including the first call; `1` disables retries
`OLLAMA_RETRY_BASE_DELAY_SEC`	`0.5`	Base retry delay
`OLLAMA_RETRY_MAX_DELAY_SEC`	`5.0`	Maximum retry delay
`OLLAMA_RETRY_JITTER`	`true`	Apply jitter to retry delays
`CIRCUIT_BREAKER_ENABLED`	`true`	Enable circuit-breaker protection for Ollama
`CIRCUIT_BREAKER_FAILURE_THRESHOLD`	`5`	Consecutive failures before the breaker opens
`CIRCUIT_BREAKER_RESET_TIMEOUT_SEC`	`30`	Delay before half-open probing
`REQUEST_TIMEOUT_SEC`	`30`	Wall-time limit for one `/api/ask` request
`STREAMING_TIMEOUT_SEC`	`120`	Wall-clock budget for the SSE token loop in `/api/ask/stream` (separate from `REQUEST_TIMEOUT_SEC`)
`DB_PERSIST_TIMEOUT_SEC`	`2.0`	Timeout for persisting one conversation message to Postgres before the write is dropped and counted in `rag_message_persist_failures_total{operation}`
`MAX_CONCURRENT_PIPELINES`	`8`	Maximum concurrent `/api/ask` pipelines
`PIPELINE_ACQUIRE_TIMEOUT_SEC`	`0.5`	How long to wait for a pipeline slot before returning `503`
`SESSION_TTL_SECONDS`	`7200`	Session idle timeout in seconds

Security and auth

Variable	Default	Description
`API_KEY`	`-`	Legacy `X-API-Key` protection for API endpoints; JWT is preferred
`ADMIN_USERNAME`	`admin`	Username for `/api/auth/login`
`ADMIN_PASSWORD_HASH`	`-`	Bcrypt password hash; if empty, dev mode accepts `admin/admin`
`JWT_SECRET`	`dev-secret-change-in-production!`	Secret for access/refresh tokens
`JWT_ACCESS_TTL`	`3600`	Access-token TTL in seconds
`JWT_REFRESH_TTL`	`604800`	Refresh-token TTL in seconds
`SESSION_SECRET_KEY`	`JWT_SECRET` fallback	Secret used by `SessionMiddleware` and OIDC state cookies
`GOOGLE_OIDC_CLIENT_ID`	`-`	Google OIDC client ID
`GOOGLE_OIDC_CLIENT_SECRET`	`-`	Google OIDC client secret
`AZURE_OIDC_TENANT`	`-`	Azure AD tenant used for issuer discovery
`AZURE_OIDC_CLIENT_ID`	`-`	Azure AD OIDC client ID
`AZURE_OIDC_CLIENT_SECRET`	`-`	Azure AD OIDC client secret
`TENANT_EMAIL_DOMAINS`	`""`	Domain-to-tenant mapping, for example `acme.com:tenant-acme,beta.io:tenant-beta`
`RAG_ENV`	`development`	`development`, `staging`, or `production`
`CORS_ORIGINS`	`*`	Comma-separated allowed origins; `*` is forbidden in production
`CORS_MAX_AGE_SEC`	`600`	Preflight cache TTL
`MAX_REQUEST_BODY_BYTES`	`1048576`	1 MiB request-body limit for non-upload endpoints
`MAX_UPLOAD_BYTES`	`52428800`	50 MiB upload limit for `/api/upload`
`ALLOW_ANONYMOUS_ADMIN`	`-`	Opt-in escape hatch when `API_KEY` is empty: set to `1`/`true` to permit anonymous admin (otherwise endpoints return HTTP 503). Local-dev only. Added 2026-04-26 audit.
`HOST`	`127.0.0.1` (bare run)	Used only when launching via `python main.py`. Default Docker Compose is local-dev only and binds host ports to `127.0.0.1`.
`PORT`	`8000`	Same — bare run only.
`UVICORN_RELOAD`	`false`	`python main.py` only: enable uvicorn auto-reload. Default off is headless-safe — auto-reload restarts the API on any write under `data/`/`demo/`, which flaps headless ingest/eval runs. Set `true` for the local dev loop.
`AUTO_MIGRATE`	`true`	Run `alembic upgrade head` in startup lifespan. In production, errors abort startup unless `AUTO_MIGRATE_FAIL_OPEN=true` is explicitly set.
`AUTO_MIGRATE_FAIL_OPEN`	`false`	Production escape hatch for temporarily logging migration failures instead of aborting startup.

Database, cache, tracing, and analytics

Variable	Default	Description
`POSTGRES_PASSWORD`	`rag_dev_password`	Local Compose password for the Postgres container
`DATABASE_URL`	`postgresql://rag:rag_dev_password@localhost:5432/rag_assistant`	Postgres DSN for sessions, audit, analytics, and copilot data
`DB_ENCRYPTION_KEY`	dev fallback	Key used by `pgcrypto`; required in production and for migration `008`
`REDIS_URL`	`redis://localhost:6379/0`	Redis cache URL
`LLM_CACHE_ENABLED`	`false`	Enable tenant-scoped response caching for `/api/ask`
`LLM_CACHE_TTL_SECONDS`	`3600`	TTL for cached LLM responses
`OTEL_ENABLED`	`false`	Enable OpenTelemetry SDK + instrumentation
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP gRPC endpoint for Jaeger/Tempo/collectors
`OTEL_SERVICE_NAME`	`rag-support-assistant`	`service.name` resource attribute
`LLM_INPUT_PRICE_PER_1M_TOKENS`	`0.0`	Fallback input-token price when a model is not listed in the provider registry
`LLM_OUTPUT_PRICE_PER_1M_TOKENS`	`0.0`	Fallback output-token price when a model is not listed in the provider registry
`LLM_MODEL_PRICES`	`-`	Optional JSON override for legacy analytics or unregistered models
`LLM_COST_PER_1M_TOKENS`	legacy fallback	Backward-compatible legacy pricing format kept for old local setups
`TRACE_RETENTION_DAYS`	`90`	Retention window for SQLite traces; `0` disables purge
`TRACE_PURGE_INTERVAL_SEC`	`86400`	Background trace-purge interval
`AUDIT_RETENTION_DAYS`	`180`	Retention window for `audit_log`
`AUDIT_PURGE_INTERVAL_SEC`	`86400`	Background audit purge interval
`SHUTDOWN_READY_DELAY_SEC`	`5`	Drain delay between readiness flip and shutdown

Channels, escalation, and reporting

Variable	Default	Description
`SUPPORT_SINK_BACKEND`	`local`	Escalation backend: `local` or `bitrix`
`BITRIX_WEBHOOK_URL`	`-`	Bitrix24 webhook URL
`TELEGRAM_BOT_TOKEN`	`-`	Optional Telegram bot token
`ALERT_WEBHOOK_URL`	`-`	Webhook used by `scripts/check_alerts.py`
`ALERT_ESCALATION_PCT`	`35`	Escalation-rate alert threshold over 24h
`ALERT_QUALITY_MIN`	`65`	Minimum 7-day average quality
`ALERT_LOW_QUALITY_PCT`	`30`	Threshold for low-quality answer share
`ALERT_P95_LATENCY_SEC`	`12`	24h p95 latency alert threshold
`ALERT_THUMBS_DOWN_PCT`	`20`	7-day thumbs-down threshold
`ALERT_THUMBS_DOWN_MIN_N`	`50`	Minimum feedback volume before thumbs-down alerts trigger
`REPORT_SLACK_WEBHOOK`	`-`	Slack webhook for weekly reports
`REPORT_EMAIL_RECIPIENTS`	`""`	Comma-separated email list for weekly reports
`REPORT_SMTP_HOST`	`SMTP_HOST` fallback	SMTP host override for weekly reports
`REPORT_SMTP_PORT`	`SMTP_PORT` fallback or `587`	SMTP port override for weekly reports
`REPORT_SMTP_USER`	`SMTP_USER` fallback	SMTP user override for weekly reports
`REPORT_SMTP_PASS`	`SMTP_PASS` fallback	SMTP password override for weekly reports
`BACKLOG_WEIGHT_REVIEW_BAD`	`3.0`	Impact weight for confirmed-bad review backlog items
`BACKLOG_WEIGHT_THUMBS_DOWN`	`2.0`	Impact weight for thumbs-down backlog items
`BACKLOG_WEIGHT_SLOW`	`1.5`	Impact weight for slow-endpoint backlog items
`BACKLOG_WEIGHT_FRESHNESS`	`1.0`	Impact weight for stale-document backlog items
`BACKLOG_WEIGHT_EVALUATOR_DRIFT`	`2.5`	Impact weight for evaluator drift backlog items
`BACKLOG_MAX_ITEMS`	`30`	Maximum number of improvement backlog items kept after ranking
`BACKLOG_FRESHNESS_MAX_DAYS`	`90`	Freshness cutoff for stale-doc backlog items
`BACKLOG_EMAIL_ENABLED`	`false`	Email the generated backlog to `TENANT_ADMIN_EMAIL` after each run
`TENANT_ADMIN_EMAIL`	`""`	Optional recipient for backlog email delivery
`EMAIL_CHANNEL_MODE`	`disabled`	Email channel mode: `disabled`, `imap`, or `webhook`
`IMAP_HOST`	`""`	IMAP server hostname
`IMAP_PORT`	`993`	IMAP server port
`IMAP_USER`	`""`	IMAP username
`IMAP_PASS`	`-`	IMAP password (`IMAP_PASSWORD` is also accepted)
`IMAP_FOLDER`	`INBOX`	IMAP folder polled by `scripts/email_poller.py`
`IMAP_POLL_INTERVAL_SEC`	`60`	Delay between IMAP polling cycles
`SMTP_HOST`	`""`	SMTP hostname for email replies
`SMTP_PORT`	`587`	SMTP port for email replies
`SMTP_USER`	`""`	SMTP username
`SMTP_PASS`	`-`	SMTP password (`SMTP_PASSWORD` is also accepted)
`SMTP_FROM_ADDRESS`	`support@example.com`	Default sender address for outbound replies
`EMAIL_WEBHOOK_SIGNING_SECRET`	`-`	Shared secret used to verify inbound email webhooks (`EMAIL_WEBHOOK_SECRET` remains a legacy fallback)

Email channel

IMAP mode runs through scripts/email_poller.py and polls IMAP_FOLDER every IMAP_POLL_INTERVAL_SEC seconds.
python scripts/email_poller.py --once is the easiest dev-mode smoke check for one poll cycle.
Webhook mode supports SendGrid Inbound Parse style payloads with from, to, subject, text, optional html, and optional raw headers.
The webhook accepts POST /webhook/email; /api/channels/email/inbound remains as a compatibility alias.
Signatures use HMAC-SHA256(body, EMAIL_WEBHOOK_SIGNING_SECRET) in the X-Signature header.
Tenant routing uses the sender email domain from TENANT_EMAIL_DOMAINS, for example TENANT_EMAIL_DOMAINS=acme.com:acme,*:default.
Low-quality email answers are persisted into escalated_tickets with status="pending_response" for the existing operator flow.

LLM response caching

The final /api/ask response is cached for (tenant, normalized_question), where normalization is .strip().lower().
Keys look like llm_resp:{tenant}:{sha256(question)[:16]}, so the raw question is not stored in Redis.
Uploads invalidate the tenant namespace llm_resp:{tenant}:*.

Providers

Provider routing is configured through config/providers.yml, which defines:

enabled providers (ollama, gracekelly, mistral)
model aliases such as ollama-small, gk-fast, and mistral-small-latest
per-model input/output pricing, rate limits, and capability flags
routing profiles local-first, gracekelly-primary, gracekelly-mixed, and external-mistral

Runtime behavior:

gracekelly-primary is the default profile and routes both tiers through the local GraceKelly orchestrator.
local-first is the explicit Ollama-only profile and keeps both fast/strong lanes on Ollama.
gracekelly-primary falls back only to the declared Ollama fallback when GraceKelly is unavailable and failover is enabled.
gracekelly-mixed keeps browser-backed strong answer generation on GraceKelly while routing fast helper/evaluator calls through direct Mistral; use it only for explicit live benchmark runs.
external-mistral uses the direct Mistral API and is the intended non-local deployment option when GraceKelly is not present.
Startup validation loads the registry, verifies LLM_PROVIDER_PROFILE, and treats placeholder credentials such as changeme as missing.
Each traced LLM step now records provider_name, model_name, token usage, and cost; Prometheus exports llm_cost_usd_total{provider,model,tenant}.
Automatic failover events are exported as llm_provider_fallback_total{from_provider,to_provider,reason}.
mistral-small is GraceKelly’s local fast-lane model name; use mistral-small-latest when you want the direct Mistral alias.
The admin UI exposes a Providers tab backed by GET /api/admin/providers, including active profile, configured providers, 1-minute usage, 24-hour cost, and the last successful call timestamp.

GraceKelly provider

gracekelly-primary is intended for local setups where D:\GraceKelly\ runs on http://127.0.0.1:8011.
The provider uses GET /healthz/ready before the first request and calls POST /api/v1/smart with reliability_level=quick.
If GraceKelly is down or times out, the runtime switches only to the declared local fallback (ollama) and caches that decision for FAILOVER_FALLBACK_CACHE_SECONDS. Ollama is not otherwise required by the default health path.
GraceKelly calls are treated as proxy/orchestrator traffic, so cost_usd remains 0.0 in local traces.

Mistral provider

external-mistral is the direct Mistral fallback for deployments where GraceKelly is unavailable.
The provider uses POST https://api.mistral.ai/v1/chat/completions with OpenAI-compatible chat payloads and reads token usage from usage.prompt_tokens / usage.completion_tokens.
Placeholder MISTRAL_API_KEY=changeme is treated as missing both in startup validation and in the provider constructor.
DAILY_COST_LIMIT_USD applies to the direct Mistral profile and blocks new runtime creation after the current UTC-day spend is exhausted.