Skip to content

Configuration — RAG Support Assistant

Moved out of the top-level README to keep it scannable; this is the full runtime-configuration reference.

Copy .env.example to .env, then adjust only what your deployment needs.

VariableDefaultDescription
OLLAMA_BASE_URLhttp://localhost:11434Base URL for explicit local-first Ollama mode or GraceKelly fallback
OLLAMA_MODEL_NAMEqwen2.5:7bPrimary Ollama model when LLM_PROVIDER_PROFILE=local-first
OLLAMA_FAST_MODEL_NAMEllama3.2:3bFaster Ollama model for explicit local helper/tool flows
MODEL_ROUTING_ENABLEDfalseEnable simple/complex/global model routing
OLLAMA_REQUEST_TIMEOUT_SEC60Timeout for a single Ollama HTTP request
REQUIRE_OLLAMAfalseFail fast at startup if explicit Ollama mode/fallback validation requires Ollama
LANGFUSE_PUBLIC_KEY-Optional Langfuse public key
LANGFUSE_SECRET_KEY-Optional Langfuse secret key
LANGFUSE_HOSThttps://cloud.langfuse.comLangfuse host for LLM observability
VariableDefaultDescription
PROVIDER_REGISTRY_PATHconfig/providers.ymlYAML registry with providers, pricing, capabilities, and routing profiles
LLM_PROVIDER_PROFILEgracekelly-primaryActive routing profile; defaults to the local GraceKelly orchestrator
LLM_BENCHMARK_ALLOW_PAID_APISfalseBackward-compatible flag that allows live external-provider calls in provider benchmarks
DAILY_COST_LIMIT_USD5.0Fail fast when tracked direct-provider spend for the current UTC day reaches this limit
MISTRAL_API_KEYchangemeDirect Mistral API key; placeholder values are treated as missing
GRACEKELLY_BASE_URLhttp://127.0.0.1:8011Base URL for the local GraceKelly orchestrator
GRACEKELLY_API_KEY-Optional GraceKelly bearer token for non-public endpoints
GRACEKELLY_API_KEY_ENVGRACEKELLY_API_KEYEnv var name used by the runtime to look up the optional GraceKelly API key
GRACEKELLY_HEALTH_CHECK_TIMEOUT_SEC2.0Readiness-probe timeout before GraceKelly is considered unavailable
GRACEKELLY_REQUEST_TIMEOUT_SEC30.0Timeout for a single GraceKelly /api/v1/smart call
FAILOVER_CHAIN_ENABLEDtrueEnable GraceKelly -> Ollama automatic failover for profiles that declare a local fallback
FAILOVER_FALLBACK_CACHE_SECONDS300Cache a successful local fallback decision for this many seconds
VariableDefaultDescription
RAG_EMBEDDING_MODELBAAI/bge-m3Embedding model used for documents and queries (local backend)
RAG_EMBEDDING_BACKENDlocallocal = SentenceTransformer on RAG_DEVICE; remote = OpenAI/Mistral-compatible embeddings API. Remote frees ingest/search from loading the heavy local model (e.g. unblocks Windows under the 1-GiB/process rule). Remote vectors are L2-normalized to match the local path
RAG_EMBEDDING_REMOTE_URLhttps://api.mistral.ai/v1/embeddingsRemote embeddings endpoint (OpenAI-compatible {model, input:[...]})
RAG_EMBEDDING_REMOTE_MODELmistral-embedRemote embedding model name
RAG_EMBEDDING_REMOTE_API_KEY_ENVMISTRAL_API_KEYName of the env var holding the remote API key (the key itself is never stored in settings/logs)
RAG_EMBEDDING_REMOTE_BATCH32Inputs per remote embeddings request
RAG_EMBEDDING_REMOTE_TIMEOUT_SEC60Timeout for a single remote embeddings request
RAG_RERANKER_MODELBAAI/bge-reranker-v2-m3Multilingual cross-encoder reranker (pairs with BGE-M3)
RAG_HYBRID_SEARCHtrueCombine BM25 with vector retrieval
RAG_RETRIEVAL_STRATEGYhybridRetrieval strategy: vector, hybrid, graph, or factcard; graph and factcard fall back to hybrid when their store is absent. factcard (opt-in) serves whole fact-cards for enumeration queries (fields/documents/conditions) — closes the customs-clearance-fields recall gap; build the collection with scripts/build_factcards.py. Auto-routing into factcard is intentionally NOT default (NO-SHIP pending Phase-5 offline-delta — see docs/operations/2026-06-14-adaptive-retrieval-closure.md)
RAG_RETRIEVAL_TOP_K20Candidate documents fetched before reranking
RAG_RERANK_TOP_K5Final document count after reranking
RRF_K60Reciprocal Rank Fusion smoothing constant
RRF_DOC_KEY_CHARS200Prefix length used to deduplicate RRF document keys
QUALITY_THRESHOLD80Default quality threshold used by routing/evaluation logic
CHUNK_SIZE800Default chunk size for ingestion
CHUNK_OVERLAP200Default chunk overlap for ingestion
API_DEFAULT_PAGE_SIZE50Default page size for list-style admin endpoints
RAG_SEMANTIC_CHUNKINGtrueEnable semantic chunking
RAG_CONTEXTUAL_HEADERStruePrepend contextual headers during ingestion. Cheap by default (build_vector_store derives headers from chunk metadata — no LLM/network). The LLM-generated variant runs only when INGESTION_BATCH_ENABLED=true, and then per document, not per chunk
INGESTION_CONTEXTUAL_CONCURRENCY4Bounded concurrency for the LLM contextual-header fallback (providers without a native batch API). 1 = strictly serial. Caps in-flight requests so a full-corpus ingest cannot fan out unbounded provider calls. Progress is logged as [contextual_headers] i/N
RAG_AGENTIC_MODEfalseEnable the tool-calling agent graph
RAG_HYDEfalseEnable Hypothetical Document Embeddings
RAG_PARENT_CHILDfalseEnable parent-child chunking
RAG_STRUCTURAL_CHUNKINGtrueSplit markdown by headers (sections), cap to CHUNK_SIZE
RAG_PARENT_EXPANSIONtruePost-rerank: supplement final chunks with neighbouring sections of their source
RAG_PARENT_EXPANSION_WINDOW2Sections taken from each side of a selected chunk
RAG_PARENT_EXPANSION_MAX_CHARS3600Cap on expanded chunk text (core + neighbours)
RAG_GRAPH_RETRIEVALoffGraph-lane activation gate: off/on/auto; condition evaluated & logged at ingestion (lane itself = Phase 2, not built)
RAG_GRAPH_MIN_CHUNKS20000auto: minimal chunk count to consider the graph lane
RAG_GRAPH_MIN_CROSSDOC_SHARE0.15auto: minimal cross-doc entity share (connectivity gate)
RAG_GRAPH_CROSSDOC_SHAREunsetMeasured probe value (scripts/graph_probe.py; 2026-06-06 corpus: 0.296, gate passed); unset = probe not run, auto stays off
RAG_ASK_BUDGET_SEC0Optional wall-clock budget for a single ConversationSession.ask() outside the HTTP path (which already has request_timeout_sec). 0 = off (blocking). When >0 and exceeded, ask() returns a graceful degraded result (route="timeout") instead of hanging on a flapping provider; the background run is not cancellable
RAG_SELF_RAG_MAX_ITER2Maximum Self-RAG iterations
RAG_SELF_RAG_MIN_QUALITY70Minimum quality score to avoid retry/escalation
STREAMING_QUALITY_EVALtrueStreaming /api/ask/stream runs one cheap Self-RAG self-eval so streamed answers are quality-routed on par with non-streaming; set false to roll back to the legacy synthetic-score streaming path
FACT_VERIFICATION_ENABLEDtrueRun fact verification after generation
FACT_VERIFICATION_MIN_SCORE70Minimum factuality score threshold
FACT_VERIFY_CONTEXT_MAX_DOCS5Max retrieved docs used as evidence when verifying answer facts
FACT_VERIFY_CONTEXT_CHARS_PER_DOC3600Chars per doc used as fact-verification evidence; aligned with RAG_PARENT_EXPANSION_MAX_CHARS so verification sees full parent-expanded chunks
SLOW_TRACE_THRESHOLD_MS10000Trace-duration threshold for review queue collection
THRESHOLD_ANALYSIS_MIN_LABELS20Minimum labeled traces required before suggesting a new threshold
REVIEW_QUEUE_ENABLEDtrueEnable review queue builder and admin endpoints
ONLINE_EVALUATORS_ENABLEDtrueEnable lightweight per-trace online evaluators, persistence, and admin views. When persistence fails (e.g. Postgres unreachable in a standalone graph run), the first failure logs at WARNING and identical repeats drop to DEBUG — one signal per process, not one per request; answers are unaffected
ONLINE_EVALUATORS_TIMEOUT_SEC1.0Per-trace online-evaluator wall-clock budget; runs that exceed it are dropped and counted in rag_online_evaluators_dropped_total{reason}
REGRESSION_GATE_MAX_REGRESSIONS2Maximum allowed curated regressions before the gate fails
REGRESSION_GATE_MIN_PASS_RATE0.85Minimum candidate pass rate required by the regression gate
RAG_VECTOR_BACKENDchromaVector store backend
VECTORDB_COLLECTION_PREFIXrag_docsChroma collection prefix; full name is {prefix}_{tenant_id}
CATEGORIES_CONFIG_PATHconfig/categories.ymlTaxonomy file for upload auto-categorization

Resilience layers apply in this order: timeout -> retry -> circuit breaker -> bounded concurrency -> request wall-time

VariableDefaultDescription
OLLAMA_RETRY_MAX_ATTEMPTS3Retry attempts including the first call; 1 disables retries
OLLAMA_RETRY_BASE_DELAY_SEC0.5Base retry delay
OLLAMA_RETRY_MAX_DELAY_SEC5.0Maximum retry delay
OLLAMA_RETRY_JITTERtrueApply jitter to retry delays
CIRCUIT_BREAKER_ENABLEDtrueEnable circuit-breaker protection for Ollama
CIRCUIT_BREAKER_FAILURE_THRESHOLD5Consecutive failures before the breaker opens
CIRCUIT_BREAKER_RESET_TIMEOUT_SEC30Delay before half-open probing
REQUEST_TIMEOUT_SEC30Wall-time limit for one /api/ask request
STREAMING_TIMEOUT_SEC120Wall-clock budget for the SSE token loop in /api/ask/stream (separate from REQUEST_TIMEOUT_SEC)
DB_PERSIST_TIMEOUT_SEC2.0Timeout for persisting one conversation message to Postgres before the write is dropped and counted in rag_message_persist_failures_total{operation}
MAX_CONCURRENT_PIPELINES8Maximum concurrent /api/ask pipelines
PIPELINE_ACQUIRE_TIMEOUT_SEC0.5How long to wait for a pipeline slot before returning 503
SESSION_TTL_SECONDS7200Session idle timeout in seconds
VariableDefaultDescription
API_KEY-Legacy X-API-Key protection for API endpoints; JWT is preferred
ADMIN_USERNAMEadminUsername for /api/auth/login
ADMIN_PASSWORD_HASH-Bcrypt password hash; if empty, dev mode accepts admin/admin
JWT_SECRETdev-secret-change-in-production!Secret for access/refresh tokens
JWT_ACCESS_TTL3600Access-token TTL in seconds
JWT_REFRESH_TTL604800Refresh-token TTL in seconds
SESSION_SECRET_KEYJWT_SECRET fallbackSecret used by SessionMiddleware and OIDC state cookies
GOOGLE_OIDC_CLIENT_ID-Google OIDC client ID
GOOGLE_OIDC_CLIENT_SECRET-Google OIDC client secret
AZURE_OIDC_TENANT-Azure AD tenant used for issuer discovery
AZURE_OIDC_CLIENT_ID-Azure AD OIDC client ID
AZURE_OIDC_CLIENT_SECRET-Azure AD OIDC client secret
TENANT_EMAIL_DOMAINS""Domain-to-tenant mapping, for example acme.com:tenant-acme,beta.io:tenant-beta
RAG_ENVdevelopmentdevelopment, staging, or production
CORS_ORIGINS*Comma-separated allowed origins; * is forbidden in production
CORS_MAX_AGE_SEC600Preflight cache TTL
MAX_REQUEST_BODY_BYTES10485761 MiB request-body limit for non-upload endpoints
MAX_UPLOAD_BYTES5242880050 MiB upload limit for /api/upload
ALLOW_ANONYMOUS_ADMIN-Opt-in escape hatch when API_KEY is empty: set to 1/true to permit anonymous admin (otherwise endpoints return HTTP 503). Local-dev only. Added 2026-04-26 audit.
HOST127.0.0.1 (bare run)Used only when launching via python main.py. Default Docker Compose is local-dev only and binds host ports to 127.0.0.1.
PORT8000Same — bare run only.
UVICORN_RELOADfalsepython main.py only: enable uvicorn auto-reload. Default off is headless-safe — auto-reload restarts the API on any write under data//demo/, which flaps headless ingest/eval runs. Set true for the local dev loop.
AUTO_MIGRATEtrueRun alembic upgrade head in startup lifespan. In production, errors abort startup unless AUTO_MIGRATE_FAIL_OPEN=true is explicitly set.
AUTO_MIGRATE_FAIL_OPENfalseProduction escape hatch for temporarily logging migration failures instead of aborting startup.
VariableDefaultDescription
POSTGRES_PASSWORDrag_dev_passwordLocal Compose password for the Postgres container
DATABASE_URLpostgresql://rag:rag_dev_password@localhost:5432/rag_assistantPostgres DSN for sessions, audit, analytics, and copilot data
DB_ENCRYPTION_KEYdev fallbackKey used by pgcrypto; required in production and for migration 008
REDIS_URLredis://localhost:6379/0Redis cache URL
LLM_CACHE_ENABLEDfalseEnable tenant-scoped response caching for /api/ask
LLM_CACHE_TTL_SECONDS3600TTL for cached LLM responses
OTEL_ENABLEDfalseEnable OpenTelemetry SDK + instrumentation
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317OTLP gRPC endpoint for Jaeger/Tempo/collectors
OTEL_SERVICE_NAMErag-support-assistantservice.name resource attribute
LLM_INPUT_PRICE_PER_1M_TOKENS0.0Fallback input-token price when a model is not listed in the provider registry
LLM_OUTPUT_PRICE_PER_1M_TOKENS0.0Fallback output-token price when a model is not listed in the provider registry
LLM_MODEL_PRICES-Optional JSON override for legacy analytics or unregistered models
LLM_COST_PER_1M_TOKENSlegacy fallbackBackward-compatible legacy pricing format kept for old local setups
TRACE_RETENTION_DAYS90Retention window for SQLite traces; 0 disables purge
TRACE_PURGE_INTERVAL_SEC86400Background trace-purge interval
AUDIT_RETENTION_DAYS180Retention window for audit_log
AUDIT_PURGE_INTERVAL_SEC86400Background audit purge interval
SHUTDOWN_READY_DELAY_SEC5Drain delay between readiness flip and shutdown
VariableDefaultDescription
SUPPORT_SINK_BACKENDlocalEscalation backend: local or bitrix
BITRIX_WEBHOOK_URL-Bitrix24 webhook URL
TELEGRAM_BOT_TOKEN-Optional Telegram bot token
ALERT_WEBHOOK_URL-Webhook used by scripts/check_alerts.py
ALERT_ESCALATION_PCT35Escalation-rate alert threshold over 24h
ALERT_QUALITY_MIN65Minimum 7-day average quality
ALERT_LOW_QUALITY_PCT30Threshold for low-quality answer share
ALERT_P95_LATENCY_SEC1224h p95 latency alert threshold
ALERT_THUMBS_DOWN_PCT207-day thumbs-down threshold
ALERT_THUMBS_DOWN_MIN_N50Minimum feedback volume before thumbs-down alerts trigger
REPORT_SLACK_WEBHOOK-Slack webhook for weekly reports
REPORT_EMAIL_RECIPIENTS""Comma-separated email list for weekly reports
REPORT_SMTP_HOSTSMTP_HOST fallbackSMTP host override for weekly reports
REPORT_SMTP_PORTSMTP_PORT fallback or 587SMTP port override for weekly reports
REPORT_SMTP_USERSMTP_USER fallbackSMTP user override for weekly reports
REPORT_SMTP_PASSSMTP_PASS fallbackSMTP password override for weekly reports
BACKLOG_WEIGHT_REVIEW_BAD3.0Impact weight for confirmed-bad review backlog items
BACKLOG_WEIGHT_THUMBS_DOWN2.0Impact weight for thumbs-down backlog items
BACKLOG_WEIGHT_SLOW1.5Impact weight for slow-endpoint backlog items
BACKLOG_WEIGHT_FRESHNESS1.0Impact weight for stale-document backlog items
BACKLOG_WEIGHT_EVALUATOR_DRIFT2.5Impact weight for evaluator drift backlog items
BACKLOG_MAX_ITEMS30Maximum number of improvement backlog items kept after ranking
BACKLOG_FRESHNESS_MAX_DAYS90Freshness cutoff for stale-doc backlog items
BACKLOG_EMAIL_ENABLEDfalseEmail the generated backlog to TENANT_ADMIN_EMAIL after each run
TENANT_ADMIN_EMAIL""Optional recipient for backlog email delivery
EMAIL_CHANNEL_MODEdisabledEmail channel mode: disabled, imap, or webhook
IMAP_HOST""IMAP server hostname
IMAP_PORT993IMAP server port
IMAP_USER""IMAP username
IMAP_PASS-IMAP password (IMAP_PASSWORD is also accepted)
IMAP_FOLDERINBOXIMAP folder polled by scripts/email_poller.py
IMAP_POLL_INTERVAL_SEC60Delay between IMAP polling cycles
SMTP_HOST""SMTP hostname for email replies
SMTP_PORT587SMTP port for email replies
SMTP_USER""SMTP username
SMTP_PASS-SMTP password (SMTP_PASSWORD is also accepted)
SMTP_FROM_ADDRESSsupport@example.comDefault sender address for outbound replies
EMAIL_WEBHOOK_SIGNING_SECRET-Shared secret used to verify inbound email webhooks (EMAIL_WEBHOOK_SECRET remains a legacy fallback)
  • IMAP mode runs through scripts/email_poller.py and polls IMAP_FOLDER every IMAP_POLL_INTERVAL_SEC seconds.
  • python scripts/email_poller.py --once is the easiest dev-mode smoke check for one poll cycle.
  • Webhook mode supports SendGrid Inbound Parse style payloads with from, to, subject, text, optional html, and optional raw headers.
  • The webhook accepts POST /webhook/email; /api/channels/email/inbound remains as a compatibility alias.
  • Signatures use HMAC-SHA256(body, EMAIL_WEBHOOK_SIGNING_SECRET) in the X-Signature header.
  • Tenant routing uses the sender email domain from TENANT_EMAIL_DOMAINS, for example TENANT_EMAIL_DOMAINS=acme.com:acme,*:default.
  • Low-quality email answers are persisted into escalated_tickets with status="pending_response" for the existing operator flow.
  • The final /api/ask response is cached for (tenant, normalized_question), where normalization is .strip().lower().
  • Keys look like llm_resp:{tenant}:{sha256(question)[:16]}, so the raw question is not stored in Redis.
  • Uploads invalidate the tenant namespace llm_resp:{tenant}:*.

Provider routing is configured through config/providers.yml, which defines:

  • enabled providers (ollama, gracekelly, mistral)
  • model aliases such as ollama-small, gk-fast, and mistral-small-latest
  • per-model input/output pricing, rate limits, and capability flags
  • routing profiles local-first, gracekelly-primary, gracekelly-mixed, and external-mistral

Runtime behavior:

  • gracekelly-primary is the default profile and routes both tiers through the local GraceKelly orchestrator.
  • local-first is the explicit Ollama-only profile and keeps both fast/strong lanes on Ollama.
  • gracekelly-primary falls back only to the declared Ollama fallback when GraceKelly is unavailable and failover is enabled.
  • gracekelly-mixed keeps browser-backed strong answer generation on GraceKelly while routing fast helper/evaluator calls through direct Mistral; use it only for explicit live benchmark runs.
  • external-mistral uses the direct Mistral API and is the intended non-local deployment option when GraceKelly is not present.
  • Startup validation loads the registry, verifies LLM_PROVIDER_PROFILE, and treats placeholder credentials such as changeme as missing.
  • Each traced LLM step now records provider_name, model_name, token usage, and cost; Prometheus exports llm_cost_usd_total{provider,model,tenant}.
  • Automatic failover events are exported as llm_provider_fallback_total{from_provider,to_provider,reason}.
  • mistral-small is GraceKelly’s local fast-lane model name; use mistral-small-latest when you want the direct Mistral alias.
  • The admin UI exposes a Providers tab backed by GET /api/admin/providers, including active profile, configured providers, 1-minute usage, 24-hour cost, and the last successful call timestamp.
  • gracekelly-primary is intended for local setups where D:\GraceKelly\ runs on http://127.0.0.1:8011.
  • The provider uses GET /healthz/ready before the first request and calls POST /api/v1/smart with reliability_level=quick.
  • If GraceKelly is down or times out, the runtime switches only to the declared local fallback (ollama) and caches that decision for FAILOVER_FALLBACK_CACHE_SECONDS. Ollama is not otherwise required by the default health path.
  • GraceKelly calls are treated as proxy/orchestrator traffic, so cost_usd remains 0.0 in local traces.
  • external-mistral is the direct Mistral fallback for deployments where GraceKelly is unavailable.
  • The provider uses POST https://api.mistral.ai/v1/chat/completions with OpenAI-compatible chat payloads and reads token usage from usage.prompt_tokens / usage.completion_tokens.
  • Placeholder MISTRAL_API_KEY=changeme is treated as missing both in startup validation and in the provider constructor.
  • DAILY_COST_LIMIT_USD applies to the direct Mistral profile and blocks new runtime creation after the current UTC-day spend is exhausted.