Audit Codex 30.05.2026 - RAG Support Assistant
Audit Codex 30.05.2026 - RAG Support Assistant
Section titled “Audit Codex 30.05.2026 - RAG Support Assistant”Дата аудита: 2026-05-30
Проект: D:\RAG_Support_Assistant
Ветка: master
HEAD: 4d60479 (ci: clarify weekly report delivery workflow)
Remote: origin/master на том же HEAD
Tracked files: 698 (git ls-files)
rg --files: 648 файлов без ignored artefacts
JS bundle/key count baseline: не применимо к основному приложению; UI - статические HTML/JS, docs-site - отдельный Astro-подпроект.
Compact Note
Section titled “Compact Note”Предыдущий аудит не использовался как источник истины. Я восстановил состояние из текущего checkout, git status, durable docs и локальных gates. audit_codex_30_05_26.md уже существовал как untracked artefact прошлой попытки; текущий файл полностью перезаписан свежим аудитом.
Executive Summary
Section titled “Executive Summary”Проект в хорошем инженерном состоянии: полный pytest green, coverage выше gate, Python lint/type/security gates green, Helm lint/render green, Alembic имеет единственную head-ревизию. Архитектурно это зрелый FastAPI + LangGraph RAG-сервис с Postgres/Redis, Chroma/Qdrant abstraction, provider runtime, observability, evaluation loop, review queue, backup/restore and docs-site.
Главные риски не в backend gates, а на стыках:
- High: DOM XSS в agent UI через
innerHTMLс API-данными тикетов/сообщений; рядом хранится bearer token вlocalStorage, что превращает XSS в захват agent/admin сессии. - High:
docs-siteимеет npm audit high vulnerability в транзитивномdevalue@5.8.0. - Medium: production app не добавляет security headers/CSP и оставляет FastAPI docs/OpenAPI включенными по умолчанию.
- Medium:
docker-compose.ymlвыглядит dev-only, но публикует app/Postgres/Redis/Ollama/Jaeger на host ports и не выставляетRAG_ENV=production. - Medium: startup auto-migration fail-open может поднять приложение с неполной схемой БД.
- Medium:
api/app.pyиagent/graph.pyостаются крупными центрами сложности; coverage слабее всего именно в критичных orchestration/admin areas. - Low: durable handoff state устарел относительно текущего HEAD, что похоже на причину “невозможного компакта”.
Verification
Section titled “Verification”| Gate | Result |
|---|---|
python -m pytest -p no:schemathesis | PASS: 748 passed, 5 skipped, 14 warnings, 17:34 |
python -m pytest -p no:schemathesis --cov --cov-report=term | PASS: 748 passed, 5 skipped, 15 warnings, total coverage 71.56%, gate 70% |
python -m ruff check . | PASS |
| strict mypy scope | PASS: 18 source files clean; warning about unused api.app override section |
python -m mypy api/app.py --no-incremental --follow-imports=skip | PASS; warning about unused agent override sections |
python -m bandit -r . -ll -c pyproject.toml ... | PASS: 0 medium/high; 42 low informational |
python -m pip_audit ... -r requirements.lock | PASS: no known Python vulns, 1 ignored Chroma advisory |
npm --prefix docs-site run astro -- build | PASS, warning: /404 conflicts with catch-all route |
npm --prefix docs-site audit --audit-level=moderate | FAIL: 1 high severity vulnerability in devalue |
helm lint deploy/helm/ --strict | PASS |
helm template ... --set secrets.existingSecret=ci-placeholder ... | PASS |
python -m alembic heads | PASS: single head 017 |
Architecture Snapshot
Section titled “Architecture Snapshot”- API entrypoint:
main.pyre-exportsapi.app:app. - Core app:
api/app.py, 1,633 LOC, owns startup, middleware, legacy helpers, health probes, vector-store initialization, retention tasks and root API router. - Routers: extracted under
api/routers/for conversation, upload, auth, admin ops, review queue, experiments, KB, analytics, agent endpoints and system health. - RAG graph:
agent/graph.py, 1,824 LOC, builds LangGraph flow: classify -> transform -> retrieve -> grade -> generate -> verify -> evaluate -> retry/suggest/log. - Retrieval: Chroma default, Qdrant stub/path, hybrid BM25 + vector, optional reranker, semantic chunking, parent-child retrieval.
- LLM runtime: provider registry in
config/providers.yml, default profilegracekelly-primary, local Ollama fallback, direct Mistral profile available with credential validation. - Persistence: Postgres via SQLAlchemy async + Alembic 001..017; trace store also uses SQLite for operational traces.
- Auth: JWT + legacy API key; RBAC dependencies; OIDC SSO callback issues cookies.
- UI: static HTML pages under
static/; docs-site is separate Astro/Starlight subproject. - Deploy: Dockerfile, docker-compose, Helm chart, GitHub Actions CI, Pages docs workflow and weekly-report workflow.
Findings
Section titled “Findings”H1 - Agent UI DOM XSS can steal bearer tokens
Section titled “H1 - Agent UI DOM XSS can steal bearer tokens”Severity: High Files:
static/agent.html:162-165static/agent.html:183-185static/agent.html:248-250api/routers/agent.py:225-240api/routers/agent.py:297-321static/agent.html:126-128static/agent.html:301-303
Evidence:
- The backend returns raw
user_question,operator_response,message.content, and similar-ticket fields from DB. - The agent UI renders those fields with
innerHTML. - The same page reads/writes
agent_tokeninlocalStorage.
Impact:
An escalated user question, stored conversation message, or operator response containing HTML/JS can execute in an agent’s browser. Because the bearer token is in localStorage, this can become account/session takeover for agent/admin workflows.
Recommended fix:
- Replace dynamic
innerHTMLpaths withtextContentand DOM node construction. - Keep
innerHTMLonly for hardcoded static SVG/empty-state snippets. - Add a regression test that injects
<img onerror=...>/<script>-like ticket content and asserts it is rendered as text. - Add CSP after the UI is cleaned up.
H2 - docs-site has a high npm vulnerability
Section titled “H2 - docs-site has a high npm vulnerability”Severity: High Files:
docs-site/package-lock.json:2469docs-site/package-lock.json:3484-3488
Evidence:
npm --prefix docs-site audit --audit-level=moderatereportsdevalue 5.6.3 - 5.8.0, severity high,GHSA-77vg-94rm-hx3p.- Current lock has
devalue@5.8.0. npm --prefix docs-site outdatedshows available updates:astro 6.3.0 -> 6.4.2,@astrojs/starlight 0.39.1 -> 0.39.2,yaml 2.8.4 -> 2.9.0.
Impact:
The public docs build may carry a known high-risk transitive dependency. Even if the app backend is clean by pip-audit, Node supply-chain security is not currently green.
Recommended fix:
- Run a controlled docs-site dependency bump, starting with
npm audit fixor explicit Astro/Starlight updates. - Commit
docs-site/package-lock.json. - Add
npm audit --audit-level=moderatetodocs-site.ymlor CI.
M1 - Missing CSP/security headers and public OpenAPI surface
Section titled “M1 - Missing CSP/security headers and public OpenAPI surface”Severity: Medium Files:
api/app.py:1720api/app.py:1816-1825api/app.py:1893-1896
Evidence:
FastAPI(...)is created without production-specificdocs_url,redoc_url, oropenapi_urlcontrols, so/docs,/redoc, and/openapi.jsonare available by default.- Middleware adds
X-Request-Id, but no CSP,X-Frame-Options,X-Content-Type-Options, HSTS, Referrer-Policy, or Permissions-Policy. - Static files are mounted directly via
StaticFiles.
Impact:
The app exposes route/schema metadata and has no browser-side mitigation against the XSS class found above. This is especially relevant because the project ships admin/agent static pages.
Recommended fix:
- Add production settings for docs/OpenAPI exposure.
- Add security headers middleware, at minimum CSP after inline script cleanup,
X-Content-Type-Options: nosniff,Referrer-Policy, frame policy and HSTS when behind HTTPS. - Consider serving admin/agent UI behind auth-gated routes only.
M2 - docker-compose is unsafe if treated as production
Section titled “M2 - docker-compose is unsafe if treated as production”Severity: Medium Files:
docker-compose.yml:5-6docker-compose.yml:29-37docker-compose.yml:46-50docker-compose.yml:59-70docker-compose.yml:73-76
Evidence:
- Ollama, Postgres, Redis, Jaeger and app are published to host ports.
- Postgres has fallback password
${POSTGRES_PASSWORD:-rag_dev_password}. - App environment does not set
RAG_ENV=production; it relies on.envand defaults.
Impact:
This is fine for local dev, but risky if someone runs compose on a reachable host. In that mode production fail-fast checks for CORS/secrets may never activate, and DB/Redis/Jaeger become network-exposed.
Recommended fix:
- Label compose explicitly as local-dev only in README and compose comments.
- Bind services to
127.0.0.1or move infra ports behind profiles. - Add a production compose override only if needed, with
RAG_ENV=productionand no default DB password.
M3 - Alembic auto-migration fails open on startup
Section titled “M3 - Alembic auto-migration fails open on startup”Severity: Medium
File: api/app.py:1460-1486
Evidence:
_run_alembic_upgrade()defaultsAUTO_MIGRATE=true.- On any migration exception, startup logs a warning and continues.
- Lifespan always calls it before vector initialization (
api/app.py:1503).
Impact:
In production, a failed migration can leave the app serving traffic against an incompatible schema. The CI migration round-trip gate reduces probability, but not runtime failure impact.
Recommended fix:
- In
RAG_ENV=production, fail startup on migration failure unless an explicitAUTO_MIGRATE_FAIL_OPEN=trueis set. - Or disable auto-migrate in production and require a separate migration job as the Helm/CI contract.
M4 - Central modules remain large and under-covered
Section titled “M4 - Central modules remain large and under-covered”Severity: Medium Evidence:
- Largest tracked Python modules:
agent/graph.py: 1,824 LOCapi/app.py: 1,633 LOCapi/routers/conversation.py: 837 LOCconfig/settings.py: 798 LOCvectordb/_base_manager.py: 727 LOC
- Fresh coverage weak spots:
api/app.py: 55%agent/tools.py: 37%auth/oidc.py: 44%api/routers/admin_review.py: 52%api/routers/analytics.py: 56%channels/email_channel.py: 54%- ingestion modules around 54-59%
Impact:
The test suite is broad, but risk concentrates in orchestration, auth/SSO, ingestion, admin review and legacy app helpers. These are exactly areas where regressions can be expensive.
Recommended fix:
- Continue extracting
api/app.pyinto startup, health, vector-store, regression/admin service modules. - Add focused tests for uncovered branches instead of only raising the global threshold.
- Put module-level coverage targets on auth/SSO, admin review and agent tools.
L1 - Deprecation warnings should be closed before upstream breaks
Section titled “L1 - Deprecation warnings should be closed before upstream breaks”Severity: Low Files:
agent/graph.py:213-222llm/providers/ollama.py:104-122scripts/restore_verify.py:202-203auth/oidc.py:12-15
Evidence from pytest:
- LangChain deprecates
Ollama/ChatOllamaimports used by the project. - Authlib warns
authlib.joseis deprecated in favor ofjoserfc. - Python 3.14 will change default
tar.extractallfiltering. - Coverage warns that the C tracer is unavailable in the local Python 3.13 environment.
Impact:
No current failure, but these warnings are calendar-driven maintenance debt.
Recommended fix:
- Move Ollama integrations to
langchain-ollama. - Track Authlib/Joserfc migration path.
- Pass an explicit safe tar extraction filter or member validation.
- Check local coverage wheel/install if coverage runtime matters on Windows.
L2 - Durable state docs are stale relative to current HEAD
Section titled “L2 - Durable state docs are stale relative to current HEAD”Severity: Low Files:
AGENT_STATE.md:7-17AGENT_STATE.md:29-31next-session-3-subagents.md
Evidence:
- Current HEAD is
4d60479. AGENT_STATE.mdstill records branch source througha86b44c, baseline HEAD415d4c8, and 697 tracked files.- Current tracked file count is 698.
next-session-3-subagents.mdalso describes the state as closed througha86b44c.
Impact:
This is not a runtime defect, but it is likely related to the previous impossible compact: future agents reading durable state can incorrectly chase stale work or stale HEAD assumptions.
Recommended fix:
- Update durable state after this audit only if autonomous handoff docs are still expected to be source of truth.
- Prefer durable docs that avoid volatile HEAD/file-count assertions unless the file is explicitly a snapshot.
L3 - Local ignored artefacts are large and some cache dirs are inaccessible
Section titled “L3 - Local ignored artefacts are large and some cache dirs are inaccessible”Severity: Low Evidence:
.mypy_cache: 22,136 files, 438.75 MBdocs-site/node_modules: 21,262 files, 309.82 MB.tmp: 1,881 files, 173.42 MBdata: 677 files, 117.99 MBhtmlcov: 108 files, 9.23 MBgit status --ignoredandrgreport permission denied for manytests/pytest-cache-files-*directories and onedata/tmp...directory.
Impact:
This slows audits and creates noisy permission errors. It also makes file-count baselines hard to compare.
Recommended fix:
- Add a documented local cleanup command or script that only targets ignored caches.
- Do not delete data/reports automatically; require explicit operator opt-in for runtime data.
Strengths
Section titled “Strengths”- Full Python test suite is large and green on Python 3.13.
- Coverage gate is honest and currently passes.
- Python dependency lock is hash-based and audited.
- CI covers lint, type-check, unit/integration matrix, migrations, Helm, Bandit, pip-audit and regression eval when prompt/settings/experiment inputs change.
- Production settings validate CORS, JWT secret, session secret, admin password hash, DB encryption key and paid-provider credentials.
- Helm chart has stronger production guardrails than compose: required secrets and
CORS_ORIGINSfail-fast. - Tenant-aware tests exist across sessions, vector store, review queue, analytics and admin surfaces.
- Upload handling sanitizes filenames, rejects dotfiles and enforces upload byte limits.
- PII redaction exists for trace state snapshots via
tracing.sqlite_trace. - Observability is broad: Prometheus metrics, component health, retry/circuit-breaker metrics, alert rules, trace/audit retention.
Prioritized Remediation Plan
Section titled “Prioritized Remediation Plan”- Fix
static/agent.htmlXSS and add a regression test. - Add CSP/security headers after inline-script constraints are understood.
- Update
docs-sitedependencies and addnpm auditto CI. - Decide production policy for
/docs,/redoc,/openapi.json. - Make Alembic startup failure policy production-safe.
- Mark
docker-compose.ymlas dev-only and bind infra ports more narrowly. - Close deprecation warnings before LangChain/Authlib/Python 3.14 changes become breaking.
- Refresh or simplify durable handoff docs to avoid compact-resume drift.
- Continue breaking down
api/app.pyand add focused tests for low-coverage critical modules.
Final Assessment
Section titled “Final Assessment”Current readiness: strong local/CI readiness with two security fixes required before public production exposure.
Backend quality is materially better than the static UI and docs-site supply-chain posture. I would not expose the admin/agent UI on an untrusted network until H1 is fixed and browser security headers are added. For internal/local use, the project is operationally solid and well verified.