Skip to content

Phased Roadmap

Last updated: 2026-04-27 (post-audit CI/static UI/analytics/root pytest/model catalog/recon telemetry/agent gate fixes landed; tag v0.1.0-pre-simplify)

2026-04-27 Post-audit CI + static UI contract fixes

Section titled “2026-04-27 Post-audit CI + static UI contract fixes”

Follow-up work from the 2026-04-26 Codex audit closed the no-regret red/contract items without starting the deferred Option B Simplify refactor.

  • 53bc0d4: CI parity gates restored. mypy src/ tests/ is green again after fixing the usage-telemetry Path.open monkeypatch typing; CI coverage gate now uses the measured 94% baseline instead of stale 97% after playwright_driver.py and tooling modules were kept visible to coverage.
  • 4e7b774: GRACEKELLY_API_KEY no longer blocks the static browser UI shell. /, /*.html, /js/*, /css/*, and /icons/* stay public while API and metrics routes remain protected.
  • 21e6b93: CSP split by surface. API/non-HTML routes keep the strict CSP; static HTML pages get the inline/CDN permissions required by the existing vanilla pages and Chart.js analytics dashboard.
  • eed6b9c: main sidebar stats now read the live /api/v1/analytics fields (total_executions, successful, total_models) instead of removed total_requests / successful_requests fields.
  • 729f816: /analytics.html now calls /api/v1/analytics only and removed stale /api/analytics/* and analytics.db assumptions.
  • Root-level pytest collection now ignores the workspace temp roots .tmp, .workflow, and pytest-temp, so the canonical project gate can run from the repository root without collecting archived/temp test copies.
  • Default dry-run/no-browser startup now serves a dry-run-static browser model catalog plus API models for /api/v1/models; a later browser-enabled startup treats that static snapshot as refreshable and replaces it with the live authenticated Perplexity menu when refresh succeeds.
  • Main UI footer no longer advertises archived static tools (english, interview, webpage) whose legacy /api/* backend routes are not present in the current /api/v1 app surface.
  • Local security preflight docs now mirror CI’s on-demand pip-audit and bandit[toml] install/run commands, so the CI-only scanner gate is locally reproducible without permanently expanding the dev extra.
  • The gracekelly.storage.postgres mypy override was rechecked and retained: removing it makes mypy src/ tests/ fail with 21 strict-typing errors in src/gracekelly/storage/postgres.py, so it is an active exception rather than stale config drift.
  • AGENTS.md now matches README/CI for the canonical local gates: python -m pytest -p no:schemathesis --tb=short -q and python -m mypy src/ tests/.
  • The 2026-04-27 Codex review follow-up closed two integration defects: gracekelly-recon-weekly now loads the repo .env before resolving GRACEKELLY_BROWSER_PROFILE_DIR, and usage telemetry now generates and returns a request id for Redis rate-limited 429 responses that bypass the correlation middleware.

Verification checkpoint: python -m pytest -p no:schemathesis --collect-only -q collected 2674 tests before the follow-up; after the Codex review fixes, python -m pytest -p no:schemathesis --tb=short -q passed with 2670 passed, 6 skipped, 14 subtests.

Tag v0.1.0-pre-simplify cut on 8518785 as a safety net before any simplification refactor. BCG-style strategic audit audit_opus_2026-04-26.md landed at the repo root: engineering 9/10, strategic fit for personal use 5/10, four options A/B/C/D, recommendation Option B Simplify (-42% src LOC, -67% test LOC) once 30-day usage telemetry confirms which endpoints are actually exercised.

  • batch-109/docs (3628def): BCG audit + README “Operating Risks” section (Perplexity ToS/Cloudflare risk, Chrome profile lock, UI drift).
  • batch-109/109-1 (032a94d): usage telemetry middleware. Opt-in JSONL append per HTTP request to <repo>/logs/usage.jsonl, gated by GRACEKELLY_USAGE_TELEMETRY_ENABLED. Wired as the outermost middleware so rate-limited and error responses are still recorded. 13 unit tests.
  • batch-109/109-2 (76a8e26): selectors weekly recon cron. gracekelly-recon-weekly console entry, weekly Friday 03:00 Windows Task Scheduler XML, install/uninstall bat helpers, structural drift detection with logs/recon-drift.jsonl + .workflow/state/perplexity-selectors-drift.flag. 8 unit tests.
  • batch-109/closure (0799485): outbox report and .done refresh.
  • batch-110 (b166de8): fix three pre-existing fails in tests/test_playwright_driver.py after ceeb27d added timeout=30_000 kwarg to two production page.goto sites without updating the _FakePage.goto / _HomeNavigationPage.goto test doubles. Surgical 4-ins/3-del kwarg addition.
  • batch-111 (df48b41): fix install_recon_cron.bat — schtasks /XML rejects UTF-8 (“(1,40): unable to switch the encoding”) and requires UTF-16 LE; pipe between Get-Content and Set-Content was caret-escaped (^|) which breaks under bash → cmd; failure-branch echo carried (rc=%RC%). whose closing paren prematurely terminated the if-block. Production-verified by registering, running once against live Perplexity, capturing the baseline (13 models including Claude Opus 4.7), and re-registering via the fixed bat.

Test status after closure: 2661 passed, 0 failed, 6 skipped, 11 subtests (was 2658 / 3 / 6 before the pre-existing fix). mypy --strict 0 errors on 107 source files (added tools.recon_weekly). ruff check src tests scripts clean.

Live smoke (2026-04-26 19:46 UTC):

  • gracekelly-recon-weekly real-Playwright run captured 13 models / 5 home buttons / DOM flags into .workflow/state/perplexity-selectors-baseline.json.
  • Telemetry middleware wrote one JSONL record per request after uvicorn restart with GRACEKELLY_USAGE_TELEMETRY_ENABLED=true.
  • scripts/ecosystem_smoke.py --skip-rag --skip-agent-toolkit --skip-juhub: V2 /smart and /orchestrate PASS.

Open question (carried forward, requires data not decision): which of Options A/B/C/D in audit_opus_2026-04-26.md §7 to take. Decision deferred until ~30 days of logs/usage.jsonl data is available (target ~2026-05-26).

  • batch-101-b/c: Mistral-as-LLM ripped out; Mistral kept for embeddings.
  • batch-102: /api/v1/orchestrate dry-run profile-gate fix.
  • batch-103: cross-audit plus smoke over 8 sync routes for the dry-run profile-gate (audit doc docs/audits/2026-04-25-dry-run-gate-audit.md).
  • batch-104: integration story documented across README / architecture / runbook / roadmap.
  • batch-105: scripts/ecosystem_smoke.py — single-command health check across V2 + 3 known clients.
  • batch-106: scripts/win-autostart/ — Windows Task Scheduler artefacts so V2 boots on user logon (manual install_autostart.bat from Admin to enable).
  • batch-107: mypy --strict src tests cleanup — full stack now type-clean (264 source files).
  • Integrators verified end-to-end:
    • RAG_Support_Assistant (3 PASS / 5 SKIP / 0 FAIL gracekelly_smoke).
    • agent_toolkit (66 unit + 10 integration PASS, including live test_orchestrator_live::test_single_query against running V2).
    • juhub (V2 endpoints, V1 auto-start dropped, Grok removed; landed in Perplexity_Orchestrator2 HEAD 8ff3886).
  • V1 orchestrator at Perplexity_Orchestrator2 (:8001, /api/gk/*) formally deprecated 2026-04-25 — DEPRECATED.md in that repo, SERVERS.md updated, deprecation banner in CLAUDE.md. No client uses V1; code preserved read-only for archeology.

Single-day roadmap closure run. All phases now marked complete; gates PASS; deferred items explicitly scoped as trigger-reactive.

  • batch-86: silent model-mismatch detection via actual_label unverified (9f2aa5f)
  • batch-87: scoped-menu search port from Perplexity_Orchestrator2 (f430d39)
  • batch-88: pattern-equivalence verification vs Orchestrator2 (48b5a93)
  • batch-89: model fallback policy single-shot (8665875)
  • batch-90: browser request budget (bd8d978)
  • batch-91: live smoke harness 5 patterns (a76b632)
  • batch-92: post-Phase-2 comprehensive audit — health 7.6/10 (5163cb8)
  • batch-93: settings injection + roadmap sync (8803a0c, d3e32d6)
  • batch-94: coverage blind spot — unhide playwright_driver (ca3d68b)
  • batch-95: mypy override cleanup + fallback logging (bfebff2)
  • batch-96: README + architecture docs sync (9108343, fa36d49)
  • batch-97: Gate 2 / Gate 3 self-review + AUTH3/cyrillic/API-hedge closure (aeaa9bf)
  • batch-98: /healthz/ready semantic readiness (7e3928f)
  • batch-99: Phase 13/14 trigger-reactive relabel (f356bc6)
  • batch-100: final docs polish (a991b1d — this batch)

Status: complete

Deliverables:

  • independent project root
  • app factory
  • public API contract
  • canonical model registry
  • memory-backed task repository

Status: complete

Deliverables:

  • adapter interface for prompt execution
  • execution result envelope
  • failure taxonomy and retry policy contract
  • multi-model execution plan contract
  • cooperative cancel-on-quorum execution flow
  • typed task, step, and event contracts
  • model timeout and concurrency hints enforced in runtime
  • async route offloading via asyncio.to_thread
  • thread-safe in-memory repository
  • strict step/result cardinality enforcement
  • event persistence failure logging

Review gates (self-review completed 2026-04-23):

  • Gate 2 (operational readiness): PASS for single-user local scope — see docs/gates/2026-04-23-gate-2-operational-review.md; use GET /api/v1/readiness for the full component breakdown.
  • Gate 3 (execution-policy): PASS for single-user local scope — see docs/gates/2026-04-23-gate-3-execution-policy-review.md.

Status: complete, live-smoke harness extended to smart/debate/consensus/compare/upload in batch-91

Deliverables:

  • isolated browser adapter package
  • session lifecycle abstraction
  • model selection verification rules
  • popup and auth recovery hooks
  • scripted browser automation backend
  • live Perplexity DOM recon note
  • centralized selector module
  • thin Playwright-backed BrowserAutomationPort implementation
  • external Gate 4 boundary review completed in audit2.md

Delivered after logging hardening:

  • circuit breaker state transition logging (trip/close/half-open/fail-fast)
  • browser adapter auth check, execution duration, response source logging
  • Playwright session reuse and response extraction diagnostics

Delivered after catalog refresh:

  • “Thinking” model added from recon evidence
  • “Model” text removed from ready_markers and shell_noise_lines (no longer in Perplexity UI)
  • Kimi K2.5 kept in registry with runtime observed_unavailable handling

Delivered after the first live-driver smoke:

  • concrete browser-driver cleanup on top of the app lifespan hook, including idle-session reset and stale runtime detection

Delivered after Phase 2 closing:

  • batch-91 HARNESS-expand-patterns: live smoke harness covers smart/debate/consensus/compare/upload with pattern-specific evaluation; operator runbook section added (a76b632, b965c3c)

Support in place:

  • manual-gated live Playwright smoke test in tests/test_playwright_live.py
  • dedicated profile bootstrap helper via gracekelly-create-perplexity-profile

Status: complete for current scope

Deliverables:

  • PostgreSQL backend alongside memory storage
  • task event log
  • health and integrity checks
  • packaged SQL migration and schema-diff tooling
  • validation CLI and optional live-PostgreSQL tests

Delivered after migration tooling:

  • gk_schema_migrations tracking table with ordered application
  • gracekelly-migrate-postgres CLI with --dry-run
  • migration status (available/applied/pending) in schema_report()

Delivered after pooling:

  • optional psycopg_pool connection pooling via GRACEKELLY_POSTGRES_POOL_ENABLED

Delivered after validation tooling:

  • JSON snapshot export/import CLIs for PostgreSQL task, step, and event data

Status: complete

Deliverables:

  • account pool manager
  • model fallback policy
  • request budget and concurrency limits
  • circuit breakers around adapters
  • quorum and merge policy for multi-model execution

Already delivered:

  • per-model timeout defaults
  • in-process concurrency limits
  • minimal browser-adapter circuit breaker behavior with readiness visibility
  • execution saturation visibility in health/readiness
  • fail-fast plan/result cardinality invariants
  • explicit retry-schema deferral tests
  • API error response sanitization (no internal detail leakage)
  • browser profile-dir path-traversal validation
  • opt-in API key authentication (GRACEKELLY_API_KEY)
  • opt-in per-IP rate limiting (GRACEKELLY_RATE_LIMIT_RPM)

Delivered after account-pool and retry:

  • thread-safe AccountPool with LRU selection and configurable cooldown
  • task-level retry via retry_of_task_id linkage and POST /api/v1/tasks/{id}/retry
  • migration 0002_add_retry_of_task_id.sql

Delivered after Phase 4 closing:

  • batch-89 CORE-model-fallback-policy: ModelSpec.fallback_model_id + single-shot ExecutionRouter fallback on AUTH_FAILED / PROVIDER_UNAVAILABLE / TIMEOUT; env GRACEKELLY_ENABLE_MODEL_FALLBACK default off (8665875)
  • batch-90 CORE-request-budget-browser: RequestBudgetTracker with per-task + rolling-hourly caps; browser-only gate; env GRACEKELLY_MAX_BROWSER_SUBMITS_PER_TASK / GRACEKELLY_MAX_BROWSER_SUBMITS_PER_HOUR default None (bd8d978)
  • batch-93 CORE-inject-settings-into-router: ExecutionRouter accepts explicit Settings via create_app, eliminating the module-global config leak (8803a0c)

Status: complete

Deliverables:

  • metrics endpoint
  • task inspection endpoint
  • operator runbook
  • lightweight admin surface if still justified

Already delivered:

  • /metrics endpoint backed by existing readiness/runtime state
  • /health and /api/v1/readiness
  • operator runbook for the current browser/storage/runtime surfaces
  • structured key-value logging across orchestrator, browser, API route, and PostgreSQL degradation paths
  • recent-task list with operator filters
  • rich GET /api/v1/tasks/{task_id} task, step, and event views
  • execution saturation and terminal-summary diagnostics

Evaluation outcome:

  • admin UI is not justified at this stage — API endpoints, CLI tools, Prometheus /metrics, operator runbook, and structured logging cover all current operator needs without adding frontend build complexity, dependencies, or security surface

Status: consolidated

Deliverables:

  • provider API adapter interface implementation
  • first low-cost provider integration path
  • provider-specific auth and rate-limit handling

Already delivered:

  • initially shipped a Mistral-compatible adapter; later removed when Mistral was constrained to embeddings-only usage
  • OpenAI-compatible adapter
  • shared BaseApiAdapter with common execute, post, healthcheck, and error handling

Trigger-reactive follow-up (not scheduled):

  • expand the API hedge beyond a single OpenAI-compatible model only if browser fragility becomes material under production-like load; this is not a currently open item for the single-user local deployment path

Phase 6–10: Core smart endpoints and consensus V1

Section titled “Phase 6–10: Core smart endpoints and consensus V1”

Status: complete

Deliverables:

  • Smart endpoint with execution profile resolution
  • Consensus V1 engine with majority voting and confidence scoring
  • Analytics endpoint with graceful degradation
  • Batch endpoint for parallel multi-prompt execution
  • Embeddings client integration

Phase 11: Consensus V2 + Infrastructure Integration

Section titled “Phase 11: Consensus V2 + Infrastructure Integration”

Status: complete

Deliverables:

  • Consensus V2 engine: HAC clustering, cluster confidence, cross-pollination, debate round, divergence handling, adaptive parameters
  • Full ConsensusExecutorV2 pipeline with peer review reranking and round weighting
  • Infrastructure modules: account loader, account pool manager, execution history, round weighting, multi-model executor, peer review reranker
  • 4 new endpoints: POST /api/v1/batch, POST /api/v1/pipeline, GET /api/v1/health/detailed, POST /api/v1/smart/v2
  • Route inventory smoke test covering all 15 endpoints
  • Audit fixes: consensus error sanitization, analytics graceful degradation

Status: complete

Deliverables:

  • POST /api/v1/debate — Devil’s Advocate debate endpoint with challenge, defense, improved response
  • POST /api/v1/compare — multi-model FIVE_MODELS_COMPARE pattern with Judge analysis
  • POST /api/v1/batch — already delivered in Phase 11
  • Task graph system: core graph, builder (sequential/parallel/fan-out-fan-in/pipeline), executor with topological sort and skip-on-failure

Status: complete

Delivered:

  • ✅ GitHub Actions CI pipeline (Python 3.11/3.12, import check, coverage reporting, pip-audit)
  • ✅ Dockerfile: Python 3.12-slim, non-root user, HEALTHCHECK
  • ✅ docker-compose (standalone + postgres-backed configurations)
  • ✅ .dockerignore
  • ✅ Security audit: no hardcoded keys, all 44 core modules importable, no silent exception swallowing in routes
  • ✅ mypy strict mode: 0 errors across all 98 source files; CI enforces —strict on every push
  • ✅ Typed AppState, typed route helpers, cast() for all json.loads returns, typed middleware
  • ✅ Health endpoint security: minimal response by default, details opt-in via GRACEKELLY_HEALTH_EXPOSE_DETAILS
  • ✅ Startup/shutdown lifecycle closes browser adapters, executors, and PostgreSQL pools cleanly
  • ✅ Orchestrate request timeout: returns HTTP 504 on breach (GRACEKELLY_ORCHESTRATE_TIMEOUT_SECONDS)
  • ✅ Analytics N+1 query fix: batch step loading via list_steps_batch
  • ✅ Event pagination for GET /tasks/{id}: events_limit / events_offset query params
  • ✅ httpx migration for API adapter (replaces requests)
  • ✅ Prometheus latency histogram: gracekelly_http_request_duration_seconds (buckets, sum, count)
  • ✅ SAST with bandit added to CI pipeline (|| true); nosec annotations for false positives
  • ✅ dry_run default changed from True to False (HIGH audit finding — prevents silent DryRunAdapter in production)
  • ✅ AnthropicApiAdapter._post_json signature aligned with BaseApiAdapter (extra_headers param)
  • ✅ app_state.py: added test coverage
  • ✅ 2309 tests passing (from 2229 start of session)

Trigger-reactive follow-up (production/multi-user scope): These items are not open work for the single-user local deploy. Each activates on its own trigger:

  • Async adapters (async httpx / async playwright) — trigger: sustained event-loop stalls under concurrent load. Current sync adapters wrapped in asyncio.to_thread are sufficient for single-user throughput.
  • Redis-backed rate limiting for multi-process deployments — trigger: horizontal scale-out beyond one uvicorn worker. In-process limiting is sufficient single-process.
  • OpenTelemetry distributed tracing — trigger: multi-service topology or collecting perf data across requests. Single-user local has structured logs.
  • Error tracking integration (Sentry) — trigger: production fleet where operator not the user. Single-user local reads logs directly.
  • Load testing framework — trigger: SLA-bound deploy or before a material scale change. Not required for local scope.

Status: complete

Deliverables:

  • ✅ README.md: Quick Start, Configuration table, API table (23 endpoints), Development, Architecture
  • ✅ X-Request-ID correlation middleware: echoes client header or generates UUID; propagated in all responses
  • ✅ RFC 7807 Problem Details: all 4xx/5xx responses use {type, title, status, detail} format
  • ✅ API key authentication documented and enforced through GRACEKELLY_API_KEY for protected endpoints
  • ✅ Kubernetes probes: GET /healthz/live (always 200) and GET /healthz/ready (checks storage, 503 if unavailable)
  • ✅ Settings.validate(): fail-fast startup validation (postgres+no DSN, timeout<1s)
  • ✅ Event buffering: OrchestratorService._event_buffer (deque maxlen=500) buffers events on storage failure, flushes on next submit
  • ✅ Property-based tests: hypothesis invariants for similarity symmetry, clustering bounds, confidence normalization
  • ✅ Coverage gap tests: smart_v2, base adapter, pipeline, storage/base
  • ✅ Multi-stage Docker build: builder + runtime stages, non-root user, HEALTHCHECK
  • ✅ playwright_driver.py excluded from coverage (requires live browser)
  • ✅ CI coverage gate raised: 90% → 93%
  • ✅ 2355 tests passing, 0 failures, coverage 93.85%

Trigger-reactive follow-up (merged with Phase 13):

  • Async adapters (async httpx.AsyncClient) — merged with Phase 13 async-adapter item above (same underlying work, same trigger). asyncio.to_thread wrapper remains in place until triggered.

Phase 15: Async adapters, observability, HTML SPA UI

Section titled “Phase 15: Async adapters, observability, HTML SPA UI”

Status: complete

Scope: replace Streamlit frontend with an HTML SPA, finish async adapter work, plug in optional observability and rate-limiting backends.

Delivered:

  • feat: async adapters (4fbfe26) — execute_async() on adapter ABC, AsyncClient in BaseApiAdapter, routes updated to await
  • feat: Sentry error tracking + Redis rate limiting (optional, env-driven) (8fc51fe)
  • feat: OpenTelemetry tracing (optional, env-driven) (ba2ca8d)
  • feat: model refresh endpoint, modern UI, chat robustness (2757daf)
  • feat: context truncation (3000 chars) + chat error recovery (df5bfb4)
  • fix: browser model routing in stream, chat dry_run toggle, UI cleanup (0b632b4)
  • feat: session chain (20 turns), auto-decomposition, file attachments (d03bb58)
  • fix: 3 smoke-дефекта — dry_run output_text, stream ValueError, extra fields 422 (316242c)
  • feat: Playwright image upload, UI file uploader, async tests, live smoke (466fc80)
  • refactor: orchestrator split + session chain + file attachments (cf2562d)
  • feat: HTML SPA UI replaces Streamlit (PO2 design) (88b8106)
  • chore: CI hardening + security dependency upgrades (3e0c0cd)
  • fix: smoke regressions — dry_run propagation, healthz/live, stream task_id (f7838b7)
  • feat: browser profile safety + screenshot debugging (9da447e)

mypy-only hardening batches (mypy tests passing under —strict):

  • 8340ba7 refactor: remove hasattr guards + async cleanup (-421 errors)
  • 9b6cfe2 fix: mypy tests -190 errors
  • 38a7c52 fix: mypy src/ tests/ → 0 errors
  • b1bdb6d fix: coverage 95.66%→97%+, CI mypy src/ tests/ lockdown

Phase 16: Browser-backed orchestration hardening

Section titled “Phase 16: Browser-backed orchestration hardening”

Status: complete (batches 69 through 74)

Scope: close the 500-on-browser-adapter gap, rebuild the PO2 HTML shell, keep the browser-side model catalog in sync with what Perplexity actually exposes, and formalise the auth-overlay handling.

Delivered:

  • batch-69 DIAG-orchestrate-500: map PermissionError and adapter timeouts to structured failure on both sync and async routes (810b4c1, f3fc848)
  • batch-69 UI1-UI3 PO2 rebuild: PO2 HTML skeleton, styles, chat.js features restored (c9751d1, f7b067b, b6c05ca)
  • batch-69 MODEL-dynamic-catalog: runtime snapshot replaces the static browser model registry (e87498c)
  • batch-70 CLOSE-69: ledger + cleanup of 69 scope (745e66f, 332f20d, d95b7d9, 36fbbc5, 8cd7603)
  • AUTH1/AUTH2 + AUTH-FIX1/FIX2/FIX3: sync returns HTTP 503 {"code":"model_auth_required"}, async task captures the same code in task.error; inline UI banner with Retry; shared constants and Playwright regression (8763c19, 0aa592d, f19ed1b, ae4f07e, 4f7ed1a)
  • batch-72 BOOTSTRAP-onboarding: dedicated Chrome profile bootstrap helper + onboarding doc; chrome-profile/ now gitignored (014493e, 2bf3bab)
  • batch-72/73 consolidated: catalog async lifespan, profile safety validator, logger visibility fix (67fc496)
  • batch-73 SMOKE-rerun: live Perplexity single-pattern smoke returns 200 (cffa4fc)
  • batch-74 UI-PO2-parity-rework: copy PO2 icons + pages + align DOM/CSS/JS (48710aa)
  • batch-74 FLAKY-postgres-test-triage: hypothesis + bisect plan recorded, fix deferred (412cee4)
  • chore: drain timed-out orchestrate submissions so unhandled futures don’t leak into unrelated tests (772ef34)
  • batch-74 closure + evidence: workflow done marker refresh, UI parity screenshots, live catalog snapshot (11 models, no Kimi) (88e1738, e1f7185)

Phase 17: Live UI smoke and workflow hygiene

Section titled “Phase 17: Live UI smoke and workflow hygiene”

Status: closed for smart/debate arc, adapter-timeout follow-up deferred

Scope: validate complex browser scenarios (file upload, decomposition, debate) against the real Perplexity UI, and keep the .workflow/ task queue clean.

Delivered:

  • batch-75 FLAKY-postgres-fix: flaky test_main_rejects_checksum_mismatch no longer reproduces on the current tip; full suite runs 2579 passed / 0 failed twice back-to-back, no skip/xfail/ordering workaround added (98b8835)
  • batch-75 INBOX-cleanup: six processed task specs archived, .ready cleared, batch-75 spec moved to .workflow/done/ (2b909d6, deebc5a)
  • batch-76 UI-SMOKE-prep + upload: / lives at http://127.0.0.1:8011/; live Claude 4.6 single-pattern smoke with attachment returns 200 and the sentinel in output_text (not committed — artefacts only)
  • batch-77 UI-MENU-extend: smart and debate exposed as user-selectable items (“Умный выбор”, “Дебаты”) in the Авто group of the PO2 model menu (f456067)
  • batch-77 UI-BROWSER-regressions: three Playwright regression tests drive the real SPA with mocked /api/v1/smart, /api/v1/debate, /api/v1/orchestrate/upload — no live Perplexity dependency (063f29b)
  • batch-79 FIX-ui-contract: debate/smart items now carry pinned_model; _resolveItem short-circuit resolves it so getSelection().model is never null (a89ca78). Mock regressions extended to assert the body carries a non-empty model.
  • batch-79 FIX-env-docs: README corrected to describe GRACEKELLY_EXECUTION_PROFILE as dry-run / api-only / hybrid (prior value default was never valid) (614b181)
  • batch-80 BACKEND-smart-debate-browser-support: /api/v1/smart and /api/v1/debate accept browser-backed models via state.browser_adapter; unit tests cover browser-path success + API-path regression (dd632c7)
  • batch-80 FLAKE-triage-http-api: test_list_tasks_exposes_winning_model_and_short_circuit_summary stabilised with a deterministic cancellable adapter; three back-to-back full runs + one coverage run green (5c62065)
  • batch-82 UI-PIN-claude-sonnet: smart/debate pinned_model switched from the unstable "best" Perplexity alias to explicit "claude-sonnet-4-6"; mock regressions updated (1e448e0)
  • batch-84 BACKEND-unify-browser-adapter: shared adapter-lookup helper across smart/debate/smart_v2/consensus/compare; /api/v1/smart/v2, /api/v1/consensus, and /api/v1/compare now accept browser-backed models (e877527)
  • batch-85 ADAPTER-raise-call-timeout: default browser adapter per-call timeout raised 60s → 120s with GRACEKELLY_BROWSER_CALL_TIMEOUT_SECONDS env override (43d29b2)
  • batch-86 BROWSER-actual-label-unverified-reflects-ui: unverified select_model returns parsed actual UI label (button-text-after or menu-snapshot first line) instead of echoing provider_model_id; _model_matches_expected now detects silent Sonar-for-Claude swaps (9f2aa5f)
  • batch-87 TOOL-extend-recon-with-attrs + RECON-best-alias-live: capture_perplexity_recon.py emits new recon-03-model-menu-attrs.json with per-item attributes; live recon bundle captured in tmp/browser-recon/2026-04-23/ confirms Best is a div[role="menuitemradio"] inside [data-radix-popper-content-wrapper], shares class patterns with other entries — nested-text-node ambiguity, not duplicate items (11f4a9a, 9203f50)
  • batch-87 BROWSER-scoped-menu-search: PlaywrightBrowserAutomation.select_model now tries menu-scoped option lookup (model_menu_candidatesmodel_menu_item_selector role-filter) before falling back to the legacy global get_by_role/text path; model_menu_candidates extended with [role="menu"]; details["option_lookup_source"] records which path matched. Pattern ported from Perplexity_Orchestrator2/browser/model_selection.py:182-240 (f430d39)
  • batch-88 VERIFY-scoped-menu-vs-orchestrator2: scoped-menu-search pattern verified equivalent to Perplexity_Orchestrator2/browser/model_selection.py:182-240 (production on ~30 Perplexity Pro accounts); multi-match nested-text-node behavior covered by regression tests — Best-alias follow-up closed (48b5a93)
  • inline AUTH-settle-unknown-state: auth_status makes a bounded _wait_for_shell retry when neither signed-out markers nor the prompt input are visible; unblocks smart auto-decomposition sub-execs that previously hit [auth_failed] mid-flight after exec #1 landed. Structured browser_auth_unknown diagnostic log added for any remaining logged-out decisions (66a64a8)
  • inline RESPONSE-strip-streaming-chrome: shell_noise_lines extended with Thinking, Ask a follow-up, Stop response, Regenerate, Sources, Answer so candidate-text cleanup filters them out (d0acbd4)
  • inline SMOKE-live-smart-acceptance: scripts/live_smart_smoke.py reusable harness drives PO2 SPA + captures /api/v1/smart response; inline run against a live chrome-profile returned status=200 model_id=claude-sonnet-4-6 answer_len=990 with a structured EV-markets answer in 49.7s, no [auth_failed] or shell-chrome markers (352c6b8, 94fd2d9)
  • inline SMOKE-live-debate-acceptance: harness extended with --pattern debate; live run returned status=200 model_id=claude-sonnet-4-6 answer_len=3293 with a full initial_position/challenge/defense/improved_response chain on an EV CO2 debate topic, 61.2s (fc6307f)
  • inline BROWSER-reset-page-state-between-execs: PerplexityBrowserAdapter.execute now calls automation.reset_page_state() (navigates to home via _navigate_home_for_model_selection) after dismiss_popups, so role-based fan-out / decomposition sub-execs no longer collide on a stale thread URL. Live SMART role-based fan-out proven: three real Perplexity sub-execs (123s/100s/53s, 9429/8962/5102 chars — each distinct), final 8962-char multi-region analysis (66ddfdb)

Incidents (deprecated follow-up specs recorded for history, not in Delivered):

  • batch-78 Live UI smoke SMART/DEBATE (first attempt): failed — UI did not expose the patterns; superseded by batch-77 + batch-79 work.
  • batch-81 Live SMART on "best" alias: failed — Perplexity UI non-deterministically selected Sonar instead of Best, one run extracted shell chrome text ("Thinking / Ask a follow-up / Model") instead of the answer. Workaround landed in batch-82 (pin to explicit model); the root-cause fix landed later in the batch-87/batch-88 browser adapter work.
  • batch-82 Live SMART with cyrillic prompt: failed — CX harness’ PowerShell pipeline converted the cyrillic prompt to ? before it reached Playwright. POST /api/v1/smart still returned 200 with the expected model_id, confirming the batch-80 backend fix, but the acceptance (meaningful answer) could not be validated.
  • batch-83 Live SMART with UTF-8 harness: failed — smart auto-decomposition fired three Perplexity calls for the prompt; adapter timeout of 60s per call clipped two of them, the third extracted 2730 chars at 53s. Route-level response was not captured within the harness’ outer 180s window.

Closed / Scoped out (2026-04-23):

  • Cyrillic prompts via some harnesses lose encoding — closed as out-of-scope: the corruption happens on the PowerShell/automation-harness side, not in GraceKelly. See docs/operator-runbook.md under Harness limitations.
  • AUTH3 persistent session reuse — closed by design via the batch-69 rationale: AUTH1 sync mapping, AUTH2 inline banner recovery, and the dedicated Chrome profile directory already cover the single-user local recovery path. Batch-72/bootstrap already covered profile lifecycle.