Phased Roadmap
Phased Roadmap
Section titled “Phased Roadmap”Last updated: 2026-04-27 (post-audit CI/static UI/analytics/root pytest/model catalog/recon telemetry/agent gate fixes landed; tag v0.1.0-pre-simplify)
2026-04-27 Post-audit CI + static UI contract fixes
Section titled “2026-04-27 Post-audit CI + static UI contract fixes”Follow-up work from the 2026-04-26 Codex audit closed the no-regret red/contract items without starting the deferred Option B Simplify refactor.
53bc0d4: CI parity gates restored.mypy src/ tests/is green again after fixing the usage-telemetryPath.openmonkeypatch typing; CI coverage gate now uses the measured 94% baseline instead of stale 97% afterplaywright_driver.pyand tooling modules were kept visible to coverage.4e7b774:GRACEKELLY_API_KEYno longer blocks the static browser UI shell./,/*.html,/js/*,/css/*, and/icons/*stay public while API and metrics routes remain protected.21e6b93: CSP split by surface. API/non-HTML routes keep the strict CSP; static HTML pages get the inline/CDN permissions required by the existing vanilla pages and Chart.js analytics dashboard.eed6b9c: main sidebar stats now read the live/api/v1/analyticsfields (total_executions,successful,total_models) instead of removedtotal_requests/successful_requestsfields.729f816:/analytics.htmlnow calls/api/v1/analyticsonly and removed stale/api/analytics/*andanalytics.dbassumptions.- Root-level pytest collection now ignores the workspace temp roots
.tmp,.workflow, andpytest-temp, so the canonical project gate can run from the repository root without collecting archived/temp test copies. - Default dry-run/no-browser startup now serves a
dry-run-staticbrowser model catalog plus API models for/api/v1/models; a later browser-enabled startup treats that static snapshot as refreshable and replaces it with the live authenticated Perplexity menu when refresh succeeds. - Main UI footer no longer advertises archived static tools (
english,interview,webpage) whose legacy/api/*backend routes are not present in the current/api/v1app surface. - Local security preflight docs now mirror CI’s on-demand
pip-auditandbandit[toml]install/run commands, so the CI-only scanner gate is locally reproducible without permanently expanding the dev extra. - The
gracekelly.storage.postgresmypy override was rechecked and retained: removing it makesmypy src/ tests/fail with 21 strict-typing errors insrc/gracekelly/storage/postgres.py, so it is an active exception rather than stale config drift. AGENTS.mdnow matches README/CI for the canonical local gates:python -m pytest -p no:schemathesis --tb=short -qandpython -m mypy src/ tests/.- The 2026-04-27 Codex review follow-up closed two integration defects:
gracekelly-recon-weeklynow loads the repo.envbefore resolvingGRACEKELLY_BROWSER_PROFILE_DIR, and usage telemetry now generates and returns a request id for Redis rate-limited 429 responses that bypass the correlation middleware.
Verification checkpoint: python -m pytest -p no:schemathesis --collect-only -q
collected 2674 tests before the follow-up; after the Codex review fixes,
python -m pytest -p no:schemathesis --tb=short -q passed with
2670 passed, 6 skipped, 14 subtests.
2026-04-26 Audit + telemetry + recon cron
Section titled “2026-04-26 Audit + telemetry + recon cron”Tag v0.1.0-pre-simplify cut on 8518785 as a safety net before any
simplification refactor. BCG-style strategic audit audit_opus_2026-04-26.md
landed at the repo root: engineering 9/10, strategic fit for personal use
5/10, four options A/B/C/D, recommendation Option B Simplify (-42% src LOC,
-67% test LOC) once 30-day usage telemetry confirms which endpoints are
actually exercised.
batch-109/docs(3628def): BCG audit + README “Operating Risks” section (Perplexity ToS/Cloudflare risk, Chrome profile lock, UI drift).batch-109/109-1(032a94d): usage telemetry middleware. Opt-in JSONL append per HTTP request to<repo>/logs/usage.jsonl, gated byGRACEKELLY_USAGE_TELEMETRY_ENABLED. Wired as the outermost middleware so rate-limited and error responses are still recorded. 13 unit tests.batch-109/109-2(76a8e26): selectors weekly recon cron.gracekelly-recon-weeklyconsole entry, weekly Friday 03:00 Windows Task Scheduler XML, install/uninstall bat helpers, structural drift detection withlogs/recon-drift.jsonl+.workflow/state/perplexity-selectors-drift.flag. 8 unit tests.batch-109/closure(0799485): outbox report and.donerefresh.batch-110(b166de8): fix three pre-existing fails intests/test_playwright_driver.pyafterceeb27daddedtimeout=30_000kwarg to two productionpage.gotosites without updating the_FakePage.goto/_HomeNavigationPage.gototest doubles. Surgical 4-ins/3-del kwarg addition.batch-111(df48b41): fixinstall_recon_cron.bat— schtasks /XML rejects UTF-8 (“(1,40): unable to switch the encoding”) and requires UTF-16 LE; pipe between Get-Content and Set-Content was caret-escaped (^|) which breaks under bash → cmd; failure-branch echo carried(rc=%RC%).whose closing paren prematurely terminated the if-block. Production-verified by registering, running once against live Perplexity, capturing the baseline (13 models including Claude Opus 4.7), and re-registering via the fixed bat.
Test status after closure: 2661 passed, 0 failed, 6 skipped, 11 subtests
(was 2658 / 3 / 6 before the pre-existing fix). mypy --strict 0 errors on
107 source files (added tools.recon_weekly). ruff check src tests scripts
clean.
Live smoke (2026-04-26 19:46 UTC):
gracekelly-recon-weeklyreal-Playwright run captured 13 models / 5 home buttons / DOM flags into.workflow/state/perplexity-selectors-baseline.json.- Telemetry middleware wrote one JSONL record per request after uvicorn
restart with
GRACEKELLY_USAGE_TELEMETRY_ENABLED=true. scripts/ecosystem_smoke.py --skip-rag --skip-agent-toolkit --skip-juhub: V2/smartand/orchestratePASS.
Open question (carried forward, requires data not decision): which of
Options A/B/C/D in audit_opus_2026-04-26.md §7 to take. Decision deferred
until ~30 days of logs/usage.jsonl data is available (target
~2026-05-26).
2026-04-25 Integration closure
Section titled “2026-04-25 Integration closure”batch-101-b/c: Mistral-as-LLM ripped out; Mistral kept for embeddings.batch-102:/api/v1/orchestratedry-run profile-gate fix.batch-103: cross-audit plus smoke over 8 sync routes for the dry-run profile-gate (audit docdocs/audits/2026-04-25-dry-run-gate-audit.md).batch-104: integration story documented across README / architecture / runbook / roadmap.batch-105:scripts/ecosystem_smoke.py— single-command health check across V2 + 3 known clients.batch-106:scripts/win-autostart/— Windows Task Scheduler artefacts so V2 boots on user logon (manualinstall_autostart.batfrom Admin to enable).batch-107:mypy --strict src testscleanup — full stack now type-clean (264 source files).- Integrators verified end-to-end:
RAG_Support_Assistant(3 PASS / 5 SKIP / 0 FAIL gracekelly_smoke).agent_toolkit(66 unit + 10 integration PASS, including livetest_orchestrator_live::test_single_queryagainst running V2).juhub(V2 endpoints, V1 auto-start dropped, Grok removed; landed inPerplexity_Orchestrator2HEAD8ff3886).
- V1 orchestrator at
Perplexity_Orchestrator2(:8001,/api/gk/*) formally deprecated 2026-04-25 —DEPRECATED.mdin that repo,SERVERS.mdupdated, deprecation banner inCLAUDE.md. No client uses V1; code preserved read-only for archeology.
2026-04-23 Closure Timeline
Section titled “2026-04-23 Closure Timeline”Single-day roadmap closure run. All phases now marked complete; gates PASS; deferred items explicitly scoped as trigger-reactive.
batch-86: silent model-mismatch detection via actual_label unverified (9f2aa5f)batch-87: scoped-menu search port from Perplexity_Orchestrator2 (f430d39)batch-88: pattern-equivalence verification vs Orchestrator2 (48b5a93)batch-89: model fallback policy single-shot (8665875)batch-90: browser request budget (bd8d978)batch-91: live smoke harness 5 patterns (a76b632)batch-92: post-Phase-2 comprehensive audit — health 7.6/10 (5163cb8)batch-93: settings injection + roadmap sync (8803a0c,d3e32d6)batch-94: coverage blind spot — unhide playwright_driver (ca3d68b)batch-95: mypy override cleanup + fallback logging (bfebff2)batch-96: README + architecture docs sync (9108343,fa36d49)batch-97: Gate 2 / Gate 3 self-review + AUTH3/cyrillic/API-hedge closure (aeaa9bf)batch-98: /healthz/ready semantic readiness (7e3928f)batch-99: Phase 13/14 trigger-reactive relabel (f356bc6)batch-100: final docs polish (a991b1d— this batch)
Phase 0: Clean foundation
Section titled “Phase 0: Clean foundation”Status: complete
Deliverables:
- independent project root
- app factory
- public API contract
- canonical model registry
- memory-backed task repository
Phase 1: Execution contract
Section titled “Phase 1: Execution contract”Status: complete
Deliverables:
- adapter interface for prompt execution
- execution result envelope
- failure taxonomy and retry policy contract
- multi-model execution plan contract
- cooperative cancel-on-quorum execution flow
- typed task, step, and event contracts
- model timeout and concurrency hints enforced in runtime
- async route offloading via
asyncio.to_thread - thread-safe in-memory repository
- strict step/result cardinality enforcement
- event persistence failure logging
Review gates (self-review completed 2026-04-23):
- Gate 2 (operational readiness): PASS for single-user local scope — see
docs/gates/2026-04-23-gate-2-operational-review.md; useGET /api/v1/readinessfor the full component breakdown. - Gate 3 (execution-policy): PASS for single-user local scope — see
docs/gates/2026-04-23-gate-3-execution-policy-review.md.
Phase 2: Browser worker
Section titled “Phase 2: Browser worker”Status: complete, live-smoke harness extended to smart/debate/consensus/compare/upload in batch-91
Deliverables:
- isolated browser adapter package
- session lifecycle abstraction
- model selection verification rules
- popup and auth recovery hooks
- scripted browser automation backend
- live Perplexity DOM recon note
- centralized selector module
- thin Playwright-backed
BrowserAutomationPortimplementation - external Gate 4 boundary review completed in
audit2.md
Delivered after logging hardening:
- circuit breaker state transition logging (trip/close/half-open/fail-fast)
- browser adapter auth check, execution duration, response source logging
- Playwright session reuse and response extraction diagnostics
Delivered after catalog refresh:
- “Thinking” model added from recon evidence
- “Model” text removed from ready_markers and shell_noise_lines (no longer in Perplexity UI)
- Kimi K2.5 kept in registry with runtime
observed_unavailablehandling
Delivered after the first live-driver smoke:
- concrete browser-driver cleanup on top of the app lifespan hook, including idle-session reset and stale runtime detection
Delivered after Phase 2 closing:
- batch-91 HARNESS-expand-patterns: live smoke harness covers smart/debate/consensus/compare/upload with pattern-specific evaluation; operator runbook section added (
a76b632,b965c3c)
Support in place:
- manual-gated live Playwright smoke test in
tests/test_playwright_live.py - dedicated profile bootstrap helper via
gracekelly-create-perplexity-profile
Phase 3: Durable state
Section titled “Phase 3: Durable state”Status: complete for current scope
Deliverables:
- PostgreSQL backend alongside memory storage
- task event log
- health and integrity checks
- packaged SQL migration and schema-diff tooling
- validation CLI and optional live-PostgreSQL tests
Delivered after migration tooling:
gk_schema_migrationstracking table with ordered applicationgracekelly-migrate-postgresCLI with--dry-run- migration status (available/applied/pending) in
schema_report()
Delivered after pooling:
- optional
psycopg_poolconnection pooling viaGRACEKELLY_POSTGRES_POOL_ENABLED
Delivered after validation tooling:
- JSON snapshot export/import CLIs for PostgreSQL task, step, and event data
Phase 4: Reliability controls
Section titled “Phase 4: Reliability controls”Status: complete
Deliverables:
- account pool manager
- model fallback policy
- request budget and concurrency limits
- circuit breakers around adapters
- quorum and merge policy for multi-model execution
Already delivered:
- per-model timeout defaults
- in-process concurrency limits
- minimal browser-adapter circuit breaker behavior with readiness visibility
- execution saturation visibility in health/readiness
- fail-fast plan/result cardinality invariants
- explicit retry-schema deferral tests
- API error response sanitization (no internal detail leakage)
- browser profile-dir path-traversal validation
- opt-in API key authentication (
GRACEKELLY_API_KEY) - opt-in per-IP rate limiting (
GRACEKELLY_RATE_LIMIT_RPM)
Delivered after account-pool and retry:
- thread-safe
AccountPoolwith LRU selection and configurable cooldown - task-level retry via
retry_of_task_idlinkage andPOST /api/v1/tasks/{id}/retry - migration
0002_add_retry_of_task_id.sql
Delivered after Phase 4 closing:
- batch-89 CORE-model-fallback-policy:
ModelSpec.fallback_model_id+ single-shotExecutionRouterfallback onAUTH_FAILED/PROVIDER_UNAVAILABLE/TIMEOUT; envGRACEKELLY_ENABLE_MODEL_FALLBACKdefault off (8665875) - batch-90 CORE-request-budget-browser:
RequestBudgetTrackerwith per-task + rolling-hourly caps; browser-only gate; envGRACEKELLY_MAX_BROWSER_SUBMITS_PER_TASK/GRACEKELLY_MAX_BROWSER_SUBMITS_PER_HOURdefaultNone(bd8d978) - batch-93 CORE-inject-settings-into-router:
ExecutionRouteraccepts explicitSettingsviacreate_app, eliminating the module-global config leak (8803a0c)
Phase 5: Operations surface
Section titled “Phase 5: Operations surface”Status: complete
Deliverables:
- metrics endpoint
- task inspection endpoint
- operator runbook
- lightweight admin surface if still justified
Already delivered:
/metricsendpoint backed by existing readiness/runtime state/healthand/api/v1/readiness- operator runbook for the current browser/storage/runtime surfaces
- structured key-value logging across orchestrator, browser, API route, and PostgreSQL degradation paths
- recent-task list with operator filters
- rich
GET /api/v1/tasks/{task_id}task, step, and event views - execution saturation and terminal-summary diagnostics
Evaluation outcome:
- admin UI is not justified at this stage — API endpoints, CLI tools, Prometheus /metrics, operator runbook, and structured logging cover all current operator needs without adding frontend build complexity, dependencies, or security surface
Parallel track: API adapters
Section titled “Parallel track: API adapters”Status: consolidated
Deliverables:
- provider API adapter interface implementation
- first low-cost provider integration path
- provider-specific auth and rate-limit handling
Already delivered:
- initially shipped a Mistral-compatible adapter; later removed when Mistral was constrained to embeddings-only usage
- OpenAI-compatible adapter
- shared
BaseApiAdapterwith common execute, post, healthcheck, and error handling
Trigger-reactive follow-up (not scheduled):
- expand the API hedge beyond a single OpenAI-compatible model only if browser fragility becomes material under production-like load; this is not a currently open item for the single-user local deployment path
Phase 6–10: Core smart endpoints and consensus V1
Section titled “Phase 6–10: Core smart endpoints and consensus V1”Status: complete
Deliverables:
- Smart endpoint with execution profile resolution
- Consensus V1 engine with majority voting and confidence scoring
- Analytics endpoint with graceful degradation
- Batch endpoint for parallel multi-prompt execution
- Embeddings client integration
Phase 11: Consensus V2 + Infrastructure Integration
Section titled “Phase 11: Consensus V2 + Infrastructure Integration”Status: complete
Deliverables:
- Consensus V2 engine: HAC clustering, cluster confidence, cross-pollination, debate round, divergence handling, adaptive parameters
- Full ConsensusExecutorV2 pipeline with peer review reranking and round weighting
- Infrastructure modules: account loader, account pool manager, execution history, round weighting, multi-model executor, peer review reranker
- 4 new endpoints: POST /api/v1/batch, POST /api/v1/pipeline, GET /api/v1/health/detailed, POST /api/v1/smart/v2
- Route inventory smoke test covering all 15 endpoints
- Audit fixes: consensus error sanitization, analytics graceful degradation
Phase 12: Child Project APIs
Section titled “Phase 12: Child Project APIs”Status: complete
Deliverables:
- POST /api/v1/debate — Devil’s Advocate debate endpoint with challenge, defense, improved response
- POST /api/v1/compare — multi-model FIVE_MODELS_COMPARE pattern with Judge analysis
- POST /api/v1/batch — already delivered in Phase 11
- Task graph system: core graph, builder (sequential/parallel/fan-out-fan-in/pipeline), executor with topological sort and skip-on-failure
Phase 13: Production Hardening
Section titled “Phase 13: Production Hardening”Status: complete
Delivered:
- ✅ GitHub Actions CI pipeline (Python 3.11/3.12, import check, coverage reporting, pip-audit)
- ✅ Dockerfile: Python 3.12-slim, non-root user, HEALTHCHECK
- ✅ docker-compose (standalone + postgres-backed configurations)
- ✅ .dockerignore
- ✅ Security audit: no hardcoded keys, all 44 core modules importable, no silent exception swallowing in routes
- ✅ mypy strict mode: 0 errors across all 98 source files; CI enforces —strict on every push
- ✅ Typed AppState, typed route helpers, cast() for all json.loads returns, typed middleware
- ✅ Health endpoint security: minimal response by default, details opt-in via GRACEKELLY_HEALTH_EXPOSE_DETAILS
- ✅ Startup/shutdown lifecycle closes browser adapters, executors, and PostgreSQL pools cleanly
- ✅ Orchestrate request timeout: returns HTTP 504 on breach (GRACEKELLY_ORCHESTRATE_TIMEOUT_SECONDS)
- ✅ Analytics N+1 query fix: batch step loading via list_steps_batch
- ✅ Event pagination for GET /tasks/{id}: events_limit / events_offset query params
- ✅ httpx migration for API adapter (replaces requests)
- ✅ Prometheus latency histogram: gracekelly_http_request_duration_seconds (buckets, sum, count)
- ✅ SAST with bandit added to CI pipeline (|| true); nosec annotations for false positives
- ✅ dry_run default changed from True to False (HIGH audit finding — prevents silent DryRunAdapter in production)
- ✅ AnthropicApiAdapter._post_json signature aligned with BaseApiAdapter (extra_headers param)
- ✅ app_state.py: added test coverage
- ✅ 2309 tests passing (from 2229 start of session)
Trigger-reactive follow-up (production/multi-user scope): These items are not open work for the single-user local deploy. Each activates on its own trigger:
- Async adapters (async httpx / async playwright) — trigger: sustained event-loop stalls under concurrent load. Current sync adapters wrapped in
asyncio.to_threadare sufficient for single-user throughput. - Redis-backed rate limiting for multi-process deployments — trigger: horizontal scale-out beyond one uvicorn worker. In-process limiting is sufficient single-process.
- OpenTelemetry distributed tracing — trigger: multi-service topology or collecting perf data across requests. Single-user local has structured logs.
- Error tracking integration (Sentry) — trigger: production fleet where operator not the user. Single-user local reads logs directly.
- Load testing framework — trigger: SLA-bound deploy or before a material scale change. Not required for local scope.
Phase 14: Quality Excellence
Section titled “Phase 14: Quality Excellence”Status: complete
Deliverables:
- ✅ README.md: Quick Start, Configuration table, API table (23 endpoints), Development, Architecture
- ✅ X-Request-ID correlation middleware: echoes client header or generates UUID; propagated in all responses
- ✅ RFC 7807 Problem Details: all 4xx/5xx responses use
{type, title, status, detail}format - ✅ API key authentication documented and enforced through
GRACEKELLY_API_KEYfor protected endpoints - ✅ Kubernetes probes: GET /healthz/live (always 200) and GET /healthz/ready (checks storage, 503 if unavailable)
- ✅ Settings.validate(): fail-fast startup validation (postgres+no DSN, timeout<1s)
- ✅ Event buffering: OrchestratorService._event_buffer (deque maxlen=500) buffers events on storage failure, flushes on next submit
- ✅ Property-based tests: hypothesis invariants for similarity symmetry, clustering bounds, confidence normalization
- ✅ Coverage gap tests: smart_v2, base adapter, pipeline, storage/base
- ✅ Multi-stage Docker build: builder + runtime stages, non-root user, HEALTHCHECK
- ✅ playwright_driver.py excluded from coverage (requires live browser)
- ✅ CI coverage gate raised: 90% → 93%
- ✅ 2355 tests passing, 0 failures, coverage 93.85%
Trigger-reactive follow-up (merged with Phase 13):
- Async adapters (async httpx.AsyncClient) — merged with Phase 13 async-adapter item above (same underlying work, same trigger).
asyncio.to_threadwrapper remains in place until triggered.
Phase 15: Async adapters, observability, HTML SPA UI
Section titled “Phase 15: Async adapters, observability, HTML SPA UI”Status: complete
Scope: replace Streamlit frontend with an HTML SPA, finish async adapter work, plug in optional observability and rate-limiting backends.
Delivered:
feat: async adapters(4fbfe26) —execute_async()on adapter ABC,AsyncClientinBaseApiAdapter, routes updated to awaitfeat: Sentry error tracking + Redis rate limiting (optional, env-driven)(8fc51fe)feat: OpenTelemetry tracing (optional, env-driven)(ba2ca8d)feat: model refresh endpoint, modern UI, chat robustness(2757daf)feat: context truncation (3000 chars) + chat error recovery(df5bfb4)fix: browser model routing in stream, chat dry_run toggle, UI cleanup(0b632b4)feat: session chain (20 turns), auto-decomposition, file attachments(d03bb58)fix: 3 smoke-дефекта — dry_run output_text, stream ValueError, extra fields 422(316242c)feat: Playwright image upload, UI file uploader, async tests, live smoke(466fc80)refactor: orchestrator split + session chain + file attachments(cf2562d)feat: HTML SPA UI replaces Streamlit (PO2 design)(88b8106)chore: CI hardening + security dependency upgrades(3e0c0cd)fix: smoke regressions — dry_run propagation, healthz/live, stream task_id(f7838b7)feat: browser profile safety + screenshot debugging(9da447e)
mypy-only hardening batches (mypy tests passing under —strict):
8340ba7refactor: remove hasattr guards + async cleanup (-421 errors)9b6cfe2fix: mypy tests -190 errors38a7c52fix: mypy src/ tests/ → 0 errorsb1bdb6dfix: coverage 95.66%→97%+, CI mypy src/ tests/ lockdown
Phase 16: Browser-backed orchestration hardening
Section titled “Phase 16: Browser-backed orchestration hardening”Status: complete (batches 69 through 74)
Scope: close the 500-on-browser-adapter gap, rebuild the PO2 HTML shell, keep the browser-side model catalog in sync with what Perplexity actually exposes, and formalise the auth-overlay handling.
Delivered:
- batch-69 DIAG-orchestrate-500: map
PermissionErrorand adapter timeouts to structured failure on both sync and async routes (810b4c1,f3fc848) - batch-69 UI1-UI3 PO2 rebuild: PO2 HTML skeleton, styles, chat.js features restored (
c9751d1,f7b067b,b6c05ca) - batch-69 MODEL-dynamic-catalog: runtime snapshot replaces the static browser model registry (
e87498c) - batch-70 CLOSE-69: ledger + cleanup of 69 scope (
745e66f,332f20d,d95b7d9,36fbbc5,8cd7603) - AUTH1/AUTH2 + AUTH-FIX1/FIX2/FIX3: sync returns
HTTP 503 {"code":"model_auth_required"}, async task captures the same code intask.error; inline UI banner with Retry; shared constants and Playwright regression (8763c19,0aa592d,f19ed1b,ae4f07e,4f7ed1a) - batch-72 BOOTSTRAP-onboarding: dedicated Chrome profile bootstrap helper + onboarding doc;
chrome-profile/now gitignored (014493e,2bf3bab) - batch-72/73 consolidated: catalog async lifespan, profile safety validator, logger visibility fix (
67fc496) - batch-73 SMOKE-rerun: live Perplexity single-pattern smoke returns
200(cffa4fc) - batch-74 UI-PO2-parity-rework: copy PO2 icons + pages + align DOM/CSS/JS (
48710aa) - batch-74 FLAKY-postgres-test-triage: hypothesis + bisect plan recorded, fix deferred (
412cee4) - chore: drain timed-out orchestrate submissions so unhandled futures don’t leak into unrelated tests (
772ef34) - batch-74 closure + evidence: workflow done marker refresh, UI parity screenshots, live catalog snapshot (11 models, no Kimi) (
88e1738,e1f7185)
Phase 17: Live UI smoke and workflow hygiene
Section titled “Phase 17: Live UI smoke and workflow hygiene”Status: closed for smart/debate arc, adapter-timeout follow-up deferred
Scope: validate complex browser scenarios (file upload, decomposition, debate)
against the real Perplexity UI, and keep the .workflow/ task queue clean.
Delivered:
- batch-75 FLAKY-postgres-fix: flaky
test_main_rejects_checksum_mismatchno longer reproduces on the current tip; full suite runs2579 passed / 0 failedtwice back-to-back, no skip/xfail/ordering workaround added (98b8835) - batch-75 INBOX-cleanup: six processed task specs archived,
.readycleared, batch-75 spec moved to.workflow/done/(2b909d6,deebc5a) - batch-76 UI-SMOKE-prep + upload:
/lives athttp://127.0.0.1:8011/; live Claude 4.6 single-pattern smoke with attachment returns200and the sentinel inoutput_text(not committed — artefacts only) - batch-77 UI-MENU-extend: smart and debate exposed as user-selectable items (“Умный выбор”, “Дебаты”) in the
Автоgroup of the PO2 model menu (f456067) - batch-77 UI-BROWSER-regressions: three Playwright regression tests drive the real SPA with mocked
/api/v1/smart,/api/v1/debate,/api/v1/orchestrate/upload— no live Perplexity dependency (063f29b) - batch-79 FIX-ui-contract: debate/smart items now carry
pinned_model;_resolveItemshort-circuit resolves it sogetSelection().modelis never null (a89ca78). Mock regressions extended to assert the body carries a non-empty model. - batch-79 FIX-env-docs: README corrected to describe
GRACEKELLY_EXECUTION_PROFILEasdry-run/api-only/hybrid(prior valuedefaultwas never valid) (614b181) - batch-80 BACKEND-smart-debate-browser-support:
/api/v1/smartand/api/v1/debateaccept browser-backed models viastate.browser_adapter; unit tests cover browser-path success + API-path regression (dd632c7) - batch-80 FLAKE-triage-http-api:
test_list_tasks_exposes_winning_model_and_short_circuit_summarystabilised with a deterministic cancellable adapter; three back-to-back full runs + one coverage run green (5c62065) - batch-82 UI-PIN-claude-sonnet: smart/debate
pinned_modelswitched from the unstable"best"Perplexity alias to explicit"claude-sonnet-4-6"; mock regressions updated (1e448e0) - batch-84 BACKEND-unify-browser-adapter: shared adapter-lookup helper across smart/debate/smart_v2/consensus/compare;
/api/v1/smart/v2,/api/v1/consensus, and/api/v1/comparenow accept browser-backed models (e877527) - batch-85 ADAPTER-raise-call-timeout: default browser adapter per-call timeout raised 60s → 120s with
GRACEKELLY_BROWSER_CALL_TIMEOUT_SECONDSenv override (43d29b2) - batch-86 BROWSER-actual-label-unverified-reflects-ui: unverified
select_modelreturns parsed actual UI label (button-text-after or menu-snapshot first line) instead of echoingprovider_model_id;_model_matches_expectednow detects silent Sonar-for-Claude swaps (9f2aa5f) - batch-87 TOOL-extend-recon-with-attrs + RECON-best-alias-live:
capture_perplexity_recon.pyemits newrecon-03-model-menu-attrs.jsonwith per-item attributes; live recon bundle captured intmp/browser-recon/2026-04-23/confirmsBestis adiv[role="menuitemradio"]inside[data-radix-popper-content-wrapper], shares class patterns with other entries — nested-text-node ambiguity, not duplicate items (11f4a9a,9203f50) - batch-87 BROWSER-scoped-menu-search:
PlaywrightBrowserAutomation.select_modelnow tries menu-scoped option lookup (model_menu_candidates→model_menu_item_selectorrole-filter) before falling back to the legacy globalget_by_role/textpath;model_menu_candidatesextended with[role="menu"];details["option_lookup_source"]records which path matched. Pattern ported fromPerplexity_Orchestrator2/browser/model_selection.py:182-240(f430d39) - batch-88 VERIFY-scoped-menu-vs-orchestrator2: scoped-menu-search pattern verified equivalent to
Perplexity_Orchestrator2/browser/model_selection.py:182-240(production on ~30 Perplexity Pro accounts); multi-match nested-text-node behavior covered by regression tests — Best-alias follow-up closed (48b5a93) - inline AUTH-settle-unknown-state:
auth_statusmakes a bounded_wait_for_shellretry when neither signed-out markers nor the prompt input are visible; unblocks smart auto-decomposition sub-execs that previously hit[auth_failed]mid-flight after exec #1 landed. Structuredbrowser_auth_unknowndiagnostic log added for any remaining logged-out decisions (66a64a8) - inline RESPONSE-strip-streaming-chrome:
shell_noise_linesextended withThinking,Ask a follow-up,Stop response,Regenerate,Sources,Answerso candidate-text cleanup filters them out (d0acbd4) - inline SMOKE-live-smart-acceptance:
scripts/live_smart_smoke.pyreusable harness drives PO2 SPA + captures/api/v1/smartresponse; inline run against a live chrome-profile returnedstatus=200 model_id=claude-sonnet-4-6 answer_len=990with a structured EV-markets answer in 49.7s, no[auth_failed]or shell-chrome markers (352c6b8,94fd2d9) - inline SMOKE-live-debate-acceptance: harness extended with
--pattern debate; live run returnedstatus=200 model_id=claude-sonnet-4-6 answer_len=3293with a full initial_position/challenge/defense/improved_response chain on an EV CO2 debate topic, 61.2s (fc6307f) - inline BROWSER-reset-page-state-between-execs:
PerplexityBrowserAdapter.executenow callsautomation.reset_page_state()(navigates to home via_navigate_home_for_model_selection) afterdismiss_popups, so role-based fan-out / decomposition sub-execs no longer collide on a stale thread URL. Live SMART role-based fan-out proven: three real Perplexity sub-execs (123s/100s/53s, 9429/8962/5102 chars — each distinct), final 8962-char multi-region analysis (66ddfdb)
Incidents (deprecated follow-up specs recorded for history, not in Delivered):
- batch-78 Live UI smoke SMART/DEBATE (first attempt): failed — UI did not expose the patterns; superseded by batch-77 + batch-79 work.
- batch-81 Live SMART on
"best"alias: failed — Perplexity UI non-deterministically selected Sonar instead of Best, one run extracted shell chrome text ("Thinking / Ask a follow-up / Model") instead of the answer. Workaround landed in batch-82 (pin to explicit model); the root-cause fix landed later in the batch-87/batch-88 browser adapter work. - batch-82 Live SMART with cyrillic prompt: failed — CX harness’ PowerShell pipeline converted the cyrillic prompt to
?before it reached Playwright.POST /api/v1/smartstill returned200with the expected model_id, confirming the batch-80 backend fix, but the acceptance (meaningful answer) could not be validated. - batch-83 Live SMART with UTF-8 harness: failed — smart auto-decomposition fired three Perplexity calls for the prompt; adapter timeout of 60s per call clipped two of them, the third extracted 2730 chars at 53s. Route-level response was not captured within the harness’ outer 180s window.
Closed / Scoped out (2026-04-23):
- Cyrillic prompts via some harnesses lose encoding — closed as out-of-scope: the corruption happens on the PowerShell/automation-harness side, not in GraceKelly. See
docs/operator-runbook.mdunderHarness limitations. - AUTH3 persistent session reuse — closed by design via the batch-69 rationale:
AUTH1sync mapping,AUTH2inline banner recovery, and the dedicated Chrome profile directory already cover the single-user local recovery path. Batch-72/bootstrap already covered profile lifecycle.