Skip to content

Operator Runbook

Last updated: 2026-04-27

This runbook covers the current operating surface for GraceKelly:

  • API authentication
  • web UI startup
  • browser execution via Perplexity as the primary path
  • service liveness and readiness
  • metrics scraping
  • browser-adapter recovery
  • storage validation and task-scoped snapshot restore

It is intentionally limited to the current in-process deployment model.

  1. Step 1 — Boot Start the backend with the browser runtime enabled:

    Terminal window
    set GRACEKELLY_BROWSER_ENABLED=true
    set GRACEKELLY_BROWSER_AUTOMATION_BACKEND=playwright
    set GRACEKELLY_EXECUTION_PROFILE=hybrid
    set GRACEKELLY_BROWSER_PROFILE_DIR=<repo>\tmp\browser-recon\perplexity-profile
    python -m uvicorn gracekelly.main:create_app --factory --host 127.0.0.1 --port 8011

    Keep that process running, then open http://127.0.0.1:8011/.

  2. Step 2 — Authenticate browser Bootstrap a dedicated Chrome profile once:

    Terminal window
    gracekelly-create-perplexity-profile

    Finish the Perplexity login manually in that profile, close every Chrome window using it, and reuse the same GRACEKELLY_BROWSER_PROFILE_DIR for the backend.

  3. Step 3 — First smoke

    Terminal window
    python scripts/live_smart_smoke.py --pattern smart

    Expected result: HTTP 200, a meaningful answer, and roughly 1-3 browser submits for the SMART flow. With GRACEKELLY_EXECUTION_PROFILE=dry-run, all eight sync routes auto-gate to dry-run execution without requiring dry_run: true in the request body.

For deeper operations see the sections below:

scripts/ecosystem_smoke.py is the single-command health check across the V2 backend and all three known clients (RAG_Support_Assistant, agent_toolkit, juhub).

Terminal window
.venv\Scripts\python scripts\ecosystem_smoke.py

Step order: pre-flight :8011/healthz/ready → V2 direct (/smart + /orchestrate) → RAG smoke (if :8000 reachable) → agent_toolkit pytest tests/integration/ (if agent_toolkit exists) → juhub --dry-run debate (if Perplexity_Orchestrator2\juhub exists). Missing components are reported as SKIP, not FAIL. Exit code 0 if every step is PASS or SKIP, 1 on the first FAIL.

Useful flags:

  • --skip-rag, --skip-agent-toolkit, --skip-juhub — narrow the run.
  • --gracekelly-url, --rag-url — override base URLs.
  • --verbose — show each subprocess stdout.

This script does not start uvicorn itself; boot V2 first.

scripts\win-autostart\ ships a Windows Task Scheduler XML and .bat helpers to keep V2 running on user logon. This is optional — purely a convenience for single-user local deploy where juhub cron at 08:30 and RAG async traffic both rely on V2 being already up.

Install once, as Administrator:

Terminal window
cd <repo>\scripts\win-autostart
install_autostart.bat

Verify:

Terminal window
schtasks /Query /TN "GraceKelly Autostart" /V /FO LIST

Switch execution profile without editing files:

Terminal window
set_profile.bat hybrid :: or dry-run / api-only

The wrapper gracekelly_uvicorn.bat reads %LOCALAPPDATA%\GraceKelly\profile.env on each start; restart the task to pick up changes. Logs land in %LOCALAPPDATA%\GraceKelly\uvicorn.log. Uninstall: uninstall_autostart.bat (also as Administrator). See scripts\win-autostart\README.md for full reference and troubleshooting.

Optional per-request JSONL log appended to <repo>/logs/usage.jsonl. Designed for honest 30-day usage audits before any simplify-driven refactor (see audit_opus_2026-04-26.md §R1).

Terminal window
set GRACEKELLY_USAGE_TELEMETRY_ENABLED=true
:: optional path override, default: <cwd>/logs/usage.jsonl
set GRACEKELLY_USAGE_TELEMETRY_PATH=

Then restart uvicorn so the new .env is read.

One JSON object per line, written after call_next completes:

{"ts":"2026-04-26T16:55:30.740374Z",
"endpoint":"/api/v1/orchestrate",
"method":"POST",
"status":200,
"duration_ms":14,
"request_id":"308deca4-c1c7-4c84-981c-2a27ed6dd95e",
"prompt_hash":"e32ca25d4ac598a59600c9b6dcc10eaf4f0636acd4e0db2ce70560adc7df146f"}
  • endpoint is UUID-normalised (e.g. /api/v1/tasks/{id}/retry).
  • prompt_hash is sha256(body_bytes) for orchestration POST routes only (/orchestrate, /orchestrate/upload, /orchestrate/stream, /consensus, /compare, /debate, /smart, /smart/v2, /batch, /pipeline); null elsewhere. The body itself is not persisted — only its hash.
  • request_id falls back to the X-Request-ID request header when no correlation middleware is wired, otherwise picks the response header. If a Redis rate-limit 429 returns before correlation middleware runs, telemetry generates a UUID and returns it as X-Request-ID on that 429 response.
Terminal window
python -c "import json,collections; c=collections.Counter(); ^
[c.update([json.loads(l)['endpoint']]) for l in open('logs/usage.jsonl')]; print(c.most_common())"

The middleware never blocks: write failures emit one usage_telemetry.write_failed warning per process, then degrade silently. The body is replayed via request._receive so downstream handlers see it intact.

Weekly Friday 03:00 scheduled task that captures the live Perplexity DOM and diffs it against a stored baseline so UI drift is detected before it breaks a live run (see audit_opus_2026-04-26.md §R4). The task runs the gracekelly-recon-weekly console entry point. It loads the repo .env before resolving CLI defaults, so the scheduled task uses the same GRACEKELLY_BROWSER_PROFILE_DIR as Settings.from_env() unless --profile-dir is passed explicitly.

Right-click <repo>\scripts\win-autostart\install_recon_cron.batRun as administrator. The installer renders recon-task.xml with the current %USERDOMAIN%\%USERNAME% substituted in and converts the file to UTF-16 LE before calling schtasks /Create /XML.

Verify:

Terminal window
schtasks /Query /TN "GraceKelly Selectors Recon" /V /FO LIST
PathMeaning
.workflow/state/perplexity-selectors-baseline.jsonReference snapshot. Created on first run, updated only on explicit acknowledgement.
.workflow/state/perplexity-selectors-latest.jsonMost recent capture, always overwritten.
.workflow/state/perplexity-selectors-drift.flagPresent iff drift was detected on the latest run; deleted automatically when the next run matches the baseline again.
logs/recon-drift.jsonlAppend-only {ts, added, removed, changed} lines for every drifted run.

The captured snapshot is structural: home-button labels, the model menu list, manifest flags (direct_model_button_visible, more_button_visible, more_clicked, model_button_visible_after_more), and the artefact-file inventory. Screenshots and intermediate HTML are written to a temporary directory and discarded after the snapshot is extracted.

When the flag is present:

  1. Inspect logs/recon-drift.jsonl for the structural diff.

  2. Decide whether the drift is benign (a new model added, a button renamed) or breaking (a selector path no longer resolves).

  3. Breaking drift — fix the selector module and rerun integration tests.

  4. To accept the new state as the baseline:

    Terminal window
    copy /Y "<repo>\.workflow\state\perplexity-selectors-latest.json" ^
    "<repo>\.workflow\state\perplexity-selectors-baseline.json"
    del "<repo>\.workflow\state\perplexity-selectors-drift.flag"

The next run will exit 0 again until the next drift.

Terminal window
.\.venv\Scripts\gracekelly-recon-weekly.exe

Exit codes: 0 no drift, 1 drift detected, 2 missing --profile-dir / GRACEKELLY_BROWSER_PROFILE_DIR after .env loading. The manual run requires no other Chrome window to be holding the same profile directory open, otherwise Playwright hits BrowserProfileBusyError — stop the autostart task or close stray Chrome processes first.

Right-click uninstall_recon_cron.batRun as administrator. The script runs schtasks /Delete /TN "GraceKelly Selectors Recon" /F.

The built-in web UI is served from the main app at http://127.0.0.1:8011/. Run the backend, then open that address in the browser.

Static UI shell paths (/, /*.html, /js/*, /css/*, /icons/*) remain public even when GRACEKELLY_API_KEY is configured. This lets the browser load the SPA and linked tools. API calls from that UI still hit protected /api/v1/* routes and need a bearer token or X-API-Key header when endpoint auth is enabled.

HTML pages use a static-compatible CSP because the current vanilla UI still has inline handlers/scripts/styles and analytics.html loads Chart.js from https://cdn.jsdelivr.net. Non-HTML routes keep the stricter CSP without 'unsafe-inline'.

/analytics.html reads only GET /api/v1/analytics. It renders totals, per-model rows, and top models from the current response fields: total_models, total_executions, models, and top_models.

Set GRACEKELLY_API_KEY to require API key on all protected endpoints. Clients must include one of:

  • Authorization: Bearer <key> header
  • X-API-Key: <key> header

Public endpoints (no key required): /health, /healthz/live, /healthz/ready, /docs, /openapi.json, /redoc, /, /*.html, /js/*, /css/*, and /icons/*.

When GRACEKELLY_API_KEY is not set, all endpoints are open (development default).

GraceKelly executes models through your Perplexity Pro subscription via browser automation. Direct provider APIs remain optional fallbacks when you need separate provider access.

  1. Create a Chrome profile logged into Perplexity Pro
  2. Set in .env:
    Terminal window
    GRACEKELLY_BROWSER_ENABLED=true
    GRACEKELLY_BROWSER_AUTOMATION_BACKEND=playwright
    GRACEKELLY_BROWSER_PROFILE_DIR=/path/to/chrome/profile
  3. Available models depend on your Perplexity subscription tier

If the browser adapter fails 3 times consecutively, the circuit breaker opens for 60 seconds. Check /metrics for gracekelly_browser_circuit_breaker_state. MODEL_MISMATCH does not count toward the breaker (Sonar auto-route is recovered by retry, not by tripping). Only PROVIDER_UNAVAILABLE, TIMEOUT, and UNKNOWN_ERROR are counted.

Configure via:

  • GRACEKELLY_BROWSER_CIRCUIT_BREAKER_FAILURE_THRESHOLD (default 3)
  • GRACEKELLY_BROWSER_CIRCUIT_BREAKER_COOLDOWN_SECONDS (default 60)

The browser adapter has three layered protections to keep sessions healthy across long runs:

  • Cold-start navigation — initial page.goto(perplexity.ai) and home re-navigations use a 30s timeout (was 5s). Cold Chromium launches no longer fail the first request.
  • Sonar auto-route retry — when Perplexity overrides the requested model to Sonar, the adapter retries select_model up to 2 extra times with a 1.5s delay before returning MODEL_MISMATCH. Class constants _MODEL_SELECT_RETRIES / _MODEL_SELECT_RETRY_DELAY_S in adapters/browser/perplexity.py.
  • Force session reset on exception — after TIMEOUT or unknown exceptions, the adapter best-effort-closes Playwright/Chromium so the next request relaunches a fresh session. Without this, a degraded session cascades through the breaker.
  • Thinking-toggle memoization — if Perplexity’s UI does not surface a separate “Thinking” toggle for the active model, the adapter records the miss once per session and skips the menu probe on subsequent calls (otherwise ~2s wasted per call).
  • Submit click force=True — the prompt-submit button uses force=True to bypass actionability waits when an overlay briefly covers it.

Live smoke verification: 12/12 sequential /api/v1/smart calls landed clean (0 failures, 0 warnings, 0 breaker trips) on HEAD ceeb27d.

After the 2026-04-26 cold-start refactor, three test doubles in tests/test_playwright_driver.py (_FakePage.goto, _HomeNavigationPage.goto) needed *, timeout: int | None = None to match the production signature. Without the kwarg, production goto(..., timeout=30_000) raised TypeError, the surrounding try/except swallowed it, and model_selection_attempted returned False. Fixed in b166de8; if a future stability change adds a new kwarg to a real page.goto call, propagate it to those test doubles.

  • GET /health
    • fast summary for service, environment, storage backend, active model executions, saturated models
  • GET /api/v1/readiness
    • component-by-component status for storage, execution router, and adapters
  • GET /metrics
    • Prometheus-style gauges for readiness, component states, execution saturation, storage counts when available, and browser circuit-breaker state
  • GET /api/v1/tasks
    • recent operator task summaries with status, execution_mode, dry_run, and failure_code filters
  • GET /api/v1/tasks/{task_id}
    • full execution context: plan scalars, steps, events, terminal execution details
  1. Confirm the process is live:
    • curl http://127.0.0.1:8011/health
  2. Confirm readiness semantics:
    • curl http://127.0.0.1:8011/api/v1/readiness
  3. Confirm scrape surface:
    • curl http://127.0.0.1:8011/metrics

Expected development baseline:

  • storage_backend=memory
  • readiness may be ok even if browser is optional and degraded under the active execution profile
  • gracekelly_execution_active_model_executions 0 when idle

CI installs security scanners on demand rather than keeping them in the default dev extra. To reproduce the CI security gates locally:

Terminal window
pip install pip-audit "bandit[toml]"
pip-audit --ignore-vuln PYSEC-2022-42969
bandit -r src/gracekelly/ -ll -x src/gracekelly/adapters/browser/

storage component:

  • ok: repository reachable and schema report acceptable
  • degraded: connectivity or schema drift issue
  • Action:
    • for PostgreSQL, run the validation CLI
    • for memory, restart the process if the in-memory store itself is corrupted

execution-router component:

  • use active_model_executions, active_by_model, model_limits, and saturated_models
  • saturated_models means requests are being rejected with rate_limited for those models

browser.perplexity component:

  • session shows configuration and last session error
  • automation shows live-driver or scripted-driver detail
  • circuit_breaker shows whether repeated infrastructure failures have opened the browser adapter

Key metric groups:

  • gracekelly_readiness_state
  • gracekelly_component_state
  • gracekelly_execution_active_model_executions
  • gracekelly_execution_model_active
  • gracekelly_execution_model_limit
  • gracekelly_execution_model_saturated
  • gracekelly_storage_task_count, gracekelly_storage_step_count, gracekelly_storage_event_count
  • gracekelly_browser_circuit_breaker_state
  • gracekelly_browser_circuit_breaker_consecutive_failures
  • gracekelly_browser_circuit_breaker_open_count
  • gracekelly_browser_circuit_breaker_fail_fast_rejections

Storage-count gauges are present on both the in-memory backend and PostgreSQL when the repository healthcheck can read the durable tables successfully.

Common task-level failure codes:

auth_failed:

  • browser profile is not logged in or Perplexity showed a late sign-in overlay
  • Recovery:
    • create or refresh a dedicated profile:
      • gracekelly-create-perplexity-profile
    • point runtime to that directory:
      • set GRACEKELLY_BROWSER_PROFILE_DIR=<repo>\tmp\browser-recon\perplexity-profile
    • rerun the live smoke
  • Diagnostics:
    • the adapter logs a structured browser_auth_unknown warning with url/title/body_length/prompt_input state whenever auth still resolves to logged_out after the settle retry. Grep the uvicorn log for browser_auth_unknown to see the actual page state that defeated the auth check.

provider_unavailable:

  • browser driver missing, profile directory busy, browser disabled, or circuit breaker currently open
  • Recovery:
    • confirm browser_enabled=true
    • close any Chrome windows using the same profile directory
    • inspect /api/v1/readiness for browser.perplexity.details.circuit_breaker
    • if the circuit breaker is open, wait for cooldown or restart the service

model_mismatch:

  • requested browser model was not confirmed in the current authenticated UI
  • Recovery:
    • inspect GET /api/v1/models
    • if model availability drift is suspected, capture fresh recon artifacts

Dry-run first start:

  • when GRACEKELLY_EXECUTION_PROFILE=dry-run and browser automation is disabled, GET /api/v1/models returns a dry-run-static browser catalog plus API models so the UI can populate before any live Perplexity refresh exists.
  • after browser automation is enabled, startup treats that static snapshot as refreshable and replaces it with the authenticated Perplexity menu when the catalog refresh succeeds.

timeout or unknown_error:

  • live UI or automation state unstable
  • If an external integrator receives failure_code: "unknown_error" with a Playwright traceback while the backend is running with GRACEKELLY_EXECUTION_PROFILE=dry-run, treat it as the known dry-run profile-gate regression (the dry-run profile must not execute real adapters).
  • Recovery:
    • inspect browser.perplexity health details and breaker counters
    • capture fresh DOM recon
    • rerun the live smoke with debug enabled
  • Per-call budget is controlled by GRACEKELLY_BROWSER_CALL_TIMEOUT_SECONDS (default 120s). Raise it for very long prompts or when SMART fan-out sub-calls are still within the budget but tight.

Fan-out / decomposition (SMART used_roles=True or DEBATE):

  • each sub-exec is routed through the same browser session. The adapter calls reset_page_state() (navigates the UI back to the home ask-input) before every submit so consecutive sub-execs do not extract stale body_after_prompt from the previous thread. If you see multiple sub-execs completing with identical output lengths or anomalously short durations (<2s) in the log, confirm the “Navigating Perplexity UI back to” log line is present between them — if missing, the reset pathway itself has regressed.

Create or refresh a dedicated authenticated profile:

Terminal window
gracekelly-create-perplexity-profile

Capture fresh authenticated recon:

Terminal window
gracekelly-capture-perplexity-recon --prompt "Reply with only OK" --timeout-seconds 60

Run the manual-gated live smoke after the backend is already running with the browser env settings from the Quick Start:

Terminal window
python scripts/live_smart_smoke.py --pattern smart

Browser circuit breaker semantics:

  • counts only provider_unavailable, timeout, and unknown_error
  • opens after the configured threshold
  • fail-fast blocks new browser executions until cooldown expires
  • the next allowed probe closes the breaker on success or reopens it on another counted failure

Runtime knobs:

Terminal window
set GRACEKELLY_BROWSER_CIRCUIT_BREAKER_ENABLED=true
set GRACEKELLY_BROWSER_CIRCUIT_BREAKER_FAILURE_THRESHOLD=3
set GRACEKELLY_BROWSER_CIRCUIT_BREAKER_COOLDOWN_SECONDS=60

Operational guidance:

  • prefer waiting for cooldown if the root cause is transient UI or provider instability
  • restart the service if the browser runtime itself is wedged and cooldown alone is not enough
  • investigate repeated open_count growth before increasing thresholds

Validate PostgreSQL connectivity and schema:

Terminal window
set GRACEKELLY_POSTGRES_DSN=postgresql://postgres:postgres@localhost:5432/gracekelly
set GRACEKELLY_POSTGRES_CONNECT_TIMEOUT_SECONDS=5
python -m gracekelly.tools.validate_postgres

Use --no-bootstrap if the target database should not be modified during validation.

Export a JSON snapshot of recent durable-state records:

Terminal window
set GRACEKELLY_POSTGRES_DSN=postgresql://postgres:postgres@localhost:5432/gracekelly
gracekelly-export-postgres --limit 100

Export specific tasks only:

Terminal window
gracekelly-export-postgres --task-id task-1 --task-id task-2

Export artifacts now carry snapshot_format_version, gracekelly_version, and snapshot_sha256 so restores can reject incompatible or corrupted JSON before task rows are touched. The export command summary now also echoes generated_at, compressed_output, output_exists, output_size_bytes, manifest_status, snapshot_status_consistency_status, selection_status, missing_task_ids_status, field-level manifest verification statuses, requested_task_ids, exported_task_ids, missing_task_ids, task_count, step_count, event_count, repository_health, and repository_schema, so the operator can capture both selection results and storage state without opening the snapshot file immediately. If export fails after the snapshot manifest was already assembled, the error payload preserves that manifest context too. If the export path ends with .gz, the snapshot is written as gzip-compressed JSON.

Inspect a snapshot artifact offline before restore:

Terminal window
gracekelly-inspect-snapshot --input <repo>\tmp\postgres-export\selected.json

That command verifies snapshot_sha256 when present and reports manifest details such as manifest_status, snapshot_status_consistency_status, selection_status, missing_task_ids_status, field-level manifest verification statuses, selection, task_count, step_count, event_count, exported_task_ids, missing_task_ids, input_size_bytes, and import_ready without requiring database connectivity. If the file cannot be parsed, the error payload still includes compressed_input and input_size_bytes.

Restore a snapshot back into PostgreSQL:

Terminal window
set GRACEKELLY_POSTGRES_DSN=postgresql://postgres:postgres@localhost:5432/gracekelly
gracekelly-import-postgres --input <repo>\tmp\postgres-export\selected.json

Restore semantics:

  • imported task_id values are replaced in place
  • related step and event rows are replaced together with the task
  • unrelated tasks remain in the database
  • snapshot_format_version is verified when present
  • snapshot_sha256 is verified when present

Restore only selected task bundles from a larger snapshot:

Terminal window
gracekelly-import-postgres --input <repo>\tmp\postgres-export\selected.json --task-id task-1 --task-id task-2

If one or more requested task_id values are absent, the command still restores the bundles that exist and returns status=partial plus missing_task_ids in the JSON summary.

Use --allow-degraded-schema only for deliberate manual recovery when the guardrail would otherwise block a needed restore.

Validate restore inputs without writing:

Terminal window
gracekelly-import-postgres --input <repo>\tmp\postgres-export\selected.json --dry-run

That success payload includes repository_health and repository_schema, so operators can confirm the target backend state in the same preflight call. It also echoes compressed_input, input_size_bytes, source_format_status, source_migration_status, source_checksum_status, source_snapshot_sha256, source_import_ready, source_status_consistency_status, source_manifest_status, source_selection_status, source_selection, source_task_count, source_step_count, source_event_count, source_exported_task_ids, source_missing_task_ids, and source_missing_task_ids_status, so the restore report preserves the source artifact manifest context. Failed import preflights also include compressed_input, input_size_bytes, and the source compatibility verdict fields derivable from the parsed artifact, so the operator can still identify and classify the rejected snapshot from the error payload. Compressed .json.gz snapshot input is supported directly.

scripts/live_smart_smoke.py is a manual-gated operator harness for end-to-end browser-backed smoke checks. It is not scheduled and is not part of CI; run it only when you explicitly want to spend live browser quota.

  1. Chrome profile is already authenticated to Perplexity Pro, for example <repo>/chrome-profile/.
  2. Uvicorn is running on http://127.0.0.1:8011/ with at least:
    Terminal window
    $env:GRACEKELLY_BROWSER_ENABLED="true"
    $env:GRACEKELLY_EXECUTION_PROFILE="hybrid"
  3. No other chrome.exe process is using that profile:
    Terminal window
    Get-CimInstance Win32_Process -Filter "name = 'chrome.exe'" |
    Where-Object { $_.CommandLine -like '*<repo>\\chrome-profile*' } |
    Select-Object ProcessId, CommandLine
    The command should return no rows before you launch the smoke.
PatternAPI pathUI labelDefault prompt summaryExpected quotaMin answer length
smart/api/v1/smartУмный выборEV market comparison across Europe, USA, China1-3 submits500 chars
debate/api/v1/debateДебатыEV market comparison with challenge/defense loop3-5 submits500 chars
consensus/api/v1/consensusnot surfaced; direct POST fallback3 leading EV manufacturers in China3-5 submits300 chars
compare/api/v1/comparenot surfaced; direct POST fallbackClaude Sonnet 4.6 vs GPT-5.4 reasoning comparison5 submits400 chars
upload/api/v1/orchestrate/uploadn/a; composer attachment flowsummarize attached file1 submit150 chars

Quota expectations are approximate and assume a healthy authenticated browser session. smart may fan out into 1-3 submits, debate usually needs 3-5, consensus usually needs 3-5, compare fans out across five models, and upload is expected to be a single submit.

The UI upload path intentionally collapses any current multi-model menu selection to one model form field before POSTing /api/v1/orchestrate/upload. With the default Claude + GPT menu item, the upload smoke uses the first resolved model (Claude Sonnet 4.6) and should report model_count=1 with no quorum cancellation.

Terminal window
python scripts/live_smart_smoke.py --pattern smart
python scripts/live_smart_smoke.py --pattern debate
python scripts/live_smart_smoke.py --pattern consensus
python scripts/live_smart_smoke.py --pattern compare
python scripts/live_smart_smoke.py --pattern upload --attachment <path>

Reports are written to .workflow/outbox/<tag>-<PATTERN>-report.md and raw payloads to .workflow/outbox/<tag>-<PATTERN>-response.json.

Status: success means the harness completed prompt-to-response end-to-end and the evaluator accepted the pattern-specific response without AUTH_FAILED, shell-chrome, forbidden markers, or length/topic failures.

Status: failure means the report contains explicit rejection reasons such as non-200 status, missing answer field, too-short output, forbidden markers, or missing topic keywords. Inspect the paired response.json to see the captured HTTP status and the raw response body fields that the evaluator examined.

Fallback behaviour is validated via unit-tests in tests/test_router_fallback.py, not through the live harness; browser-adapter failure is not reproduced artificially in smoke runs.

This harness does not cover smart/v2, batch, or pipeline. Those paths stay validated through unit tests and route-level smoke coverage such as tests/test_routes_*.

When the harness or any other CLI tool passes a cyrillic prompt through a PowerShell pipe (echo 'привет' | python ...), PowerShell’s default encoding can downgrade the text to ? placeholders before the child process sees it. That is a PowerShell / harness issue, not a GraceKelly backend bug.

Workarounds:

  • pass the prompt directly with --prompt, for example python scripts/live_smart_smoke.py --prompt "привет"
  • set $OutputEncoding = [System.Text.Encoding]::UTF8 before piping in the current session
  • use --ascii-fallback for deterministic ASCII smoke prompts

Reference incident: Phase 17 / batch-82 live SMART failure recorded in docs/phased-roadmap.md.

Authentication is persisted through the dedicated Chrome profile directory (default chrome-profile/, configurable via GRACEKELLY_BROWSER_PROFILE_DIR). There is no separate session token file to rotate or copy.

Current local operation is single-account: GRACEKELLY_BROWSER_PROFILE_DIR=<repo>/chrome-profile and no GRACEKELLY_ACCOUNTS pool. Keep that profile signed in with the intended Gmail-backed Perplexity account; do not add alternate browser profiles or API fallback keys unless the operating mode changes deliberately.

If another Chrome process is still using that profile, startup can fail with the live-profile guard or BrowserProfileBusyError. Use a dedicated profile created by gracekelly-create-perplexity-profile and follow docs/onboarding.md for the bootstrap / recovery flow.

Selector drift symptom: if model selection reports Upload files or images, Search, or a stray New model, rerun the live smoke and inspect PerplexitySelectors.model_button. The model button can include a mode suffix such as Gemini 3.1 Pro Thinking; the selector must prefer known model labels over the broad composer menu fallback. For the current Pro-backed profile, keep GPT-5.4 as the GPT browser model; do not admit Max-only menu labels such as GPT-5.5, Claude Opus, or Max into the runtime browser catalog.

  1. Find the recent failures:
    • GET /api/v1/tasks?status=failed&dry_run=false
  2. Narrow by backend shape:
    • GET /api/v1/tasks?execution_mode=browser
  3. Narrow by failure class:
    • GET /api/v1/tasks?failure_code=provider_unavailable
  4. Inspect one task deeply:
    • GET /api/v1/tasks/{task_id}

Use execution_details, terminal event payloads, and step event details together. That is where current adapter diagnostics, browser driver metadata, and circuit-breaker-origin failures surface without widening storage tables.

If callers supply metadata.trace_id on POST /api/v1/orchestrate, GraceKelly now echoes that value in:

  • route-level orchestrate.request / orchestrate.accepted
  • orchestrator-level task.submit.started / task.submit.completed
  • task.event_persistence_failed warnings

That gives a minimal correlation key across HTTP entry, task creation, and best-effort event logging without requiring an external tracing system.

The GET /health endpoint returns a minimal summary by default (status, environment, backend name, saturation counts). Internal component details are hidden.

To expose full details (storage schema, browser circuit-breaker state, adapter keys present/absent):

Terminal window
set GRACEKELLY_HEALTH_EXPOSE_DETAILS=true

Security implications:

  • The detailed view reveals which adapters have API keys configured and whether the browser session is authenticated.
  • Keep GRACEKELLY_HEALTH_EXPOSE_DETAILS=false (default) in any internet-facing deployment.
  • The detailed view is safe on an internal monitoring network or when the health endpoint is behind API key authentication.
  • GET /api/v1/health/detailed always returns full adapter and embeddings status; protect it with GRACEKELLY_API_KEY if health endpoints are public.

POST /api/v1/orchestrate runs synchronously in a thread pool. To cap execution time and return HTTP 504 to the caller instead of holding the connection indefinitely:

Terminal window
set GRACEKELLY_ORCHESTRATE_TIMEOUT_SECONDS=60

Setting 0 (default) disables the timeout.

How it works:

  • The orchestration coroutine is wrapped in asyncio.wait_for with the configured timeout.
  • On breach, the endpoint returns 504 Gateway Timeout with detail: "Orchestration request timed out."
  • The background thread continues running until the underlying adapter call completes or fails on its own - the timeout only affects the HTTP response, not the execution itself.

Tuning guidance:

  • Start with the slowest expected model timeout + 10 s of overhead (e.g. Anthropic 120 s -> set 130 s).
  • For dry-run mode, 5 s is sufficient.
  • For consensus V2 with multiple rounds, account for max_rounds x variations_per_round x model_timeout_seconds.
  • Pair this setting with load-balancer / reverse-proxy timeouts: both must be larger than the orchestrate timeout.

V2 is the only active orchestrator. All three known clients run on http://127.0.0.1:8011:

  • RAG_Support_Assistant (RAG_Support_Assistant, port 8000)
    • Smoke: python RAG_Support_Assistant\scripts\gracekelly_smoke.py
    • Failover provider: ollama (when V2 returns 5xx).
  • agent_toolkit (agent_toolkit)
    • LangGraph wrapper (OrchestratorChatModel) → V2 endpoints by GKPattern.
    • Test: cd agent_toolkit && uv run pytest tests/integration/
  • juhub (Perplexity_Orchestrator2\juhub, scheduled 08:30 daily)
    • backend/scheduler.py does pre-flight :8011/healthz/ready; if V2 is down, the run is skipped with an error log (no auto-start).
    • Manual dry-run: cd Perplexity_Orchestrator2 && set GK_DRY_RUN=1 && python -m juhub.backend.scheduler --now

Legacy V1 orchestrator at Perplexity_Orchestrator2 (:8001, /api/gk/*) is deprecated 2026-04-25 and not used by any client. See Perplexity_Orchestrator2\DEPRECATED.md.