From d36fa6538aaca1c8063aa755901d6f1dd9bf7253 Mon Sep 17 00:00:00 2001 From: Caret Date: Mon, 6 Apr 2026 12:50:22 +0000 Subject: [PATCH] =?UTF-8?q?research:=20Phase=200=20reports=202=20and=203?= =?UTF-8?q?=20=E2=80=94=20gateway=20internals=20+=20live=20state=20audit?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- research/RESEARCH-02-gateway-internals.md | 163 ++++++++++++++++++++++ research/RESEARCH-03-live-state-audit.md | 132 ++++++++++++++++++ 2 files changed, 295 insertions(+) create mode 100644 research/RESEARCH-02-gateway-internals.md create mode 100644 research/RESEARCH-03-live-state-audit.md diff --git a/research/RESEARCH-02-gateway-internals.md b/research/RESEARCH-02-gateway-internals.md new file mode 100644 index 0000000..588779f --- /dev/null +++ b/research/RESEARCH-02-gateway-internals.md @@ -0,0 +1,163 @@ +# Research 02 — openclaw gateway internals + +**Subagent:** `ae5ca38f70b1e9626` (Explore) +**Completed:** 2026-04-06 12:50 UTC + +## Gateway API surface + +WebSocket-first RPC at `ws://localhost:18789/`, with HTTP fallback routes. + +### HTTP endpoints + +| Method | Path | Purpose | +|--------|-------------------------------------|-------------------------------------------------------------------------| +| POST | `/hooks/{hookPath}/wake` | Trigger heartbeat or immediate agent wake. Body: `{text, mode}`. | +| POST | `/hooks/{hookPath}/agent` | Spawn isolated agent session. Body: `{agentId, sessionKey, message, channel, to, deliver, model, thinking, timeoutSeconds}`. Returns `{ok, runId}`. Idempotency: 60s dedup by `Authorization + X-Idempotency-Key`. | +| POST | `/tools/invoke` | Call a tool directly. Body: `{tool, action, args, sessionKey, dryRun}`. | +| GET | `/health` / `/healthz` / `/ready` | Liveness / readiness probes. | +| GET | `/` and `/app/*` | Built-in web control UI (the SPA we saw when probing earlier). | +| Plugin-registered routes | Custom plugin HTTP endpoints; auth enforced per plugin's `requiresAuth`.| + +### Authentication + +- `Authorization: Bearer ` OR `X-OpenClaw-Token: ` header +- Token sources: `gateway.auth.token` in config, `OPENCLAW_GATEWAY_TOKEN` env var, device token at `~/.openclaw/credentials/device-token` +- WebSocket auth: passed in URL query `?token=...` or connect frame + +### RPC method RBAC scopes + +- READ: `health`, `channels.status`, `sessions.list`, `cron.list`, `node.list`, ... +- WRITE: `send`, `agent`, `agent.wait`, `wake`, `node.invoke`, ... +- ADMIN: `config.set`, `agents.create`, `cron.add`, `sessions.reset`, ... +- APPROVALS, PAIRING: narrower scoped methods. + +## Session spawn recipe + +### The primary spawn path + +``` +Client RPC request → gateway dispatch → agentHandlers.agent() → agentCommandFromIngress() → in-process task +``` + +Not a child process. Sessions run as in-process tasks under the gateway. Each session's message history lives in `~/.openclaw/sessions/*.jsonl`. + +### Agent identity & tool allowlist resolution at spawn + +1. Resolve agent ID from `params.agentId` or `agents.defaults.id`. +2. Resolve tool allowlist: first match wins among `agents[id].tools.allow/deny` → `agents[id].toolProfile` → `agents.defaults.tools.*` → subagent role restrictions. +3. Hard-deny list always wins (`exec.approval.*`, `node_invoke_system_run`, etc.). +4. Runtime context: `runtime="subagent"` (sandboxed) or `"acp"` (host access). +5. Workspace and session store selected from agent's config. + +### Subagent / ACP spawn (for nesting) + +```typescript +const result = await spawn({ + task: "Analyze the attached image", + mode: "run" | "session", + thread: true, + agentId: "analyzer" +}); +// Returns { status, childSessionKey: "subagent:uuid", runId } +``` + +Sessions prefixed `subagent:*` run in a sandbox (gVisor or Docker container). `acp:*` runs on host under parent's cwd. Parent sees subagent output but can't reach into its filesystem. + +## Cron / heartbeat mechanism + +**It's not a crontab. It's an in-process scheduler built into the gateway.** + +### Heartbeat loop + +1. At gateway boot, `startHeartbeatRunner()` in `src/infra/heartbeat-runner.ts` starts. +2. For each agent where `agents[id].heartbeat.enabled == true`: + - Parse `heartbeat.every` interval + - Calculate next-due time + - Set a timer (internally a `setInterval` that checks wall clock every ~10s) +3. When timer fires: + - Read `memory/heartbeat-state.json` (for dedup / avoid double-fires) + - Read pending `memory/system-events/` (queued by cron jobs, exec completions, etc.) + - Build a prompt from heartbeat config + pending events + - Spawn agent with `extraSystemPrompt` = heartbeat prompt + - Agent responds (may be empty) + - Update heartbeat state file + +### Cron service (parallel to heartbeat) + +- Class: `CronService` in `src/cron/service.ts` +- Config: `cron.jobs[].schedule` (cron expression) +- State: `~/.openclaw/memory/cron/store.json` with `{id, schedule, agentId, prompt, lastRunMs, nextDueMs}` +- Run logs: `~/.openclaw/memory/cron/runs/` +- Can enqueue `system-events/*.json` that heartbeat picks up next cycle. + +### Ad hoc triggers + +- `openclaw wake --now` fires heartbeat immediately +- `openclaw cron run --force` fires a cron job immediately +- `openclaw system-event "text"` queues an event for next heartbeat + +## Plugin discovery and wiring + +### Loader + +`src/plugins/loader.ts` → `loadOpenClawPlugins()`: + +1. Scan `~/.openclaw/plugins/` directory +2. Read each plugin's manifest (plugin.yaml or package.json exports) +3. Dynamic-import plugin module via jiti +4. Initialize `PluginRuntime` with sandbox context, gateway request handler, scoped filesystem access +5. Register plugin's hooks (lifecycle events) and gateway methods (HTTP/RPC) + +### Example: Telegram plugin + +- Starts a polling loop calling Telegram Bot API `getUpdates()` +- For each incoming message, calls `dispatchGatewayMethod("agent", {...})` to spawn a Claude session +- Claude's response routed back via plugin's send handler + +## Replacement difficulty matrix + +| Component | Difficulty | Notes | +|--------------------------------------------|-----------|----------------------------------------------------------------| +| Session storage (JSONL messages) | Easy | Simple file format, adopt as-is | +| Heartbeat scheduler | Medium | Timer logic easy; state/dedup is the work | +| Cron service | Medium | Schedule parsing + state persistence | +| Hook API (POST /hooks) | Easy | Stateless request/response | +| RPC / WebSocket protocol | Hard | Custom protocol with dedup, framing, RBAC | +| Tool policy and allowlist resolution | Medium | Glob pattern + inheritance hierarchy | +| Plugin system | Hard | Dynamic loading, sandboxed runtime contexts | +| Subagent / ACP spawn | Hard | Nesting, thread binding, runtime isolation | +| Delivery system (Telegram, Slack, etc.) | Hard | Multi-channel abstraction; tightly coupled | +| Control UI | Medium | React SPA; can be replaced if protocol stays compatible | +| Authentication and RBAC | Medium | Token validation + scope checks | + +## Don't reinvent this + +1. **Session transcript storage** (`src/config/sessions/`) — JSONL with dedup, compression, archival. Adopt. +2. **Plugin SDK** (`src/plugin-sdk/`) — type-safe hook runners, tool registration. Many plugins depend on it. +3. **Tool policy resolution** (`src/agents/tool-policy*.ts`) — battle-tested glob + inheritance. 2-3 weeks to replace. +4. **Delivery system** (`src/infra/outbound/`) — routes to Telegram/Slack/Discord/WhatsApp with retries and dedup. Very tightly coupled. +5. **Exec approvals** (`src/infra/exec-approvals-*`) — human-in-the-loop for sensitive ops. Keep if you plan approvals. +6. **Hot-reload config** (`src/gateway/config-reload.ts`) — atomic updates with broadcasts. + +## Migration path summary + +To replace openclaw's orchestration while keeping agents and tools: + +1. Adopt existing session storage (or thin DB adapter) +2. Keep plugin system — at minimum the hook-runner pattern for startup/shutdown +3. Reimplement heartbeat scheduler as a background job +4. Reimplement cron service with same semantics +5. Build your own HTTP/RPC gateway, keeping `/tools/invoke` signature for compatibility +6. Map hook API to your agent spawn endpoint +7. Reimplement tool policy resolution using your config schema +8. Adopt delivery system or build equivalent (biggest lift) + +**Estimated effort:** 4-8 weeks competent team, assuming Claude SDK agent harness is mostly intact and session/tool abstractions reused. + +## Caret's conclusion + +Full orchestration replacement is a 4-8 week project. That's NOT what I want. + +**What I DO want is much smaller**: the specific slice that handles Gitea webhook events → policy enforcement → optional agent wake-up. That's a ~600-800 line bun listener, not a whole orchestrator. Everything else (session storage, plugin SDK, delivery system, tool policy) I keep depending on openclaw for, or reuse Claude Code's native primitives (Channels plugins, CronCreate, hooks). + +The research confirms the right shape: build a **minimal webhook listener + event router + script fan-out** that can run standalone, and wire it into Claude Code's native Channels mechanism for the judgment wake-ups. Don't try to replicate the whole orchestrator. diff --git a/research/RESEARCH-03-live-state-audit.md b/research/RESEARCH-03-live-state-audit.md new file mode 100644 index 0000000..b2aff15 --- /dev/null +++ b/research/RESEARCH-03-live-state-audit.md @@ -0,0 +1,132 @@ +# Research 03 — live openclaw state audit + +**Subagent:** `abf0cb0928d823a0b` (Explore) +**Completed:** 2026-04-06 12:50 UTC +**Status:** openclaw system currently DEGRADED — multiple cron timeouts and model misconfiguration. + +## 🚨 CRITICAL FINDING — openclaw is degraded right now + +The subagent discovered the openclaw system is already unhealthy: + +1. **Model misconfiguration.** `openclaw.json` line 138 references `claudehack/claude-sonnet-4-6`, which does not exist in the models provider list. All heartbeat jobs fail with `FailoverError: Unknown model` before any LLM call is made. This matches the errors I saw in the gateway logs earlier today. +2. **Cron job failures.** 8 of 12 scheduled cron jobs are in error state with 4-20 consecutive failures each: + - `self-review-3day`, `compress-daily-notes`, `project-archive`, `rooh-style-review`, `rooh-style-reply-handler` (20 consecutive failures), `queue-doctor-6h`, `ws-sync`, `webhook-verify` — all failing + - Only `rooh-style-learner` is currently succeeding +3. **API pressure.** Failover chains exhaust within 10-11s per attempt: Anthropic Opus/Sonnet timeout, OpenAI Codex rate-limited, openai/gpt-5.4 has no API key in this env. +4. **Stale data.** `db/repos.json` is stale — the `ws-sync` cron has been failing since at least 2026-04-05, so the cached repo list is 48+ hours old. +5. **Webhook E2E test failing every 6h.** The `webhook-verify` cron that's supposed to do a full pipeline delivery test hasn't succeeded for multiple cycles. + +**Implication for the migration:** this is not a "replace a working system" project. It's a "replace a system that's already showing cracks". The migration deadline gets more urgent because the current pipeline is on thin ice. + +## Registered Gitea webhooks + +The sol token lacks `read:admin` scope so system-level webhooks can't be listed. But the per-repo webhooks paint a clear picture: + +| Repo | URL | Events | Secret | Content-Type | +|----------------------|----------------------------------------|---------------------------------------------------------------------|-----------|--------------------| +| gitea-webhooks | https://slack.solio.tech/hooks/gitea | issues, issue_comment, issue_label, issue_assign, issue_milestone | NOT SET | application/json | +| openclaw-mattermost | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json | +| openclaw-commands | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json | +| e2e-ticket-system | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json | + +**Pattern:** every repo has a webhook that posts to the same single endpoint. `HMAC secret is not set on any of them` — confirming Research 01's finding that openclaw's pipeline is authenticated via bearer token + nginx ACL, not Gitea HMAC. + +Event set is uniform: the five issue/PR-style events. Notably MISSING from the event list: `push`, `repository`, `create`, `fork`. Either those are configured elsewhere (system-level admin webhooks, which I can't see) or the `post-repo-audit.sh` pipeline only fires when a human manually triggers it. + +## Running cron / timer infrastructure + +- **No systemd timers** (systemctl not in container) +- **No classic crontab** +- **All scheduling is openclaw's internal scheduler** — `queue-daemon.js` running as PID 4 inside the gateway container, reading `~/.openclaw/cron/jobs.json` +- Jobs are in-process timers managed by the gateway's `CronService` + +### Active cron jobs (from jobs.json) + +| Job ID | Schedule (UTC) | Status | Last result | +|-----------------------------|---------------------|----------|---------------------------------| +| self-review-3day | `0 6 */3 * *` | Enabled | Error (timeout) | +| compress-daily-notes | `0 4 * * *` | Enabled | Error (timeout) | +| project-archive | `30 3 * * *` | Enabled | Error (timeout) | +| rooh-style-review | `0 */4 * * *` | Enabled | Error (timeout) | +| rooh-style-reply-handler | `*/5 * * * *` | Enabled | Error (timeout, 20 consecutive)| +| queue-doctor-6h | `0 */6 * * *` | Enabled | Error (timeout) | +| rooh-style-learner | `0 */2 * * *` | Enabled | OK ✓ | +| ws-sync | daily | Enabled | Error (timeout) | +| webhook-verify | `0 */6 * * *` | Enabled | Error (timeout) | + +Daemon: `/root/.openclaw/hooks/queue-daemon.js` (42KB, active). PID tracked at `/root/.openclaw/hooks/queue-daemon.pid`. + +## Heartbeat checklists + +- **Main agent** (`/root/.openclaw/workspace/HEARTBEAT.md`): comprehensive, 28 checks across 7 intervals, managed by `tools/heartbeat-scheduler.sh` with state at `memory/heartbeat-state.json`. Last check 2026-04-05 15:18 UTC. Every-heartbeat items include protocol re-read, project DB validation, incident checking, webhook auditing. Every-6h includes repo policy audit, webhook E2E verification. +- **Xen agent** (`/root/.openclaw/agents/xen/workspace/HEARTBEAT.md`): intentionally empty (comment-only). **No periodic tasks for Xen.** +- **Spawner, gitea-worker, coder-agent, god-agent, global-calendar, gym-designer, nutrition, particles-ai**: all have empty or task-driven heartbeat configs, not time-driven. + +**Pattern:** only the main agent runs the comprehensive heartbeat. Everything else is event-driven. + +## Live agent processes + +| Agent | Runtime | Status | +|------------------|--------------------------------------|--------------------------------------------------------| +| main | Embedded in openclaw gateway | Running (part of the gateway process) | +| xen | Directory config only | Configured, not actively running | +| spawner | Directory config only | Configured, not actively running | +| gitea-worker | Directory config only | Configured, event-driven (spawned by queue daemon) | +| coder-agent | Directory config only | Not running | +| god-agent | Workspace only | Not running | +| global-calendar | Workspace only | Not running | +| gym-designer | Workspace only | Not running | +| nutrition-agent | Shares gym workspace | Not running | +| particles-ai | Directory config only | Not running | + +### Key discovery — main agent is NOT a separate process + +The main agent ("Xen") isn't a long-lived process. It's **embedded in the openclaw-openclaw-gateway-1 container**, running as in-process Claude sessions triggered by user input and heartbeat events. The gateway was last restarted ~5 hours ago. + +Configuration: +- Primary model: `anthropic/claude-opus-4-6` +- Fallback chain: Sonnet, gpt-5.4, gpt-5.3-codex +- Workspace: `/root/.openclaw/workspace` (155 active projects) + +### claude-worker (me) is NOT alive right now + +The subagent checked the process list for `tmux` sessions matching the claude-worker startup scripts (`/root/start-claude-worker.sh`, `claude-worker-watcher.sh`) and **found no such session alive**. My current running session (this conversation) was started directly, not via those scripts. + +## Shared state and data stores + +All under `/root/.openclaw/`: + +| Store | Path | Purpose | Freshness | +|-------------------------|--------------------------------------|--------------------------------------------------------|----------------------------------| +| Project registry | `workspace/projects/registry.json` | Master index of 155+ projects, status, repo links, manager assignments | Real-time, updated on lifecycle events | +| Repo database | `workspace/db/repos.json` | Cached list of sol/* Gitea repos, metadata | **STALE** (48+ hours, ws-sync failing) | +| Infrastructure registry | `workspace/db/infra.json` | Running containers, scripts, transforms, cron, sandboxes | Last good scan 2026-04-05 15:04 UTC | +| Agent source mapping | `workspace/db/agent-sources.json` | Which repo built which agent | Manual | +| Memory system | `workspace/memory/` | Daily logs, lessons, errors, checklist state, heartbeat timestamps, rooh-translator staging | Real-time | +| Projects DB (text) | `workspace/PROJECTS_DB.md` | Human-readable project summary | Manual | +| Cron jobs state | `cron/jobs.json` | 12 scheduled jobs, execution state, error counts | Very high churn (failing jobs retrying) | +| Cron logs | `hooks/logs/audit.jsonl`, `incidents.jsonl` | JSONL audit trail | Real-time | +| Queue inbox | `hooks/queue-inbox/` | Temporary staging for webhook payloads | Ephemeral | +| Vault | `vault-data/` | Encrypted configuration, secret state | Sandbox-managed | +| Credentials | `credentials/` | API tokens, OAuth profiles, Anthropic keys (perms 700)| Updated by sync-oauth-token.sh | +| Agent auth profiles | `agents/*/agent/auth-profiles.json` | Per-agent Anthropic, Google, OpenAI, Codex creds | Last morning sync 2026-04-06 07:11 | + +**Architecture:** event-driven with scheduled backup scans. Real-time data in `workspace/` is Git-versioned (branch `main`). Calculated data in `db/` is cached output of pure bash+jq scripts. No central database; on-disk state is source of truth. + +## What the migration actually has to replace + +A **single main agent** (Opus, embedded in gateway) managing 155 projects across 47 sol/* Gitea repos with a distributed sub-agent orchestration pattern. The agent operates on: + +- **Inbound:** Gitea webhooks → `slack.solio.tech/hooks/gitea` → queue daemon → gitea-worker transform → agent spawn +- **Outbound:** Cron jobs (12 scheduled, 8 failing) → delivery queue → agent heartbeats → Mattermost DMs and Gitea issue updates +- **State:** Git-versioned `/root/.openclaw/workspace/` with 5-level agent hierarchy and a 28-item heartbeat checklist + +The migration only needs to replace the **Gitea-facing slice** (webhook ingress, event router, script fan-out, repo policy enforcement). The rest of the system — the 155 projects, the workspace, the memory system, the sub-agent orchestration — stays owned by openclaw unless Rooh expands the scope. + +## Caret's immediate observations + +1. **The model misconfiguration needs to be fixed NOW.** Even if I don't touch the migration yet, the `claudehack/claude-sonnet-4-6` reference is breaking the heartbeat. That's Xen's production pipeline. Worth flagging to Rooh. +2. **The admin scope issue is real.** I cannot list system-level webhooks. Any migration requires either token elevation or Rooh manually registering the new endpoint. +3. **Current security is flimsier than the code looks.** All four known webhooks have no HMAC secret. The protection is bearer token + nginx ACL. My replacement should do better — implement real HMAC on the raw body. +4. **The migration scope is smaller than I feared.** I don't need to replicate the 28-item heartbeat or the 155-project workspace. I only need the Gitea-facing slice. That's a few days of work, not 4-8 weeks. +5. **Rooh's earlier 6-hour cron I set up (`eaeef6ff`) is the right shape** for the basic policy sweep. It doesn't need to be elaborate.