research: Phase 0 reports 2 and 3 — gateway internals + live state audit
This commit is contained in:
163
research/RESEARCH-02-gateway-internals.md
Normal file
163
research/RESEARCH-02-gateway-internals.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# Research 02 — openclaw gateway internals
|
||||
|
||||
**Subagent:** `ae5ca38f70b1e9626` (Explore)
|
||||
**Completed:** 2026-04-06 12:50 UTC
|
||||
|
||||
## Gateway API surface
|
||||
|
||||
WebSocket-first RPC at `ws://localhost:18789/`, with HTTP fallback routes.
|
||||
|
||||
### HTTP endpoints
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|-------------------------------------|-------------------------------------------------------------------------|
|
||||
| POST | `/hooks/{hookPath}/wake` | Trigger heartbeat or immediate agent wake. Body: `{text, mode}`. |
|
||||
| POST | `/hooks/{hookPath}/agent` | Spawn isolated agent session. Body: `{agentId, sessionKey, message, channel, to, deliver, model, thinking, timeoutSeconds}`. Returns `{ok, runId}`. Idempotency: 60s dedup by `Authorization + X-Idempotency-Key`. |
|
||||
| POST | `/tools/invoke` | Call a tool directly. Body: `{tool, action, args, sessionKey, dryRun}`. |
|
||||
| GET | `/health` / `/healthz` / `/ready` | Liveness / readiness probes. |
|
||||
| GET | `/` and `/app/*` | Built-in web control UI (the SPA we saw when probing earlier). |
|
||||
| Plugin-registered routes | Custom plugin HTTP endpoints; auth enforced per plugin's `requiresAuth`.|
|
||||
|
||||
### Authentication
|
||||
|
||||
- `Authorization: Bearer <token>` OR `X-OpenClaw-Token: <token>` header
|
||||
- Token sources: `gateway.auth.token` in config, `OPENCLAW_GATEWAY_TOKEN` env var, device token at `~/.openclaw/credentials/device-token`
|
||||
- WebSocket auth: passed in URL query `?token=...` or connect frame
|
||||
|
||||
### RPC method RBAC scopes
|
||||
|
||||
- READ: `health`, `channels.status`, `sessions.list`, `cron.list`, `node.list`, ...
|
||||
- WRITE: `send`, `agent`, `agent.wait`, `wake`, `node.invoke`, ...
|
||||
- ADMIN: `config.set`, `agents.create`, `cron.add`, `sessions.reset`, ...
|
||||
- APPROVALS, PAIRING: narrower scoped methods.
|
||||
|
||||
## Session spawn recipe
|
||||
|
||||
### The primary spawn path
|
||||
|
||||
```
|
||||
Client RPC request → gateway dispatch → agentHandlers.agent() → agentCommandFromIngress() → in-process task
|
||||
```
|
||||
|
||||
Not a child process. Sessions run as in-process tasks under the gateway. Each session's message history lives in `~/.openclaw/sessions/*.jsonl`.
|
||||
|
||||
### Agent identity & tool allowlist resolution at spawn
|
||||
|
||||
1. Resolve agent ID from `params.agentId` or `agents.defaults.id`.
|
||||
2. Resolve tool allowlist: first match wins among `agents[id].tools.allow/deny` → `agents[id].toolProfile` → `agents.defaults.tools.*` → subagent role restrictions.
|
||||
3. Hard-deny list always wins (`exec.approval.*`, `node_invoke_system_run`, etc.).
|
||||
4. Runtime context: `runtime="subagent"` (sandboxed) or `"acp"` (host access).
|
||||
5. Workspace and session store selected from agent's config.
|
||||
|
||||
### Subagent / ACP spawn (for nesting)
|
||||
|
||||
```typescript
|
||||
const result = await spawn({
|
||||
task: "Analyze the attached image",
|
||||
mode: "run" | "session",
|
||||
thread: true,
|
||||
agentId: "analyzer"
|
||||
});
|
||||
// Returns { status, childSessionKey: "subagent:uuid", runId }
|
||||
```
|
||||
|
||||
Sessions prefixed `subagent:*` run in a sandbox (gVisor or Docker container). `acp:*` runs on host under parent's cwd. Parent sees subagent output but can't reach into its filesystem.
|
||||
|
||||
## Cron / heartbeat mechanism
|
||||
|
||||
**It's not a crontab. It's an in-process scheduler built into the gateway.**
|
||||
|
||||
### Heartbeat loop
|
||||
|
||||
1. At gateway boot, `startHeartbeatRunner()` in `src/infra/heartbeat-runner.ts` starts.
|
||||
2. For each agent where `agents[id].heartbeat.enabled == true`:
|
||||
- Parse `heartbeat.every` interval
|
||||
- Calculate next-due time
|
||||
- Set a timer (internally a `setInterval` that checks wall clock every ~10s)
|
||||
3. When timer fires:
|
||||
- Read `memory/heartbeat-state.json` (for dedup / avoid double-fires)
|
||||
- Read pending `memory/system-events/` (queued by cron jobs, exec completions, etc.)
|
||||
- Build a prompt from heartbeat config + pending events
|
||||
- Spawn agent with `extraSystemPrompt` = heartbeat prompt
|
||||
- Agent responds (may be empty)
|
||||
- Update heartbeat state file
|
||||
|
||||
### Cron service (parallel to heartbeat)
|
||||
|
||||
- Class: `CronService` in `src/cron/service.ts`
|
||||
- Config: `cron.jobs[].schedule` (cron expression)
|
||||
- State: `~/.openclaw/memory/cron/store.json` with `{id, schedule, agentId, prompt, lastRunMs, nextDueMs}`
|
||||
- Run logs: `~/.openclaw/memory/cron/runs/`
|
||||
- Can enqueue `system-events/*.json` that heartbeat picks up next cycle.
|
||||
|
||||
### Ad hoc triggers
|
||||
|
||||
- `openclaw wake --now` fires heartbeat immediately
|
||||
- `openclaw cron run <id> --force` fires a cron job immediately
|
||||
- `openclaw system-event "text"` queues an event for next heartbeat
|
||||
|
||||
## Plugin discovery and wiring
|
||||
|
||||
### Loader
|
||||
|
||||
`src/plugins/loader.ts` → `loadOpenClawPlugins()`:
|
||||
|
||||
1. Scan `~/.openclaw/plugins/` directory
|
||||
2. Read each plugin's manifest (plugin.yaml or package.json exports)
|
||||
3. Dynamic-import plugin module via jiti
|
||||
4. Initialize `PluginRuntime` with sandbox context, gateway request handler, scoped filesystem access
|
||||
5. Register plugin's hooks (lifecycle events) and gateway methods (HTTP/RPC)
|
||||
|
||||
### Example: Telegram plugin
|
||||
|
||||
- Starts a polling loop calling Telegram Bot API `getUpdates()`
|
||||
- For each incoming message, calls `dispatchGatewayMethod("agent", {...})` to spawn a Claude session
|
||||
- Claude's response routed back via plugin's send handler
|
||||
|
||||
## Replacement difficulty matrix
|
||||
|
||||
| Component | Difficulty | Notes |
|
||||
|--------------------------------------------|-----------|----------------------------------------------------------------|
|
||||
| Session storage (JSONL messages) | Easy | Simple file format, adopt as-is |
|
||||
| Heartbeat scheduler | Medium | Timer logic easy; state/dedup is the work |
|
||||
| Cron service | Medium | Schedule parsing + state persistence |
|
||||
| Hook API (POST /hooks) | Easy | Stateless request/response |
|
||||
| RPC / WebSocket protocol | Hard | Custom protocol with dedup, framing, RBAC |
|
||||
| Tool policy and allowlist resolution | Medium | Glob pattern + inheritance hierarchy |
|
||||
| Plugin system | Hard | Dynamic loading, sandboxed runtime contexts |
|
||||
| Subagent / ACP spawn | Hard | Nesting, thread binding, runtime isolation |
|
||||
| Delivery system (Telegram, Slack, etc.) | Hard | Multi-channel abstraction; tightly coupled |
|
||||
| Control UI | Medium | React SPA; can be replaced if protocol stays compatible |
|
||||
| Authentication and RBAC | Medium | Token validation + scope checks |
|
||||
|
||||
## Don't reinvent this
|
||||
|
||||
1. **Session transcript storage** (`src/config/sessions/`) — JSONL with dedup, compression, archival. Adopt.
|
||||
2. **Plugin SDK** (`src/plugin-sdk/`) — type-safe hook runners, tool registration. Many plugins depend on it.
|
||||
3. **Tool policy resolution** (`src/agents/tool-policy*.ts`) — battle-tested glob + inheritance. 2-3 weeks to replace.
|
||||
4. **Delivery system** (`src/infra/outbound/`) — routes to Telegram/Slack/Discord/WhatsApp with retries and dedup. Very tightly coupled.
|
||||
5. **Exec approvals** (`src/infra/exec-approvals-*`) — human-in-the-loop for sensitive ops. Keep if you plan approvals.
|
||||
6. **Hot-reload config** (`src/gateway/config-reload.ts`) — atomic updates with broadcasts.
|
||||
|
||||
## Migration path summary
|
||||
|
||||
To replace openclaw's orchestration while keeping agents and tools:
|
||||
|
||||
1. Adopt existing session storage (or thin DB adapter)
|
||||
2. Keep plugin system — at minimum the hook-runner pattern for startup/shutdown
|
||||
3. Reimplement heartbeat scheduler as a background job
|
||||
4. Reimplement cron service with same semantics
|
||||
5. Build your own HTTP/RPC gateway, keeping `/tools/invoke` signature for compatibility
|
||||
6. Map hook API to your agent spawn endpoint
|
||||
7. Reimplement tool policy resolution using your config schema
|
||||
8. Adopt delivery system or build equivalent (biggest lift)
|
||||
|
||||
**Estimated effort:** 4-8 weeks competent team, assuming Claude SDK agent harness is mostly intact and session/tool abstractions reused.
|
||||
|
||||
## Caret's conclusion
|
||||
|
||||
Full orchestration replacement is a 4-8 week project. That's NOT what I want.
|
||||
|
||||
**What I DO want is much smaller**: the specific slice that handles Gitea webhook events → policy enforcement → optional agent wake-up. That's a ~600-800 line bun listener, not a whole orchestrator. Everything else (session storage, plugin SDK, delivery system, tool policy) I keep depending on openclaw for, or reuse Claude Code's native primitives (Channels plugins, CronCreate, hooks).
|
||||
|
||||
The research confirms the right shape: build a **minimal webhook listener + event router + script fan-out** that can run standalone, and wire it into Claude Code's native Channels mechanism for the judgment wake-ups. Don't try to replicate the whole orchestrator.
|
||||
Reference in New Issue
Block a user