# Research 03 — live openclaw state audit **Subagent:** `abf0cb0928d823a0b` (Explore) **Completed:** 2026-04-06 12:50 UTC **Status:** openclaw system currently DEGRADED — multiple cron timeouts and model misconfiguration. ## 🚨 CRITICAL FINDING — openclaw is degraded right now The subagent discovered the openclaw system is already unhealthy: 1. **Model misconfiguration.** `openclaw.json` line 138 references `claudehack/claude-sonnet-4-6`, which does not exist in the models provider list. All heartbeat jobs fail with `FailoverError: Unknown model` before any LLM call is made. This matches the errors I saw in the gateway logs earlier today. 2. **Cron job failures.** 8 of 12 scheduled cron jobs are in error state with 4-20 consecutive failures each: - `self-review-3day`, `compress-daily-notes`, `project-archive`, `rooh-style-review`, `rooh-style-reply-handler` (20 consecutive failures), `queue-doctor-6h`, `ws-sync`, `webhook-verify` — all failing - Only `rooh-style-learner` is currently succeeding 3. **API pressure.** Failover chains exhaust within 10-11s per attempt: Anthropic Opus/Sonnet timeout, OpenAI Codex rate-limited, openai/gpt-5.4 has no API key in this env. 4. **Stale data.** `db/repos.json` is stale — the `ws-sync` cron has been failing since at least 2026-04-05, so the cached repo list is 48+ hours old. 5. **Webhook E2E test failing every 6h.** The `webhook-verify` cron that's supposed to do a full pipeline delivery test hasn't succeeded for multiple cycles. **Implication for the migration:** this is not a "replace a working system" project. It's a "replace a system that's already showing cracks". The migration deadline gets more urgent because the current pipeline is on thin ice. ## Registered Gitea webhooks The sol token lacks `read:admin` scope so system-level webhooks can't be listed. But the per-repo webhooks paint a clear picture: | Repo | URL | Events | Secret | Content-Type | |----------------------|----------------------------------------|---------------------------------------------------------------------|-----------|--------------------| | gitea-webhooks | https://slack.solio.tech/hooks/gitea | issues, issue_comment, issue_label, issue_assign, issue_milestone | NOT SET | application/json | | openclaw-mattermost | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json | | openclaw-commands | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json | | e2e-ticket-system | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json | **Pattern:** every repo has a webhook that posts to the same single endpoint. `HMAC secret is not set on any of them` — confirming Research 01's finding that openclaw's pipeline is authenticated via bearer token + nginx ACL, not Gitea HMAC. Event set is uniform: the five issue/PR-style events. Notably MISSING from the event list: `push`, `repository`, `create`, `fork`. Either those are configured elsewhere (system-level admin webhooks, which I can't see) or the `post-repo-audit.sh` pipeline only fires when a human manually triggers it. ## Running cron / timer infrastructure - **No systemd timers** (systemctl not in container) - **No classic crontab** - **All scheduling is openclaw's internal scheduler** — `queue-daemon.js` running as PID 4 inside the gateway container, reading `~/.openclaw/cron/jobs.json` - Jobs are in-process timers managed by the gateway's `CronService` ### Active cron jobs (from jobs.json) | Job ID | Schedule (UTC) | Status | Last result | |-----------------------------|---------------------|----------|---------------------------------| | self-review-3day | `0 6 */3 * *` | Enabled | Error (timeout) | | compress-daily-notes | `0 4 * * *` | Enabled | Error (timeout) | | project-archive | `30 3 * * *` | Enabled | Error (timeout) | | rooh-style-review | `0 */4 * * *` | Enabled | Error (timeout) | | rooh-style-reply-handler | `*/5 * * * *` | Enabled | Error (timeout, 20 consecutive)| | queue-doctor-6h | `0 */6 * * *` | Enabled | Error (timeout) | | rooh-style-learner | `0 */2 * * *` | Enabled | OK ✓ | | ws-sync | daily | Enabled | Error (timeout) | | webhook-verify | `0 */6 * * *` | Enabled | Error (timeout) | Daemon: `/root/.openclaw/hooks/queue-daemon.js` (42KB, active). PID tracked at `/root/.openclaw/hooks/queue-daemon.pid`. ## Heartbeat checklists - **Main agent** (`/root/.openclaw/workspace/HEARTBEAT.md`): comprehensive, 28 checks across 7 intervals, managed by `tools/heartbeat-scheduler.sh` with state at `memory/heartbeat-state.json`. Last check 2026-04-05 15:18 UTC. Every-heartbeat items include protocol re-read, project DB validation, incident checking, webhook auditing. Every-6h includes repo policy audit, webhook E2E verification. - **Xen agent** (`/root/.openclaw/agents/xen/workspace/HEARTBEAT.md`): intentionally empty (comment-only). **No periodic tasks for Xen.** - **Spawner, gitea-worker, coder-agent, god-agent, global-calendar, gym-designer, nutrition, particles-ai**: all have empty or task-driven heartbeat configs, not time-driven. **Pattern:** only the main agent runs the comprehensive heartbeat. Everything else is event-driven. ## Live agent processes | Agent | Runtime | Status | |------------------|--------------------------------------|--------------------------------------------------------| | main | Embedded in openclaw gateway | Running (part of the gateway process) | | xen | Directory config only | Configured, not actively running | | spawner | Directory config only | Configured, not actively running | | gitea-worker | Directory config only | Configured, event-driven (spawned by queue daemon) | | coder-agent | Directory config only | Not running | | god-agent | Workspace only | Not running | | global-calendar | Workspace only | Not running | | gym-designer | Workspace only | Not running | | nutrition-agent | Shares gym workspace | Not running | | particles-ai | Directory config only | Not running | ### Key discovery — main agent is NOT a separate process The main agent ("Xen") isn't a long-lived process. It's **embedded in the openclaw-openclaw-gateway-1 container**, running as in-process Claude sessions triggered by user input and heartbeat events. The gateway was last restarted ~5 hours ago. Configuration: - Primary model: `anthropic/claude-opus-4-6` - Fallback chain: Sonnet, gpt-5.4, gpt-5.3-codex - Workspace: `/root/.openclaw/workspace` (155 active projects) ### claude-worker (me) is NOT alive right now The subagent checked the process list for `tmux` sessions matching the claude-worker startup scripts (`/root/start-claude-worker.sh`, `claude-worker-watcher.sh`) and **found no such session alive**. My current running session (this conversation) was started directly, not via those scripts. ## Shared state and data stores All under `/root/.openclaw/`: | Store | Path | Purpose | Freshness | |-------------------------|--------------------------------------|--------------------------------------------------------|----------------------------------| | Project registry | `workspace/projects/registry.json` | Master index of 155+ projects, status, repo links, manager assignments | Real-time, updated on lifecycle events | | Repo database | `workspace/db/repos.json` | Cached list of sol/* Gitea repos, metadata | **STALE** (48+ hours, ws-sync failing) | | Infrastructure registry | `workspace/db/infra.json` | Running containers, scripts, transforms, cron, sandboxes | Last good scan 2026-04-05 15:04 UTC | | Agent source mapping | `workspace/db/agent-sources.json` | Which repo built which agent | Manual | | Memory system | `workspace/memory/` | Daily logs, lessons, errors, checklist state, heartbeat timestamps, rooh-translator staging | Real-time | | Projects DB (text) | `workspace/PROJECTS_DB.md` | Human-readable project summary | Manual | | Cron jobs state | `cron/jobs.json` | 12 scheduled jobs, execution state, error counts | Very high churn (failing jobs retrying) | | Cron logs | `hooks/logs/audit.jsonl`, `incidents.jsonl` | JSONL audit trail | Real-time | | Queue inbox | `hooks/queue-inbox/` | Temporary staging for webhook payloads | Ephemeral | | Vault | `vault-data/` | Encrypted configuration, secret state | Sandbox-managed | | Credentials | `credentials/` | API tokens, OAuth profiles, Anthropic keys (perms 700)| Updated by sync-oauth-token.sh | | Agent auth profiles | `agents/*/agent/auth-profiles.json` | Per-agent Anthropic, Google, OpenAI, Codex creds | Last morning sync 2026-04-06 07:11 | **Architecture:** event-driven with scheduled backup scans. Real-time data in `workspace/` is Git-versioned (branch `main`). Calculated data in `db/` is cached output of pure bash+jq scripts. No central database; on-disk state is source of truth. ## What the migration actually has to replace A **single main agent** (Opus, embedded in gateway) managing 155 projects across 47 sol/* Gitea repos with a distributed sub-agent orchestration pattern. The agent operates on: - **Inbound:** Gitea webhooks → `slack.solio.tech/hooks/gitea` → queue daemon → gitea-worker transform → agent spawn - **Outbound:** Cron jobs (12 scheduled, 8 failing) → delivery queue → agent heartbeats → Mattermost DMs and Gitea issue updates - **State:** Git-versioned `/root/.openclaw/workspace/` with 5-level agent hierarchy and a 28-item heartbeat checklist The migration only needs to replace the **Gitea-facing slice** (webhook ingress, event router, script fan-out, repo policy enforcement). The rest of the system — the 155 projects, the workspace, the memory system, the sub-agent orchestration — stays owned by openclaw unless Rooh expands the scope. ## Caret's immediate observations 1. **The model misconfiguration needs to be fixed NOW.** Even if I don't touch the migration yet, the `claudehack/claude-sonnet-4-6` reference is breaking the heartbeat. That's Xen's production pipeline. Worth flagging to Rooh. 2. **The admin scope issue is real.** I cannot list system-level webhooks. Any migration requires either token elevation or Rooh manually registering the new endpoint. 3. **Current security is flimsier than the code looks.** All four known webhooks have no HMAC secret. The protection is bearer token + nginx ACL. My replacement should do better — implement real HMAC on the raw body. 4. **The migration scope is smaller than I feared.** I don't need to replicate the 28-item heartbeat or the 155-project workspace. I only need the Gitea-facing slice. That's a few days of work, not 4-8 weeks. 5. **Rooh's earlier 6-hour cron I set up (`eaeef6ff`) is the right shape** for the basic policy sweep. It doesn't need to be elaborate.