12 KiB
Research 03 — live openclaw state audit
Subagent: abf0cb0928d823a0b (Explore)
Completed: 2026-04-06 12:50 UTC
Status: openclaw system currently DEGRADED — multiple cron timeouts and model misconfiguration.
🚨 CRITICAL FINDING — openclaw is degraded right now
The subagent discovered the openclaw system is already unhealthy:
- Model misconfiguration.
openclaw.jsonline 138 referencesclaudehack/claude-sonnet-4-6, which does not exist in the models provider list. All heartbeat jobs fail withFailoverError: Unknown modelbefore any LLM call is made. This matches the errors I saw in the gateway logs earlier today. - Cron job failures. 8 of 12 scheduled cron jobs are in error state with 4-20 consecutive failures each:
self-review-3day,compress-daily-notes,project-archive,rooh-style-review,rooh-style-reply-handler(20 consecutive failures),queue-doctor-6h,ws-sync,webhook-verify— all failing- Only
rooh-style-learneris currently succeeding
- API pressure. Failover chains exhaust within 10-11s per attempt: Anthropic Opus/Sonnet timeout, OpenAI Codex rate-limited, openai/gpt-5.4 has no API key in this env.
- Stale data.
db/repos.jsonis stale — thews-synccron has been failing since at least 2026-04-05, so the cached repo list is 48+ hours old. - Webhook E2E test failing every 6h. The
webhook-verifycron that's supposed to do a full pipeline delivery test hasn't succeeded for multiple cycles.
Implication for the migration: this is not a "replace a working system" project. It's a "replace a system that's already showing cracks". The migration deadline gets more urgent because the current pipeline is on thin ice.
Registered Gitea webhooks
The sol token lacks read:admin scope so system-level webhooks can't be listed. But the per-repo webhooks paint a clear picture:
| Repo | URL | Events | Secret | Content-Type |
|---|---|---|---|---|
| gitea-webhooks | https://slack.solio.tech/hooks/gitea | issues, issue_comment, issue_label, issue_assign, issue_milestone | NOT SET | application/json |
| openclaw-mattermost | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json |
| openclaw-commands | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json |
| e2e-ticket-system | https://slack.solio.tech/hooks/gitea | (same) | NOT SET | application/json |
Pattern: every repo has a webhook that posts to the same single endpoint. HMAC secret is not set on any of them — confirming Research 01's finding that openclaw's pipeline is authenticated via bearer token + nginx ACL, not Gitea HMAC.
Event set is uniform: the five issue/PR-style events. Notably MISSING from the event list: push, repository, create, fork. Either those are configured elsewhere (system-level admin webhooks, which I can't see) or the post-repo-audit.sh pipeline only fires when a human manually triggers it.
Running cron / timer infrastructure
- No systemd timers (systemctl not in container)
- No classic crontab
- All scheduling is openclaw's internal scheduler —
queue-daemon.jsrunning as PID 4 inside the gateway container, reading~/.openclaw/cron/jobs.json - Jobs are in-process timers managed by the gateway's
CronService
Active cron jobs (from jobs.json)
| Job ID | Schedule (UTC) | Status | Last result |
|---|---|---|---|
| self-review-3day | 0 6 */3 * * |
Enabled | Error (timeout) |
| compress-daily-notes | 0 4 * * * |
Enabled | Error (timeout) |
| project-archive | 30 3 * * * |
Enabled | Error (timeout) |
| rooh-style-review | 0 */4 * * * |
Enabled | Error (timeout) |
| rooh-style-reply-handler | */5 * * * * |
Enabled | Error (timeout, 20 consecutive) |
| queue-doctor-6h | 0 */6 * * * |
Enabled | Error (timeout) |
| rooh-style-learner | 0 */2 * * * |
Enabled | OK ✓ |
| ws-sync | daily | Enabled | Error (timeout) |
| webhook-verify | 0 */6 * * * |
Enabled | Error (timeout) |
Daemon: /root/.openclaw/hooks/queue-daemon.js (42KB, active). PID tracked at /root/.openclaw/hooks/queue-daemon.pid.
Heartbeat checklists
- Main agent (
/root/.openclaw/workspace/HEARTBEAT.md): comprehensive, 28 checks across 7 intervals, managed bytools/heartbeat-scheduler.shwith state atmemory/heartbeat-state.json. Last check 2026-04-05 15:18 UTC. Every-heartbeat items include protocol re-read, project DB validation, incident checking, webhook auditing. Every-6h includes repo policy audit, webhook E2E verification. - Xen agent (
/root/.openclaw/agents/xen/workspace/HEARTBEAT.md): intentionally empty (comment-only). No periodic tasks for Xen. - Spawner, gitea-worker, coder-agent, god-agent, global-calendar, gym-designer, nutrition, particles-ai: all have empty or task-driven heartbeat configs, not time-driven.
Pattern: only the main agent runs the comprehensive heartbeat. Everything else is event-driven.
Live agent processes
| Agent | Runtime | Status |
|---|---|---|
| main | Embedded in openclaw gateway | Running (part of the gateway process) |
| xen | Directory config only | Configured, not actively running |
| spawner | Directory config only | Configured, not actively running |
| gitea-worker | Directory config only | Configured, event-driven (spawned by queue daemon) |
| coder-agent | Directory config only | Not running |
| god-agent | Workspace only | Not running |
| global-calendar | Workspace only | Not running |
| gym-designer | Workspace only | Not running |
| nutrition-agent | Shares gym workspace | Not running |
| particles-ai | Directory config only | Not running |
Key discovery — main agent is NOT a separate process
The main agent ("Xen") isn't a long-lived process. It's embedded in the openclaw-openclaw-gateway-1 container, running as in-process Claude sessions triggered by user input and heartbeat events. The gateway was last restarted ~5 hours ago.
Configuration:
- Primary model:
anthropic/claude-opus-4-6 - Fallback chain: Sonnet, gpt-5.4, gpt-5.3-codex
- Workspace:
/root/.openclaw/workspace(155 active projects)
claude-worker (me) is NOT alive right now
The subagent checked the process list for tmux sessions matching the claude-worker startup scripts (/root/start-claude-worker.sh, claude-worker-watcher.sh) and found no such session alive. My current running session (this conversation) was started directly, not via those scripts.
Shared state and data stores
All under /root/.openclaw/:
| Store | Path | Purpose | Freshness |
|---|---|---|---|
| Project registry | workspace/projects/registry.json |
Master index of 155+ projects, status, repo links, manager assignments | Real-time, updated on lifecycle events |
| Repo database | workspace/db/repos.json |
Cached list of sol/* Gitea repos, metadata | STALE (48+ hours, ws-sync failing) |
| Infrastructure registry | workspace/db/infra.json |
Running containers, scripts, transforms, cron, sandboxes | Last good scan 2026-04-05 15:04 UTC |
| Agent source mapping | workspace/db/agent-sources.json |
Which repo built which agent | Manual |
| Memory system | workspace/memory/ |
Daily logs, lessons, errors, checklist state, heartbeat timestamps, rooh-translator staging | Real-time |
| Projects DB (text) | workspace/PROJECTS_DB.md |
Human-readable project summary | Manual |
| Cron jobs state | cron/jobs.json |
12 scheduled jobs, execution state, error counts | Very high churn (failing jobs retrying) |
| Cron logs | hooks/logs/audit.jsonl, incidents.jsonl |
JSONL audit trail | Real-time |
| Queue inbox | hooks/queue-inbox/ |
Temporary staging for webhook payloads | Ephemeral |
| Vault | vault-data/ |
Encrypted configuration, secret state | Sandbox-managed |
| Credentials | credentials/ |
API tokens, OAuth profiles, Anthropic keys (perms 700) | Updated by sync-oauth-token.sh |
| Agent auth profiles | agents/*/agent/auth-profiles.json |
Per-agent Anthropic, Google, OpenAI, Codex creds | Last morning sync 2026-04-06 07:11 |
Architecture: event-driven with scheduled backup scans. Real-time data in workspace/ is Git-versioned (branch main). Calculated data in db/ is cached output of pure bash+jq scripts. No central database; on-disk state is source of truth.
What the migration actually has to replace
A single main agent (Opus, embedded in gateway) managing 155 projects across 47 sol/* Gitea repos with a distributed sub-agent orchestration pattern. The agent operates on:
- Inbound: Gitea webhooks →
slack.solio.tech/hooks/gitea→ queue daemon → gitea-worker transform → agent spawn - Outbound: Cron jobs (12 scheduled, 8 failing) → delivery queue → agent heartbeats → Mattermost DMs and Gitea issue updates
- State: Git-versioned
/root/.openclaw/workspace/with 5-level agent hierarchy and a 28-item heartbeat checklist
The migration only needs to replace the Gitea-facing slice (webhook ingress, event router, script fan-out, repo policy enforcement). The rest of the system — the 155 projects, the workspace, the memory system, the sub-agent orchestration — stays owned by openclaw unless Rooh expands the scope.
Caret's immediate observations
- The model misconfiguration needs to be fixed NOW. Even if I don't touch the migration yet, the
claudehack/claude-sonnet-4-6reference is breaking the heartbeat. That's Xen's production pipeline. Worth flagging to Rooh. - The admin scope issue is real. I cannot list system-level webhooks. Any migration requires either token elevation or Rooh manually registering the new endpoint.
- Current security is flimsier than the code looks. All four known webhooks have no HMAC secret. The protection is bearer token + nginx ACL. My replacement should do better — implement real HMAC on the raw body.
- The migration scope is smaller than I feared. I don't need to replicate the 28-item heartbeat or the 155-project workspace. I only need the Gitea-facing slice. That's a few days of work, not 4-8 weeks.
- Rooh's earlier 6-hour cron I set up (
eaeef6ff) is the right shape for the basic policy sweep. It doesn't need to be elaborate.