Research 03 — live openclaw state audit

Subagent: abf0cb0928d823a0b (Explore) Completed: 2026-04-06 12:50 UTC Status: openclaw system currently DEGRADED — multiple cron timeouts and model misconfiguration.

🚨 CRITICAL FINDING — openclaw is degraded right now

The subagent discovered the openclaw system is already unhealthy:

Model misconfiguration. openclaw.json line 138 references claudehack/claude-sonnet-4-6, which does not exist in the models provider list. All heartbeat jobs fail with FailoverError: Unknown model before any LLM call is made. This matches the errors I saw in the gateway logs earlier today.
Cron job failures. 8 of 12 scheduled cron jobs are in error state with 4-20 consecutive failures each:
- self-review-3day, compress-daily-notes, project-archive, rooh-style-review, rooh-style-reply-handler (20 consecutive failures), queue-doctor-6h, ws-sync, webhook-verify — all failing
- Only rooh-style-learner is currently succeeding
API pressure. Failover chains exhaust within 10-11s per attempt: Anthropic Opus/Sonnet timeout, OpenAI Codex rate-limited, openai/gpt-5.4 has no API key in this env.
Stale data. db/repos.json is stale — the ws-sync cron has been failing since at least 2026-04-05, so the cached repo list is 48+ hours old.
Webhook E2E test failing every 6h. The webhook-verify cron that's supposed to do a full pipeline delivery test hasn't succeeded for multiple cycles.

Implication for the migration: this is not a "replace a working system" project. It's a "replace a system that's already showing cracks". The migration deadline gets more urgent because the current pipeline is on thin ice.

Registered Gitea webhooks

The sol token lacks read:admin scope so system-level webhooks can't be listed. But the per-repo webhooks paint a clear picture:

Repo	URL	Events	Secret	Content-Type
gitea-webhooks	https://slack.solio.tech/hooks/gitea	issues, issue_comment, issue_label, issue_assign, issue_milestone	NOT SET	application/json
openclaw-mattermost	https://slack.solio.tech/hooks/gitea	(same)	NOT SET	application/json
openclaw-commands	https://slack.solio.tech/hooks/gitea	(same)	NOT SET	application/json
e2e-ticket-system	https://slack.solio.tech/hooks/gitea	(same)	NOT SET	application/json

Pattern: every repo has a webhook that posts to the same single endpoint. HMAC secret is not set on any of them — confirming Research 01's finding that openclaw's pipeline is authenticated via bearer token + nginx ACL, not Gitea HMAC.

Event set is uniform: the five issue/PR-style events. Notably MISSING from the event list: push, repository, create, fork. Either those are configured elsewhere (system-level admin webhooks, which I can't see) or the post-repo-audit.sh pipeline only fires when a human manually triggers it.

Running cron / timer infrastructure

No systemd timers (systemctl not in container)
No classic crontab
All scheduling is openclaw's internal scheduler — queue-daemon.js running as PID 4 inside the gateway container, reading ~/.openclaw/cron/jobs.json
Jobs are in-process timers managed by the gateway's CronService

Active cron jobs (from jobs.json)

Job ID	Schedule (UTC)	Status	Last result
self-review-3day	`0 6 /3 *`	Enabled	Error (timeout)
compress-daily-notes	`0 4 * * *`	Enabled	Error (timeout)
project-archive	`30 3 * * *`	Enabled	Error (timeout)
rooh-style-review	`0 /4 * *`	Enabled	Error (timeout)
rooh-style-reply-handler	`/5 * * *`	Enabled	Error (timeout, 20 consecutive)
queue-doctor-6h	`0 /6 * *`	Enabled	Error (timeout)
rooh-style-learner	`0 /2 * *`	Enabled	OK ✓
ws-sync	daily	Enabled	Error (timeout)
webhook-verify	`0 /6 * *`	Enabled	Error (timeout)

Daemon: /root/.openclaw/hooks/queue-daemon.js (42KB, active). PID tracked at /root/.openclaw/hooks/queue-daemon.pid.

Heartbeat checklists

Main agent (/root/.openclaw/workspace/HEARTBEAT.md): comprehensive, 28 checks across 7 intervals, managed by tools/heartbeat-scheduler.sh with state at memory/heartbeat-state.json. Last check 2026-04-05 15:18 UTC. Every-heartbeat items include protocol re-read, project DB validation, incident checking, webhook auditing. Every-6h includes repo policy audit, webhook E2E verification.
Xen agent (/root/.openclaw/agents/xen/workspace/HEARTBEAT.md): intentionally empty (comment-only). No periodic tasks for Xen.
Spawner, gitea-worker, coder-agent, god-agent, global-calendar, gym-designer, nutrition, particles-ai: all have empty or task-driven heartbeat configs, not time-driven.

Pattern: only the main agent runs the comprehensive heartbeat. Everything else is event-driven.

Live agent processes

Agent	Runtime	Status
main	Embedded in openclaw gateway	Running (part of the gateway process)
xen	Directory config only	Configured, not actively running
spawner	Directory config only	Configured, not actively running
gitea-worker	Directory config only	Configured, event-driven (spawned by queue daemon)
coder-agent	Directory config only	Not running
god-agent	Workspace only	Not running
global-calendar	Workspace only	Not running
gym-designer	Workspace only	Not running
nutrition-agent	Shares gym workspace	Not running
particles-ai	Directory config only	Not running

Key discovery — main agent is NOT a separate process

The main agent ("Xen") isn't a long-lived process. It's embedded in the openclaw-openclaw-gateway-1 container, running as in-process Claude sessions triggered by user input and heartbeat events. The gateway was last restarted ~5 hours ago.

Configuration:

Primary model: anthropic/claude-opus-4-6
Fallback chain: Sonnet, gpt-5.4, gpt-5.3-codex
Workspace: /root/.openclaw/workspace (155 active projects)

claude-worker (me) is NOT alive right now

The subagent checked the process list for tmux sessions matching the claude-worker startup scripts (/root/start-claude-worker.sh, claude-worker-watcher.sh) and found no such session alive. My current running session (this conversation) was started directly, not via those scripts.

Shared state and data stores

All under /root/.openclaw/:

Store	Path	Purpose	Freshness
Project registry	`workspace/projects/registry.json`	Master index of 155+ projects, status, repo links, manager assignments	Real-time, updated on lifecycle events
Repo database	`workspace/db/repos.json`	Cached list of sol/* Gitea repos, metadata	STALE (48+ hours, ws-sync failing)
Infrastructure registry	`workspace/db/infra.json`	Running containers, scripts, transforms, cron, sandboxes	Last good scan 2026-04-05 15:04 UTC
Agent source mapping	`workspace/db/agent-sources.json`	Which repo built which agent	Manual
Memory system	`workspace/memory/`	Daily logs, lessons, errors, checklist state, heartbeat timestamps, rooh-translator staging	Real-time
Projects DB (text)	`workspace/PROJECTS_DB.md`	Human-readable project summary	Manual
Cron jobs state	`cron/jobs.json`	12 scheduled jobs, execution state, error counts	Very high churn (failing jobs retrying)
Cron logs	`hooks/logs/audit.jsonl`, `incidents.jsonl`	JSONL audit trail	Real-time
Queue inbox	`hooks/queue-inbox/`	Temporary staging for webhook payloads	Ephemeral
Vault	`vault-data/`	Encrypted configuration, secret state	Sandbox-managed
Credentials	`credentials/`	API tokens, OAuth profiles, Anthropic keys (perms 700)	Updated by sync-oauth-token.sh
Agent auth profiles	`agents/*/agent/auth-profiles.json`	Per-agent Anthropic, Google, OpenAI, Codex creds	Last morning sync 2026-04-06 07:11

Architecture: event-driven with scheduled backup scans. Real-time data in workspace/ is Git-versioned (branch main). Calculated data in db/ is cached output of pure bash+jq scripts. No central database; on-disk state is source of truth.

What the migration actually has to replace

A single main agent (Opus, embedded in gateway) managing 155 projects across 47 sol/* Gitea repos with a distributed sub-agent orchestration pattern. The agent operates on:

Inbound: Gitea webhooks → slack.solio.tech/hooks/gitea → queue daemon → gitea-worker transform → agent spawn
Outbound: Cron jobs (12 scheduled, 8 failing) → delivery queue → agent heartbeats → Mattermost DMs and Gitea issue updates
State: Git-versioned /root/.openclaw/workspace/ with 5-level agent hierarchy and a 28-item heartbeat checklist

The migration only needs to replace the Gitea-facing slice (webhook ingress, event router, script fan-out, repo policy enforcement). The rest of the system — the 155 projects, the workspace, the memory system, the sub-agent orchestration — stays owned by openclaw unless Rooh expands the scope.

Caret's immediate observations

The model misconfiguration needs to be fixed NOW. Even if I don't touch the migration yet, the claudehack/claude-sonnet-4-6 reference is breaking the heartbeat. That's Xen's production pipeline. Worth flagging to Rooh.
The admin scope issue is real. I cannot list system-level webhooks. Any migration requires either token elevation or Rooh manually registering the new endpoint.
Current security is flimsier than the code looks. All four known webhooks have no HMAC secret. The protection is bearer token + nginx ACL. My replacement should do better — implement real HMAC on the raw body.
The migration scope is smaller than I feared. I don't need to replicate the 28-item heartbeat or the 155-project workspace. I only need the Gitea-facing slice. That's a few days of work, not 4-8 weeks.
Rooh's earlier 6-hour cron I set up (eaeef6ff) is the right shape for the basic policy sweep. It doesn't need to be elaborate.

12 KiB Raw Permalink Blame History