Files
openclaw-to-caret-migration/research/RESEARCH-03-live-state-audit.md

12 KiB

Research 03 — live openclaw state audit

Subagent: abf0cb0928d823a0b (Explore) Completed: 2026-04-06 12:50 UTC Status: openclaw system currently DEGRADED — multiple cron timeouts and model misconfiguration.

🚨 CRITICAL FINDING — openclaw is degraded right now

The subagent discovered the openclaw system is already unhealthy:

  1. Model misconfiguration. openclaw.json line 138 references claudehack/claude-sonnet-4-6, which does not exist in the models provider list. All heartbeat jobs fail with FailoverError: Unknown model before any LLM call is made. This matches the errors I saw in the gateway logs earlier today.
  2. Cron job failures. 8 of 12 scheduled cron jobs are in error state with 4-20 consecutive failures each:
    • self-review-3day, compress-daily-notes, project-archive, rooh-style-review, rooh-style-reply-handler (20 consecutive failures), queue-doctor-6h, ws-sync, webhook-verify — all failing
    • Only rooh-style-learner is currently succeeding
  3. API pressure. Failover chains exhaust within 10-11s per attempt: Anthropic Opus/Sonnet timeout, OpenAI Codex rate-limited, openai/gpt-5.4 has no API key in this env.
  4. Stale data. db/repos.json is stale — the ws-sync cron has been failing since at least 2026-04-05, so the cached repo list is 48+ hours old.
  5. Webhook E2E test failing every 6h. The webhook-verify cron that's supposed to do a full pipeline delivery test hasn't succeeded for multiple cycles.

Implication for the migration: this is not a "replace a working system" project. It's a "replace a system that's already showing cracks". The migration deadline gets more urgent because the current pipeline is on thin ice.

Registered Gitea webhooks

The sol token lacks read:admin scope so system-level webhooks can't be listed. But the per-repo webhooks paint a clear picture:

Repo URL Events Secret Content-Type
gitea-webhooks https://slack.solio.tech/hooks/gitea issues, issue_comment, issue_label, issue_assign, issue_milestone NOT SET application/json
openclaw-mattermost https://slack.solio.tech/hooks/gitea (same) NOT SET application/json
openclaw-commands https://slack.solio.tech/hooks/gitea (same) NOT SET application/json
e2e-ticket-system https://slack.solio.tech/hooks/gitea (same) NOT SET application/json

Pattern: every repo has a webhook that posts to the same single endpoint. HMAC secret is not set on any of them — confirming Research 01's finding that openclaw's pipeline is authenticated via bearer token + nginx ACL, not Gitea HMAC.

Event set is uniform: the five issue/PR-style events. Notably MISSING from the event list: push, repository, create, fork. Either those are configured elsewhere (system-level admin webhooks, which I can't see) or the post-repo-audit.sh pipeline only fires when a human manually triggers it.

Running cron / timer infrastructure

  • No systemd timers (systemctl not in container)
  • No classic crontab
  • All scheduling is openclaw's internal schedulerqueue-daemon.js running as PID 4 inside the gateway container, reading ~/.openclaw/cron/jobs.json
  • Jobs are in-process timers managed by the gateway's CronService

Active cron jobs (from jobs.json)

Job ID Schedule (UTC) Status Last result
self-review-3day 0 6 */3 * * Enabled Error (timeout)
compress-daily-notes 0 4 * * * Enabled Error (timeout)
project-archive 30 3 * * * Enabled Error (timeout)
rooh-style-review 0 */4 * * * Enabled Error (timeout)
rooh-style-reply-handler */5 * * * * Enabled Error (timeout, 20 consecutive)
queue-doctor-6h 0 */6 * * * Enabled Error (timeout)
rooh-style-learner 0 */2 * * * Enabled OK ✓
ws-sync daily Enabled Error (timeout)
webhook-verify 0 */6 * * * Enabled Error (timeout)

Daemon: /root/.openclaw/hooks/queue-daemon.js (42KB, active). PID tracked at /root/.openclaw/hooks/queue-daemon.pid.

Heartbeat checklists

  • Main agent (/root/.openclaw/workspace/HEARTBEAT.md): comprehensive, 28 checks across 7 intervals, managed by tools/heartbeat-scheduler.sh with state at memory/heartbeat-state.json. Last check 2026-04-05 15:18 UTC. Every-heartbeat items include protocol re-read, project DB validation, incident checking, webhook auditing. Every-6h includes repo policy audit, webhook E2E verification.
  • Xen agent (/root/.openclaw/agents/xen/workspace/HEARTBEAT.md): intentionally empty (comment-only). No periodic tasks for Xen.
  • Spawner, gitea-worker, coder-agent, god-agent, global-calendar, gym-designer, nutrition, particles-ai: all have empty or task-driven heartbeat configs, not time-driven.

Pattern: only the main agent runs the comprehensive heartbeat. Everything else is event-driven.

Live agent processes

Agent Runtime Status
main Embedded in openclaw gateway Running (part of the gateway process)
xen Directory config only Configured, not actively running
spawner Directory config only Configured, not actively running
gitea-worker Directory config only Configured, event-driven (spawned by queue daemon)
coder-agent Directory config only Not running
god-agent Workspace only Not running
global-calendar Workspace only Not running
gym-designer Workspace only Not running
nutrition-agent Shares gym workspace Not running
particles-ai Directory config only Not running

Key discovery — main agent is NOT a separate process

The main agent ("Xen") isn't a long-lived process. It's embedded in the openclaw-openclaw-gateway-1 container, running as in-process Claude sessions triggered by user input and heartbeat events. The gateway was last restarted ~5 hours ago.

Configuration:

  • Primary model: anthropic/claude-opus-4-6
  • Fallback chain: Sonnet, gpt-5.4, gpt-5.3-codex
  • Workspace: /root/.openclaw/workspace (155 active projects)

claude-worker (me) is NOT alive right now

The subagent checked the process list for tmux sessions matching the claude-worker startup scripts (/root/start-claude-worker.sh, claude-worker-watcher.sh) and found no such session alive. My current running session (this conversation) was started directly, not via those scripts.

Shared state and data stores

All under /root/.openclaw/:

Store Path Purpose Freshness
Project registry workspace/projects/registry.json Master index of 155+ projects, status, repo links, manager assignments Real-time, updated on lifecycle events
Repo database workspace/db/repos.json Cached list of sol/* Gitea repos, metadata STALE (48+ hours, ws-sync failing)
Infrastructure registry workspace/db/infra.json Running containers, scripts, transforms, cron, sandboxes Last good scan 2026-04-05 15:04 UTC
Agent source mapping workspace/db/agent-sources.json Which repo built which agent Manual
Memory system workspace/memory/ Daily logs, lessons, errors, checklist state, heartbeat timestamps, rooh-translator staging Real-time
Projects DB (text) workspace/PROJECTS_DB.md Human-readable project summary Manual
Cron jobs state cron/jobs.json 12 scheduled jobs, execution state, error counts Very high churn (failing jobs retrying)
Cron logs hooks/logs/audit.jsonl, incidents.jsonl JSONL audit trail Real-time
Queue inbox hooks/queue-inbox/ Temporary staging for webhook payloads Ephemeral
Vault vault-data/ Encrypted configuration, secret state Sandbox-managed
Credentials credentials/ API tokens, OAuth profiles, Anthropic keys (perms 700) Updated by sync-oauth-token.sh
Agent auth profiles agents/*/agent/auth-profiles.json Per-agent Anthropic, Google, OpenAI, Codex creds Last morning sync 2026-04-06 07:11

Architecture: event-driven with scheduled backup scans. Real-time data in workspace/ is Git-versioned (branch main). Calculated data in db/ is cached output of pure bash+jq scripts. No central database; on-disk state is source of truth.

What the migration actually has to replace

A single main agent (Opus, embedded in gateway) managing 155 projects across 47 sol/* Gitea repos with a distributed sub-agent orchestration pattern. The agent operates on:

  • Inbound: Gitea webhooks → slack.solio.tech/hooks/gitea → queue daemon → gitea-worker transform → agent spawn
  • Outbound: Cron jobs (12 scheduled, 8 failing) → delivery queue → agent heartbeats → Mattermost DMs and Gitea issue updates
  • State: Git-versioned /root/.openclaw/workspace/ with 5-level agent hierarchy and a 28-item heartbeat checklist

The migration only needs to replace the Gitea-facing slice (webhook ingress, event router, script fan-out, repo policy enforcement). The rest of the system — the 155 projects, the workspace, the memory system, the sub-agent orchestration — stays owned by openclaw unless Rooh expands the scope.

Caret's immediate observations

  1. The model misconfiguration needs to be fixed NOW. Even if I don't touch the migration yet, the claudehack/claude-sonnet-4-6 reference is breaking the heartbeat. That's Xen's production pipeline. Worth flagging to Rooh.
  2. The admin scope issue is real. I cannot list system-level webhooks. Any migration requires either token elevation or Rooh manually registering the new endpoint.
  3. Current security is flimsier than the code looks. All four known webhooks have no HMAC secret. The protection is bearer token + nginx ACL. My replacement should do better — implement real HMAC on the raw body.
  4. The migration scope is smaller than I feared. I don't need to replicate the 28-item heartbeat or the 155-project workspace. I only need the Gitea-facing slice. That's a few days of work, not 4-8 weeks.
  5. Rooh's earlier 6-hour cron I set up (eaeef6ff) is the right shape for the basic policy sweep. It doesn't need to be elaborate.