Files
openclaw-to-caret-migration/ARCHITECTURE.md
2026-04-06 12:52:16 +00:00

15 KiB

ARCHITECTURE.md — current openclaw Gitea slice and migration boundary

Repo: sol/openclaw-to-caret-migration Date: 2026-04-06 Source reports:

  • research/RESEARCH-01-gitea-webhooks-deep-read.md
  • research/RESEARCH-02-gateway-internals.md
  • research/RESEARCH-03-live-state-audit.md

Executive summary

The migration target is much smaller than a full OpenClaw replacement.

OpenClaw today owns a large orchestration platform: gateway auth, session storage, plugin loading, subagent spawning, cron, heartbeat, tool policy enforcement, multi-channel delivery, and the long-lived workspace for 155+ projects. Replacing all of that would be a 4-8 week systems project.

But the Gitea-facing slice that this migration actually needs is narrower:

  1. Webhook ingress
  2. Event validation / routing
  3. Deterministic script fan-out
  4. Issue workflow gates / lock logic
  5. Optional judgment wake-up when automation is not enough

That slice can be rebuilt as a small standalone listener plus a handful of copied/adapted scripts. The practical shape is a 600-800 line Bun listener with raw-body signature verification, dedup, file locks, script dispatch, and structured logs.

The live audit also changed the urgency: this is not a clean migration away from a stable system. The current OpenClaw installation is already degraded, with 8 of 12 cron jobs failing due to a bad model reference (claudehack/claude-sonnet-4-6). That does not directly prove the Gitea webhook path is broken, but it does mean the surrounding automation is already brittle and parts of the verification pipeline are failing.

Scope boundary

In scope for this migration

  • Gitea webhook receiver for repo / issue / comment style events
  • Authentication of incoming webhook traffic
  • Deduplication and idempotency checks
  • Event router
  • Deterministic script execution for policy enforcement and repo hygiene
  • File-based issue lock management
  • Minimal queue / retry behavior where needed
  • Structured audit logging
  • Optional handoff into a Claude-native judgment path

Explicitly out of scope

These stay owned by OpenClaw unless Phase 1 expands scope intentionally:

  • Full gateway RPC / WebSocket protocol
  • Session transcript storage system
  • General subagent orchestration framework
  • Global cron and heartbeat scheduler
  • Plugin SDK and plugin runtime
  • Delivery abstraction for Mattermost / Telegram / Discord / WhatsApp
  • Full tool allowlist inheritance engine
  • Existing 155-project workspace and project registry
  • Global memory / archive / compaction machinery

Current system: end-to-end picture

Deterministic path today

Gitea
  -> HTTPS POST https://slack.solio.tech/hooks/gitea
  -> nginx
     - TLS termination
     - local forwarding
     - injects Authorization: Bearer <OPENCLAW_HOOKS_TOKEN>
  -> OpenClaw gateway /hooks/agent
  -> gitea-transform.js
  -> event router
  -> pure scripts (post-repo-audit, policy audit, security checks, etc.)
  -> logs / queue state / lock files

Judgment / agent path today

Gitea event
  -> transform validation and trust checks
  -> route decision
  -> if issue workflow requires agent action:
       precompute spawn params
       async dispatch to spawner / manager path
       OpenClaw creates isolated session
       agent writes back to Gitea / chat surfaces

Platform services supporting both

OpenClaw gateway
  - auth / bearer validation
  - hook ingestion
  - session spawn
  - tool allowlist resolution
  - cron service
  - heartbeat runner
  - plugin loading
  - outbound delivery
  - workspace/session state persistence

Security model: what exists now

1) Incoming Gitea webhooks are not protected by Gitea HMAC today

This was the most important architecture surprise.

Although Gitea supports X-Gitea-Signature, the current OpenClaw transform layer does not have access to the raw request body, so it does not perform real body-level HMAC verification. The live repo audit also showed the visible repo webhooks have no secret set.

Current protection is instead layered as:

  1. HTTPS via nginx
  2. nginx forwarding only to local gateway
  3. injected bearer token (Authorization: Bearer ...)
  4. gateway token validation
  5. delivery dedup by X-Gitea-Delivery

This is workable, but weaker and more indirect than true webhook HMAC.

2) Spawn signatures are a separate HMAC system

There is HMAC in the system, but it protects a different boundary.

When sol creates an [IMPLEMENT] issue, the issue body includes a spawn signature comment:

<!-- xen-spawn-sig:HMAC:TIMESTAMP -->

The transform recomputes HMAC-SHA256 over repo|title|timestamp, validates it with a local secret, and rejects invalid or stale signatures. This is not webhook authentication. It is an authorization gate for a privileged workflow.

3) Trust routing is identity-aware

The transform classifies senders into trust levels such as owner, collaborator/contributor, and readonly. That trust level affects:

  • which agent receives the event
  • whether approval words are honored
  • whether a manager spawn may occur
  • whether an event is ignored as untrusted or looped

4) Issue lock files are a core safety mechanism

Issue workflows are protected with file locks under a hooks lock directory. Locks have TTL-based behavior, and closed issues move into a short grace state before release. This matters because concurrent comments or duplicate deliveries can otherwise spawn duplicate work.

Live-state findings that affect architecture

Current health status: DEGRADED

The current OpenClaw deployment is degraded now, not hypothetically later.

Confirmed problems

  • openclaw.json references a non-existent model alias: claudehack/claude-sonnet-4-6
  • 8 of 12 cron jobs are failing repeatedly
  • ws-sync is failing, so cached repo state is stale
  • webhook-verify is failing, so the pipeline's own end-to-end verification job is unhealthy
  • failover chains are slow and noisy under API pressure

Why this matters for migration design

  • The migration should reduce dependency on fragile global cron/heartbeat behavior
  • The replacement should make ingress validation and deterministic enforcement stand on their own
  • The replacement should log every event locally, even when downstream agent work fails
  • The replacement should avoid hidden couplings to provider/model config where possible

Current components and responsibilities

1) nginx edge

Responsibilities today:

  • TLS termination
  • forwarding inbound webhook traffic
  • injecting the gateway bearer token
  • relying on network locality and host-level topology as part of trust

Migration implication: The new Caret listener can either:

  • keep using nginx as the front door and share the bearer-token pattern, or
  • terminate webhook traffic directly and verify raw-body HMAC itself

The second option is better if Rooh wants the replacement to improve security rather than merely preserve behavior.

2) OpenClaw gateway

Responsibilities today:

  • receive hook traffic
  • authenticate requests
  • dispatch transform logic
  • spawn agent sessions
  • run heartbeats and cron jobs
  • host plugins and outbound delivery
  • enforce tool policies

Migration implication: We should not replace the whole gateway. We only need a listener for the Gitea slice.

3) gitea-transform.js

This is the current Gitea event router. It performs:

  • event-type filtering
  • dedup checks
  • trust classification
  • loop prevention
  • rate limiting
  • lock checks
  • route decisions
  • script execution for deterministic cases
  • manager/spawner dispatch for workflow cases
  • audit logging

Migration implication: This is the closest thing to the spec for the new listener. The replacement should preserve its behavior selectively, not copy the whole gateway.

4) Deterministic script layer

Examples found in research:

  • post-repo-audit.sh
  • audit-webhooks.sh
  • audit-repo-policies.sh
  • secret-scan.sh
  • check-implement-orphans.sh
  • spawn-manager.sh

These are mostly stateless bash/node tools with path/config coupling.

Migration implication: Do not rewrite these from scratch unless necessary. Copy/adapt the working ones, strip OpenClaw-specific paths, and make config explicit.

5) Session / workflow orchestration

OpenClaw provides:

  • isolated session spawn
  • role/tool policy resolution
  • session transcript storage
  • channel delivery
  • wake mechanisms

Migration implication: This is the expensive part to rebuild. Avoid it. Use Claude-native primitives only for the narrow judgment path.

The minimal replacement architecture

The smallest viable Caret-owned architecture is:

Gitea
  -> Caret listener (Bun)
     - raw body capture
     - HMAC verify
     - delivery dedup
     - trust + routing
     - file locks
     - structured logs
     - script fan-out
     - optional judgment trigger
  -> deterministic tools/
  -> optional Claude-native wake-up path

Listener responsibilities

The listener should own exactly these jobs:

  1. Read raw request body before parsing
  2. Verify X-Gitea-Signature with timing-safe HMAC compare
  3. Parse event metadata and delivery ID
  4. Deduplicate by delivery ID
  5. Apply event-type filters
  6. Classify sender / trust level
  7. Enforce loop prevention for agent-authored comments
  8. Acquire/check per-issue lock where needed
  9. Dispatch deterministic scripts by event type
  10. Emit structured JSON logs for all outcomes
  11. Optionally trigger a judgment wake-up when deterministic automation cannot decide

Deterministic script fan-out

The likely event map after design review:

Event Action
repository.create collaborator add + webhook ensure + repo policy baseline
push to protected branch secret scan + policy re-check
issues.opened on automation-tagged issues route to gated workflow logic
issue_comment on active workflow issue approval parsing, lock check, optional wake-up
unsupported / irrelevant event log and ignore

This keeps the zero-token path zero-token.

Judgment path

Only use judgment for cases that deterministic automation cannot safely resolve, such as:

  • ambiguous repo type
  • policy enforcement failure requiring explanation
  • explicit request for AI review
  • human-authored workflow step that needs synthesis rather than a script

This should not require recreating OpenClaw's full spawn/orchestration model. The design target should be a small Claude-native wake-up primitive, not a manager framework clone.

Hard dependencies vs removable dependencies

Dependencies the new Gitea slice can remove

  • OpenClaw hook ingestion for Gitea webhooks
  • OpenClaw transform execution for Gitea routing
  • reliance on nginx bearer injection as the only authenticity check
  • OpenClaw-specific queue inbox / lock path layout
  • OpenClaw-specific script path assumptions

Dependencies the new slice should keep, at least initially

  • Gitea itself
  • existing policy scripts and repo hygiene logic
  • existing human workflow semantics where already working
  • OpenClaw-owned broader workspace/project system
  • OpenClaw-owned non-Gitea cron/heartbeat ecosystem
  • Claude-native or OpenClaw-native judgment wake-up until a better primitive is chosen

Data / state the replacement must own

The replacement does not need a database. File-backed state is enough.

Required local state

  • logs/events.jsonl or similar structured event log
  • state/dedup.json for recent delivery IDs
  • state/locks/<repo>-<issue>.lock for per-issue workflow control
  • state/runs/ or similar optional execution receipts
  • config files for webhook secret, Gitea endpoint, token, allowed repos/users

Nice-to-have state

  • replay queue for transient failures
  • dead-letter folder for malformed events
  • event latency counters / health summaries

Architectural differences between current and target state

Concern Current OpenClaw state Target Caret state
Webhook auth bearer token + nginx locality raw-body Gitea HMAC preferred
Router transform inside gateway standalone Bun listener
Deterministic actions scripts invoked by transform same scripts invoked by listener
Locks OpenClaw hooks lock dir Caret-owned lock dir
Dedup OpenClaw cache file Caret-owned dedup state
Judgment wake-up OpenClaw session spawn Claude-native minimal wake-up
Cron/heartbeat OpenClaw global scheduler only if truly needed for this slice
Workspace ownership OpenClaw workspace unchanged unless explicitly expanded

Main migration conclusions

Conclusion 1: do not rebuild OpenClaw

That would be a category error. The gateway, plugin runtime, delivery layer, cron/heartbeat engine, and session/orchestration stack are a separate platform project.

Conclusion 2: rebuild the Gitea ingress/router slice only

This is the actual migration target and is small enough to complete quickly.

Conclusion 3: improve security while migrating

The replacement should implement actual raw-body Gitea HMAC verification. The current webhook path does not.

Conclusion 4: keep deterministic work pure-script

The current split is correct. Repo policy and enforcement work should remain fast, cheap, and idempotent.

Conclusion 5: judgment must be narrow and explicit

Do not wake Claude on every webhook. Use it only for ambiguity, escalation, or clearly user-requested reasoning.

Conclusion 6: design should assume the current system is fragile

Because surrounding cron/verification infrastructure is already degraded, the replacement should be independently observable and easy to test without depending on OpenClaw's unhealthy scheduler chain.

Open questions for Phase 1 design

These questions should be answered in DESIGN.md.

  1. Ingress topology: keep nginx in front, or let the Caret listener terminate the webhook directly?
  2. Auth model: bearer only for parity, or proper Gitea HMAC as the new standard?
  3. Judgment primitive: Channels plugin, direct Claude Code primitive, or temporary dependency on OpenClaw for wake-up?
  4. Script packaging: copy the existing scripts wholesale first, or split them into library + thin wrappers?
  5. Repo registration: per-repo hooks only, or system-level hook once token/admin constraints are solved?
  6. Retry model: synchronous fire-and-log only, or file-backed retry queue for transient failures?
  7. Observability: plain JSONL logs only, or add a health endpoint plus counters and replay tooling?
  8. Workflow semantics: which current issue/comment workflows are worth preserving exactly, and which can be simplified?

Move to Phase 1 — Architecture design with the following framing:

  • Treat this document as the baseline map of the current system
  • Design only the Gitea-facing slice, not a gateway replacement
  • Preserve the deterministic/judgment split
  • Improve webhook authentication with real HMAC
  • Make observability first-class because the current environment is already degraded

That keeps the project in the "days" category instead of letting it sprawl back into a multi-week platform rewrite.