sol/openclaw-to-caret-migration

Fork 0

Files

openclaw-agent 416c5759b5 docs: add architecture synthesis

2026-04-06 12:52:16 +00:00

15 KiB

Raw Permalink Blame History

ARCHITECTURE.md — current openclaw Gitea slice and migration boundary

Repo: sol/openclaw-to-caret-migration Date: 2026-04-06 Source reports:

research/RESEARCH-01-gitea-webhooks-deep-read.md
research/RESEARCH-02-gateway-internals.md
research/RESEARCH-03-live-state-audit.md

Executive summary

The migration target is much smaller than a full OpenClaw replacement.

OpenClaw today owns a large orchestration platform: gateway auth, session storage, plugin loading, subagent spawning, cron, heartbeat, tool policy enforcement, multi-channel delivery, and the long-lived workspace for 155+ projects. Replacing all of that would be a 4-8 week systems project.

But the Gitea-facing slice that this migration actually needs is narrower:

Webhook ingress
Event validation / routing
Deterministic script fan-out
Issue workflow gates / lock logic
Optional judgment wake-up when automation is not enough

That slice can be rebuilt as a small standalone listener plus a handful of copied/adapted scripts. The practical shape is a 600-800 line Bun listener with raw-body signature verification, dedup, file locks, script dispatch, and structured logs.

The live audit also changed the urgency: this is not a clean migration away from a stable system. The current OpenClaw installation is already degraded, with 8 of 12 cron jobs failing due to a bad model reference (claudehack/claude-sonnet-4-6). That does not directly prove the Gitea webhook path is broken, but it does mean the surrounding automation is already brittle and parts of the verification pipeline are failing.

Scope boundary

In scope for this migration

Gitea webhook receiver for repo / issue / comment style events
Authentication of incoming webhook traffic
Deduplication and idempotency checks
Event router
Deterministic script execution for policy enforcement and repo hygiene
File-based issue lock management
Minimal queue / retry behavior where needed
Structured audit logging
Optional handoff into a Claude-native judgment path

Explicitly out of scope

These stay owned by OpenClaw unless Phase 1 expands scope intentionally:

Full gateway RPC / WebSocket protocol
Session transcript storage system
General subagent orchestration framework
Global cron and heartbeat scheduler
Plugin SDK and plugin runtime
Delivery abstraction for Mattermost / Telegram / Discord / WhatsApp
Full tool allowlist inheritance engine
Existing 155-project workspace and project registry
Global memory / archive / compaction machinery

Current system: end-to-end picture

Deterministic path today

Gitea
  -> HTTPS POST https://slack.solio.tech/hooks/gitea
  -> nginx
     - TLS termination
     - local forwarding
     - injects Authorization: Bearer <OPENCLAW_HOOKS_TOKEN>
  -> OpenClaw gateway /hooks/agent
  -> gitea-transform.js
  -> event router
  -> pure scripts (post-repo-audit, policy audit, security checks, etc.)
  -> logs / queue state / lock files

Judgment / agent path today

Gitea event
  -> transform validation and trust checks
  -> route decision
  -> if issue workflow requires agent action:
       precompute spawn params
       async dispatch to spawner / manager path
       OpenClaw creates isolated session
       agent writes back to Gitea / chat surfaces

Platform services supporting both

OpenClaw gateway
  - auth / bearer validation
  - hook ingestion
  - session spawn
  - tool allowlist resolution
  - cron service
  - heartbeat runner
  - plugin loading
  - outbound delivery
  - workspace/session state persistence

Security model: what exists now

1) Incoming Gitea webhooks are not protected by Gitea HMAC today

This was the most important architecture surprise.

Although Gitea supports X-Gitea-Signature, the current OpenClaw transform layer does not have access to the raw request body, so it does not perform real body-level HMAC verification. The live repo audit also showed the visible repo webhooks have no secret set.

Current protection is instead layered as:

HTTPS via nginx
nginx forwarding only to local gateway
injected bearer token (Authorization: Bearer ...)
gateway token validation
delivery dedup by X-Gitea-Delivery

This is workable, but weaker and more indirect than true webhook HMAC.

2) Spawn signatures are a separate HMAC system

There is HMAC in the system, but it protects a different boundary.

When sol creates an [IMPLEMENT] issue, the issue body includes a spawn signature comment:

<!-- xen-spawn-sig:HMAC:TIMESTAMP -->

The transform recomputes HMAC-SHA256 over repo|title|timestamp, validates it with a local secret, and rejects invalid or stale signatures. This is not webhook authentication. It is an authorization gate for a privileged workflow.

3) Trust routing is identity-aware

The transform classifies senders into trust levels such as owner, collaborator/contributor, and readonly. That trust level affects:

which agent receives the event
whether approval words are honored
whether a manager spawn may occur
whether an event is ignored as untrusted or looped

4) Issue lock files are a core safety mechanism

Issue workflows are protected with file locks under a hooks lock directory. Locks have TTL-based behavior, and closed issues move into a short grace state before release. This matters because concurrent comments or duplicate deliveries can otherwise spawn duplicate work.

Live-state findings that affect architecture

Current health status: DEGRADED

The current OpenClaw deployment is degraded now, not hypothetically later.

Confirmed problems

openclaw.json references a non-existent model alias: claudehack/claude-sonnet-4-6
8 of 12 cron jobs are failing repeatedly
ws-sync is failing, so cached repo state is stale
webhook-verify is failing, so the pipeline's own end-to-end verification job is unhealthy
failover chains are slow and noisy under API pressure

Why this matters for migration design

The migration should reduce dependency on fragile global cron/heartbeat behavior
The replacement should make ingress validation and deterministic enforcement stand on their own
The replacement should log every event locally, even when downstream agent work fails
The replacement should avoid hidden couplings to provider/model config where possible

Current components and responsibilities

1) nginx edge

Responsibilities today:

TLS termination
forwarding inbound webhook traffic
injecting the gateway bearer token
relying on network locality and host-level topology as part of trust

Migration implication: The new Caret listener can either:

keep using nginx as the front door and share the bearer-token pattern, or
terminate webhook traffic directly and verify raw-body HMAC itself

The second option is better if Rooh wants the replacement to improve security rather than merely preserve behavior.

2) OpenClaw gateway

Responsibilities today:

receive hook traffic
authenticate requests
dispatch transform logic
spawn agent sessions
run heartbeats and cron jobs
host plugins and outbound delivery
enforce tool policies

Migration implication: We should not replace the whole gateway. We only need a listener for the Gitea slice.

3) `gitea-transform.js`

This is the current Gitea event router. It performs:

event-type filtering
dedup checks
trust classification
loop prevention
rate limiting
lock checks
route decisions
script execution for deterministic cases
manager/spawner dispatch for workflow cases
audit logging

Migration implication: This is the closest thing to the spec for the new listener. The replacement should preserve its behavior selectively, not copy the whole gateway.

4) Deterministic script layer

Examples found in research:

post-repo-audit.sh
audit-webhooks.sh
audit-repo-policies.sh
secret-scan.sh
check-implement-orphans.sh
spawn-manager.sh

These are mostly stateless bash/node tools with path/config coupling.

Migration implication: Do not rewrite these from scratch unless necessary. Copy/adapt the working ones, strip OpenClaw-specific paths, and make config explicit.

5) Session / workflow orchestration

OpenClaw provides:

isolated session spawn
role/tool policy resolution
session transcript storage
channel delivery
wake mechanisms

Migration implication: This is the expensive part to rebuild. Avoid it. Use Claude-native primitives only for the narrow judgment path.

The minimal replacement architecture

The smallest viable Caret-owned architecture is:

Gitea
  -> Caret listener (Bun)
     - raw body capture
     - HMAC verify
     - delivery dedup
     - trust + routing
     - file locks
     - structured logs
     - script fan-out
     - optional judgment trigger
  -> deterministic tools/
  -> optional Claude-native wake-up path

Listener responsibilities

The listener should own exactly these jobs:

Read raw request body before parsing
Verify X-Gitea-Signature with timing-safe HMAC compare
Parse event metadata and delivery ID
Deduplicate by delivery ID
Apply event-type filters
Classify sender / trust level
Enforce loop prevention for agent-authored comments
Acquire/check per-issue lock where needed
Dispatch deterministic scripts by event type
Emit structured JSON logs for all outcomes
Optionally trigger a judgment wake-up when deterministic automation cannot decide

Deterministic script fan-out

The likely event map after design review:

Event	Action
`repository.create`	collaborator add + webhook ensure + repo policy baseline
`push` to protected branch	secret scan + policy re-check
`issues.opened` on automation-tagged issues	route to gated workflow logic
`issue_comment` on active workflow issue	approval parsing, lock check, optional wake-up
unsupported / irrelevant event	log and ignore

This keeps the zero-token path zero-token.

Judgment path

Only use judgment for cases that deterministic automation cannot safely resolve, such as:

ambiguous repo type
policy enforcement failure requiring explanation
explicit request for AI review
human-authored workflow step that needs synthesis rather than a script

This should not require recreating OpenClaw's full spawn/orchestration model. The design target should be a small Claude-native wake-up primitive, not a manager framework clone.

Hard dependencies vs removable dependencies

Dependencies the new Gitea slice can remove

OpenClaw hook ingestion for Gitea webhooks
OpenClaw transform execution for Gitea routing
reliance on nginx bearer injection as the only authenticity check
OpenClaw-specific queue inbox / lock path layout
OpenClaw-specific script path assumptions

Dependencies the new slice should keep, at least initially

Gitea itself
existing policy scripts and repo hygiene logic
existing human workflow semantics where already working
OpenClaw-owned broader workspace/project system
OpenClaw-owned non-Gitea cron/heartbeat ecosystem
Claude-native or OpenClaw-native judgment wake-up until a better primitive is chosen

Data / state the replacement must own

The replacement does not need a database. File-backed state is enough.

Required local state

logs/events.jsonl or similar structured event log
state/dedup.json for recent delivery IDs
state/locks/<repo>-<issue>.lock for per-issue workflow control
state/runs/ or similar optional execution receipts
config files for webhook secret, Gitea endpoint, token, allowed repos/users

Nice-to-have state

replay queue for transient failures
dead-letter folder for malformed events
event latency counters / health summaries

Architectural differences between current and target state

Concern	Current OpenClaw state	Target Caret state
Webhook auth	bearer token + nginx locality	raw-body Gitea HMAC preferred
Router	transform inside gateway	standalone Bun listener
Deterministic actions	scripts invoked by transform	same scripts invoked by listener
Locks	OpenClaw hooks lock dir	Caret-owned lock dir
Dedup	OpenClaw cache file	Caret-owned dedup state
Judgment wake-up	OpenClaw session spawn	Claude-native minimal wake-up
Cron/heartbeat	OpenClaw global scheduler	only if truly needed for this slice
Workspace ownership	OpenClaw workspace	unchanged unless explicitly expanded

Main migration conclusions

Conclusion 1: do not rebuild OpenClaw

That would be a category error. The gateway, plugin runtime, delivery layer, cron/heartbeat engine, and session/orchestration stack are a separate platform project.

Conclusion 2: rebuild the Gitea ingress/router slice only

This is the actual migration target and is small enough to complete quickly.

Conclusion 3: improve security while migrating

The replacement should implement actual raw-body Gitea HMAC verification. The current webhook path does not.

Conclusion 4: keep deterministic work pure-script

The current split is correct. Repo policy and enforcement work should remain fast, cheap, and idempotent.

Conclusion 5: judgment must be narrow and explicit

Do not wake Claude on every webhook. Use it only for ambiguity, escalation, or clearly user-requested reasoning.

Conclusion 6: design should assume the current system is fragile

Because surrounding cron/verification infrastructure is already degraded, the replacement should be independently observable and easy to test without depending on OpenClaw's unhealthy scheduler chain.

Open questions for Phase 1 design

These questions should be answered in DESIGN.md.

Ingress topology: keep nginx in front, or let the Caret listener terminate the webhook directly?
Auth model: bearer only for parity, or proper Gitea HMAC as the new standard?
Judgment primitive: Channels plugin, direct Claude Code primitive, or temporary dependency on OpenClaw for wake-up?
Script packaging: copy the existing scripts wholesale first, or split them into library + thin wrappers?
Repo registration: per-repo hooks only, or system-level hook once token/admin constraints are solved?
Retry model: synchronous fire-and-log only, or file-backed retry queue for transient failures?
Observability: plain JSONL logs only, or add a health endpoint plus counters and replay tooling?
Workflow semantics: which current issue/comment workflows are worth preserving exactly, and which can be simplified?

Recommended next step

Move to Phase 1 — Architecture design with the following framing:

Treat this document as the baseline map of the current system
Design only the Gitea-facing slice, not a gateway replacement
Preserve the deterministic/judgment split
Improve webhook authentication with real HMAC
Make observability first-class because the current environment is already degraded

That keeps the project in the "days" category instead of letting it sprawl back into a multi-week platform rewrite.

15 KiB Raw Permalink Blame History