From 416c5759b598861f531d6581610bfc170739a51b Mon Sep 17 00:00:00 2001 From: openclaw-agent Date: Mon, 6 Apr 2026 12:52:16 +0000 Subject: [PATCH] docs: add architecture synthesis --- ARCHITECTURE.md | 407 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 407 insertions(+) create mode 100644 ARCHITECTURE.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..f3a287d --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,407 @@ +# ARCHITECTURE.md — current openclaw Gitea slice and migration boundary + +**Repo:** `sol/openclaw-to-caret-migration` +**Date:** 2026-04-06 +**Source reports:** +- `research/RESEARCH-01-gitea-webhooks-deep-read.md` +- `research/RESEARCH-02-gateway-internals.md` +- `research/RESEARCH-03-live-state-audit.md` + +## Executive summary + +The migration target is much smaller than a full OpenClaw replacement. + +OpenClaw today owns a large orchestration platform: gateway auth, session storage, plugin loading, subagent spawning, cron, heartbeat, tool policy enforcement, multi-channel delivery, and the long-lived workspace for 155+ projects. Replacing all of that would be a 4-8 week systems project. + +But the **Gitea-facing slice** that this migration actually needs is narrower: + +1. **Webhook ingress** +2. **Event validation / routing** +3. **Deterministic script fan-out** +4. **Issue workflow gates / lock logic** +5. **Optional judgment wake-up when automation is not enough** + +That slice can be rebuilt as a small standalone listener plus a handful of copied/adapted scripts. The practical shape is a **600-800 line Bun listener** with raw-body signature verification, dedup, file locks, script dispatch, and structured logs. + +The live audit also changed the urgency: this is not a clean migration away from a stable system. The current OpenClaw installation is already **degraded**, with 8 of 12 cron jobs failing due to a bad model reference (`claudehack/claude-sonnet-4-6`). That does not directly prove the Gitea webhook path is broken, but it does mean the surrounding automation is already brittle and parts of the verification pipeline are failing. + +## Scope boundary + +### In scope for this migration + +- Gitea webhook receiver for repo / issue / comment style events +- Authentication of incoming webhook traffic +- Deduplication and idempotency checks +- Event router +- Deterministic script execution for policy enforcement and repo hygiene +- File-based issue lock management +- Minimal queue / retry behavior where needed +- Structured audit logging +- Optional handoff into a Claude-native judgment path + +### Explicitly out of scope + +These stay owned by OpenClaw unless Phase 1 expands scope intentionally: + +- Full gateway RPC / WebSocket protocol +- Session transcript storage system +- General subagent orchestration framework +- Global cron and heartbeat scheduler +- Plugin SDK and plugin runtime +- Delivery abstraction for Mattermost / Telegram / Discord / WhatsApp +- Full tool allowlist inheritance engine +- Existing 155-project workspace and project registry +- Global memory / archive / compaction machinery + +## Current system: end-to-end picture + +### Deterministic path today + +```text +Gitea + -> HTTPS POST https://slack.solio.tech/hooks/gitea + -> nginx + - TLS termination + - local forwarding + - injects Authorization: Bearer + -> OpenClaw gateway /hooks/agent + -> gitea-transform.js + -> event router + -> pure scripts (post-repo-audit, policy audit, security checks, etc.) + -> logs / queue state / lock files +``` + +### Judgment / agent path today + +```text +Gitea event + -> transform validation and trust checks + -> route decision + -> if issue workflow requires agent action: + precompute spawn params + async dispatch to spawner / manager path + OpenClaw creates isolated session + agent writes back to Gitea / chat surfaces +``` + +### Platform services supporting both + +```text +OpenClaw gateway + - auth / bearer validation + - hook ingestion + - session spawn + - tool allowlist resolution + - cron service + - heartbeat runner + - plugin loading + - outbound delivery + - workspace/session state persistence +``` + +## Security model: what exists now + +## +## 1) Incoming Gitea webhooks are not protected by Gitea HMAC today + +This was the most important architecture surprise. + +Although Gitea supports `X-Gitea-Signature`, the current OpenClaw transform layer does not have access to the raw request body, so it does **not** perform real body-level HMAC verification. The live repo audit also showed the visible repo webhooks have **no secret set**. + +Current protection is instead layered as: + +1. HTTPS via nginx +2. nginx forwarding only to local gateway +3. injected bearer token (`Authorization: Bearer ...`) +4. gateway token validation +5. delivery dedup by `X-Gitea-Delivery` + +This is workable, but weaker and more indirect than true webhook HMAC. + +## 2) Spawn signatures are a separate HMAC system + +There *is* HMAC in the system, but it protects a different boundary. + +When `sol` creates an `[IMPLEMENT]` issue, the issue body includes a spawn signature comment: + +```html + +``` + +The transform recomputes HMAC-SHA256 over `repo|title|timestamp`, validates it with a local secret, and rejects invalid or stale signatures. This is not webhook authentication. It is an authorization gate for a privileged workflow. + +## 3) Trust routing is identity-aware + +The transform classifies senders into trust levels such as owner, collaborator/contributor, and readonly. That trust level affects: + +- which agent receives the event +- whether approval words are honored +- whether a manager spawn may occur +- whether an event is ignored as untrusted or looped + +## 4) Issue lock files are a core safety mechanism + +Issue workflows are protected with file locks under a hooks lock directory. Locks have TTL-based behavior, and closed issues move into a short grace state before release. This matters because concurrent comments or duplicate deliveries can otherwise spawn duplicate work. + +## Live-state findings that affect architecture + +## Current health status: DEGRADED + +The current OpenClaw deployment is degraded now, not hypothetically later. + +### Confirmed problems + +- `openclaw.json` references a non-existent model alias: `claudehack/claude-sonnet-4-6` +- 8 of 12 cron jobs are failing repeatedly +- `ws-sync` is failing, so cached repo state is stale +- `webhook-verify` is failing, so the pipeline's own end-to-end verification job is unhealthy +- failover chains are slow and noisy under API pressure + +### Why this matters for migration design + +- The migration should reduce dependency on fragile global cron/heartbeat behavior +- The replacement should make ingress validation and deterministic enforcement stand on their own +- The replacement should log every event locally, even when downstream agent work fails +- The replacement should avoid hidden couplings to provider/model config where possible + +## Current components and responsibilities + +### 1) nginx edge + +Responsibilities today: + +- TLS termination +- forwarding inbound webhook traffic +- injecting the gateway bearer token +- relying on network locality and host-level topology as part of trust + +**Migration implication:** +The new Caret listener can either: + +- keep using nginx as the front door and share the bearer-token pattern, or +- terminate webhook traffic directly and verify raw-body HMAC itself + +The second option is better if Rooh wants the replacement to improve security rather than merely preserve behavior. + +### 2) OpenClaw gateway + +Responsibilities today: + +- receive hook traffic +- authenticate requests +- dispatch transform logic +- spawn agent sessions +- run heartbeats and cron jobs +- host plugins and outbound delivery +- enforce tool policies + +**Migration implication:** +We should not replace the whole gateway. We only need a listener for the Gitea slice. + +### 3) `gitea-transform.js` + +This is the current Gitea event router. It performs: + +- event-type filtering +- dedup checks +- trust classification +- loop prevention +- rate limiting +- lock checks +- route decisions +- script execution for deterministic cases +- manager/spawner dispatch for workflow cases +- audit logging + +**Migration implication:** +This is the closest thing to the spec for the new listener. The replacement should preserve its behavior selectively, not copy the whole gateway. + +### 4) Deterministic script layer + +Examples found in research: + +- `post-repo-audit.sh` +- `audit-webhooks.sh` +- `audit-repo-policies.sh` +- `secret-scan.sh` +- `check-implement-orphans.sh` +- `spawn-manager.sh` + +These are mostly stateless bash/node tools with path/config coupling. + +**Migration implication:** +Do not rewrite these from scratch unless necessary. Copy/adapt the working ones, strip OpenClaw-specific paths, and make config explicit. + +### 5) Session / workflow orchestration + +OpenClaw provides: + +- isolated session spawn +- role/tool policy resolution +- session transcript storage +- channel delivery +- wake mechanisms + +**Migration implication:** +This is the expensive part to rebuild. Avoid it. Use Claude-native primitives only for the narrow judgment path. + +## The minimal replacement architecture + +The smallest viable Caret-owned architecture is: + +```text +Gitea + -> Caret listener (Bun) + - raw body capture + - HMAC verify + - delivery dedup + - trust + routing + - file locks + - structured logs + - script fan-out + - optional judgment trigger + -> deterministic tools/ + -> optional Claude-native wake-up path +``` + +### Listener responsibilities + +The listener should own exactly these jobs: + +1. Read raw request body before parsing +2. Verify `X-Gitea-Signature` with timing-safe HMAC compare +3. Parse event metadata and delivery ID +4. Deduplicate by delivery ID +5. Apply event-type filters +6. Classify sender / trust level +7. Enforce loop prevention for agent-authored comments +8. Acquire/check per-issue lock where needed +9. Dispatch deterministic scripts by event type +10. Emit structured JSON logs for all outcomes +11. Optionally trigger a judgment wake-up when deterministic automation cannot decide + +### Deterministic script fan-out + +The likely event map after design review: + +| Event | Action | +|---|---| +| `repository.create` | collaborator add + webhook ensure + repo policy baseline | +| `push` to protected branch | secret scan + policy re-check | +| `issues.opened` on automation-tagged issues | route to gated workflow logic | +| `issue_comment` on active workflow issue | approval parsing, lock check, optional wake-up | +| unsupported / irrelevant event | log and ignore | + +This keeps the zero-token path zero-token. + +### Judgment path + +Only use judgment for cases that deterministic automation cannot safely resolve, such as: + +- ambiguous repo type +- policy enforcement failure requiring explanation +- explicit request for AI review +- human-authored workflow step that needs synthesis rather than a script + +This should not require recreating OpenClaw's full spawn/orchestration model. The design target should be a small Claude-native wake-up primitive, not a manager framework clone. + +## Hard dependencies vs removable dependencies + +### Dependencies the new Gitea slice can remove + +- OpenClaw hook ingestion for Gitea webhooks +- OpenClaw transform execution for Gitea routing +- reliance on nginx bearer injection as the only authenticity check +- OpenClaw-specific queue inbox / lock path layout +- OpenClaw-specific script path assumptions + +### Dependencies the new slice should keep, at least initially + +- Gitea itself +- existing policy scripts and repo hygiene logic +- existing human workflow semantics where already working +- OpenClaw-owned broader workspace/project system +- OpenClaw-owned non-Gitea cron/heartbeat ecosystem +- Claude-native or OpenClaw-native judgment wake-up until a better primitive is chosen + +## Data / state the replacement must own + +The replacement does not need a database. File-backed state is enough. + +### Required local state + +- `logs/events.jsonl` or similar structured event log +- `state/dedup.json` for recent delivery IDs +- `state/locks/-.lock` for per-issue workflow control +- `state/runs/` or similar optional execution receipts +- config files for webhook secret, Gitea endpoint, token, allowed repos/users + +### Nice-to-have state + +- replay queue for transient failures +- dead-letter folder for malformed events +- event latency counters / health summaries + +## Architectural differences between current and target state + +| Concern | Current OpenClaw state | Target Caret state | +|---|---|---| +| Webhook auth | bearer token + nginx locality | raw-body Gitea HMAC preferred | +| Router | transform inside gateway | standalone Bun listener | +| Deterministic actions | scripts invoked by transform | same scripts invoked by listener | +| Locks | OpenClaw hooks lock dir | Caret-owned lock dir | +| Dedup | OpenClaw cache file | Caret-owned dedup state | +| Judgment wake-up | OpenClaw session spawn | Claude-native minimal wake-up | +| Cron/heartbeat | OpenClaw global scheduler | only if truly needed for this slice | +| Workspace ownership | OpenClaw workspace | unchanged unless explicitly expanded | + +## Main migration conclusions + +### Conclusion 1: do not rebuild OpenClaw + +That would be a category error. The gateway, plugin runtime, delivery layer, cron/heartbeat engine, and session/orchestration stack are a separate platform project. + +### Conclusion 2: rebuild the Gitea ingress/router slice only + +This is the actual migration target and is small enough to complete quickly. + +### Conclusion 3: improve security while migrating + +The replacement should implement actual raw-body Gitea HMAC verification. The current webhook path does not. + +### Conclusion 4: keep deterministic work pure-script + +The current split is correct. Repo policy and enforcement work should remain fast, cheap, and idempotent. + +### Conclusion 5: judgment must be narrow and explicit + +Do not wake Claude on every webhook. Use it only for ambiguity, escalation, or clearly user-requested reasoning. + +### Conclusion 6: design should assume the current system is fragile + +Because surrounding cron/verification infrastructure is already degraded, the replacement should be independently observable and easy to test without depending on OpenClaw's unhealthy scheduler chain. + +## Open questions for Phase 1 design + +These questions should be answered in `DESIGN.md`. + +1. **Ingress topology:** keep nginx in front, or let the Caret listener terminate the webhook directly? +2. **Auth model:** bearer only for parity, or proper Gitea HMAC as the new standard? +3. **Judgment primitive:** Channels plugin, direct Claude Code primitive, or temporary dependency on OpenClaw for wake-up? +4. **Script packaging:** copy the existing scripts wholesale first, or split them into library + thin wrappers? +5. **Repo registration:** per-repo hooks only, or system-level hook once token/admin constraints are solved? +6. **Retry model:** synchronous fire-and-log only, or file-backed retry queue for transient failures? +7. **Observability:** plain JSONL logs only, or add a health endpoint plus counters and replay tooling? +8. **Workflow semantics:** which current issue/comment workflows are worth preserving exactly, and which can be simplified? + +## Recommended next step + +Move to **Phase 1 — Architecture design** with the following framing: + +- Treat this document as the baseline map of the current system +- Design only the **Gitea-facing slice**, not a gateway replacement +- Preserve the deterministic/judgment split +- Improve webhook authentication with real HMAC +- Make observability first-class because the current environment is already degraded + +That keeps the project in the "days" category instead of letting it sprawl back into a multi-week platform rewrite.