docs: add architecture synthesis

2026-04-06 12:52:16 +00:00
parent d36fa6538a
commit 416c5759b5
1 changed files with 407 additions and 0 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -0,0 +1,407 @@
+# ARCHITECTURE.md — current openclaw Gitea slice and migration boundary
+
+**Repo:** `sol/openclaw-to-caret-migration`
+**Date:** 2026-04-06
+**Source reports:**
+- `research/RESEARCH-01-gitea-webhooks-deep-read.md`
+- `research/RESEARCH-02-gateway-internals.md`
+- `research/RESEARCH-03-live-state-audit.md`
+
+## Executive summary
+
+The migration target is much smaller than a full OpenClaw replacement.
+
+OpenClaw today owns a large orchestration platform: gateway auth, session storage, plugin loading, subagent spawning, cron, heartbeat, tool policy enforcement, multi-channel delivery, and the long-lived workspace for 155+ projects. Replacing all of that would be a 4-8 week systems project.
+
+But the **Gitea-facing slice** that this migration actually needs is narrower:
+
+1. **Webhook ingress**
+2. **Event validation / routing**
+3. **Deterministic script fan-out**
+4. **Issue workflow gates / lock logic**
+5. **Optional judgment wake-up when automation is not enough**
+
+That slice can be rebuilt as a small standalone listener plus a handful of copied/adapted scripts. The practical shape is a **600-800 line Bun listener** with raw-body signature verification, dedup, file locks, script dispatch, and structured logs.
+
+The live audit also changed the urgency: this is not a clean migration away from a stable system. The current OpenClaw installation is already **degraded**, with 8 of 12 cron jobs failing due to a bad model reference (`claudehack/claude-sonnet-4-6`). That does not directly prove the Gitea webhook path is broken, but it does mean the surrounding automation is already brittle and parts of the verification pipeline are failing.
+
+## Scope boundary
+
+### In scope for this migration
+
+- Gitea webhook receiver for repo / issue / comment style events
+- Authentication of incoming webhook traffic
+- Deduplication and idempotency checks
+- Event router
+- Deterministic script execution for policy enforcement and repo hygiene
+- File-based issue lock management
+- Minimal queue / retry behavior where needed
+- Structured audit logging
+- Optional handoff into a Claude-native judgment path
+
+### Explicitly out of scope
+
+These stay owned by OpenClaw unless Phase 1 expands scope intentionally:
+
+- Full gateway RPC / WebSocket protocol
+- Session transcript storage system
+- General subagent orchestration framework
+- Global cron and heartbeat scheduler
+- Plugin SDK and plugin runtime
+- Delivery abstraction for Mattermost / Telegram / Discord / WhatsApp
+- Full tool allowlist inheritance engine
+- Existing 155-project workspace and project registry
+- Global memory / archive / compaction machinery
+
+## Current system: end-to-end picture
+
+### Deterministic path today
+
+```text
+Gitea
+  -> HTTPS POST https://slack.solio.tech/hooks/gitea
+  -> nginx
+     - TLS termination
+     - local forwarding
+     - injects Authorization: Bearer <OPENCLAW_HOOKS_TOKEN>
+  -> OpenClaw gateway /hooks/agent
+  -> gitea-transform.js
+  -> event router
+  -> pure scripts (post-repo-audit, policy audit, security checks, etc.)
+  -> logs / queue state / lock files
+```
+
+### Judgment / agent path today
+
+```text
+Gitea event
+  -> transform validation and trust checks
+  -> route decision
+  -> if issue workflow requires agent action:
+       precompute spawn params
+       async dispatch to spawner / manager path
+       OpenClaw creates isolated session
+       agent writes back to Gitea / chat surfaces
+```
+
+### Platform services supporting both
+
+```text
+OpenClaw gateway
+  - auth / bearer validation
+  - hook ingestion
+  - session spawn
+  - tool allowlist resolution
+  - cron service
+  - heartbeat runner
+  - plugin loading
+  - outbound delivery
+  - workspace/session state persistence
+```
+
+## Security model: what exists now
+
+##
+## 1) Incoming Gitea webhooks are not protected by Gitea HMAC today
+
+This was the most important architecture surprise.
+
+Although Gitea supports `X-Gitea-Signature`, the current OpenClaw transform layer does not have access to the raw request body, so it does **not** perform real body-level HMAC verification. The live repo audit also showed the visible repo webhooks have **no secret set**.
+
+Current protection is instead layered as:
+
+1. HTTPS via nginx
+2. nginx forwarding only to local gateway
+3. injected bearer token (`Authorization: Bearer ...`)
+4. gateway token validation
+5. delivery dedup by `X-Gitea-Delivery`
+
+This is workable, but weaker and more indirect than true webhook HMAC.
+
+## 2) Spawn signatures are a separate HMAC system
+
+There *is* HMAC in the system, but it protects a different boundary.
+
+When `sol` creates an `[IMPLEMENT]` issue, the issue body includes a spawn signature comment:
+
+```html
+<!-- xen-spawn-sig:HMAC:TIMESTAMP -->
+```
+
+The transform recomputes HMAC-SHA256 over `repo|title|timestamp`, validates it with a local secret, and rejects invalid or stale signatures. This is not webhook authentication. It is an authorization gate for a privileged workflow.
+
+## 3) Trust routing is identity-aware
+
+The transform classifies senders into trust levels such as owner, collaborator/contributor, and readonly. That trust level affects:
+
+- which agent receives the event
+- whether approval words are honored
+- whether a manager spawn may occur
+- whether an event is ignored as untrusted or looped
+
+## 4) Issue lock files are a core safety mechanism
+
+Issue workflows are protected with file locks under a hooks lock directory. Locks have TTL-based behavior, and closed issues move into a short grace state before release. This matters because concurrent comments or duplicate deliveries can otherwise spawn duplicate work.
+
+## Live-state findings that affect architecture
+
+## Current health status: DEGRADED
+
+The current OpenClaw deployment is degraded now, not hypothetically later.
+
+### Confirmed problems
+
+- `openclaw.json` references a non-existent model alias: `claudehack/claude-sonnet-4-6`
+- 8 of 12 cron jobs are failing repeatedly
+- `ws-sync` is failing, so cached repo state is stale
+- `webhook-verify` is failing, so the pipeline's own end-to-end verification job is unhealthy
+- failover chains are slow and noisy under API pressure
+
+### Why this matters for migration design
+
+- The migration should reduce dependency on fragile global cron/heartbeat behavior
+- The replacement should make ingress validation and deterministic enforcement stand on their own
+- The replacement should log every event locally, even when downstream agent work fails
+- The replacement should avoid hidden couplings to provider/model config where possible
+
+## Current components and responsibilities
+
+### 1) nginx edge
+
+Responsibilities today:
+
+- TLS termination
+- forwarding inbound webhook traffic
+- injecting the gateway bearer token
+- relying on network locality and host-level topology as part of trust
+
+**Migration implication:**
+The new Caret listener can either:
+
+- keep using nginx as the front door and share the bearer-token pattern, or
+- terminate webhook traffic directly and verify raw-body HMAC itself
+
+The second option is better if Rooh wants the replacement to improve security rather than merely preserve behavior.
+
+### 2) OpenClaw gateway
+
+Responsibilities today:
+
+- receive hook traffic
+- authenticate requests
+- dispatch transform logic
+- spawn agent sessions
+- run heartbeats and cron jobs
+- host plugins and outbound delivery
+- enforce tool policies
+
+**Migration implication:**
+We should not replace the whole gateway. We only need a listener for the Gitea slice.
+
+### 3) `gitea-transform.js`
+
+This is the current Gitea event router. It performs:
+
+- event-type filtering
+- dedup checks
+- trust classification
+- loop prevention
+- rate limiting
+- lock checks
+- route decisions
+- script execution for deterministic cases
+- manager/spawner dispatch for workflow cases
+- audit logging
+
+**Migration implication:**
+This is the closest thing to the spec for the new listener. The replacement should preserve its behavior selectively, not copy the whole gateway.
+
+### 4) Deterministic script layer
+
+Examples found in research:
+
+- `post-repo-audit.sh`
+- `audit-webhooks.sh`
+- `audit-repo-policies.sh`
+- `secret-scan.sh`
+- `check-implement-orphans.sh`
+- `spawn-manager.sh`
+
+These are mostly stateless bash/node tools with path/config coupling.
+
+**Migration implication:**
+Do not rewrite these from scratch unless necessary. Copy/adapt the working ones, strip OpenClaw-specific paths, and make config explicit.
+
+### 5) Session / workflow orchestration
+
+OpenClaw provides:
+
+- isolated session spawn
+- role/tool policy resolution
+- session transcript storage
+- channel delivery
+- wake mechanisms
+
+**Migration implication:**
+This is the expensive part to rebuild. Avoid it. Use Claude-native primitives only for the narrow judgment path.
+
+## The minimal replacement architecture
+
+The smallest viable Caret-owned architecture is:
+
+```text
+Gitea
+  -> Caret listener (Bun)
+     - raw body capture
+     - HMAC verify
+     - delivery dedup
+     - trust + routing
+     - file locks
+     - structured logs
+     - script fan-out
+     - optional judgment trigger
+  -> deterministic tools/
+  -> optional Claude-native wake-up path
+```
+
+### Listener responsibilities
+
+The listener should own exactly these jobs:
+
+1. Read raw request body before parsing
+2. Verify `X-Gitea-Signature` with timing-safe HMAC compare
+3. Parse event metadata and delivery ID
+4. Deduplicate by delivery ID
+5. Apply event-type filters
+6. Classify sender / trust level
+7. Enforce loop prevention for agent-authored comments
+8. Acquire/check per-issue lock where needed
+9. Dispatch deterministic scripts by event type
+10. Emit structured JSON logs for all outcomes
+11. Optionally trigger a judgment wake-up when deterministic automation cannot decide
+
+### Deterministic script fan-out
+
+The likely event map after design review:
+
+| Event | Action |
+|---|---|
+| `repository.create` | collaborator add + webhook ensure + repo policy baseline |
+| `push` to protected branch | secret scan + policy re-check |
+| `issues.opened` on automation-tagged issues | route to gated workflow logic |
+| `issue_comment` on active workflow issue | approval parsing, lock check, optional wake-up |
+| unsupported / irrelevant event | log and ignore |
+
+This keeps the zero-token path zero-token.
+
+### Judgment path
+
+Only use judgment for cases that deterministic automation cannot safely resolve, such as:
+
+- ambiguous repo type
+- policy enforcement failure requiring explanation
+- explicit request for AI review
+- human-authored workflow step that needs synthesis rather than a script
+
+This should not require recreating OpenClaw's full spawn/orchestration model. The design target should be a small Claude-native wake-up primitive, not a manager framework clone.
+
+## Hard dependencies vs removable dependencies
+
+### Dependencies the new Gitea slice can remove
+
+- OpenClaw hook ingestion for Gitea webhooks
+- OpenClaw transform execution for Gitea routing
+- reliance on nginx bearer injection as the only authenticity check
+- OpenClaw-specific queue inbox / lock path layout
+- OpenClaw-specific script path assumptions
+
+### Dependencies the new slice should keep, at least initially
+
+- Gitea itself
+- existing policy scripts and repo hygiene logic
+- existing human workflow semantics where already working
+- OpenClaw-owned broader workspace/project system
+- OpenClaw-owned non-Gitea cron/heartbeat ecosystem
+- Claude-native or OpenClaw-native judgment wake-up until a better primitive is chosen
+
+## Data / state the replacement must own
+
+The replacement does not need a database. File-backed state is enough.
+
+### Required local state
+
+- `logs/events.jsonl` or similar structured event log
+- `state/dedup.json` for recent delivery IDs
+- `state/locks/<repo>-<issue>.lock` for per-issue workflow control
+- `state/runs/` or similar optional execution receipts
+- config files for webhook secret, Gitea endpoint, token, allowed repos/users
+
+### Nice-to-have state
+
+- replay queue for transient failures
+- dead-letter folder for malformed events
+- event latency counters / health summaries
+
+## Architectural differences between current and target state
+
+| Concern | Current OpenClaw state | Target Caret state |
+|---|---|---|
+| Webhook auth | bearer token + nginx locality | raw-body Gitea HMAC preferred |
+| Router | transform inside gateway | standalone Bun listener |
+| Deterministic actions | scripts invoked by transform | same scripts invoked by listener |
+| Locks | OpenClaw hooks lock dir | Caret-owned lock dir |
+| Dedup | OpenClaw cache file | Caret-owned dedup state |
+| Judgment wake-up | OpenClaw session spawn | Claude-native minimal wake-up |
+| Cron/heartbeat | OpenClaw global scheduler | only if truly needed for this slice |
+| Workspace ownership | OpenClaw workspace | unchanged unless explicitly expanded |
+
+## Main migration conclusions
+
+### Conclusion 1: do not rebuild OpenClaw
+
+That would be a category error. The gateway, plugin runtime, delivery layer, cron/heartbeat engine, and session/orchestration stack are a separate platform project.
+
+### Conclusion 2: rebuild the Gitea ingress/router slice only
+
+This is the actual migration target and is small enough to complete quickly.
+
+### Conclusion 3: improve security while migrating
+
+The replacement should implement actual raw-body Gitea HMAC verification. The current webhook path does not.
+
+### Conclusion 4: keep deterministic work pure-script
+
+The current split is correct. Repo policy and enforcement work should remain fast, cheap, and idempotent.
+
+### Conclusion 5: judgment must be narrow and explicit
+
+Do not wake Claude on every webhook. Use it only for ambiguity, escalation, or clearly user-requested reasoning.
+
+### Conclusion 6: design should assume the current system is fragile
+
+Because surrounding cron/verification infrastructure is already degraded, the replacement should be independently observable and easy to test without depending on OpenClaw's unhealthy scheduler chain.
+
+## Open questions for Phase 1 design
+
+These questions should be answered in `DESIGN.md`.
+
+1. **Ingress topology:** keep nginx in front, or let the Caret listener terminate the webhook directly?
+2. **Auth model:** bearer only for parity, or proper Gitea HMAC as the new standard?
+3. **Judgment primitive:** Channels plugin, direct Claude Code primitive, or temporary dependency on OpenClaw for wake-up?
+4. **Script packaging:** copy the existing scripts wholesale first, or split them into library + thin wrappers?
+5. **Repo registration:** per-repo hooks only, or system-level hook once token/admin constraints are solved?
+6. **Retry model:** synchronous fire-and-log only, or file-backed retry queue for transient failures?
+7. **Observability:** plain JSONL logs only, or add a health endpoint plus counters and replay tooling?
+8. **Workflow semantics:** which current issue/comment workflows are worth preserving exactly, and which can be simplified?
+
+## Recommended next step
+
+Move to **Phase 1 — Architecture design** with the following framing:
+
+- Treat this document as the baseline map of the current system
+- Design only the **Gitea-facing slice**, not a gateway replacement
+- Preserve the deterministic/judgment split
+- Improve webhook authentication with real HMAC
+- Make observability first-class because the current environment is already degraded
+
+That keeps the project in the "days" category instead of letting it sprawl back into a multi-week platform rewrite.