Compare commits
1 Commits
main
...
feat/archi
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
416c5759b5 |
407
ARCHITECTURE.md
Normal file
407
ARCHITECTURE.md
Normal file
@@ -0,0 +1,407 @@
|
|||||||
|
# ARCHITECTURE.md — current openclaw Gitea slice and migration boundary
|
||||||
|
|
||||||
|
**Repo:** `sol/openclaw-to-caret-migration`
|
||||||
|
**Date:** 2026-04-06
|
||||||
|
**Source reports:**
|
||||||
|
- `research/RESEARCH-01-gitea-webhooks-deep-read.md`
|
||||||
|
- `research/RESEARCH-02-gateway-internals.md`
|
||||||
|
- `research/RESEARCH-03-live-state-audit.md`
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
The migration target is much smaller than a full OpenClaw replacement.
|
||||||
|
|
||||||
|
OpenClaw today owns a large orchestration platform: gateway auth, session storage, plugin loading, subagent spawning, cron, heartbeat, tool policy enforcement, multi-channel delivery, and the long-lived workspace for 155+ projects. Replacing all of that would be a 4-8 week systems project.
|
||||||
|
|
||||||
|
But the **Gitea-facing slice** that this migration actually needs is narrower:
|
||||||
|
|
||||||
|
1. **Webhook ingress**
|
||||||
|
2. **Event validation / routing**
|
||||||
|
3. **Deterministic script fan-out**
|
||||||
|
4. **Issue workflow gates / lock logic**
|
||||||
|
5. **Optional judgment wake-up when automation is not enough**
|
||||||
|
|
||||||
|
That slice can be rebuilt as a small standalone listener plus a handful of copied/adapted scripts. The practical shape is a **600-800 line Bun listener** with raw-body signature verification, dedup, file locks, script dispatch, and structured logs.
|
||||||
|
|
||||||
|
The live audit also changed the urgency: this is not a clean migration away from a stable system. The current OpenClaw installation is already **degraded**, with 8 of 12 cron jobs failing due to a bad model reference (`claudehack/claude-sonnet-4-6`). That does not directly prove the Gitea webhook path is broken, but it does mean the surrounding automation is already brittle and parts of the verification pipeline are failing.
|
||||||
|
|
||||||
|
## Scope boundary
|
||||||
|
|
||||||
|
### In scope for this migration
|
||||||
|
|
||||||
|
- Gitea webhook receiver for repo / issue / comment style events
|
||||||
|
- Authentication of incoming webhook traffic
|
||||||
|
- Deduplication and idempotency checks
|
||||||
|
- Event router
|
||||||
|
- Deterministic script execution for policy enforcement and repo hygiene
|
||||||
|
- File-based issue lock management
|
||||||
|
- Minimal queue / retry behavior where needed
|
||||||
|
- Structured audit logging
|
||||||
|
- Optional handoff into a Claude-native judgment path
|
||||||
|
|
||||||
|
### Explicitly out of scope
|
||||||
|
|
||||||
|
These stay owned by OpenClaw unless Phase 1 expands scope intentionally:
|
||||||
|
|
||||||
|
- Full gateway RPC / WebSocket protocol
|
||||||
|
- Session transcript storage system
|
||||||
|
- General subagent orchestration framework
|
||||||
|
- Global cron and heartbeat scheduler
|
||||||
|
- Plugin SDK and plugin runtime
|
||||||
|
- Delivery abstraction for Mattermost / Telegram / Discord / WhatsApp
|
||||||
|
- Full tool allowlist inheritance engine
|
||||||
|
- Existing 155-project workspace and project registry
|
||||||
|
- Global memory / archive / compaction machinery
|
||||||
|
|
||||||
|
## Current system: end-to-end picture
|
||||||
|
|
||||||
|
### Deterministic path today
|
||||||
|
|
||||||
|
```text
|
||||||
|
Gitea
|
||||||
|
-> HTTPS POST https://slack.solio.tech/hooks/gitea
|
||||||
|
-> nginx
|
||||||
|
- TLS termination
|
||||||
|
- local forwarding
|
||||||
|
- injects Authorization: Bearer <OPENCLAW_HOOKS_TOKEN>
|
||||||
|
-> OpenClaw gateway /hooks/agent
|
||||||
|
-> gitea-transform.js
|
||||||
|
-> event router
|
||||||
|
-> pure scripts (post-repo-audit, policy audit, security checks, etc.)
|
||||||
|
-> logs / queue state / lock files
|
||||||
|
```
|
||||||
|
|
||||||
|
### Judgment / agent path today
|
||||||
|
|
||||||
|
```text
|
||||||
|
Gitea event
|
||||||
|
-> transform validation and trust checks
|
||||||
|
-> route decision
|
||||||
|
-> if issue workflow requires agent action:
|
||||||
|
precompute spawn params
|
||||||
|
async dispatch to spawner / manager path
|
||||||
|
OpenClaw creates isolated session
|
||||||
|
agent writes back to Gitea / chat surfaces
|
||||||
|
```
|
||||||
|
|
||||||
|
### Platform services supporting both
|
||||||
|
|
||||||
|
```text
|
||||||
|
OpenClaw gateway
|
||||||
|
- auth / bearer validation
|
||||||
|
- hook ingestion
|
||||||
|
- session spawn
|
||||||
|
- tool allowlist resolution
|
||||||
|
- cron service
|
||||||
|
- heartbeat runner
|
||||||
|
- plugin loading
|
||||||
|
- outbound delivery
|
||||||
|
- workspace/session state persistence
|
||||||
|
```
|
||||||
|
|
||||||
|
## Security model: what exists now
|
||||||
|
|
||||||
|
##
|
||||||
|
## 1) Incoming Gitea webhooks are not protected by Gitea HMAC today
|
||||||
|
|
||||||
|
This was the most important architecture surprise.
|
||||||
|
|
||||||
|
Although Gitea supports `X-Gitea-Signature`, the current OpenClaw transform layer does not have access to the raw request body, so it does **not** perform real body-level HMAC verification. The live repo audit also showed the visible repo webhooks have **no secret set**.
|
||||||
|
|
||||||
|
Current protection is instead layered as:
|
||||||
|
|
||||||
|
1. HTTPS via nginx
|
||||||
|
2. nginx forwarding only to local gateway
|
||||||
|
3. injected bearer token (`Authorization: Bearer ...`)
|
||||||
|
4. gateway token validation
|
||||||
|
5. delivery dedup by `X-Gitea-Delivery`
|
||||||
|
|
||||||
|
This is workable, but weaker and more indirect than true webhook HMAC.
|
||||||
|
|
||||||
|
## 2) Spawn signatures are a separate HMAC system
|
||||||
|
|
||||||
|
There *is* HMAC in the system, but it protects a different boundary.
|
||||||
|
|
||||||
|
When `sol` creates an `[IMPLEMENT]` issue, the issue body includes a spawn signature comment:
|
||||||
|
|
||||||
|
```html
|
||||||
|
<!-- xen-spawn-sig:HMAC:TIMESTAMP -->
|
||||||
|
```
|
||||||
|
|
||||||
|
The transform recomputes HMAC-SHA256 over `repo|title|timestamp`, validates it with a local secret, and rejects invalid or stale signatures. This is not webhook authentication. It is an authorization gate for a privileged workflow.
|
||||||
|
|
||||||
|
## 3) Trust routing is identity-aware
|
||||||
|
|
||||||
|
The transform classifies senders into trust levels such as owner, collaborator/contributor, and readonly. That trust level affects:
|
||||||
|
|
||||||
|
- which agent receives the event
|
||||||
|
- whether approval words are honored
|
||||||
|
- whether a manager spawn may occur
|
||||||
|
- whether an event is ignored as untrusted or looped
|
||||||
|
|
||||||
|
## 4) Issue lock files are a core safety mechanism
|
||||||
|
|
||||||
|
Issue workflows are protected with file locks under a hooks lock directory. Locks have TTL-based behavior, and closed issues move into a short grace state before release. This matters because concurrent comments or duplicate deliveries can otherwise spawn duplicate work.
|
||||||
|
|
||||||
|
## Live-state findings that affect architecture
|
||||||
|
|
||||||
|
## Current health status: DEGRADED
|
||||||
|
|
||||||
|
The current OpenClaw deployment is degraded now, not hypothetically later.
|
||||||
|
|
||||||
|
### Confirmed problems
|
||||||
|
|
||||||
|
- `openclaw.json` references a non-existent model alias: `claudehack/claude-sonnet-4-6`
|
||||||
|
- 8 of 12 cron jobs are failing repeatedly
|
||||||
|
- `ws-sync` is failing, so cached repo state is stale
|
||||||
|
- `webhook-verify` is failing, so the pipeline's own end-to-end verification job is unhealthy
|
||||||
|
- failover chains are slow and noisy under API pressure
|
||||||
|
|
||||||
|
### Why this matters for migration design
|
||||||
|
|
||||||
|
- The migration should reduce dependency on fragile global cron/heartbeat behavior
|
||||||
|
- The replacement should make ingress validation and deterministic enforcement stand on their own
|
||||||
|
- The replacement should log every event locally, even when downstream agent work fails
|
||||||
|
- The replacement should avoid hidden couplings to provider/model config where possible
|
||||||
|
|
||||||
|
## Current components and responsibilities
|
||||||
|
|
||||||
|
### 1) nginx edge
|
||||||
|
|
||||||
|
Responsibilities today:
|
||||||
|
|
||||||
|
- TLS termination
|
||||||
|
- forwarding inbound webhook traffic
|
||||||
|
- injecting the gateway bearer token
|
||||||
|
- relying on network locality and host-level topology as part of trust
|
||||||
|
|
||||||
|
**Migration implication:**
|
||||||
|
The new Caret listener can either:
|
||||||
|
|
||||||
|
- keep using nginx as the front door and share the bearer-token pattern, or
|
||||||
|
- terminate webhook traffic directly and verify raw-body HMAC itself
|
||||||
|
|
||||||
|
The second option is better if Rooh wants the replacement to improve security rather than merely preserve behavior.
|
||||||
|
|
||||||
|
### 2) OpenClaw gateway
|
||||||
|
|
||||||
|
Responsibilities today:
|
||||||
|
|
||||||
|
- receive hook traffic
|
||||||
|
- authenticate requests
|
||||||
|
- dispatch transform logic
|
||||||
|
- spawn agent sessions
|
||||||
|
- run heartbeats and cron jobs
|
||||||
|
- host plugins and outbound delivery
|
||||||
|
- enforce tool policies
|
||||||
|
|
||||||
|
**Migration implication:**
|
||||||
|
We should not replace the whole gateway. We only need a listener for the Gitea slice.
|
||||||
|
|
||||||
|
### 3) `gitea-transform.js`
|
||||||
|
|
||||||
|
This is the current Gitea event router. It performs:
|
||||||
|
|
||||||
|
- event-type filtering
|
||||||
|
- dedup checks
|
||||||
|
- trust classification
|
||||||
|
- loop prevention
|
||||||
|
- rate limiting
|
||||||
|
- lock checks
|
||||||
|
- route decisions
|
||||||
|
- script execution for deterministic cases
|
||||||
|
- manager/spawner dispatch for workflow cases
|
||||||
|
- audit logging
|
||||||
|
|
||||||
|
**Migration implication:**
|
||||||
|
This is the closest thing to the spec for the new listener. The replacement should preserve its behavior selectively, not copy the whole gateway.
|
||||||
|
|
||||||
|
### 4) Deterministic script layer
|
||||||
|
|
||||||
|
Examples found in research:
|
||||||
|
|
||||||
|
- `post-repo-audit.sh`
|
||||||
|
- `audit-webhooks.sh`
|
||||||
|
- `audit-repo-policies.sh`
|
||||||
|
- `secret-scan.sh`
|
||||||
|
- `check-implement-orphans.sh`
|
||||||
|
- `spawn-manager.sh`
|
||||||
|
|
||||||
|
These are mostly stateless bash/node tools with path/config coupling.
|
||||||
|
|
||||||
|
**Migration implication:**
|
||||||
|
Do not rewrite these from scratch unless necessary. Copy/adapt the working ones, strip OpenClaw-specific paths, and make config explicit.
|
||||||
|
|
||||||
|
### 5) Session / workflow orchestration
|
||||||
|
|
||||||
|
OpenClaw provides:
|
||||||
|
|
||||||
|
- isolated session spawn
|
||||||
|
- role/tool policy resolution
|
||||||
|
- session transcript storage
|
||||||
|
- channel delivery
|
||||||
|
- wake mechanisms
|
||||||
|
|
||||||
|
**Migration implication:**
|
||||||
|
This is the expensive part to rebuild. Avoid it. Use Claude-native primitives only for the narrow judgment path.
|
||||||
|
|
||||||
|
## The minimal replacement architecture
|
||||||
|
|
||||||
|
The smallest viable Caret-owned architecture is:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Gitea
|
||||||
|
-> Caret listener (Bun)
|
||||||
|
- raw body capture
|
||||||
|
- HMAC verify
|
||||||
|
- delivery dedup
|
||||||
|
- trust + routing
|
||||||
|
- file locks
|
||||||
|
- structured logs
|
||||||
|
- script fan-out
|
||||||
|
- optional judgment trigger
|
||||||
|
-> deterministic tools/
|
||||||
|
-> optional Claude-native wake-up path
|
||||||
|
```
|
||||||
|
|
||||||
|
### Listener responsibilities
|
||||||
|
|
||||||
|
The listener should own exactly these jobs:
|
||||||
|
|
||||||
|
1. Read raw request body before parsing
|
||||||
|
2. Verify `X-Gitea-Signature` with timing-safe HMAC compare
|
||||||
|
3. Parse event metadata and delivery ID
|
||||||
|
4. Deduplicate by delivery ID
|
||||||
|
5. Apply event-type filters
|
||||||
|
6. Classify sender / trust level
|
||||||
|
7. Enforce loop prevention for agent-authored comments
|
||||||
|
8. Acquire/check per-issue lock where needed
|
||||||
|
9. Dispatch deterministic scripts by event type
|
||||||
|
10. Emit structured JSON logs for all outcomes
|
||||||
|
11. Optionally trigger a judgment wake-up when deterministic automation cannot decide
|
||||||
|
|
||||||
|
### Deterministic script fan-out
|
||||||
|
|
||||||
|
The likely event map after design review:
|
||||||
|
|
||||||
|
| Event | Action |
|
||||||
|
|---|---|
|
||||||
|
| `repository.create` | collaborator add + webhook ensure + repo policy baseline |
|
||||||
|
| `push` to protected branch | secret scan + policy re-check |
|
||||||
|
| `issues.opened` on automation-tagged issues | route to gated workflow logic |
|
||||||
|
| `issue_comment` on active workflow issue | approval parsing, lock check, optional wake-up |
|
||||||
|
| unsupported / irrelevant event | log and ignore |
|
||||||
|
|
||||||
|
This keeps the zero-token path zero-token.
|
||||||
|
|
||||||
|
### Judgment path
|
||||||
|
|
||||||
|
Only use judgment for cases that deterministic automation cannot safely resolve, such as:
|
||||||
|
|
||||||
|
- ambiguous repo type
|
||||||
|
- policy enforcement failure requiring explanation
|
||||||
|
- explicit request for AI review
|
||||||
|
- human-authored workflow step that needs synthesis rather than a script
|
||||||
|
|
||||||
|
This should not require recreating OpenClaw's full spawn/orchestration model. The design target should be a small Claude-native wake-up primitive, not a manager framework clone.
|
||||||
|
|
||||||
|
## Hard dependencies vs removable dependencies
|
||||||
|
|
||||||
|
### Dependencies the new Gitea slice can remove
|
||||||
|
|
||||||
|
- OpenClaw hook ingestion for Gitea webhooks
|
||||||
|
- OpenClaw transform execution for Gitea routing
|
||||||
|
- reliance on nginx bearer injection as the only authenticity check
|
||||||
|
- OpenClaw-specific queue inbox / lock path layout
|
||||||
|
- OpenClaw-specific script path assumptions
|
||||||
|
|
||||||
|
### Dependencies the new slice should keep, at least initially
|
||||||
|
|
||||||
|
- Gitea itself
|
||||||
|
- existing policy scripts and repo hygiene logic
|
||||||
|
- existing human workflow semantics where already working
|
||||||
|
- OpenClaw-owned broader workspace/project system
|
||||||
|
- OpenClaw-owned non-Gitea cron/heartbeat ecosystem
|
||||||
|
- Claude-native or OpenClaw-native judgment wake-up until a better primitive is chosen
|
||||||
|
|
||||||
|
## Data / state the replacement must own
|
||||||
|
|
||||||
|
The replacement does not need a database. File-backed state is enough.
|
||||||
|
|
||||||
|
### Required local state
|
||||||
|
|
||||||
|
- `logs/events.jsonl` or similar structured event log
|
||||||
|
- `state/dedup.json` for recent delivery IDs
|
||||||
|
- `state/locks/<repo>-<issue>.lock` for per-issue workflow control
|
||||||
|
- `state/runs/` or similar optional execution receipts
|
||||||
|
- config files for webhook secret, Gitea endpoint, token, allowed repos/users
|
||||||
|
|
||||||
|
### Nice-to-have state
|
||||||
|
|
||||||
|
- replay queue for transient failures
|
||||||
|
- dead-letter folder for malformed events
|
||||||
|
- event latency counters / health summaries
|
||||||
|
|
||||||
|
## Architectural differences between current and target state
|
||||||
|
|
||||||
|
| Concern | Current OpenClaw state | Target Caret state |
|
||||||
|
|---|---|---|
|
||||||
|
| Webhook auth | bearer token + nginx locality | raw-body Gitea HMAC preferred |
|
||||||
|
| Router | transform inside gateway | standalone Bun listener |
|
||||||
|
| Deterministic actions | scripts invoked by transform | same scripts invoked by listener |
|
||||||
|
| Locks | OpenClaw hooks lock dir | Caret-owned lock dir |
|
||||||
|
| Dedup | OpenClaw cache file | Caret-owned dedup state |
|
||||||
|
| Judgment wake-up | OpenClaw session spawn | Claude-native minimal wake-up |
|
||||||
|
| Cron/heartbeat | OpenClaw global scheduler | only if truly needed for this slice |
|
||||||
|
| Workspace ownership | OpenClaw workspace | unchanged unless explicitly expanded |
|
||||||
|
|
||||||
|
## Main migration conclusions
|
||||||
|
|
||||||
|
### Conclusion 1: do not rebuild OpenClaw
|
||||||
|
|
||||||
|
That would be a category error. The gateway, plugin runtime, delivery layer, cron/heartbeat engine, and session/orchestration stack are a separate platform project.
|
||||||
|
|
||||||
|
### Conclusion 2: rebuild the Gitea ingress/router slice only
|
||||||
|
|
||||||
|
This is the actual migration target and is small enough to complete quickly.
|
||||||
|
|
||||||
|
### Conclusion 3: improve security while migrating
|
||||||
|
|
||||||
|
The replacement should implement actual raw-body Gitea HMAC verification. The current webhook path does not.
|
||||||
|
|
||||||
|
### Conclusion 4: keep deterministic work pure-script
|
||||||
|
|
||||||
|
The current split is correct. Repo policy and enforcement work should remain fast, cheap, and idempotent.
|
||||||
|
|
||||||
|
### Conclusion 5: judgment must be narrow and explicit
|
||||||
|
|
||||||
|
Do not wake Claude on every webhook. Use it only for ambiguity, escalation, or clearly user-requested reasoning.
|
||||||
|
|
||||||
|
### Conclusion 6: design should assume the current system is fragile
|
||||||
|
|
||||||
|
Because surrounding cron/verification infrastructure is already degraded, the replacement should be independently observable and easy to test without depending on OpenClaw's unhealthy scheduler chain.
|
||||||
|
|
||||||
|
## Open questions for Phase 1 design
|
||||||
|
|
||||||
|
These questions should be answered in `DESIGN.md`.
|
||||||
|
|
||||||
|
1. **Ingress topology:** keep nginx in front, or let the Caret listener terminate the webhook directly?
|
||||||
|
2. **Auth model:** bearer only for parity, or proper Gitea HMAC as the new standard?
|
||||||
|
3. **Judgment primitive:** Channels plugin, direct Claude Code primitive, or temporary dependency on OpenClaw for wake-up?
|
||||||
|
4. **Script packaging:** copy the existing scripts wholesale first, or split them into library + thin wrappers?
|
||||||
|
5. **Repo registration:** per-repo hooks only, or system-level hook once token/admin constraints are solved?
|
||||||
|
6. **Retry model:** synchronous fire-and-log only, or file-backed retry queue for transient failures?
|
||||||
|
7. **Observability:** plain JSONL logs only, or add a health endpoint plus counters and replay tooling?
|
||||||
|
8. **Workflow semantics:** which current issue/comment workflows are worth preserving exactly, and which can be simplified?
|
||||||
|
|
||||||
|
## Recommended next step
|
||||||
|
|
||||||
|
Move to **Phase 1 — Architecture design** with the following framing:
|
||||||
|
|
||||||
|
- Treat this document as the baseline map of the current system
|
||||||
|
- Design only the **Gitea-facing slice**, not a gateway replacement
|
||||||
|
- Preserve the deterministic/judgment split
|
||||||
|
- Improve webhook authentication with real HMAC
|
||||||
|
- Make observability first-class because the current environment is already degraded
|
||||||
|
|
||||||
|
That keeps the project in the "days" category instead of letting it sprawl back into a multi-week platform rewrite.
|
||||||
Reference in New Issue
Block a user