research: Phase 0 report 1 — gitea-webhooks deep read

This commit is contained in:
Caret
2026-04-06 12:43:18 +00:00
parent 40c0ca3300
commit f3296d136a

View File

@@ -0,0 +1,127 @@
# Research 01 — gitea-webhooks, workspace-ops, agent-reliability deep read
**Subagent:** `afa92905872a43a9b` (Explore)
**Completed:** 2026-04-06 12:40 UTC
**Scope:** Architecture of the current openclaw gitea-webhooks pipeline and the two smaller reference repos.
## Key architectural findings
### Ingress path (the surprise)
```
Gitea → HTTPS POST https://slack.solio.tech/hooks/gitea
→ nginx (TLS term, strips path, injects Bearer header)
→ http://127.0.0.1:18789/hooks/agent (openclaw gateway)
→ gitea-transform.js
→ event router
```
**HMAC is NOT verified by the pipeline.** Gitea sends `X-Gitea-Signature` but the raw request body is not available to the transform layer, so real HMAC-SHA256 verification never happens. Authentication is layered differently:
1. **TLS** via Let's Encrypt on nginx
2. **Nginx localhost ACL** on port 18789 (gateway only accessible from the host)
3. **Bearer token** injected by nginx (`Authorization: Bearer {OPENCLAW_HOOKS_TOKEN}`); gateway rejects 401 on mismatch
4. **Dedup cache** of `X-Gitea-Delivery` headers, 24h window, persisted at `/root/.openclaw/hooks/logs/dedup-cache.json`
There IS an HMAC in the system, but it's for a completely different purpose: **spawn signatures**. When the `sol` account creates an `[IMPLEMENT]` issue, it embeds `<!-- xen-spawn-sig:HMAC:TIMESTAMP -->` in the body. The transform re-computes HMAC-SHA256 over `repo|title|timestamp` with secret at `/root/.openclaw/hooks/spawn-secret`, verifies against the embedded value, and rejects stale or invalid signatures (2h TTL). This lets sol trigger SPAWN_MANAGER directly (bypassing gitea-worker), while keeping unauthorized creators locked out.
**Implication for Caret migration:** I inherit the same security model (bearer + ACL) OR I add real HMAC to my own listener. If I run a separate listener and want to share the Gitea webhook config, I need to cooperate with nginx or stand up a second endpoint. If I terminate events myself, I can do proper HMAC-SHA256 of the raw body using the `X-Gitea-Signature` header.
### Transform phases (gitea-transform.js v13.0)
1. **Validation gates** — event type filter, dedup, sender validation, `clawbot` loop prevention, agent echo suppression (`<!-- openclaw-agent -->` HTML comment), rate limiting (5 concurrent).
2. **Trust level detection** — Rooh (ID 29) → owner/main, collaborators → contributor/gitea-worker, unknown → readonly.
3. **Session lock check** — file-based locks at `/root/.openclaw/hooks/locks/{owner}-{repo}-{issue}`. 2h TTL. Closed issues transition to IS_DONE with 5min grace.
4. **Event routing** (the heart of it):
- `repository` / `create``execSync post-repo-audit.sh` — pure script, zero tokens
- `push` → skip main/master, route to main or gitea-worker with ACTION: RUN_CI
- `issues.opened` with `[IMPLEMENT]` title → verify spawn sig (if sol) → `precomputeSpawnParams``asyncDispatchToSpawner` → returns SPAWN_MANAGER directive
- `issue_comment` with approval words from Rooh → acquire lock → precompute → dispatch → EXECUTE_PLAN
5. **Logging & audit**`logs/audit.jsonl`, `logs/webhook-events-YYYY-MM.jsonl`, `logs/incidents.jsonl`.
### Tools fan-out (scripts invoked)
| Script | Trigger | Role |
|------------------------------|----------------------------------|-----------------------------------------------------------------------------|
| `post-repo-audit.sh` | repository/create (execSync) | Add Rooh as admin collaborator, ensure webhook exists. Pure script, ~seconds. |
| `audit-webhooks.sh` | heartbeat (15min) | Verify all webhooks exist and are healthy. `--fix` recreates missing hooks. |
| `audit-repo-policies.sh` | heartbeat (6h) + manual | Enforce required files (Makefile, .editorconfig, etc.) from sol/repo-policies. |
| `secret-scan.sh` | CI (`make check`) | Find private keys and high-entropy tokens. Allowlist at `.secret-scan-allowlist`. |
| `create-implement-issue.sh` | manual / agent | Create signed `[IMPLEMENT]` issue with HMAC spawn signature. |
| `check-implement-orphans.sh` | heartbeat (15min) | Detect stale pending spawn files, inactive Managers, orphaned Workers. |
| `spawn-manager.sh` | agent-called | Generate Manager spawn JSON from issue body. Creates project workspace. |
### Openclaw hard couplings I must replace
| Coupling | What I'll need | Difficulty |
|-----------------------------------------------------------------|--------------------------------------------------------|------------|
| `sessions_spawn` (openclaw's subagent spawn primitive) | Channels plugin or HTTP-triggered session creation | **Hard** |
| `wakeMode: now / next-heartbeat` | My own dispatch queue | Medium |
| Agent IDs (`main`, `gitea-worker`, `spawner`) | My own routing scheme (just `caret` + role tags) | Easy |
| Bearer token auth (`OPENCLAW_HOOKS_TOKEN`) | My own bearer token, shared with nginx or own listener | Easy |
| Workspace path (`/root/.openclaw/workspace/projects/PROJ-XXX-*`) | My own workspace path; update all scripts | Easy |
| Session lock dir (`/root/.openclaw/hooks/locks/`) | My own lock dir | Easy |
| Queue inbox (`/root/.openclaw/hooks/queue-inbox/`) | My own queue + daemon | Easy |
| Mattermost incident posting | Swap to tg-stream or keep calling Mattermost | Easy |
| Gitea API wrapper | Keep as-is (stateless curl) | Trivial |
### Surprising behaviors / gotchas
1. Session locks have a **2h TTL**, different from archive timeout. Long-running Managers can lose their lock and trigger a race.
2. Closed-while-locked issues transition to **IS_DONE with 5min grace**, not immediate release.
3. Spawn signatures expire after **2h** regardless of issue state — approving a stale issue rejects the spawn.
4. Dedup cache is persisted to disk; crash recovery repopulates within 24h but replays may occur during that window.
5. Gitea HMAC not verified — bearer + nginx ACL are the only layers.
6. The `sol` account is treated as a contributor UNLESS its issue has a valid spawn signature. Approval words from sol are ignored.
7. Rate limiting is **per agent ID** (5 concurrent per agent), tracked in `active-sessions.json`.
8. `asyncDispatchToSpawner` is fire-and-forget HTTP POST — no error handling if the spawner is down.
9. `precomputeSpawnParams` calls `spawn-manager.sh` via `execSync` with a 30s timeout in the transform's hot path. Timeout = text-directive fallback.
10. Manager death before `STATE.json` checkpoint can cause duplicate respawns (flagged in agent-reliability).
11. Lock expiry cleanup is lazy — expired files accumulate until someone calls `check()`.
12. Queue daemon re-verifies spawn signatures using the stored timestamp, moving the TTL check off the hot path.
### HMAC recipe I can steal (for a proper replacement)
Gitea sends raw body + `X-Gitea-Signature: <hex HMAC-SHA256>` computed with the webhook secret over the raw JSON body. Node.js verification:
```javascript
const crypto = require('node:crypto');
function verifyGiteaSignature(rawBody, signatureHeader, secret) {
if (!signatureHeader) return false;
const expected = crypto.createHmac('sha256', secret).update(rawBody).digest('hex');
// timing-safe compare
const sigBuf = Buffer.from(signatureHeader, 'hex');
const expBuf = Buffer.from(expected, 'hex');
if (sigBuf.length !== expBuf.length) return false;
return crypto.timingSafeEqual(sigBuf, expBuf);
}
```
The catch: the listener must read the **raw body before JSON parsing**. Middleware that parses JSON first strips the raw bytes and makes verification impossible. This is why openclaw's gateway never verified HMAC — it handed already-parsed JSON to the transform. My replacement listener must keep the raw body around (store it on the request object) until after verification.
### Test infrastructure I can reuse
- `tests/test-transform.js` — full unit tests for routing, trust levels, locks, rate limiting (Node.js, 50+ cases, mocks Gitea API). Directly portable.
- `tests/test-lifecycle.js` — Manager/Worker lifecycle including spawn signatures.
- `tests/test-spawn-manager.sh` — project ID generation, workspace creation (isolated, no API).
- `tools/test-transform.sh` — manual smoke test against a live gateway.
- Test fixtures in `tests/fixtures/` — mock Gitea payloads for events.
All are portable. I can copy them into my migration repo with their license intact and adapt paths.
## Answers to the questions that defined this research
**Q: Can I replace the deterministic side cleanly?** Yes. Every pure script is a standalone bash/node script with minimal coupling to openclaw. Main effort is rewriting the event router (`gitea-transform.js` equivalent) and the ingress path.
**Q: Can I replace the agent-spawn side cleanly?** Hard. `sessions_spawn` is a core openclaw primitive with session state, model selection, tool allowlists, wakeMode semantics, and process lifecycle management. The replacement needs a Channels plugin or a long-running worker pool; neither is a weekend project.
**Q: What's the minimum viable Caret pipeline?** An HTTP listener with proper HMAC verification, a router with the same event → script fan-out as gitea-transform.js, a file-based session lock manager, a structured log, and ONE script per event type. That's ~600-800 lines in one bun file, doable in Phase 2.
**Q: What should I NOT rebuild?** The tools themselves — `post-repo-audit.sh`, `audit-repo-policies.sh`, `spawn-manager.sh`, etc. Copy them, strip the openclaw path prefixes, ship them in `/host/root/.caret/tools/`. Don't reinvent the workspace/PROJ-XXX scheme unless Rooh explicitly asks — it's a working system with its own conventions I'd be recreating poorly.
## Next reads pending
- **Subagent `ae5ca38f70b1e9626`**: openclaw gateway internals — how does `sessions_spawn` actually fire a new Claude session? Where is the tool allowlist enforced? What does the cron/heartbeat scheduler look like?
- **Subagent `abf0cb0928d823a0b`**: live state audit — already partial results visible: four webhooks registered across sol/* repos, all pointing to `https://slack.solio.tech/hooks/gitea`, all with `Secret: NOT SET`. That's the smoking gun for "HMAC is not in the path right now."
When both finish, synthesize into `ARCHITECTURE.md` in the migration repo and move to Phase 1 design.