plan: initial migration plan and README

2026-04-06 12:37:12 +00:00
parent 3e69fb7beb
commit 40c0ca3300
2 changed files with 115 additions and 2 deletions
--- a/PLAN.md
+++ b/PLAN.md
@@ -0,0 +1,109 @@
+# openclaw → Caret migration plan
+
+**Status:** in progress, 2026-04-06
+**Owner:** Caret
+**Approver:** Rooh
+**Tracking:** issues in this repo
+
+## Goal
+
+Take over the agent infrastructure that openclaw currently runs through Xen — webhooks, policy enforcement, heartbeat checks, session spawning, scheduled work — and stand up a Caret-owned replacement that works 100% of the features correctly, so openclaw/Xen can be disabled later with zero regression.
+
+The migration has to preserve two categories of behavior cleanly:
+
+1. **Deterministic work** — pure-script operations that don't need an LLM. These must stay cheap (zero token cost), fast (sub-second), and reliable. Examples: adding collaborators on repo creation, ensuring the HMAC webhook exists on new repos, running the policy template baseline, HMAC verification.
+2. **Judgment work** — operations that benefit from opus-level reasoning. These should *wake me up* via a native Claude Code primitive (channels plugin or similar), not a permanent process. Examples: drafting a README from commit history, deciding which template fits an unusual repo, reviewing PR policy violations conversationally, explaining anomalies in plain language.
+
+The split matters because openclaw's current gitea-webhooks pipeline is explicitly tagged `Zero tokens — pure script enforcement` in its `post-repo-audit.sh` header comment. Keeping the same split avoids ballooning token spend.
+
+## Phases
+
+### Phase 0 — Research (in progress)
+
+Read the three reference repos and the live system state to understand what I'm replacing. Three parallel Explore subagents are doing this now.
+
+- **R0.1** Read `sol/gitea-webhooks` deeply — data flow, HMAC, transform logic, tool fan-out, repo-type detection, openclaw couplings. (subagent afa92905872a43a9b)
+- **R0.2** Read `sol/workspace-ops` and `sol/agent-reliability` — scope, entry points, openclaw couplings, overlaps. (same subagent)
+- **R0.3** Map the openclaw gateway internals — session spawn API, cron/heartbeat mechanism, tool allowlist enforcement, plugin wiring, replacement difficulty matrix. (subagent ae5ca38f70b1e9626)
+- **R0.4** Audit the currently-running state — every registered Gitea webhook, live cron/timer infra, HEARTBEAT.md checklists, active agent processes, shared state stores. (subagent abf0cb0928d823a0b)
+- **R0.5** Synthesize the three reports into `ARCHITECTURE.md` in this repo — a single readable picture of what openclaw does today and what I'll need to rebuild.
+
+Exit criteria: I can answer any architectural question about the current openclaw gitea pipeline without grepping the source again.
+
+### Phase 1 — Architecture design
+
+Before writing any code, lock down the target design with Rooh's sign-off.
+
+- **A1.1** Write `DESIGN.md` in this repo describing the target Caret-owned stack: components, data flows, ownership boundaries, storage, observability.
+- **A1.2** List every openclaw dependency the new stack removes, and every openclaw feature it depends on staying around (if any).
+- **A1.3** DMG-style brief: surface every design choice as a multiple-choice question with a recommendation and reasoning, let Rooh approve or override. No code written until this is signed off.
+- **A1.4** Define "100% feature parity" concretely — a test list that must pass after migration. Every test is a one-line assertion about behavior ("new sol/* repo gets Makefile within 10s", "HMAC-signed POST with wrong secret is rejected with 403", etc.).
+
+Exit criteria: `DESIGN.md` in the repo, signed off by Rooh, with a test list that defines done.
+
+### Phase 2 — Deterministic path build
+
+Build the pure-script side first. No LLM in the loop.
+
+- **B2.1** Scaffold the webhook listener (bun HTTP server) in its own docker container under `/host/root/.caret/`. Minimal footprint, one file where possible.
+- **B2.2** Implement HMAC verification exactly matching the openclaw pattern so existing Gitea webhooks work without re-registration.
+- **B2.3** Port the core scripts from `sol/gitea-webhooks/tools/` into `/host/root/.caret/tools/` with the openclaw-specific bits (hardcoded workspace paths, Mattermost channel IDs, etc.) stripped out.
+- **B2.4** Wire the listener's event router to call the right script for each event type. `repository.create` → `post-repo-audit.sh` + `audit-repo-policies.sh --fix`. `push` to main → `secret-scan.sh` + policy re-check. Etc.
+- **B2.5** Structured JSON logging to `/host/root/.caret/log/repo-enforcer.log` with line-count rotation (same pattern as tg-stream).
+- **B2.6** Unit tests: mock webhook payloads via curl against a locally-running listener. Cover verification success/failure, event routing, idempotency.
+
+Exit criteria: a canary test — create a throwaway `sol/caret-test-<ts>` repo against the live system, watch policies apply within 10 seconds, verify the log captures the event and the commit hash.
+
+### Phase 3 — Judgment path build
+
+Build the wake-me-up side. This is the piece openclaw doesn't currently do — it exclusively uses pure scripts.
+
+- **J3.1** Evaluate the Channels plugin mechanism (`claude code channels`) as the native primitive for "external event → Claude session". Build a minimal plugin at `/host/root/.caret/channels/gitea-judgment/` that receives an HTTP POST and starts a session with the payload as the initial prompt.
+- **J3.2** Define the trigger conditions — when does the deterministic path hand off to judgment? e.g. "policy enforcer errored", "repo-type detection was ambiguous", "a flag in the issue body requests AI review".
+- **J3.3** Ensure cost hygiene: judgment is always opt-in or error-triggered, never fired on every event. Document the budget in `DESIGN.md`.
+
+Exit criteria: a second canary — a repo with a deliberately weird structure fires the deterministic path, that path detects it can't auto-fix, and a Claude session wakes up to handle it, reporting back via `tg-stream`.
+
+### Phase 4 — Parallel run with Xen
+
+Openclaw stays on. My replacement runs alongside it, both hitting the same Gitea events.
+
+- **P4.1** Register my webhook endpoint on a small set of test repos first (not all of sol/*). Verify both pipelines fire and neither breaks.
+- **P4.2** After 24 hours of clean dual-run on the test repos, widen to all sol/* repos.
+- **P4.3** Monitor the log for any divergence — cases where my pipeline disagreed with openclaw's. Investigate and reconcile every one.
+
+Exit criteria: 72 hours of dual-run with zero unreconciled divergences.
+
+### Phase 5 — Cut-over
+
+Only run this with Rooh's explicit go-ahead.
+
+- **C5.1** Disable openclaw's gitea-transform by stopping the openclaw-openclaw-gateway-1 container OR by removing its hooks from its settings (reversible).
+- **C5.2** Watch my pipeline handle all incoming Gitea events solo for 24 hours.
+- **C5.3** If anything breaks: immediate rollback by restarting openclaw. Rollback must be one command, tested before cut-over.
+- **C5.4** After 7 days of clean solo run, mark the migration complete. Delete the staging files, archive the research repos, move this project from "in progress" to "done" in the repo description.
+
+Exit criteria: 7 days of Caret-only operation with no regressions.
+
+## Dependencies and coordination
+
+- **Gitea token scope:** the sol token I have doesn't have `read:admin` scope, so I can't list or create system-level webhooks. I'll need Rooh to either elevate the token or register the system webhook manually one time.
+- **Xen's review:** even though this migration replaces Xen's territory, I should loop Xen in at Phase 1 (design review) and Phase 4 (parallel run starts). Not approval — just visibility so Xen doesn't delete something I'm depending on.
+- **Openclaw upgrades:** if openclaw ships an upgrade while we're mid-migration, it could overwrite files I'm reading. I should work against a snapshot commit of the research repos.
+
+## Risks
+
+1. **HMAC secret management** — if I don't get the secret storage and rotation story right, a leaked secret compromises every webhook. Must be documented and rotatable.
+2. **Idempotency** — if my pipeline runs twice on the same event (retry, dual-register, replay), it must not double-apply fixes or double-add collaborators. Every operation must be idempotent.
+3. **Silent drops** — openclaw currently has gaps (e.g. the bug I found earlier where concurrent pollers race). I must not introduce new silent drops. Every incoming event must produce a log line, even if the line says "ignored because X".
+4. **Rollback speed** — if the cut-over goes wrong, Rooh must see working policy enforcement again within minutes. The rollback procedure is a first-class deliverable, not an afterthought.
+5. **Token scope escalation** — I cannot register system-wide webhooks without admin scope. Either we elevate the token or we accept manual one-time setup.
+6. **Coupling I don't yet see** — the research phase will probably surface dependencies I haven't imagined. The plan should update as findings land.
+
+## Tracking
+
+Issues in this repo track each phase task. Each task becomes an issue, each issue gets closed when its exit criterion is met. `ARCHITECTURE.md`, `DESIGN.md`, and this `PLAN.md` are the three canonical documents the work references.
+
+## Changelog
+
+- **2026-04-06** Plan drafted. Three research subagents spawned. Repo created, README + plan committed.
--- a/README.md
+++ b/README.md
@@ -1,3 +1,7 @@
-# openclaw-to-caret-migration
+# openclaw → Caret migration

-Migration project: openclaw agent infrastructure (Xen-owned) to a Claude-native stack owned by Caret. Tracks architecture decisions, phase plans, and implementation progress via issues.
+Taking over the openclaw gitea integration (currently owned by Xen) and rebuilding it as a Caret-owned stack that can fully replace openclaw's pipeline when Xen is disabled.
+
+See `PLAN.md` for the phased plan. Progress is tracked via issues in this repo.
+
+Status: **Phase 0 — research in progress** (2026-04-06)