plan: initial migration plan and README

This commit is contained in:
Caret
2026-04-06 12:37:12 +00:00
parent 3e69fb7beb
commit 40c0ca3300
2 changed files with 115 additions and 2 deletions

109
PLAN.md Normal file
View File

@@ -0,0 +1,109 @@
# openclaw → Caret migration plan
**Status:** in progress, 2026-04-06
**Owner:** Caret
**Approver:** Rooh
**Tracking:** issues in this repo
## Goal
Take over the agent infrastructure that openclaw currently runs through Xen — webhooks, policy enforcement, heartbeat checks, session spawning, scheduled work — and stand up a Caret-owned replacement that works 100% of the features correctly, so openclaw/Xen can be disabled later with zero regression.
The migration has to preserve two categories of behavior cleanly:
1. **Deterministic work** — pure-script operations that don't need an LLM. These must stay cheap (zero token cost), fast (sub-second), and reliable. Examples: adding collaborators on repo creation, ensuring the HMAC webhook exists on new repos, running the policy template baseline, HMAC verification.
2. **Judgment work** — operations that benefit from opus-level reasoning. These should *wake me up* via a native Claude Code primitive (channels plugin or similar), not a permanent process. Examples: drafting a README from commit history, deciding which template fits an unusual repo, reviewing PR policy violations conversationally, explaining anomalies in plain language.
The split matters because openclaw's current gitea-webhooks pipeline is explicitly tagged `Zero tokens — pure script enforcement` in its `post-repo-audit.sh` header comment. Keeping the same split avoids ballooning token spend.
## Phases
### Phase 0 — Research (in progress)
Read the three reference repos and the live system state to understand what I'm replacing. Three parallel Explore subagents are doing this now.
- **R0.1** Read `sol/gitea-webhooks` deeply — data flow, HMAC, transform logic, tool fan-out, repo-type detection, openclaw couplings. (subagent afa92905872a43a9b)
- **R0.2** Read `sol/workspace-ops` and `sol/agent-reliability` — scope, entry points, openclaw couplings, overlaps. (same subagent)
- **R0.3** Map the openclaw gateway internals — session spawn API, cron/heartbeat mechanism, tool allowlist enforcement, plugin wiring, replacement difficulty matrix. (subagent ae5ca38f70b1e9626)
- **R0.4** Audit the currently-running state — every registered Gitea webhook, live cron/timer infra, HEARTBEAT.md checklists, active agent processes, shared state stores. (subagent abf0cb0928d823a0b)
- **R0.5** Synthesize the three reports into `ARCHITECTURE.md` in this repo — a single readable picture of what openclaw does today and what I'll need to rebuild.
Exit criteria: I can answer any architectural question about the current openclaw gitea pipeline without grepping the source again.
### Phase 1 — Architecture design
Before writing any code, lock down the target design with Rooh's sign-off.
- **A1.1** Write `DESIGN.md` in this repo describing the target Caret-owned stack: components, data flows, ownership boundaries, storage, observability.
- **A1.2** List every openclaw dependency the new stack removes, and every openclaw feature it depends on staying around (if any).
- **A1.3** DMG-style brief: surface every design choice as a multiple-choice question with a recommendation and reasoning, let Rooh approve or override. No code written until this is signed off.
- **A1.4** Define "100% feature parity" concretely — a test list that must pass after migration. Every test is a one-line assertion about behavior ("new sol/* repo gets Makefile within 10s", "HMAC-signed POST with wrong secret is rejected with 403", etc.).
Exit criteria: `DESIGN.md` in the repo, signed off by Rooh, with a test list that defines done.
### Phase 2 — Deterministic path build
Build the pure-script side first. No LLM in the loop.
- **B2.1** Scaffold the webhook listener (bun HTTP server) in its own docker container under `/host/root/.caret/`. Minimal footprint, one file where possible.
- **B2.2** Implement HMAC verification exactly matching the openclaw pattern so existing Gitea webhooks work without re-registration.
- **B2.3** Port the core scripts from `sol/gitea-webhooks/tools/` into `/host/root/.caret/tools/` with the openclaw-specific bits (hardcoded workspace paths, Mattermost channel IDs, etc.) stripped out.
- **B2.4** Wire the listener's event router to call the right script for each event type. `repository.create``post-repo-audit.sh` + `audit-repo-policies.sh --fix`. `push` to main → `secret-scan.sh` + policy re-check. Etc.
- **B2.5** Structured JSON logging to `/host/root/.caret/log/repo-enforcer.log` with line-count rotation (same pattern as tg-stream).
- **B2.6** Unit tests: mock webhook payloads via curl against a locally-running listener. Cover verification success/failure, event routing, idempotency.
Exit criteria: a canary test — create a throwaway `sol/caret-test-<ts>` repo against the live system, watch policies apply within 10 seconds, verify the log captures the event and the commit hash.
### Phase 3 — Judgment path build
Build the wake-me-up side. This is the piece openclaw doesn't currently do — it exclusively uses pure scripts.
- **J3.1** Evaluate the Channels plugin mechanism (`claude code channels`) as the native primitive for "external event → Claude session". Build a minimal plugin at `/host/root/.caret/channels/gitea-judgment/` that receives an HTTP POST and starts a session with the payload as the initial prompt.
- **J3.2** Define the trigger conditions — when does the deterministic path hand off to judgment? e.g. "policy enforcer errored", "repo-type detection was ambiguous", "a flag in the issue body requests AI review".
- **J3.3** Ensure cost hygiene: judgment is always opt-in or error-triggered, never fired on every event. Document the budget in `DESIGN.md`.
Exit criteria: a second canary — a repo with a deliberately weird structure fires the deterministic path, that path detects it can't auto-fix, and a Claude session wakes up to handle it, reporting back via `tg-stream`.
### Phase 4 — Parallel run with Xen
Openclaw stays on. My replacement runs alongside it, both hitting the same Gitea events.
- **P4.1** Register my webhook endpoint on a small set of test repos first (not all of sol/*). Verify both pipelines fire and neither breaks.
- **P4.2** After 24 hours of clean dual-run on the test repos, widen to all sol/* repos.
- **P4.3** Monitor the log for any divergence — cases where my pipeline disagreed with openclaw's. Investigate and reconcile every one.
Exit criteria: 72 hours of dual-run with zero unreconciled divergences.
### Phase 5 — Cut-over
Only run this with Rooh's explicit go-ahead.
- **C5.1** Disable openclaw's gitea-transform by stopping the openclaw-openclaw-gateway-1 container OR by removing its hooks from its settings (reversible).
- **C5.2** Watch my pipeline handle all incoming Gitea events solo for 24 hours.
- **C5.3** If anything breaks: immediate rollback by restarting openclaw. Rollback must be one command, tested before cut-over.
- **C5.4** After 7 days of clean solo run, mark the migration complete. Delete the staging files, archive the research repos, move this project from "in progress" to "done" in the repo description.
Exit criteria: 7 days of Caret-only operation with no regressions.
## Dependencies and coordination
- **Gitea token scope:** the sol token I have doesn't have `read:admin` scope, so I can't list or create system-level webhooks. I'll need Rooh to either elevate the token or register the system webhook manually one time.
- **Xen's review:** even though this migration replaces Xen's territory, I should loop Xen in at Phase 1 (design review) and Phase 4 (parallel run starts). Not approval — just visibility so Xen doesn't delete something I'm depending on.
- **Openclaw upgrades:** if openclaw ships an upgrade while we're mid-migration, it could overwrite files I'm reading. I should work against a snapshot commit of the research repos.
## Risks
1. **HMAC secret management** — if I don't get the secret storage and rotation story right, a leaked secret compromises every webhook. Must be documented and rotatable.
2. **Idempotency** — if my pipeline runs twice on the same event (retry, dual-register, replay), it must not double-apply fixes or double-add collaborators. Every operation must be idempotent.
3. **Silent drops** — openclaw currently has gaps (e.g. the bug I found earlier where concurrent pollers race). I must not introduce new silent drops. Every incoming event must produce a log line, even if the line says "ignored because X".
4. **Rollback speed** — if the cut-over goes wrong, Rooh must see working policy enforcement again within minutes. The rollback procedure is a first-class deliverable, not an afterthought.
5. **Token scope escalation** — I cannot register system-wide webhooks without admin scope. Either we elevate the token or we accept manual one-time setup.
6. **Coupling I don't yet see** — the research phase will probably surface dependencies I haven't imagined. The plan should update as findings land.
## Tracking
Issues in this repo track each phase task. Each task becomes an issue, each issue gets closed when its exit criterion is met. `ARCHITECTURE.md`, `DESIGN.md`, and this `PLAN.md` are the three canonical documents the work references.
## Changelog
- **2026-04-06** Plan drafted. Three research subagents spawned. Repo created, README + plan committed.

View File

@@ -1,3 +1,7 @@
# openclaw-to-caret-migration
# openclaw → Caret migration
Migration project: openclaw agent infrastructure (Xen-owned) to a Claude-native stack owned by Caret. Tracks architecture decisions, phase plans, and implementation progress via issues.
Taking over the openclaw gitea integration (currently owned by Xen) and rebuilding it as a Caret-owned stack that can fully replace openclaw's pipeline when Xen is disabled.
See `PLAN.md` for the phased plan. Progress is tracked via issues in this repo.
Status: **Phase 0 — research in progress** (2026-04-06)