plan: production-grade PLAN.md v2 (revised architecture + audit + simulation)

2026-03-07 16:09:36 +00:00
parent 6ef50269b5
commit b3ec2c61db
2 changed files with 294 additions and 174 deletions
--- a/PLAN.md
+++ b/PLAN.md
@@ -1,105 +1,136 @@
-# Implementation Plan: Live Status v4
-> Generated: 2026-03-07 | Agent: planner:proj035 | Status: DRAFT
+# Implementation Plan: Live Status v4 (Production-Grade)
+> Generated: 2026-03-07 | Agent: planner:proj035-v2 | Status: DRAFT
+> Revised: Incorporates production-grade changes from scalability/efficiency review (comment #11402)

 ## 1. Goal

-Replace the broken agent-cooperative live-status system with a transparent infrastructure-level daemon that tails OpenClaw's JSONL transcript files in real-time and updates a Mattermost status box automatically — **zero agent cooperation required**. Sub-agents become visible. Spam is eliminated. Sessions never lose state. Works from gateway startup without any AGENTS.md instruction injection.
+Replace the broken agent-cooperative live-status system (v1) with a transparent infrastructure-level daemon that tails OpenClaw JSONL transcript files in real-time and auto-updates Mattermost status boxes — zero agent cooperation required. Sub-agents become visible. Final-response spam is eliminated. Sessions never lose state. A single multiplexed daemon handles all concurrent sessions efficiently.

 ## 2. Architecture

 ```
 OpenClaw Gateway
-├── Agent Sessions (main, coder-agent, sub-agents, hooks...)
-│   └── writes {uuid}.jsonl as it works
-│
-└── status-watcher daemon (per active session)
-        ├── Polls/watches {uuid}.jsonl (new line = new event)
-        ├── Parses tool calls, results, assistant text
-        ├── Maps tool names → human-readable labels
-        ├── Debounces Mattermost updates (500ms)
-        ├── Auto-creates status box in correct channel/thread
-        ├── Detects sub-agent spawns → nests sub-agent status
-        └── Auto-completes when agent stops writing (idle timeout)
+  Agent Sessions (main, coder-agent, sub-agents, hooks...)
+    -> writes {uuid}.jsonl as they work
+
+  status-watcher daemon (SINGLE PROCESS — not per-session)
+    -> fs.watch recursive on transcript directory (inotify, Node 22)
+    -> Multiplexes all active session transcripts
+    -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] }
+    -> Shared HTTP connection pool (keep-alive, maxSockets=4)
+    -> Throttled Mattermost updates (leading edge + trailing flush, 500ms)
+    -> Bounded concurrency: max N active status boxes (configurable, default 20)
+    -> Structured JSON logging (pino)
+    -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted")
+    -> Circuit breaker for Mattermost API failures
+
+  Sub-agent transcripts
+    -> Session key pattern: agent:{id}:subagent:{uuid}
+    -> Detected automatically by directory watcher
+    -> spawnedBy field in sessions.json links child to parent
+    -> Nested under parent status box automatically

  sessions.json (runtime registry)
-├── session key → {sessionId, sessionFile, spawnedBy, spawnDepth, channel, ...}
-└── used to: resolve JSONL file path, determine channel, link parent→child
-
-OpenClaw Hook (gateway:startup + command:new)
-└── Spawns status-watcher for the right session
+    -> Maps session keys -> { sessionId, spawnedBy, spawnDepth, label, channel }
+    -> Polled every 2s for new sessions (supplement to directory watch)

  Mattermost API (slack.solio.tech)
-├── POST /api/v4/posts  → create status box
-├── PUT /api/v4/posts/{id} → update in-place (no edit time limit)
-└── Multiple bot tokens per agent
+    -> POST /api/v4/posts  -- create status box
+    -> PUT  /api/v4/posts/{id} -- update in-place (no edit time limit confirmed)
+    -> Shared http.Agent with keepAlive: true, maxSockets: 4
+    -> Circuit breaker: open after 5 failures, 30s cooldown, half-open probe
 ```

 ### Key Design Decisions (from discovery)

-1. **Watch sessions.json, not just transcript files.** sessions.json is the authoritative registry that maps session keys (including sub-agents) to JSONL files. Monitor it to detect new sessions.
+1. **Single multiplexed daemon vs per-session daemons.** Eliminates unbounded process spawning. One V8 heap, one connection pool, one point of control. Scales to 30+ concurrent sessions without linear process overhead.

-2. **No new hook events needed.** We cannot use `session:start`/`session:end` hooks (they don't exist). Instead: use `gateway:startup` to begin watching all active sessions, and poll sessions.json for new sessions.
+2. **fs.watch recursive on transcript directory.** Node 22 on Linux uses inotify natively for recursive watch. One watch, all sessions. No polling fallback needed for the watch itself.

-3. **Sub-agent detection via `spawnedBy` field.** When sessions.json gets a new entry with `spawnedBy`, we know it's a sub-agent of the given parent session. Nest its status under the parent status box.
+3. **Poll sessions.json every 2s.** fs.watch on JSON files is unreliable on Linux (writes may not trigger events). Poll to detect new sessions reliably.

-4. **JSONL format is stable.** Version 3 format confirmed. Key events:
-   - `message` with role=`assistant` + content `toolCall` → tool being called
-   - `message` with role=`toolResult` → tool completed
-   - `message` with role=`assistant` + content `text` → agent thinking/responding
-   - `custom` with `customType: openclaw.cache-ttl` → turn boundary (good idle signal)
+4. **Smart idle detection via pendingToolCalls.** Do not use a naive 30s timeout. Track tool_call / tool_result pairs. Session is idle only when pendingToolCalls==0 AND no new lines for IDLE_TIMEOUT seconds (default 60s).

-5. **Mattermost post edit is unlimited.** `PostEditTimeLimit = -1`. We can update the status post indefinitely. No workaround needed.
+5. **Leading edge + trailing flush throttle.** First event fires immediately (responsiveness). Subsequent events batched. Guaranteed final flush when activity stops (no lost updates).

-6. **Keep live-status.js as thin orchestration layer.** agents can still call it manually for special cases, but it's no longer the primary mechanism.
+6. **Mattermost post edit is unlimited.** PostEditTimeLimit=-1 confirmed on this server. No workarounds needed.
+
+7. **All config via environment variables.** No hardcoded tokens, no sed replacement during install. Clean, 12-factor-style config.
+
+8. **pino for structured logging.** Fast, JSON output, leveled. Production-debuggable.
+
+9. **Circuit breaker for Mattermost API.** Prevents cascading failures during Mattermost outages. Bounded retry queue (max 100 entries).
+
+10. **JSONL format is stable.** Version 3 confirmed. Parser abstracts format behind interface for future-proofing.

 ## 3. Tech Stack

 | Layer | Technology | Version | Reason |
 |-------|-----------|---------|--------|
-| Watcher daemon | Node.js | 22.x (existing) | Already installed, fs.watch/setInterval available |
-| File watching | fs.watch + fallback polling | built-in | fs.watch is iffy on Linux; polling fallback needed |
-| Mattermost API | https (built-in) | - | Already used in live-status.js |
-| Session registry | JSON file watch | - | sessions.json updated on every message |
-| IPC (parent↔watcher) | PID file + signals | - | Simple, no deps |
-| Hook integration | OpenClaw hooks system | existing | gateway:startup hook for auto-start |
+| Runtime | Node.js | 22.x (system) | Already installed; inotify recursive fs.watch supported |
+| File watching | fs.watch recursive | built-in | inotify on Linux/Node22; efficient, no polling |
+| Session discovery | setInterval poll | built-in | sessions.json polling for new session detection |
+| HTTP client | http.Agent | built-in | keepAlive, maxSockets; no extra dependency |
+| Structured logging | pino | ^9.x | Fast JSON logging; single new dependency |
+| Config | process.env | built-in | 12-factor; validated at startup |
+| Health check | http.createServer | built-in | Lightweight health endpoint |
+| Process management | PID file + signals | built-in | Simple, no supervisor dependency |
+
+**New npm dependencies:** `pino` only. Everything else uses Node.js built-ins.

 ## 4. Project Structure

 ```
 MATTERMOST_OPENCLAW_LIVESTATUS/
 ├── src/
-│   ├── status-watcher.js      CREATE  Core transcript tail + parse + debounce
-│   ├── session-monitor.js     CREATE  Watch sessions.json for new/ended sessions
-│   ├── mattermost-client.js   CREATE  Mattermost HTTP API wrapper (rate-limited)
-│   ├── tool-labels.json       CREATE  Tool name → human-readable label map
-│   ├── status-formatter.js    CREATE  Format status box message (text + sub-agents)
-│   ├── watcher-manager.js     CREATE  Start/stop watchers per session, PID tracking
-│   ├── live-status.js         MODIFY  Add start-watcher/stop-watcher commands; keep create/update/complete
-│   └── agent-accounts.json    KEEP    Agent ID → bot account mapping
+│   ├── status-watcher.js      CREATE  Multiplexed directory watcher + JSONL parser
+│   ├── status-box.js          CREATE  Mattermost post manager (shared pool, throttle, circuit breaker)
+│   ├── session-monitor.js     CREATE  Poll sessions.json for new/ended sessions
+│   ├── tool-labels.js         CREATE  Pattern-matching tool name -> label resolver
+│   ├── config.js              CREATE  Centralized env-var config with validation
+│   ├── logger.js              CREATE  pino wrapper (structured JSON logging)
+│   ├── circuit-breaker.js     CREATE  Circuit breaker for API resilience
+│   ├── health.js              CREATE  HTTP health endpoint + metrics
+│   ├── watcher-manager.js     CREATE  Entrypoint: orchestrates all above, PID file, graceful shutdown
+│   ├── tool-labels.json       CREATE  Built-in tool label defaults
+│   ├── live-status.js         DEPRECATE  Keep for backward compat; add deprecation warning
+│   └── agent-accounts.json    KEEP    Agent ID -> bot account mapping
 │
 ├── hooks/
 │   └── status-watcher-hook/
-│       ├── HOOK.md            CREATE  Hook metadata (events: gateway:startup, command:new)
-│       └── handler.ts         CREATE  Spawns watcher-manager on gateway start
+│       ├── HOOK.md            CREATE  events: ["gateway:startup"]
+│       └── handler.js         CREATE  Spawns watcher-manager on gateway start
+│
+├── deploy/
+│   ├── status-watcher.service CREATE  systemd unit file
+│   └── Dockerfile             CREATE  Container deployment option
+│
+├── test/
+│   ├── unit/                  CREATE  Unit tests (parser, tool-labels, circuit-breaker, throttle)
+│   └── integration/           CREATE  Integration tests (lifecycle, restart recovery, sub-agent)
 │
 ├── skill/
-│   └── SKILL.md               REWRITE  Remove verbose manual protocol; just note status is automatic
+│   └── SKILL.md               REWRITE  "Status is automatic, no action needed" (10 lines)
 │
-├── deploy-to-agents.sh        REWRITE  Installs hook instead of AGENTS.md injection
-├── install.sh                 REWRITE  New install flow: npm install + hook enable
+├── discoveries/
+│   └── README.md              EXISTING  Discovery findings (do not overwrite)
+│
+├── deploy-to-agents.sh        REWRITE  Installs hook into workspace hooks dir; no AGENTS.md injection
+├── install.sh                 REWRITE  npm install + deploy hook + optional gateway restart
 ├── README.md                  REWRITE  Full v4 documentation
-├── package.json               MODIFY   Add start/stop/status npm scripts
-└── Makefile                   MODIFY   Add check/test/lint/fmt targets
+├── package.json               MODIFY   Add pino dependency, test/start/stop/status scripts
+└── Makefile                   MODIFY   Update check/test/lint/fmt targets
 ```

 ## 5. Dependencies

 | Package | Version | Purpose | New/Existing |
 |---------|---------|---------|-------------|
+| pino | ^9.x | Structured JSON logging | NEW |
 | node.js | 22.x | Runtime | Existing (system) |
-| (none) | - | All built-in: https, fs, path, child_process | - |
+| http, fs, path, child_process | built-in | All other functionality | Existing |

-No new npm dependencies. Everything uses Node.js built-ins to keep install footprint at zero.
+One new npm dependency only. Minimal footprint.

 ## 6. Data Model

@@ -118,166 +149,242 @@ No new npm dependencies. Everything uses Node.js built-ins to keep install footp
 }
 ```

-### JSONL event schema (parsed by watcher)
+### JSONL event schema
 ```
-type=session    → session UUID, cwd (first line only)
-type=message    → role=user|assistant|toolResult; content[]=text|toolCall|toolResult
-type=custom     → customType=openclaw.cache-ttl (turn boundary marker)
+type=session      -> id (UUID), version (3), cwd — first line only
+type=message      -> role=user|assistant|toolResult; content[]=text|toolCall|toolResult|thinking
+type=custom       -> customType=openclaw.cache-ttl (turn boundary marker)
+type=model_change -> provider, modelId
 ```

-### Watcher state per session
+### SessionState (in-memory per active session)
 ```json
 {
  "sessionKey": "agent:main:subagent:uuid",
-  "sessionFile": "/path/to/uuid.jsonl",
-  "bytesRead": 1024,
-  "statusPostId": "abc123def456...",
+  "sessionFile": "/path/to/{uuid}.jsonl",
+  "bytesRead": 4096,
+  "statusPostId": "abc123def456",
  "channelId": "yy8agcha...",
  "rootPostId": null,
-  "lastActivity": 1772897576000,
-  "subAgentWatchers": ["child-session-key"],
-  "statusLines": ["[15:21] Reading file... done", ...],
-  "parentStatusPostId": null
+  "startTime": 1772897576000,
+  "lastActivity": 1772897590000,
+  "pendingToolCalls": 0,
+  "lines": ["[15:21] Reading file... done", ...],
+  "subAgentKeys": ["agent:main:subagent:child-uuid"],
+  "parentSessionKey": null,
+  "complete": false
 }
 ```

-### Status box format
+### Configuration (env vars)
 ```
-Agent: main — PROJ-035 Plan
-[15:21:22] Reading transcript format...
-[15:21:25] exec: ls /agents/sessions done (0.8s)
-[15:21:28] Writing implementation plan...
+MM_TOKEN              (required) Mattermost bot token
+MM_URL                (required) Mattermost base URL
+TRANSCRIPT_DIR        (required) Path to agent sessions directory
+SESSIONS_JSON         (required) Path to sessions.json
+THROTTLE_MS           500        Min interval between Mattermost updates
+IDLE_TIMEOUT_S        60         Inactivity before marking session complete
+MAX_SESSION_DURATION_S 1800      Hard timeout for any session (30 min)
+MAX_STATUS_LINES      15         Max lines in status box (oldest dropped)
+MAX_ACTIVE_SESSIONS   20         Bounded concurrency for status boxes
+MAX_MESSAGE_CHARS     15000      Mattermost post truncation limit
+HEALTH_PORT           9090       Health check HTTP port (0 = disabled)
+LOG_LEVEL             info       Logging level
+CIRCUIT_BREAKER_THRESHOLD 5     Consecutive failures to open circuit
+CIRCUIT_BREAKER_COOLDOWN_S 30   Cooldown before half-open probe
+PID_FILE              /tmp/status-watcher.pid
+TOOL_LABELS_FILE      null       Optional external tool labels JSON override
+DEFAULT_CHANNEL       null       Fallback channel for non-MM sessions (null = skip)
+```
+
+### Status box format (rendered Mattermost text)
+```
+[ACTIVE] main | 38s
+Reading live-status source code...
+  exec: ls /agents/sessions [OK]
+Analyzing agent configurations...
+  exec: grep -r live-status [OK]
+Writing new implementation...
  Sub-agent: proj035-planner
-    [15:21:42] Reading protocol...
-    [15:21:55] Analyzing JSONL format...
-    [15:22:10] Complete (28s)
-[15:22:15] Plan ready. Awaiting approval.
-Runtime: 53s
+    Reading protocol...
+    Analyzing JSONL format...
+    [DONE] 28s
+Plan ready. Awaiting approval.
+[DONE] 53s | 12.4k tokens
 ```

 ## 7. Task Checklist

-### Phase 0: Repo Sync + Setup ⏱️ 10min
+### Phase 0: Repo Sync + Environment Verification ⏱️ 30min
 > Parallelizable: no | Dependencies: none
- [ ] 0.1: Sync workspace live-status.js to remote repo (git push) → remote matches workspace
- [ ] 0.2: Verify Makefile has check/test/lint/fmt targets (or add them) → make check passes
- [ ] 0.3: Create `src/tool-labels.json` with initial tool→label mapping → file exists
- [ ] 0.4: Create `src/agent-accounts.json` (already exists, verify) → agent→account mapping
+- [ ] 0.1: Sync workspace live-status.js (283-line v2) to remote repo — git push → remote matches workspace copy
+- [ ] 0.2: Fix existing lint errors in live-status.js (43 issues: empty catch blocks, console statements) — replace empty catches with error logging, add eslint-disable comments for intentional console.log → make lint passes
+- [ ] 0.3: Run `make check` — verify all Makefile targets pass (lint/fmt/test/secret-scan) → clean run, zero failures
+- [ ] 0.4: Verify `pino` available via npm — add to package.json and `npm install` → confirm installs cleanly
+- [ ] 0.5: Create `src/tool-labels.json` with initial tool->label mapping (all known tools from agent-accounts + TOOLS.md) → file exists, valid JSON
+- [ ] 0.6: Document exact transcript directory path and sessions.json path from the running gateway → constants confirmed for config.js (transcript dir: /home/node/.openclaw/agents/{agent}/sessions/, sessions.json: same path)

-### Phase 1: Core Watcher ⏱️ 2-3h
-> Parallelizable: no | Dependencies: Phase 0
- [ ] 1.1: Create `src/mattermost-client.js` — HTTP wrapper with rate limiting (max 2 req/s), retry on 429, create/update/delete post methods → tested with curl
- [ ] 1.2: Create `src/status-formatter.js` — formats status box lines from events, sub-agent nesting, timestamps → unit testable pure function
- [ ] 1.3: Create `src/status-watcher.js` — core daemon:
-  - Accepts: sessionKey, sessionFile, channelId, rootPostId (optional), statusPostId (optional)
-  - Reads JSONL file from current byte offset
-  - On new lines: parse event type, extract human-readable status
-  - Debounce 500ms before Mattermost update
-  - Idle timeout: 30s after last new line → mark complete
-  - Emits events: status-update, session-complete
-  - Returns: statusPostId (created on first event)
- [ ] 1.4: Add `src/tool-labels.json` with all known tools → exec, read, write, edit, web_search, web_fetch, message, subagents, nodes, browser, image, camofox_*, claude_code_*
- [ ] 1.5: Manual test — start watcher against a real session file, verify Mattermost post appears → post created and updated
+### Phase 1: Core Components ⏱️ 8-12h
+> Parallelizable: partially (config/logger/circuit-breaker are independent) | Dependencies: Phase 0

-### Phase 2: Session Monitor ⏱️ 1-2h
+- [ ] 1.1: Create `src/config.js` — reads all env vars with validation; throws clear error on missing required vars; exports typed config object → unit testable, fails fast
+- [ ] 1.2: Create `src/logger.js` — pino wrapper with default config (JSON output, leveled); singleton; session-scoped child loggers via `logger.child({sessionKey})` → used by all modules
+- [ ] 1.3: Create `src/circuit-breaker.js` — state machine (closed/open/half-open), configurable threshold and cooldown, callbacks for state changes → unit tested with simulated failures
+- [ ] 1.4: Create `src/tool-labels.js` — loads `tool-labels.json`; supports exact match, prefix match (e.g. `camofox_*`), regex match; default label "Working..."; configurable external override file → unit tested with 20+ tool names
+- [ ] 1.5: Create `src/status-box.js` — Mattermost post manager:
+  - Shared `http.Agent` (keepAlive, maxSockets=4)
+  - `createPost(channelId, text, rootId?)` -> postId
+  - `updatePost(postId, text)` -> void
+  - Throttle: leading edge fires immediately, trailing flush after THROTTLE_MS; coalesce intermediate updates
+  - Message size guard: truncate to MAX_MESSAGE_CHARS
+  - Circuit breaker wrapping all API calls
+  - Retry with exponential backoff on 429/5xx (up to 3 retries)
+  - Structured logs for every API call
+  → unit tested with mock HTTP server
+- [ ] 1.6: Create `src/status-formatter.js` — pure function; input: SessionState; output: formatted Mattermost text string (compact, MAX_STATUS_LINES, sub-agent nesting, status prefix, timestamps) → unit tested with varied inputs
+- [ ] 1.7: Create `src/health.js` — HTTP server on HEALTH_PORT; GET /health returns JSON {status, activeSessions, uptime, lastError, metrics: {updates_sent, updates_failed, circuit_state, queue_depth}} → manually tested with curl
+- [ ] 1.8: Create `src/status-watcher.js` — core JSONL watcher:
+  - fs.watch on TRANSCRIPT_DIR (recursive)
+  - On file change event: determine which sessionKey owns the file (via filename->sessionKey map built from sessions.json)
+  - Read new bytes from lastOffset; split on newlines; parse JSONL
+  - Map parsed events to SessionState updates:
+    - toolCall -> increment pendingToolCalls, add status line
+    - toolResult -> decrement pendingToolCalls, update status line with result
+    - assistant text -> add status line (truncated to 80 chars)
+    - turn boundary (cache-ttl custom) -> flush status update
+  - Detect file truncation (stat.size < bytesRead) -> reset offset, log warning
+  - Debounce updates via status-box.js throttle
+  - Idle detection: when pendingToolCalls==0 and no new lines for IDLE_TIMEOUT_S
+  → integration tested with real JSONL sample files
+- [ ] 1.9: Unit test suite (`test/unit/`) — parser, tool-labels, circuit-breaker, throttle, status-formatter → `npm test` green
+
+### Phase 2: Session Monitor + Lifecycle ⏱️ 4-6h
 > Parallelizable: no | Dependencies: Phase 1
- [ ] 2.1: Create `src/session-monitor.js` — watches sessions.json for changes:
-  - Polls every 2s (fs.watch unreliable on Linux for JSON files)
-  - Diffs previous vs current sessions.json
-  - On new session: emit `session-added` with session details
-  - On removed session: emit `session-removed`
-  - Resolves channel/thread from session key format
- [ ] 2.2: Create `src/watcher-manager.js` — coordinates monitor + watchers:
-  - On session-added: resolve channel (from session key), start status-watcher
-  - Tracks active watchers in memory (Map: sessionKey → watcher)
-  - On session-removed or watcher-complete: clean up
-  - Handles sub-agents: on `spawnedBy` session added, nest under parent watcher
-  - PID file at `/tmp/openclaw-status-watcher.pid` for single-instance enforcement
- [ ] 2.3: Entry point `src/watcher-manager.js` CLI: `node watcher-manager.js start|stop|status` → process management
- [ ] 2.4: End-to-end test — run manager in foreground, trigger agent session, verify status box appears → automated smoke test

-### Phase 3: Channel Resolution ⏱️ 1h
+- [ ] 2.1: Create `src/session-monitor.js` — polls sessions.json every 2s:
+  - Diffs prev vs current to detect added/removed sessions
+  - Emits `session-added` with {sessionKey, sessionFile, spawnedBy, channelId, rootPostId}
+  - Emits `session-removed` with sessionKey
+  - Resolves channelId from session key (format: `agent:main:mattermost:channel:{id}:...`)
+  - Resolves rootPostId from session key (format: `...thread:{id}`)
+  - Falls back to DEFAULT_CHANNEL for non-MM sessions (or null to skip)
+  → integration tested with mock sessions.json writes
+- [ ] 2.2: Persist session offsets to disk — on each status update, write { sessionKey: bytesRead } to `/tmp/status-watcher-offsets.json`; on startup, load and restore existing sessions → restart recovery working
+- [ ] 2.3: Post recovery on restart — on startup, for each restored session, search channel history for status post with marker comment `<!-- sw:{sessionKey} -->`; if found, resume updating it; if not, create new post → tested by killing and restarting daemon mid-session
+- [ ] 2.4: Create `src/watcher-manager.js` — top-level orchestrator:
+  - Starts session-monitor and status-watcher
+  - On session-added: create SessionState, link to parent if spawnedBy set, add to status-watcher watch list
+  - On session-removed: schedule idle cleanup (allow final flush)
+  - Enforces MAX_ACTIVE_SESSIONS (drops lowest-priority session if over limit, logs warning)
+  - Writes/reads PID file
+  - Registers SIGTERM/SIGINT handlers:
+    - On signal: mark all active status boxes "interrupted", flush all pending updates, remove PID file, exit 0
+  - CLI: `node watcher-manager.js start|stop|status` → process management
+  → smoke tested end-to-end
+- [ ] 2.5: Integration test suite (`test/integration/`) — lifecycle events, restart recovery → `npm run test:integration` green
+
+### Phase 3: Sub-Agent Support ⏱️ 3-4h
 > Parallelizable: no | Dependencies: Phase 2
- [ ] 3.1: Implement channel resolver — given a session key like `agent:main:mattermost:channel:abc123`, extract the Mattermost channel ID → function with unit test
- [ ] 3.2: Handle thread sessions — `agent:main:mattermost:channel:abc123:thread:def456` → channel=abc123, rootPost=def456
- [ ] 3.3: Fallback for non-Mattermost sessions (hook sessions, cron sessions) — use configured default channel → configurable in openclaw.json or env var
- [ ] 3.4: Sub-agent channel resolution — inherit parent session's channel + use parent status box as `rootPostId` → sub-agent status appears under parent
+
+- [ ] 3.1: Sub-agent detection — session-monitor detects entries with `spawnedBy` field; links child SessionState to parent via `parentSessionKey` → linked correctly
+- [ ] 3.2: Nested status rendering — status-formatter renders sub-agent lines as indented block under parent status; sub-agent summary: label + elapsed + final status → visible in Mattermost as nested
+- [ ] 3.3: Cascade completion — parent session's idle detection checks that all child sessions are complete before marking parent done → no premature parent completion
+- [ ] 3.4: Sub-agent status post reuse — sub-agents do not create new top-level posts; their status is embedded in the parent post body → only one post per parent session visible in channel
+- [ ] 3.5: Integration test — spawn mock sub-agent transcript, verify parent status box shows nested child progress → manual verification in Mattermost

 ### Phase 4: Hook Integration ⏱️ 1h
-> Parallelizable: no | Dependencies: Phase 2, Phase 3
- [ ] 4.1: Create `hooks/status-watcher-hook/HOOK.md` with `events: ["gateway:startup"]` → discovered by OpenClaw hooks system
- [ ] 4.2: Create `hooks/status-watcher-hook/handler.js` (plain JS) — on gateway:startup, spawn `watcher-manager.js start` as background child_process → watcher manager auto-starts with gateway. Note: OpenClaw hooks system discovers `handler.ts` first, then `handler.js` — both are supported natively via dynamic import. Plain .js is confirmed to work.
- [ ] 4.3: Add `hooks/status-watcher-hook/` to workspace hooks dir (`/home/node/.openclaw/workspace/hooks/`) via `deploy-to-agents.sh` → hook auto-discovered
- [ ] 4.4: Test: restart gateway → watcher-manager starts → verify PID file exists
+> Parallelizable: no | Dependencies: Phase 2 (watcher-manager CLI working)

-### Phase 5: Polish + Cleanup ⏱️ 1h
-> Parallelizable: no | Dependencies: Phase 4
- [ ] 5.1: Rewrite `skill/SKILL.md` — remove manual protocol; say "live status is automatic, no action needed" → 10-line skill file
- [ ] 5.2: Rewrite `deploy-to-agents.sh` — remove AGENTS.md injection; install hook into workspace hooks dir; restart gateway → one-command deploy
- [ ] 5.3: Update `install.sh` — npm install, deploy hook, optionally restart gateway
- [ ] 5.4: Update `src/live-status.js` — add `start-watcher` and `stop-watcher` commands for manual control; mark create/update/complete as deprecated but keep working
- [ ] 5.5: Handle session compaction — detect if JSONL file gets smaller (compaction rewrites) → reset byte offset and re-read from start
- [ ] 5.6: Write `README.md` — full v4 documentation with architecture diagram, install steps, config reference
- [ ] 5.7: Run `make check` to verify lint/format passes → clean CI
+- [ ] 4.1: Create `hooks/status-watcher-hook/HOOK.md` — events: ["gateway:startup"], description, required env vars listed → OpenClaw discovers hook
+- [ ] 4.2: Create `hooks/status-watcher-hook/handler.js` — on gateway:startup: check if watcher already running (PID file), if not: spawn `node watcher-manager.js start` as detached background process → watcher auto-starts with gateway
+- [ ] 4.3: Deploy hook to workspace — `cp -r hooks/status-watcher-hook /home/node/.openclaw/workspace/hooks/` → hook in place
+- [ ] 4.4: Test: gateway restart -> watcher starts, PID file written, health endpoint responds → verified
+
+### Phase 5: Polish + Deployment ⏱️ 3-4h
+> Parallelizable: yes (docs, deploy scripts, skill rewrite are independent) | Dependencies: Phase 4
+
+- [ ] 5.1: Rewrite `skill/SKILL.md` — 10-line file: "Live status updates are automatic. You do not need to call live-status manually. Focus on your task." → no protocol injection
+- [ ] 5.2: Rewrite `deploy-to-agents.sh` — remove AGENTS.md injection; deploy hook; npm install; optionally restart gateway → one-command deploy
+- [ ] 5.3: Rewrite `install.sh` — npm install (installs pino); deploy hook; print post-install instructions including env vars required → clean install flow
+- [ ] 5.4: Create `deploy/status-watcher.service` — systemd unit file for standalone deployment (non-hook mode); uses env file at `/etc/status-watcher.env` → usable with systemctl
+- [ ] 5.5: Create `deploy/Dockerfile` — FROM node:22-alpine; COPY src/ test/; RUN npm install; CMD ["node", "watcher-manager.js", "start"] → containerized deployment option
+- [ ] 5.6: Update `src/live-status.js` — add startup deprecation warning "NOTE: live-status CLI is deprecated as of v4. Status updates are now automatic."; add `start-watcher` and `stop-watcher` pass-through commands → backward compat maintained
+- [ ] 5.7: Handle session compaction edge case — add test with truncated JSONL file; verify watcher resets offset and continues without crash → no data loss
+- [ ] 5.8: Write `README.md` — architecture diagram (ASCII), install steps, config reference, upgrade guide from v1, troubleshooting → complete documentation
+- [ ] 5.9: Run `make check` → zero lint/format errors; `npm test` → green

 ### Phase 6: Remove v1 Injection from AGENTS.md ⏱️ 30min
-> Parallelizable: no | Dependencies: Phase 5 (after watcher confirmed working)
- [ ] 6.1: Remove "📡 Live Status Protocol (MANDATORY)" section from main agent's AGENTS.md
- [ ] 6.2: Remove from all other agent AGENTS.md files (coder-agent, xen, global-calendar, etc.)
- [ ] 6.3: Confirm watcher is running before removing (safety check) → watcher PID file exists
+> Parallelizable: no | Dependencies: Phase 5 fully verified + watcher confirmed running
+> SAFETY: Do not execute this phase until watcher has been running successfully for at least 1 hour
+
+- [ ] 6.1: Verify watcher is running — check PID file, health endpoint, and at least one real status box update → confirmed working before touching AGENTS.md
+- [ ] 6.2: Remove "Live Status Protocol (MANDATORY)" section from main AGENTS.md → section removed
+- [ ] 6.3: Remove from all other agent AGENTS.md files (coder-agent, xen, global-calendar, nutrition-agent, gym-designer) → all cleaned up
+- [ ] 6.4: Commit AGENTS.md changes with message "feat: remove v1 live-status injection (v4 watcher active)" → change tracked

 ## 8. Testing Strategy

 | What | Type | How | Success Criteria |
 |------|------|-----|-----------------|
-| Mattermost client | Unit | Direct API call with test channel | Post created and updated |
-| Status formatter | Unit | Input JSONL events → verify output strings | Correct labels, timestamps |
-| Channel resolver | Unit | Test session key strings → verify channel/thread extracted | All formats parsed |
-| JSONL parser | Unit | Sample events from real transcripts | All types handled |
-| Session monitor | Integration | Write to sessions.json, verify events emitted | New session detected in <2s |
-| Status watcher | Integration | Append to JSONL file, verify Mattermost post updates | Update within 1s of new line |
-| Sub-agent nesting | Integration | Spawn real sub-agent, verify nested status | Sub-agent visible in parent box |
-| Idle timeout | Integration | Stop writing to JSONL, verify complete after 30s | Status box marked done |
-| Compaction | Integration | Truncate JSONL file, verify watcher recovers | No duplicate events, no crash |
-| E2E | Manual smoke test | Real agent task in Mattermost, verify status box | Real-time updates visible |
+| config.js | Unit | Env var injection, missing var detection | Throws on missing required vars; correct defaults |
+| logger.js | Unit | Log output format | JSON output, levels respected |
+| circuit-breaker.js | Unit | Simulate N failures, verify state transitions | open after threshold, half-open after cooldown |
+| tool-labels.js | Unit | 30+ tool names (exact, prefix, regex, unmapped) | Correct labels returned; default for unknown |
+| status-formatter.js | Unit | Various SessionState inputs | Correct compact output; MAX_LINES enforced |
+| status-box.js | Unit | Mock HTTP server | create/update called correctly; throttle works; circuit fires |
+| session-monitor.js | Integration | Write test sessions.json; verify events emitted | session-added/removed within 2s |
+| status-watcher.js | Integration | Append to JSONL file; verify Mattermost update | Update within 1.5s of new line |
+| Idle detection | Integration | Stop writing; verify complete after IDLE_TIMEOUT+5s | Status box marked done |
+| Session compaction | Integration | Truncate JSONL file mid-session | No crash; offset reset; no duplicate events |
+| Restart recovery | Integration | Kill daemon mid-session; restart | Existing post updated, not new post created |
+| Sub-agent nesting | Integration | Mock parent + child transcripts | Child visible in parent status box |
+| Cascade completion | Integration | Child completes; verify parent waits | Parent marks done after last child |
+| Health endpoint | Manual | curl localhost:9090/health | JSON with correct metrics |
+| E2E smoke test | Manual | Real agent task in Mattermost | Real-time updates; no spam; done on completion |

 ## 9. Risks & Mitigations

 | Risk | Impact | Mitigation |
 |------|--------|-----------|
-| fs.watch unreliable on Linux | High | Fall back to polling (setInterval 2s). fs.watch as optimization |
-| Sessions.json write race condition | Medium | Use atomic read (retry on parse error), debounce diff |
-| Mattermost rate limit (10 req/s) | Medium | Debounce updates to 500ms; queue + batch; exponential backoff on 429 |
-| Session compaction truncates JSONL | Medium | Compare file size on each poll; if smaller, reset offset |
-| Multiple gateway restarts create duplicate watchers | Medium | PID file check + kill old process before spawning new |
-| Sub-agent session key not stable across restarts | Low | Use sessionId (UUID) as key, not session key string |
-| Watcher dies silently | Low | Cron health check or gateway boot-md restart |
-| Non-Mattermost sessions (xen, hook) get status boxes | Low | Channel resolver returns null for non-MM sessions; skip gracefully |
-| JSONL format change in future OpenClaw version | Medium | Abstract parser behind interface; version check on session record |
+| fs.watch recursive not reliable on this kernel | High | Detect at startup; fall back to polling if watch fails (setInterval 2s on directory listing) |
+| sessions.json write race causes parse error | Medium | Try/catch on JSON.parse; retry next poll cycle; log warning |
+| Mattermost rate limit (10 req/s default) | Medium | Throttle to max 2 req/s per session; circuit breaker; exponential backoff on 429 |
+| Session compaction truncates JSONL | Medium | Detect stat.size < bytesRead on each read; reset offset; dedup by tracking last processed line index |
+| Multiple gateway restarts create duplicate watchers | Medium | PID file check + SIGTERM old process before spawning new one |
+| Non-MM sessions (hook, cron) generate noise | Low | Channel resolver returns null; watcher skips session gracefully |
+| pino dependency unavailable | Low | If npm install fails, fallback to console.log (degrade gracefully, log warning) |
+| Status box exceeds Mattermost post size limit | Low | Hard truncate at MAX_MESSAGE_CHARS (15000); tested with message size guard |
+| JSONL format changes in future OpenClaw | Medium | Abstract parser behind EventParser interface; version check on session record |
+| Daemon crashes mid-session | Medium | Health check via systemd/Docker; restart policy; offset persistence enables recovery |

 ## 10. Effort Estimate

 | Phase | Time | Can Parallelize? | Depends On |
 |-------|------|-------------------|-----------|
-| Phase 0: Repo Setup | 10min | No | — |
-| Phase 1: Core Watcher | 2-3h | No | Phase 0 |
-| Phase 2: Session Monitor | 1-2h | No | Phase 1 |
-| Phase 3: Channel Resolution | 1h | No | Phase 2 |
+| Phase 0: Repo + Env Verification | 15min | No | — |
+| Phase 1: Core Components | 8-12h | Partially (config/logger/circuit-breaker) | Phase 0 |
+| Phase 2: Session Monitor + Lifecycle | 4-6h | No | Phase 1 |
+| Phase 3: Sub-Agent Support | 3-4h | No | Phase 2 |
 | Phase 4: Hook Integration | 1h | No | Phase 2+3 |
-| Phase 5: Polish + Cleanup | 1h | No | Phase 4 |
-| Phase 6: Remove v1 Injection | 30min | No | Phase 5 (verified) |
-| **Total** | **7-9h** | | |
+| Phase 5: Polish + Deployment | 3-4h | Yes (docs, deploy, skill) | Phase 4 |
+| Phase 6: Remove v1 AGENTS.md Injection | 30min | No | Phase 5 verified |
+| **Total** | **20-28h** | | |

 ## 11. Open Questions

- [ ] **Q1: Idle timeout threshold.** 30s is aggressive — exec commands can run for minutes. Should we use a smarter heuristic? E.g., detect `stopReason: "toolUse"` (agent is waiting for tool) vs `stopReason: "stop"` (agent is done).
-  **Default if unanswered:** Use `stopReason: "stop"` in the most recent assistant message as the idle signal, combined with 10s of no new lines. If stop_reason=toolUse, reset idle timer on every toolResult line. This is accurate and avoids false completions during long tool runs.
+All questions have defaults that allow execution to proceed without answers.

- [ ] **Q2: Default channel for non-MM sessions.** Hook-triggered sessions (agent:main:hook:gitea:...) don't have a Mattermost channel. Should we (a) skip them, (b) post to a default monitoring channel, or (c) allow config per-session-type?
-  **Default if unanswered:** (a) Skip non-MM sessions. Hook and cron sessions are largely invisible today and not causing user pain. The priority is Mattermost interactive sessions. Non-MM support can be Phase 7.
+- [ ] **Q1 (informational): Idle timeout tuning.** 60s default may still cause premature completion for very long exec calls (e.g., a 3-minute build). Smart heuristic (pendingToolCalls tracking) should handle this correctly, but production data may reveal edge cases.
+  **Default:** Use smart heuristic (pendingToolCalls + IDLE_TIMEOUT_S=60). Log false-positives for tuning.

- [ ] **Q3: Status box per-session or per-request?** A single agent session may handle multiple sequential requests. Should each new user message create a new status box, or does one session = one status box?
-  **Default if unanswered:** One status box per user message (per-request). Each incoming user message starts a new status cycle. When agent sends final response (stopReason=stop + no tool calls), mark current box complete. On next user message, create a new box. This matches expected UX: one progress indicator per task.
+- [ ] **Q2 (informational): Non-MM session behavior.** Hook sessions, cron sessions, and xen sessions don't have a Mattermost channel. Currently skipped.
+  **Default:** Skip non-MM sessions (no status box). Log at debug level. Can revisit for Phase 7.

- [ ] **Q4: Compaction behavior.** When OpenClaw compacts a transcript (rewrites the JSONL), does it preserve the original file or create a new one?
-  **Default if unanswered:** Assume in-place truncation (most likely based on `compactionCount` field in sessions.json). Detect by checking if fileSize < bytesRead on each poll. If truncated, reset bytesRead to 0 and re-read from start (with deduplication via message IDs to avoid re-posting old events).
+- [ ] **Q3 (informational): Status box per-request vs per-session.** Currently: one status box per user message (reset on new user turn). This is the most natural UX.
+  **Default:** Per-request. New user message starts new status cycle. Works correctly with smart idle detection.
+
+- [ ] **Q4 (informational): Compaction dedup strategy.** When JSONL is truncated, we reset offset and re-read. We may re-process events already posted to Mattermost.
+  **Default:** Track last processed line count (not just byte offset). Skip lines already processed on re-read. OR: detect compaction and do not re-append old events (since they were already shown). Simplest: mark box as "session compacted - continuing" and reset the visible lines in the status box.
+
+- [ ] **Q5 (blocking if no): AGENTS.md modification scope.** Phase 6 removes Live Status Protocol section from all agent AGENTS.md files. Confirm Rooh wants all instances removed (not just main agent).
+  **Default if not answered:** Remove from all agents. This is the stated goal — removing v1 injection everywhere.
--- a/STATE.json
+++ b/STATE.json
@@ -4,8 +4,8 @@
  "planVersion": "v5-beta",
  "phase": 0,
  "totalPhases": 6,
-  "lastAgent": "planner:proj035:subagent:e8bb592a",
-  "lastUpdated": "2026-03-07T16:00:00Z",
+  "lastAgent": "planner:proj035-v2:subagent:37c5a99e",
+  "lastUpdated": "2026-03-07T17:08:00Z",
  "planPostedTo": "gitea",
  "giteaRepo": "ROOH/MATTERMOST_OPENCLAW_LIVESTATUS",
  "giteaIssueNumber": 3,
@@ -14,8 +14,11 @@
  "synthesisComplete": true,
  "synthesisDoc": "discoveries/README.md",
  "auditComplete": true,
-  "auditScore": "32/32",
-  "auditFindings": ["WARNING: gateway restart needed to activate hook — coordinate with Rooh"],
+  "auditScore": "34/34",
+  "auditFindings": [
+    "WARNING: make check currently fails on existing live-status.js (43 lint issues) — Phase 0.2 addresses this",
+    "WARNING: gateway restart needed to activate hook in Phase 4 — coordinate with Rooh"
+  ],
  "simulationComplete": true,
  "simulationVerdict": "READY",
  "hasOpenQuestions": true,
@@ -26,5 +29,15 @@
  "errors": [],
  "maxConcurrentSubagents": 2,
  "activeSubagents": 0,
-  "queuedTasks": []
+  "queuedTasks": [],
+  "notes": {
+    "transcriptDir": "/home/node/.openclaw/agents/{agentId}/sessions/",
+    "sessionsJsonPath": "/home/node/.openclaw/agents/{agentId}/sessions/sessions.json",
+    "subagentKeyPattern": "agent:main:subagent:{uuid}",
+    "subagentFields": ["sessionId", "spawnedBy", "spawnDepth", "label"],
+    "mmApiReachable": true,
+    "pinoInstallable": true,
+    "fsWatchNode22": "confirmed (inotify recursive)",
+    "makeCheckStatus": "FAILS on existing live-status.js (43 issues) — must fix in Phase 0"
+  }
 }