# Implementation Plan: Live Status v4 (Production-Grade) > Generated: 2026-03-07 | Agent: planner:proj035-v2 | Status: DRAFT > Revised: Incorporates production-grade changes from scalability/efficiency review (comment #11402) ## 1. Goal Replace the broken agent-cooperative live-status system (v1) with a transparent infrastructure-level daemon that tails OpenClaw JSONL transcript files in real-time and auto-updates Mattermost status boxes — zero agent cooperation required. Sub-agents become visible. Final-response spam is eliminated. Sessions never lose state. A single multiplexed daemon handles all concurrent sessions efficiently. ## 2. Architecture ``` OpenClaw Gateway Agent Sessions (main, coder-agent, sub-agents, hooks...) -> writes {uuid}.jsonl as they work status-watcher daemon (SINGLE PROCESS — not per-session) -> fs.watch recursive on transcript directory (inotify, Node 22) -> Multiplexes all active session transcripts -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] } -> Shared HTTP connection pool (keep-alive, maxSockets=4) -> Throttled Mattermost updates (leading edge + trailing flush, 500ms) -> Bounded concurrency: max N active status boxes (configurable, default 20) -> Structured JSON logging (pino) -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted") -> Circuit breaker for Mattermost API failures Sub-agent transcripts -> Session key pattern: agent:{id}:subagent:{uuid} -> Detected automatically by directory watcher -> spawnedBy field in sessions.json links child to parent -> Nested under parent status box automatically sessions.json (runtime registry) -> Maps session keys -> { sessionId, spawnedBy, spawnDepth, label, channel } -> Polled every 2s for new sessions (supplement to directory watch) Mattermost API (slack.solio.tech) -> POST /api/v4/posts -- create status box -> PUT /api/v4/posts/{id} -- update in-place (no edit time limit confirmed) -> Shared http.Agent with keepAlive: true, maxSockets: 4 -> Circuit breaker: open after 5 failures, 30s cooldown, half-open probe ``` ### Key Design Decisions (from discovery) 1. **Single multiplexed daemon vs per-session daemons.** Eliminates unbounded process spawning. One V8 heap, one connection pool, one point of control. Scales to 30+ concurrent sessions without linear process overhead. 2. **fs.watch recursive on transcript directory.** Node 22 on Linux uses inotify natively for recursive watch. One watch, all sessions. No polling fallback needed for the watch itself. 3. **Poll sessions.json every 2s.** fs.watch on JSON files is unreliable on Linux (writes may not trigger events). Poll to detect new sessions reliably. 4. **Smart idle detection via pendingToolCalls.** Do not use a naive 30s timeout. Track tool_call / tool_result pairs. Session is idle only when pendingToolCalls==0 AND no new lines for IDLE_TIMEOUT seconds (default 60s). 5. **Leading edge + trailing flush throttle.** First event fires immediately (responsiveness). Subsequent events batched. Guaranteed final flush when activity stops (no lost updates). 6. **Mattermost post edit is unlimited.** PostEditTimeLimit=-1 confirmed on this server. No workarounds needed. 7. **All config via environment variables.** No hardcoded tokens, no sed replacement during install. Clean, 12-factor-style config. 8. **pino for structured logging.** Fast, JSON output, leveled. Production-debuggable. 9. **Circuit breaker for Mattermost API.** Prevents cascading failures during Mattermost outages. Bounded retry queue (max 100 entries). 10. **JSONL format is stable.** Version 3 confirmed. Parser abstracts format behind interface for future-proofing. ## 3. Tech Stack | Layer | Technology | Version | Reason | | ------------------ | ------------------ | ------------- | ------------------------------------------------------- | | Runtime | Node.js | 22.x (system) | Already installed; inotify recursive fs.watch supported | | File watching | fs.watch recursive | built-in | inotify on Linux/Node22; efficient, no polling | | Session discovery | setInterval poll | built-in | sessions.json polling for new session detection | | HTTP client | http.Agent | built-in | keepAlive, maxSockets; no extra dependency | | Structured logging | pino | ^9.x | Fast JSON logging; single new dependency | | Config | process.env | built-in | 12-factor; validated at startup | | Health check | http.createServer | built-in | Lightweight health endpoint | | Process management | PID file + signals | built-in | Simple, no supervisor dependency | **New npm dependencies:** `pino` only. Everything else uses Node.js built-ins. ## 4. Project Structure ``` MATTERMOST_OPENCLAW_LIVESTATUS/ ├── src/ │ ├── status-watcher.js CREATE Multiplexed directory watcher + JSONL parser │ ├── status-box.js CREATE Mattermost post manager (shared pool, throttle, circuit breaker) │ ├── session-monitor.js CREATE Poll sessions.json for new/ended sessions │ ├── tool-labels.js CREATE Pattern-matching tool name -> label resolver │ ├── config.js CREATE Centralized env-var config with validation │ ├── logger.js CREATE pino wrapper (structured JSON logging) │ ├── circuit-breaker.js CREATE Circuit breaker for API resilience │ ├── health.js CREATE HTTP health endpoint + metrics │ ├── watcher-manager.js CREATE Entrypoint: orchestrates all above, PID file, graceful shutdown │ ├── tool-labels.json CREATE Built-in tool label defaults │ ├── live-status.js DEPRECATE Keep for backward compat; add deprecation warning │ └── agent-accounts.json KEEP Agent ID -> bot account mapping │ ├── hooks/ │ └── status-watcher-hook/ │ ├── HOOK.md CREATE events: ["gateway:startup"] │ └── handler.js CREATE Spawns watcher-manager on gateway start │ ├── deploy/ │ ├── status-watcher.service CREATE systemd unit file │ └── Dockerfile CREATE Container deployment option │ ├── test/ │ ├── unit/ CREATE Unit tests (parser, tool-labels, circuit-breaker, throttle) │ └── integration/ CREATE Integration tests (lifecycle, restart recovery, sub-agent) │ ├── skill/ │ └── SKILL.md REWRITE "Status is automatic, no action needed" (10 lines) │ ├── discoveries/ │ └── README.md EXISTING Discovery findings (do not overwrite) │ ├── deploy-to-agents.sh REWRITE Installs hook into workspace hooks dir; no AGENTS.md injection ├── install.sh REWRITE npm install + deploy hook + optional gateway restart ├── README.md REWRITE Full v4 documentation ├── package.json MODIFY Add pino dependency, test/start/stop/status scripts └── Makefile MODIFY Update check/test/lint/fmt targets ``` ## 5. Dependencies | Package | Version | Purpose | New/Existing | | ----------------------------- | -------- | ----------------------- | ----------------- | | pino | ^9.x | Structured JSON logging | NEW | | node.js | 22.x | Runtime | Existing (system) | | http, fs, path, child_process | built-in | All other functionality | Existing | One new npm dependency only. Minimal footprint. ## 6. Data Model ### sessions.json entry (relevant fields) ```json { "agent:main:subagent:uuid": { "sessionId": "50dc13ad-...", "sessionFile": "50dc13ad-....jsonl", "spawnedBy": "agent:main:main", "spawnDepth": 1, "label": "proj035-planner", "channel": "mattermost", "groupChannel": "#channelId__botUserId" } } ``` ### JSONL event schema ``` type=session -> id (UUID), version (3), cwd — first line only type=message -> role=user|assistant|toolResult; content[]=text|toolCall|toolResult|thinking type=custom -> customType=openclaw.cache-ttl (turn boundary marker) type=model_change -> provider, modelId ``` ### SessionState (in-memory per active session) ```json { "sessionKey": "agent:main:subagent:uuid", "sessionFile": "/path/to/{uuid}.jsonl", "bytesRead": 4096, "statusPostId": "abc123def456", "channelId": "yy8agcha...", "rootPostId": null, "startTime": 1772897576000, "lastActivity": 1772897590000, "pendingToolCalls": 0, "lines": ["[15:21] Reading file... done", ...], "subAgentKeys": ["agent:main:subagent:child-uuid"], "parentSessionKey": null, "complete": false } ``` ### Configuration (env vars) ``` MM_TOKEN (required) Mattermost bot token MM_URL (required) Mattermost base URL TRANSCRIPT_DIR (required) Path to agent sessions directory SESSIONS_JSON (required) Path to sessions.json THROTTLE_MS 500 Min interval between Mattermost updates IDLE_TIMEOUT_S 60 Inactivity before marking session complete MAX_SESSION_DURATION_S 1800 Hard timeout for any session (30 min) MAX_STATUS_LINES 15 Max lines in status box (oldest dropped) MAX_ACTIVE_SESSIONS 20 Bounded concurrency for status boxes MAX_MESSAGE_CHARS 15000 Mattermost post truncation limit HEALTH_PORT 9090 Health check HTTP port (0 = disabled) LOG_LEVEL info Logging level CIRCUIT_BREAKER_THRESHOLD 5 Consecutive failures to open circuit CIRCUIT_BREAKER_COOLDOWN_S 30 Cooldown before half-open probe PID_FILE /tmp/status-watcher.pid TOOL_LABELS_FILE null Optional external tool labels JSON override DEFAULT_CHANNEL null Fallback channel for non-MM sessions (null = skip) ``` ### Status box format (rendered Mattermost text) ``` [ACTIVE] main | 38s Reading live-status source code... exec: ls /agents/sessions [OK] Analyzing agent configurations... exec: grep -r live-status [OK] Writing new implementation... Sub-agent: proj035-planner Reading protocol... Analyzing JSONL format... [DONE] 28s Plan ready. Awaiting approval. [DONE] 53s | 12.4k tokens ``` ## 7. Task Checklist ### Phase 0: Repo Sync + Environment Verification ⏱️ 30min > Parallelizable: no | Dependencies: none - [ ] 0.1: Sync workspace live-status.js (283-line v2) to remote repo — git push → remote matches workspace copy - [ ] 0.2: Fix existing lint errors in live-status.js (43 issues: empty catch blocks, console statements) — replace empty catches with error logging, add eslint-disable comments for intentional console.log → make lint passes - [ ] 0.3: Run `make check` — verify all Makefile targets pass (lint/fmt/test/secret-scan) → clean run, zero failures - [ ] 0.4: Verify `pino` available via npm — add to package.json and `npm install` → confirm installs cleanly - [ ] 0.5: Create `src/tool-labels.json` with initial tool->label mapping (all known tools from agent-accounts + TOOLS.md) → file exists, valid JSON - [ ] 0.6: Document exact transcript directory path and sessions.json path from the running gateway → constants confirmed for config.js (transcript dir: /home/node/.openclaw/agents/{agent}/sessions/, sessions.json: same path) ### Phase 1: Core Components ⏱️ 8-12h > Parallelizable: partially (config/logger/circuit-breaker are independent) | Dependencies: Phase 0 - [ ] 1.1: Create `src/config.js` — reads all env vars with validation; throws clear error on missing required vars; exports typed config object → unit testable, fails fast - [ ] 1.2: Create `src/logger.js` — pino wrapper with default config (JSON output, leveled); singleton; session-scoped child loggers via `logger.child({sessionKey})` → used by all modules - [ ] 1.3: Create `src/circuit-breaker.js` — state machine (closed/open/half-open), configurable threshold and cooldown, callbacks for state changes → unit tested with simulated failures - [ ] 1.4: Create `src/tool-labels.js` — loads `tool-labels.json`; supports exact match, prefix match (e.g. `camofox_*`), regex match; default label "Working..."; configurable external override file → unit tested with 20+ tool names - [ ] 1.5: Create `src/status-box.js` — Mattermost post manager: - Shared `http.Agent` (keepAlive, maxSockets=4) - `createPost(channelId, text, rootId?)` -> postId - `updatePost(postId, text)` -> void - Throttle: leading edge fires immediately, trailing flush after THROTTLE_MS; coalesce intermediate updates - Message size guard: truncate to MAX_MESSAGE_CHARS - Circuit breaker wrapping all API calls - Retry with exponential backoff on 429/5xx (up to 3 retries) - Structured logs for every API call → unit tested with mock HTTP server - [ ] 1.6: Create `src/status-formatter.js` — pure function; input: SessionState; output: formatted Mattermost text string (compact, MAX_STATUS_LINES, sub-agent nesting, status prefix, timestamps) → unit tested with varied inputs - [ ] 1.7: Create `src/health.js` — HTTP server on HEALTH_PORT; GET /health returns JSON {status, activeSessions, uptime, lastError, metrics: {updates_sent, updates_failed, circuit_state, queue_depth}} → manually tested with curl - [ ] 1.8: Create `src/status-watcher.js` — core JSONL watcher: - fs.watch on TRANSCRIPT_DIR (recursive) - On file change event: determine which sessionKey owns the file (via filename->sessionKey map built from sessions.json) - Read new bytes from lastOffset; split on newlines; parse JSONL - Map parsed events to SessionState updates: - toolCall -> increment pendingToolCalls, add status line - toolResult -> decrement pendingToolCalls, update status line with result - assistant text -> add status line (truncated to 80 chars) - turn boundary (cache-ttl custom) -> flush status update - Detect file truncation (stat.size < bytesRead) -> reset offset, log warning - Debounce updates via status-box.js throttle - Idle detection: when pendingToolCalls==0 and no new lines for IDLE_TIMEOUT_S → integration tested with real JSONL sample files - [ ] 1.9: Unit test suite (`test/unit/`) — parser, tool-labels, circuit-breaker, throttle, status-formatter → `npm test` green ### Phase 2: Session Monitor + Lifecycle ⏱️ 4-6h > Parallelizable: no | Dependencies: Phase 1 - [ ] 2.1: Create `src/session-monitor.js` — polls sessions.json every 2s: - Diffs prev vs current to detect added/removed sessions - Emits `session-added` with {sessionKey, sessionFile, spawnedBy, channelId, rootPostId} - Emits `session-removed` with sessionKey - Resolves channelId from session key (format: `agent:main:mattermost:channel:{id}:...`) - Resolves rootPostId from session key (format: `...thread:{id}`) - Falls back to DEFAULT_CHANNEL for non-MM sessions (or null to skip) → integration tested with mock sessions.json writes - [ ] 2.2: Persist session offsets to disk — on each status update, write { sessionKey: bytesRead } to `/tmp/status-watcher-offsets.json`; on startup, load and restore existing sessions → restart recovery working - [ ] 2.3: Post recovery on restart — on startup, for each restored session, search channel history for status post with marker comment ``; if found, resume updating it; if not, create new post → tested by killing and restarting daemon mid-session - [ ] 2.4: Create `src/watcher-manager.js` — top-level orchestrator: - Starts session-monitor and status-watcher - On session-added: create SessionState, link to parent if spawnedBy set, add to status-watcher watch list - On session-removed: schedule idle cleanup (allow final flush) - Enforces MAX_ACTIVE_SESSIONS (drops lowest-priority session if over limit, logs warning) - Writes/reads PID file - Registers SIGTERM/SIGINT handlers: - On signal: mark all active status boxes "interrupted", flush all pending updates, remove PID file, exit 0 - CLI: `node watcher-manager.js start|stop|status` → process management → smoke tested end-to-end - [ ] 2.5: Integration test suite (`test/integration/`) — lifecycle events, restart recovery → `npm run test:integration` green ### Phase 3: Sub-Agent Support ⏱️ 3-4h > Parallelizable: no | Dependencies: Phase 2 - [ ] 3.1: Sub-agent detection — session-monitor detects entries with `spawnedBy` field; links child SessionState to parent via `parentSessionKey` → linked correctly - [ ] 3.2: Nested status rendering — status-formatter renders sub-agent lines as indented block under parent status; sub-agent summary: label + elapsed + final status → visible in Mattermost as nested - [ ] 3.3: Cascade completion — parent session's idle detection checks that all child sessions are complete before marking parent done → no premature parent completion - [ ] 3.4: Sub-agent status post reuse — sub-agents do not create new top-level posts; their status is embedded in the parent post body → only one post per parent session visible in channel - [ ] 3.5: Integration test — spawn mock sub-agent transcript, verify parent status box shows nested child progress → manual verification in Mattermost ### Phase 4: Hook Integration ⏱️ 1h > Parallelizable: no | Dependencies: Phase 2 (watcher-manager CLI working) - [ ] 4.1: Create `hooks/status-watcher-hook/HOOK.md` — events: ["gateway:startup"], description, required env vars listed → OpenClaw discovers hook - [ ] 4.2: Create `hooks/status-watcher-hook/handler.js` — on gateway:startup: check if watcher already running (PID file), if not: spawn `node watcher-manager.js start` as detached background process → watcher auto-starts with gateway - [ ] 4.3: Deploy hook to workspace — `cp -r hooks/status-watcher-hook /home/node/.openclaw/workspace/hooks/` → hook in place - [ ] 4.4: Test: gateway restart -> watcher starts, PID file written, health endpoint responds → verified ### Phase 5: Polish + Deployment ⏱️ 3-4h > Parallelizable: yes (docs, deploy scripts, skill rewrite are independent) | Dependencies: Phase 4 - [ ] 5.1: Rewrite `skill/SKILL.md` — 10-line file: "Live status updates are automatic. You do not need to call live-status manually. Focus on your task." → no protocol injection - [ ] 5.2: Rewrite `deploy-to-agents.sh` — remove AGENTS.md injection; deploy hook; npm install; optionally restart gateway → one-command deploy - [ ] 5.3: Rewrite `install.sh` — npm install (installs pino); deploy hook; print post-install instructions including env vars required → clean install flow - [ ] 5.4: Create `deploy/status-watcher.service` — systemd unit file for standalone deployment (non-hook mode); uses env file at `/etc/status-watcher.env` → usable with systemctl - [ ] 5.5: Create `deploy/Dockerfile` — FROM node:22-alpine; COPY src/ test/; RUN npm install; CMD ["node", "watcher-manager.js", "start"] → containerized deployment option - [ ] 5.6: Update `src/live-status.js` — add startup deprecation warning "NOTE: live-status CLI is deprecated as of v4. Status updates are now automatic."; add `start-watcher` and `stop-watcher` pass-through commands → backward compat maintained - [ ] 5.7: Handle session compaction edge case — add test with truncated JSONL file; verify watcher resets offset and continues without crash → no data loss - [ ] 5.8: Write `README.md` — architecture diagram (ASCII), install steps, config reference, upgrade guide from v1, troubleshooting → complete documentation - [ ] 5.9: Run `make check` → zero lint/format errors; `npm test` → green ### Phase 6: Remove v1 Injection from AGENTS.md ⏱️ 30min > Parallelizable: no | Dependencies: Phase 5 fully verified + watcher confirmed running > SAFETY: Do not execute this phase until watcher has been running successfully for at least 1 hour - [ ] 6.1: Verify watcher is running — check PID file, health endpoint, and at least one real status box update → confirmed working before touching AGENTS.md - [ ] 6.2: Remove "Live Status Protocol (MANDATORY)" section from main AGENTS.md → section removed - [ ] 6.3: Remove from all other agent AGENTS.md files (coder-agent, xen, global-calendar, nutrition-agent, gym-designer) → all cleaned up - [ ] 6.4: Commit AGENTS.md changes with message "feat: remove v1 live-status injection (v4 watcher active)" → change tracked ## 8. Testing Strategy | What | Type | How | Success Criteria | | ------------------- | ----------- | --------------------------------------------------- | ------------------------------------------------------------- | | config.js | Unit | Env var injection, missing var detection | Throws on missing required vars; correct defaults | | logger.js | Unit | Log output format | JSON output, levels respected | | circuit-breaker.js | Unit | Simulate N failures, verify state transitions | open after threshold, half-open after cooldown | | tool-labels.js | Unit | 30+ tool names (exact, prefix, regex, unmapped) | Correct labels returned; default for unknown | | status-formatter.js | Unit | Various SessionState inputs | Correct compact output; MAX_LINES enforced | | status-box.js | Unit | Mock HTTP server | create/update called correctly; throttle works; circuit fires | | session-monitor.js | Integration | Write test sessions.json; verify events emitted | session-added/removed within 2s | | status-watcher.js | Integration | Append to JSONL file; verify Mattermost update | Update within 1.5s of new line | | Idle detection | Integration | Stop writing; verify complete after IDLE_TIMEOUT+5s | Status box marked done | | Session compaction | Integration | Truncate JSONL file mid-session | No crash; offset reset; no duplicate events | | Restart recovery | Integration | Kill daemon mid-session; restart | Existing post updated, not new post created | | Sub-agent nesting | Integration | Mock parent + child transcripts | Child visible in parent status box | | Cascade completion | Integration | Child completes; verify parent waits | Parent marks done after last child | | Health endpoint | Manual | curl localhost:9090/health | JSON with correct metrics | | E2E smoke test | Manual | Real agent task in Mattermost | Real-time updates; no spam; done on completion | ## 9. Risks & Mitigations | Risk | Impact | Mitigation | | --------------------------------------------------- | ------ | ---------------------------------------------------------------------------------------------------- | | fs.watch recursive not reliable on this kernel | High | Detect at startup; fall back to polling if watch fails (setInterval 2s on directory listing) | | sessions.json write race causes parse error | Medium | Try/catch on JSON.parse; retry next poll cycle; log warning | | Mattermost rate limit (10 req/s default) | Medium | Throttle to max 2 req/s per session; circuit breaker; exponential backoff on 429 | | Session compaction truncates JSONL | Medium | Detect stat.size < bytesRead on each read; reset offset; dedup by tracking last processed line index | | Multiple gateway restarts create duplicate watchers | Medium | PID file check + SIGTERM old process before spawning new one | | Non-MM sessions (hook, cron) generate noise | Low | Channel resolver returns null; watcher skips session gracefully | | pino dependency unavailable | Low | If npm install fails, fallback to console.log (degrade gracefully, log warning) | | Status box exceeds Mattermost post size limit | Low | Hard truncate at MAX_MESSAGE_CHARS (15000); tested with message size guard | | JSONL format changes in future OpenClaw | Medium | Abstract parser behind EventParser interface; version check on session record | | Daemon crashes mid-session | Medium | Health check via systemd/Docker; restart policy; offset persistence enables recovery | ## 10. Effort Estimate | Phase | Time | Can Parallelize? | Depends On | | -------------------------------------- | ---------- | ----------------------------------------- | ---------------- | | Phase 0: Repo + Env Verification | 15min | No | — | | Phase 1: Core Components | 8-12h | Partially (config/logger/circuit-breaker) | Phase 0 | | Phase 2: Session Monitor + Lifecycle | 4-6h | No | Phase 1 | | Phase 3: Sub-Agent Support | 3-4h | No | Phase 2 | | Phase 4: Hook Integration | 1h | No | Phase 2+3 | | Phase 5: Polish + Deployment | 3-4h | Yes (docs, deploy, skill) | Phase 4 | | Phase 6: Remove v1 AGENTS.md Injection | 30min | No | Phase 5 verified | | **Total** | **20-28h** | | | ## 11. Open Questions All questions have defaults that allow execution to proceed without answers. - [ ] **Q1 (informational): Idle timeout tuning.** 60s default may still cause premature completion for very long exec calls (e.g., a 3-minute build). Smart heuristic (pendingToolCalls tracking) should handle this correctly, but production data may reveal edge cases. **Default:** Use smart heuristic (pendingToolCalls + IDLE_TIMEOUT_S=60). Log false-positives for tuning. - [ ] **Q2 (informational): Non-MM session behavior.** Hook sessions, cron sessions, and xen sessions don't have a Mattermost channel. Currently skipped. **Default:** Skip non-MM sessions (no status box). Log at debug level. Can revisit for Phase 7. - [ ] **Q3 (informational): Status box per-request vs per-session.** Currently: one status box per user message (reset on new user turn). This is the most natural UX. **Default:** Per-request. New user message starts new status cycle. Works correctly with smart idle detection. - [ ] **Q4 (informational): Compaction dedup strategy.** When JSONL is truncated, we reset offset and re-read. We may re-process events already posted to Mattermost. **Default:** Track last processed line count (not just byte offset). Skip lines already processed on re-read. OR: detect compaction and do not re-append old events (since they were already shown). Simplest: mark box as "session compacted - continuing" and reset the visible lines in the status box. - [ ] **Q5 (blocking if no): AGENTS.md modification scope.** Phase 6 removes Live Status Protocol section from all agent AGENTS.md files. Confirm Rooh wants all instances removed (not just main agent). **Default if not answered:** Remove from all agents. This is the stated goal — removing v1 injection everywhere.