Files

sol 43cfebee96 feat: Phase 0+1 — repo sync, pino, lint fixes, core components

Phase 0:
- Synced latest live-status.js from workspace (9928 bytes)
- Fixed 43 lint issues: empty catch blocks, console statements
- Added pino dependency
- Created src/tool-labels.json with all known tool mappings
- make check passes

Phase 1 (Core Components):
- src/config.js: env-var config with validation, throws on missing required vars
- src/logger.js: pino singleton with child loggers, level validation
- src/circuit-breaker.js: CLOSED/OPEN/HALF_OPEN state machine with callbacks
- src/tool-labels.js: exact/prefix/regex tool->label resolver with external override
- src/status-box.js: Mattermost post manager (keepAlive, throttle, retry, circuit breaker)
- src/status-formatter.js: pure SessionState->text formatter (nested, compact)
- src/health.js: HTTP health endpoint + metrics
- src/status-watcher.js: JSONL file watcher (inotify, compaction detection, idle detection)

Tests:
- test/unit/config.test.js: 7 tests
- test/unit/circuit-breaker.test.js: 12 tests
- test/unit/logger.test.js: 5 tests
- test/unit/status-formatter.test.js: 20 tests
- test/unit/tool-labels.test.js: 15 tests

All 59 unit tests pass. make check clean.

2026-03-07 17:26:53 +00:00

28 KiB

Raw Blame History

Implementation Plan: Live Status v4 (Production-Grade)

Generated: 2026-03-07 | Agent: planner:proj035-v2 | Status: DRAFT Revised: Incorporates production-grade changes from scalability/efficiency review (comment #11402)

1. Goal

Replace the broken agent-cooperative live-status system (v1) with a transparent infrastructure-level daemon that tails OpenClaw JSONL transcript files in real-time and auto-updates Mattermost status boxes — zero agent cooperation required. Sub-agents become visible. Final-response spam is eliminated. Sessions never lose state. A single multiplexed daemon handles all concurrent sessions efficiently.

2. Architecture

OpenClaw Gateway
  Agent Sessions (main, coder-agent, sub-agents, hooks...)
    -> writes {uuid}.jsonl as they work

  status-watcher daemon (SINGLE PROCESS — not per-session)
    -> fs.watch recursive on transcript directory (inotify, Node 22)
    -> Multiplexes all active session transcripts
    -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] }
    -> Shared HTTP connection pool (keep-alive, maxSockets=4)
    -> Throttled Mattermost updates (leading edge + trailing flush, 500ms)
    -> Bounded concurrency: max N active status boxes (configurable, default 20)
    -> Structured JSON logging (pino)
    -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted")
    -> Circuit breaker for Mattermost API failures

  Sub-agent transcripts
    -> Session key pattern: agent:{id}:subagent:{uuid}
    -> Detected automatically by directory watcher
    -> spawnedBy field in sessions.json links child to parent
    -> Nested under parent status box automatically

  sessions.json (runtime registry)
    -> Maps session keys -> { sessionId, spawnedBy, spawnDepth, label, channel }
    -> Polled every 2s for new sessions (supplement to directory watch)

  Mattermost API (slack.solio.tech)
    -> POST /api/v4/posts  -- create status box
    -> PUT  /api/v4/posts/{id} -- update in-place (no edit time limit confirmed)
    -> Shared http.Agent with keepAlive: true, maxSockets: 4
    -> Circuit breaker: open after 5 failures, 30s cooldown, half-open probe

Key Design Decisions (from discovery)

Single multiplexed daemon vs per-session daemons. Eliminates unbounded process spawning. One V8 heap, one connection pool, one point of control. Scales to 30+ concurrent sessions without linear process overhead.
fs.watch recursive on transcript directory. Node 22 on Linux uses inotify natively for recursive watch. One watch, all sessions. No polling fallback needed for the watch itself.
Poll sessions.json every 2s. fs.watch on JSON files is unreliable on Linux (writes may not trigger events). Poll to detect new sessions reliably.
Smart idle detection via pendingToolCalls. Do not use a naive 30s timeout. Track tool_call / tool_result pairs. Session is idle only when pendingToolCalls==0 AND no new lines for IDLE_TIMEOUT seconds (default 60s).
Leading edge + trailing flush throttle. First event fires immediately (responsiveness). Subsequent events batched. Guaranteed final flush when activity stops (no lost updates).
Mattermost post edit is unlimited. PostEditTimeLimit=-1 confirmed on this server. No workarounds needed.
All config via environment variables. No hardcoded tokens, no sed replacement during install. Clean, 12-factor-style config.
pino for structured logging. Fast, JSON output, leveled. Production-debuggable.
Circuit breaker for Mattermost API. Prevents cascading failures during Mattermost outages. Bounded retry queue (max 100 entries).
JSONL format is stable. Version 3 confirmed. Parser abstracts format behind interface for future-proofing.

3. Tech Stack

Layer	Technology	Version	Reason
Runtime	Node.js	22.x (system)	Already installed; inotify recursive fs.watch supported
File watching	fs.watch recursive	built-in	inotify on Linux/Node22; efficient, no polling
Session discovery	setInterval poll	built-in	sessions.json polling for new session detection
HTTP client	http.Agent	built-in	keepAlive, maxSockets; no extra dependency
Structured logging	pino	^9.x	Fast JSON logging; single new dependency
Config	process.env	built-in	12-factor; validated at startup
Health check	http.createServer	built-in	Lightweight health endpoint
Process management	PID file + signals	built-in	Simple, no supervisor dependency

New npm dependencies: pino only. Everything else uses Node.js built-ins.

4. Project Structure

MATTERMOST_OPENCLAW_LIVESTATUS/
├── src/
│   ├── status-watcher.js      CREATE  Multiplexed directory watcher + JSONL parser
│   ├── status-box.js          CREATE  Mattermost post manager (shared pool, throttle, circuit breaker)
│   ├── session-monitor.js     CREATE  Poll sessions.json for new/ended sessions
│   ├── tool-labels.js         CREATE  Pattern-matching tool name -> label resolver
│   ├── config.js              CREATE  Centralized env-var config with validation
│   ├── logger.js              CREATE  pino wrapper (structured JSON logging)
│   ├── circuit-breaker.js     CREATE  Circuit breaker for API resilience
│   ├── health.js              CREATE  HTTP health endpoint + metrics
│   ├── watcher-manager.js     CREATE  Entrypoint: orchestrates all above, PID file, graceful shutdown
│   ├── tool-labels.json       CREATE  Built-in tool label defaults
│   ├── live-status.js         DEPRECATE  Keep for backward compat; add deprecation warning
│   └── agent-accounts.json    KEEP    Agent ID -> bot account mapping
│
├── hooks/
│   └── status-watcher-hook/
│       ├── HOOK.md            CREATE  events: ["gateway:startup"]
│       └── handler.js         CREATE  Spawns watcher-manager on gateway start
│
├── deploy/
│   ├── status-watcher.service CREATE  systemd unit file
│   └── Dockerfile             CREATE  Container deployment option
│
├── test/
│   ├── unit/                  CREATE  Unit tests (parser, tool-labels, circuit-breaker, throttle)
│   └── integration/           CREATE  Integration tests (lifecycle, restart recovery, sub-agent)
│
├── skill/
│   └── SKILL.md               REWRITE  "Status is automatic, no action needed" (10 lines)
│
├── discoveries/
│   └── README.md              EXISTING  Discovery findings (do not overwrite)
│
├── deploy-to-agents.sh        REWRITE  Installs hook into workspace hooks dir; no AGENTS.md injection
├── install.sh                 REWRITE  npm install + deploy hook + optional gateway restart
├── README.md                  REWRITE  Full v4 documentation
├── package.json               MODIFY   Add pino dependency, test/start/stop/status scripts
└── Makefile                   MODIFY   Update check/test/lint/fmt targets

5. Dependencies

Package	Version	Purpose	New/Existing
pino	^9.x	Structured JSON logging	NEW
node.js	22.x	Runtime	Existing (system)
http, fs, path, child_process	built-in	All other functionality	Existing

One new npm dependency only. Minimal footprint.

6. Data Model

sessions.json entry (relevant fields)

{
  "agent:main:subagent:uuid": {
    "sessionId": "50dc13ad-...",
    "sessionFile": "50dc13ad-....jsonl",
    "spawnedBy": "agent:main:main",
    "spawnDepth": 1,
    "label": "proj035-planner",
    "channel": "mattermost",
    "groupChannel": "#channelId__botUserId"
  }
}

JSONL event schema

type=session      -> id (UUID), version (3), cwd — first line only
type=message      -> role=user|assistant|toolResult; content[]=text|toolCall|toolResult|thinking
type=custom       -> customType=openclaw.cache-ttl (turn boundary marker)
type=model_change -> provider, modelId

SessionState (in-memory per active session)

{
  "sessionKey": "agent:main:subagent:uuid",
  "sessionFile": "/path/to/{uuid}.jsonl",
  "bytesRead": 4096,
  "statusPostId": "abc123def456",
  "channelId": "yy8agcha...",
  "rootPostId": null,
  "startTime": 1772897576000,
  "lastActivity": 1772897590000,
  "pendingToolCalls": 0,
  "lines": ["[15:21] Reading file... done", ...],
  "subAgentKeys": ["agent:main:subagent:child-uuid"],
  "parentSessionKey": null,
  "complete": false
}

Configuration (env vars)

MM_TOKEN              (required) Mattermost bot token
MM_URL                (required) Mattermost base URL
TRANSCRIPT_DIR        (required) Path to agent sessions directory
SESSIONS_JSON         (required) Path to sessions.json
THROTTLE_MS           500        Min interval between Mattermost updates
IDLE_TIMEOUT_S        60         Inactivity before marking session complete
MAX_SESSION_DURATION_S 1800      Hard timeout for any session (30 min)
MAX_STATUS_LINES      15         Max lines in status box (oldest dropped)
MAX_ACTIVE_SESSIONS   20         Bounded concurrency for status boxes
MAX_MESSAGE_CHARS     15000      Mattermost post truncation limit
HEALTH_PORT           9090       Health check HTTP port (0 = disabled)
LOG_LEVEL             info       Logging level
CIRCUIT_BREAKER_THRESHOLD 5     Consecutive failures to open circuit
CIRCUIT_BREAKER_COOLDOWN_S 30   Cooldown before half-open probe
PID_FILE              /tmp/status-watcher.pid
TOOL_LABELS_FILE      null       Optional external tool labels JSON override
DEFAULT_CHANNEL       null       Fallback channel for non-MM sessions (null = skip)

Status box format (rendered Mattermost text)

[ACTIVE] main | 38s
Reading live-status source code...
  exec: ls /agents/sessions [OK]
Analyzing agent configurations...
  exec: grep -r live-status [OK]
Writing new implementation...
  Sub-agent: proj035-planner
    Reading protocol...
    Analyzing JSONL format...
    [DONE] 28s
Plan ready. Awaiting approval.
[DONE] 53s | 12.4k tokens

7. Task Checklist

Phase 0: Repo Sync + Environment Verification ⏱️ 30min

Parallelizable: no | Dependencies: none

0.1: Sync workspace live-status.js (283-line v2) to remote repo — git push → remote matches workspace copy
0.2: Fix existing lint errors in live-status.js (43 issues: empty catch blocks, console statements) — replace empty catches with error logging, add eslint-disable comments for intentional console.log → make lint passes
0.3: Run make check — verify all Makefile targets pass (lint/fmt/test/secret-scan) → clean run, zero failures
0.4: Verify pino available via npm — add to package.json and npm install → confirm installs cleanly
0.5: Create src/tool-labels.json with initial tool->label mapping (all known tools from agent-accounts + TOOLS.md) → file exists, valid JSON
0.6: Document exact transcript directory path and sessions.json path from the running gateway → constants confirmed for config.js (transcript dir: /home/node/.openclaw/agents/{agent}/sessions/, sessions.json: same path)

Phase 1: Core Components ⏱️ 8-12h

Parallelizable: partially (config/logger/circuit-breaker are independent) | Dependencies: Phase 0

1.1: Create src/config.js — reads all env vars with validation; throws clear error on missing required vars; exports typed config object → unit testable, fails fast
1.2: Create src/logger.js — pino wrapper with default config (JSON output, leveled); singleton; session-scoped child loggers via logger.child({sessionKey}) → used by all modules
1.3: Create src/circuit-breaker.js — state machine (closed/open/half-open), configurable threshold and cooldown, callbacks for state changes → unit tested with simulated failures
1.4: Create src/tool-labels.js — loads tool-labels.json; supports exact match, prefix match (e.g. camofox_*), regex match; default label "Working..."; configurable external override file → unit tested with 20+ tool names
1.5: Create src/status-box.js — Mattermost post manager:
- Shared http.Agent (keepAlive, maxSockets=4)
- createPost(channelId, text, rootId?) -> postId
- updatePost(postId, text) -> void
- Throttle: leading edge fires immediately, trailing flush after THROTTLE_MS; coalesce intermediate updates
- Message size guard: truncate to MAX_MESSAGE_CHARS
- Circuit breaker wrapping all API calls
- Retry with exponential backoff on 429/5xx (up to 3 retries)
- Structured logs for every API call → unit tested with mock HTTP server
1.6: Create src/status-formatter.js — pure function; input: SessionState; output: formatted Mattermost text string (compact, MAX_STATUS_LINES, sub-agent nesting, status prefix, timestamps) → unit tested with varied inputs
1.7: Create src/health.js — HTTP server on HEALTH_PORT; GET /health returns JSON {status, activeSessions, uptime, lastError, metrics: {updates_sent, updates_failed, circuit_state, queue_depth}} → manually tested with curl
1.8: Create src/status-watcher.js — core JSONL watcher:
- fs.watch on TRANSCRIPT_DIR (recursive)
- On file change event: determine which sessionKey owns the file (via filename->sessionKey map built from sessions.json)
- Read new bytes from lastOffset; split on newlines; parse JSONL
- Map parsed events to SessionState updates:
  - toolCall -> increment pendingToolCalls, add status line
  - toolResult -> decrement pendingToolCalls, update status line with result
  - assistant text -> add status line (truncated to 80 chars)
  - turn boundary (cache-ttl custom) -> flush status update
- Detect file truncation (stat.size < bytesRead) -> reset offset, log warning
- Debounce updates via status-box.js throttle
- Idle detection: when pendingToolCalls==0 and no new lines for IDLE_TIMEOUT_S → integration tested with real JSONL sample files
1.9: Unit test suite (test/unit/) — parser, tool-labels, circuit-breaker, throttle, status-formatter → npm test green

Phase 2: Session Monitor + Lifecycle ⏱️ 4-6h

Parallelizable: no | Dependencies: Phase 1

2.1: Create src/session-monitor.js — polls sessions.json every 2s:
- Diffs prev vs current to detect added/removed sessions
- Emits session-added with {sessionKey, sessionFile, spawnedBy, channelId, rootPostId}
- Emits session-removed with sessionKey
- Resolves channelId from session key (format: agent:main:mattermost:channel:{id}:...)
- Resolves rootPostId from session key (format: ...thread:{id})
- Falls back to DEFAULT_CHANNEL for non-MM sessions (or null to skip) → integration tested with mock sessions.json writes
2.2: Persist session offsets to disk — on each status update, write { sessionKey: bytesRead } to /tmp/status-watcher-offsets.json; on startup, load and restore existing sessions → restart recovery working
2.3: Post recovery on restart — on startup, for each restored session, search channel history for status post with marker comment ; if found, resume updating it; if not, create new post → tested by killing and restarting daemon mid-session
2.4: Create src/watcher-manager.js — top-level orchestrator:
- Starts session-monitor and status-watcher
- On session-added: create SessionState, link to parent if spawnedBy set, add to status-watcher watch list
- On session-removed: schedule idle cleanup (allow final flush)
- Enforces MAX_ACTIVE_SESSIONS (drops lowest-priority session if over limit, logs warning)
- Writes/reads PID file
- Registers SIGTERM/SIGINT handlers:
  - On signal: mark all active status boxes "interrupted", flush all pending updates, remove PID file, exit 0
- CLI: node watcher-manager.js start|stop|status → process management → smoke tested end-to-end
2.5: Integration test suite (test/integration/) — lifecycle events, restart recovery → npm run test:integration green

Phase 3: Sub-Agent Support ⏱️ 3-4h

Parallelizable: no | Dependencies: Phase 2

3.1: Sub-agent detection — session-monitor detects entries with spawnedBy field; links child SessionState to parent via parentSessionKey → linked correctly
3.2: Nested status rendering — status-formatter renders sub-agent lines as indented block under parent status; sub-agent summary: label + elapsed + final status → visible in Mattermost as nested
3.3: Cascade completion — parent session's idle detection checks that all child sessions are complete before marking parent done → no premature parent completion
3.4: Sub-agent status post reuse — sub-agents do not create new top-level posts; their status is embedded in the parent post body → only one post per parent session visible in channel
3.5: Integration test — spawn mock sub-agent transcript, verify parent status box shows nested child progress → manual verification in Mattermost

Phase 4: Hook Integration ⏱️ 1h

Parallelizable: no | Dependencies: Phase 2 (watcher-manager CLI working)

4.1: Create hooks/status-watcher-hook/HOOK.md — events: ["gateway:startup"], description, required env vars listed → OpenClaw discovers hook
4.2: Create hooks/status-watcher-hook/handler.js — on gateway:startup: check if watcher already running (PID file), if not: spawn node watcher-manager.js start as detached background process → watcher auto-starts with gateway
4.3: Deploy hook to workspace — cp -r hooks/status-watcher-hook /home/node/.openclaw/workspace/hooks/ → hook in place
4.4: Test: gateway restart -> watcher starts, PID file written, health endpoint responds → verified

Phase 5: Polish + Deployment ⏱️ 3-4h

Parallelizable: yes (docs, deploy scripts, skill rewrite are independent) | Dependencies: Phase 4

5.1: Rewrite skill/SKILL.md — 10-line file: "Live status updates are automatic. You do not need to call live-status manually. Focus on your task." → no protocol injection
5.2: Rewrite deploy-to-agents.sh — remove AGENTS.md injection; deploy hook; npm install; optionally restart gateway → one-command deploy
5.3: Rewrite install.sh — npm install (installs pino); deploy hook; print post-install instructions including env vars required → clean install flow
5.4: Create deploy/status-watcher.service — systemd unit file for standalone deployment (non-hook mode); uses env file at /etc/status-watcher.env → usable with systemctl
5.5: Create deploy/Dockerfile — FROM node:22-alpine; COPY src/ test/; RUN npm install; CMD ["node", "watcher-manager.js", "start"] → containerized deployment option
5.6: Update src/live-status.js — add startup deprecation warning "NOTE: live-status CLI is deprecated as of v4. Status updates are now automatic."; add start-watcher and stop-watcher pass-through commands → backward compat maintained
5.7: Handle session compaction edge case — add test with truncated JSONL file; verify watcher resets offset and continues without crash → no data loss
5.8: Write README.md — architecture diagram (ASCII), install steps, config reference, upgrade guide from v1, troubleshooting → complete documentation
5.9: Run make check → zero lint/format errors; npm test → green

Phase 6: Remove v1 Injection from AGENTS.md ⏱️ 30min

Parallelizable: no | Dependencies: Phase 5 fully verified + watcher confirmed running SAFETY: Do not execute this phase until watcher has been running successfully for at least 1 hour

6.1: Verify watcher is running — check PID file, health endpoint, and at least one real status box update → confirmed working before touching AGENTS.md
6.2: Remove "Live Status Protocol (MANDATORY)" section from main AGENTS.md → section removed
6.3: Remove from all other agent AGENTS.md files (coder-agent, xen, global-calendar, nutrition-agent, gym-designer) → all cleaned up
6.4: Commit AGENTS.md changes with message "feat: remove v1 live-status injection (v4 watcher active)" → change tracked

8. Testing Strategy

What	Type	How	Success Criteria
config.js	Unit	Env var injection, missing var detection	Throws on missing required vars; correct defaults
logger.js	Unit	Log output format	JSON output, levels respected
circuit-breaker.js	Unit	Simulate N failures, verify state transitions	open after threshold, half-open after cooldown
tool-labels.js	Unit	30+ tool names (exact, prefix, regex, unmapped)	Correct labels returned; default for unknown
status-formatter.js	Unit	Various SessionState inputs	Correct compact output; MAX_LINES enforced
status-box.js	Unit	Mock HTTP server	create/update called correctly; throttle works; circuit fires
session-monitor.js	Integration	Write test sessions.json; verify events emitted	session-added/removed within 2s
status-watcher.js	Integration	Append to JSONL file; verify Mattermost update	Update within 1.5s of new line
Idle detection	Integration	Stop writing; verify complete after IDLE_TIMEOUT+5s	Status box marked done
Session compaction	Integration	Truncate JSONL file mid-session	No crash; offset reset; no duplicate events
Restart recovery	Integration	Kill daemon mid-session; restart	Existing post updated, not new post created
Sub-agent nesting	Integration	Mock parent + child transcripts	Child visible in parent status box
Cascade completion	Integration	Child completes; verify parent waits	Parent marks done after last child
Health endpoint	Manual	curl localhost:9090/health	JSON with correct metrics
E2E smoke test	Manual	Real agent task in Mattermost	Real-time updates; no spam; done on completion

9. Risks & Mitigations

Risk	Impact	Mitigation
fs.watch recursive not reliable on this kernel	High	Detect at startup; fall back to polling if watch fails (setInterval 2s on directory listing)
sessions.json write race causes parse error	Medium	Try/catch on JSON.parse; retry next poll cycle; log warning
Mattermost rate limit (10 req/s default)	Medium	Throttle to max 2 req/s per session; circuit breaker; exponential backoff on 429
Session compaction truncates JSONL	Medium	Detect stat.size < bytesRead on each read; reset offset; dedup by tracking last processed line index
Multiple gateway restarts create duplicate watchers	Medium	PID file check + SIGTERM old process before spawning new one
Non-MM sessions (hook, cron) generate noise	Low	Channel resolver returns null; watcher skips session gracefully
pino dependency unavailable	Low	If npm install fails, fallback to console.log (degrade gracefully, log warning)
Status box exceeds Mattermost post size limit	Low	Hard truncate at MAX_MESSAGE_CHARS (15000); tested with message size guard
JSONL format changes in future OpenClaw	Medium	Abstract parser behind EventParser interface; version check on session record
Daemon crashes mid-session	Medium	Health check via systemd/Docker; restart policy; offset persistence enables recovery

10. Effort Estimate

Phase	Time	Can Parallelize?	Depends On
Phase 0: Repo + Env Verification	15min	No	—
Phase 1: Core Components	8-12h	Partially (config/logger/circuit-breaker)	Phase 0
Phase 2: Session Monitor + Lifecycle	4-6h	No	Phase 1
Phase 3: Sub-Agent Support	3-4h	No	Phase 2
Phase 4: Hook Integration	1h	No	Phase 2+3
Phase 5: Polish + Deployment	3-4h	Yes (docs, deploy, skill)	Phase 4
Phase 6: Remove v1 AGENTS.md Injection	30min	No	Phase 5 verified
Total	20-28h

11. Open Questions

All questions have defaults that allow execution to proceed without answers.

Q1 (informational): Idle timeout tuning. 60s default may still cause premature completion for very long exec calls (e.g., a 3-minute build). Smart heuristic (pendingToolCalls tracking) should handle this correctly, but production data may reveal edge cases. Default: Use smart heuristic (pendingToolCalls + IDLE_TIMEOUT_S=60). Log false-positives for tuning.
Q2 (informational): Non-MM session behavior. Hook sessions, cron sessions, and xen sessions don't have a Mattermost channel. Currently skipped. Default: Skip non-MM sessions (no status box). Log at debug level. Can revisit for Phase 7.
Q3 (informational): Status box per-request vs per-session. Currently: one status box per user message (reset on new user turn). This is the most natural UX. Default: Per-request. New user message starts new status cycle. Works correctly with smart idle detection.
Q4 (informational): Compaction dedup strategy. When JSONL is truncated, we reset offset and re-read. We may re-process events already posted to Mattermost. Default: Track last processed line count (not just byte offset). Skip lines already processed on re-read. OR: detect compaction and do not re-append old events (since they were already shown). Simplest: mark box as "session compacted - continuing" and reset the visible lines in the status box.
Q5 (blocking if no): AGENTS.md modification scope. Phase 6 removes Live Status Protocol section from all agent AGENTS.md files. Confirm Rooh wants all instances removed (not just main agent). Default if not answered: Remove from all agents. This is the stated goal — removing v1 injection everywhere.

28 KiB Raw Blame History