[v4] Live Status Rewrite - Production-Grade Real-Time Agent Progress System #3

New Issue

sol · 2026-03-07T16:25:50+01:00

sol commented

2026-03-07 16:25:50 +01:00

Problem Statement

The current live-status system (v1) is fundamentally broken in production.

Current Architecture

Agent -> exec("live-status create/update ...") -> Mattermost API

Diagnosed Failures

Agents forget to use it. Even with MANDATORY instructions in every AGENTS.md, agents skip live-status because it requires 4+ separate exec calls per task, agents must remember post IDs across tool calls, and there's no enforcement mechanism.
Spam problem. When agents DO try to update status, the final response dumps 10+ status messages into the chat.
No sub-agent visibility. Sub-agents work in isolated sessions. Their progress is invisible until the announce step.
Thread isolation breaks state. Mattermost threads create separate OpenClaw sessions.
Naive solutions don't scale. Telling agents harder in system prompts doesn't work.

Proposed Solution: Live Status v4

Core Principle

Don't rely on agents to update status. Intercept their work automatically.

Architecture

A status-watcher daemon that tails the agent's JSONL transcript in real-time and auto-updates a Mattermost status box.

OpenClaw Gateway
  Agent Session -> writes transcript JSONL
  status-watcher daemon (per-session)
    -> fs.watch on transcript file
    -> Parses tool calls, results, assistant text
    -> Debounced Mattermost API updates (500ms)
    -> Auto-create/complete status box
  Sub-agent sessions
    -> Same watcher pattern
    -> Nested under parent status box

Components

1. status-watcher - Transcript Tail Daemon

tail -f the agent's JSONL transcript file
Parse each new line for tool calls, tool results, and assistant text
Map tool names to human-readable status labels
Debounce and batch updates to Mattermost (max 1 update/500ms)
Auto-create status box on first activity
Auto-mark complete when session goes idle (no new lines for 30s)
Handle sub-agent transcripts (nested status)

2. status-box - Mattermost Post Manager

Rich message attachments with colored status cards
Sub-agent progress as nested items
Timestamps and duration tracking
Auto-cleanup on session end

3. Hook Integration

Trigger status-watcher on session start via OpenClaw hooks
Kill watcher on session end
Route sub-agent announces through status box

4. Agent-Side Simplification

Agents get ONE simple instruction: Status updates are automatic. Focus on the task.

Implementation Plan

Phase 1: Core Watcher

status-watcher.js - transcript tail + parse + Mattermost update
Tool-name to status-label mapping (configurable)
Debounced Mattermost updates (500ms default)
Auto-create status box in correct channel/thread
Auto-complete detection (idle timeout)

Phase 2: Session Lifecycle

Start watcher when agent session begins (via hook or cron)
Stop watcher when session ends
Handle session compaction (transcript rewrite)
Thread-aware: detect thread root ID from session key

Phase 3: Sub-Agent Support

Watch sub-agent transcripts
Nest sub-agent status under parent status box
Cascade completion

Phase 4: Polish

Rich Mattermost attachments
Rate limiting
Error recovery
Metrics/logging
New deploy script
Remove old AGENTS.md protocol injection

Status Box Format (v4)

Agent: god-agent - Fixing live-status system
[15:21:22] Reading live-status source code...
[15:21:25] Read: /src/live-status.js done
[15:21:28] Analyzing agent configurations...
[15:21:30] exec: grep -r live-status ... done
[15:21:35] Writing new implementation...
[15:21:40] Sub-agent: coder-agent (Phase 1)
  [15:21:42] Writing status-watcher.js...
  [15:21:55] Complete (13s)
[15:22:00] Task complete (38s)
Runtime: 38s | Tokens: 12.4k | Cost: $0.08

Files to Create/Modify

src/status-watcher.js - CREATE - Core transcript watcher daemon
src/status-box.js - CREATE - Mattermost post manager
src/tool-labels.json - CREATE - Tool name to human label mapping
src/live-status.js - DEPRECATE - Keep for backward compat
skill/SKILL.md - REWRITE - Simpler instructions
deploy-to-agents.sh - REWRITE - Install watcher instead of prompt injection
install.sh - REWRITE - New install flow
README.md - REWRITE - Full v4 documentation

Success Criteria

Agents produce live status updates WITHOUT any explicit live-status calls
Sub-agent progress is visible in real-time
No status spam in final response
Works across thread sessions automatically
Survives session compaction and gateway restarts
Production-ready: rate limiting, error recovery, logging
Single install command deploys everything

References

Current code: src/live-status.js (CLI tool, ~250 lines)
OpenClaw transcripts: JSONL files at workspace/session-id.jsonl
OpenClaw hooks: POST /hooks/agent for session lifecycle events
OpenClaw sub-agents: agent:id:subagent:uuid session pattern
Mattermost API: POST/PUT /api/v4/posts
Inspiration: Google Antigravity-style live execution visibility

## Problem Statement The current live-status system (v1) is fundamentally broken in production. ### Current Architecture ``` Agent -> exec("live-status create/update ...") -> Mattermost API ``` ### Diagnosed Failures 1. **Agents forget to use it.** Even with MANDATORY instructions in every AGENTS.md, agents skip live-status because it requires 4+ separate exec calls per task, agents must remember post IDs across tool calls, and there's no enforcement mechanism. 2. **Spam problem.** When agents DO try to update status, the final response dumps 10+ status messages into the chat. 3. **No sub-agent visibility.** Sub-agents work in isolated sessions. Their progress is invisible until the announce step. 4. **Thread isolation breaks state.** Mattermost threads create separate OpenClaw sessions. 5. **Naive solutions don't scale.** Telling agents harder in system prompts doesn't work. --- ## Proposed Solution: Live Status v4 ### Core Principle **Don't rely on agents to update status. Intercept their work automatically.** ### Architecture A `status-watcher` daemon that tails the agent's JSONL transcript in real-time and auto-updates a Mattermost status box. ``` OpenClaw Gateway Agent Session -> writes transcript JSONL status-watcher daemon (per-session) -> fs.watch on transcript file -> Parses tool calls, results, assistant text -> Debounced Mattermost API updates (500ms) -> Auto-create/complete status box Sub-agent sessions -> Same watcher pattern -> Nested under parent status box ``` ### Components #### 1. status-watcher - Transcript Tail Daemon - tail -f the agent's JSONL transcript file - Parse each new line for tool calls, tool results, and assistant text - Map tool names to human-readable status labels - Debounce and batch updates to Mattermost (max 1 update/500ms) - Auto-create status box on first activity - Auto-mark complete when session goes idle (no new lines for 30s) - Handle sub-agent transcripts (nested status) #### 2. status-box - Mattermost Post Manager - Rich message attachments with colored status cards - Sub-agent progress as nested items - Timestamps and duration tracking - Auto-cleanup on session end #### 3. Hook Integration - Trigger status-watcher on session start via OpenClaw hooks - Kill watcher on session end - Route sub-agent announces through status box #### 4. Agent-Side Simplification Agents get ONE simple instruction: Status updates are automatic. Focus on the task. --- ## Implementation Plan ### Phase 1: Core Watcher - [ ] status-watcher.js - transcript tail + parse + Mattermost update - [ ] Tool-name to status-label mapping (configurable) - [ ] Debounced Mattermost updates (500ms default) - [ ] Auto-create status box in correct channel/thread - [ ] Auto-complete detection (idle timeout) ### Phase 2: Session Lifecycle - [ ] Start watcher when agent session begins (via hook or cron) - [ ] Stop watcher when session ends - [ ] Handle session compaction (transcript rewrite) - [ ] Thread-aware: detect thread root ID from session key ### Phase 3: Sub-Agent Support - [ ] Watch sub-agent transcripts - [ ] Nest sub-agent status under parent status box - [ ] Cascade completion ### Phase 4: Polish - [ ] Rich Mattermost attachments - [ ] Rate limiting - [ ] Error recovery - [ ] Metrics/logging - [ ] New deploy script - [ ] Remove old AGENTS.md protocol injection --- ## Status Box Format (v4) ``` Agent: god-agent - Fixing live-status system [15:21:22] Reading live-status source code... [15:21:25] Read: /src/live-status.js done [15:21:28] Analyzing agent configurations... [15:21:30] exec: grep -r live-status ... done [15:21:35] Writing new implementation... [15:21:40] Sub-agent: coder-agent (Phase 1) [15:21:42] Writing status-watcher.js... [15:21:55] Complete (13s) [15:22:00] Task complete (38s) Runtime: 38s | Tokens: 12.4k | Cost: $0.08 ``` --- ## Files to Create/Modify - src/status-watcher.js - CREATE - Core transcript watcher daemon - src/status-box.js - CREATE - Mattermost post manager - src/tool-labels.json - CREATE - Tool name to human label mapping - src/live-status.js - DEPRECATE - Keep for backward compat - skill/SKILL.md - REWRITE - Simpler instructions - deploy-to-agents.sh - REWRITE - Install watcher instead of prompt injection - install.sh - REWRITE - New install flow - README.md - REWRITE - Full v4 documentation --- ## Success Criteria - [ ] Agents produce live status updates WITHOUT any explicit live-status calls - [ ] Sub-agent progress is visible in real-time - [ ] No status spam in final response - [ ] Works across thread sessions automatically - [ ] Survives session compaction and gateway restarts - [ ] Production-ready: rate limiting, error recovery, logging - [ ] Single install command deploys everything --- ## References - Current code: src/live-status.js (CLI tool, ~250 lines) - OpenClaw transcripts: JSONL files at workspace/session-id.jsonl - OpenClaw hooks: POST /hooks/agent for session lifecycle events - OpenClaw sub-agents: agent:id:subagent:uuid session pattern - Mattermost API: POST/PUT /api/v4/posts - Inspiration: Google Antigravity-style live execution visibility

sol changed title from ~~[v4] Live Status Rewrite — Production-Grade Real-Time Agent Progress System~~ to [v4] Live Status Rewrite - Production-Grade Real-Time Agent Progress System

2026-03-07 16:28:30 +01:00

sol commented

2026-03-07 16:29:36 +01:00

Triage: v4 Live Status Rewrite

Reviewed the spec against the current codebase. Here is the assessment.

Current State (v1)

src/live-status.js (114 lines) -- simple CLI wrapper around Mattermost HTTP API
Agents must manually call live-status create / live-status update with post IDs
install.sh injects protocol text into AGENTS.md (prompt injection approach)
No sub-agent awareness, no auto-detection, no lifecycle hooks
Token hardcoded as placeholder, replaced by sed during install

Spec Assessment

The v4 proposal correctly diagnoses the core problem: agents don't reliably use explicit status tools. The solution (transcript-tailing daemon) is architecturally sound -- it moves the burden from agents to infrastructure.

Key Technical Observations

Transcript JSONL tailing -- This depends on OpenClaw's internal transcript format. Need to verify: (a) where transcripts live, (b) the JSONL schema, (c) whether they are written synchronously or buffered. This is discovery work.
Session lifecycle hooks -- The spec mentions POST /hooks/agent. Need to verify this hook exists and what payload it sends. If it does not exist, this becomes a dependency on OpenClaw core.
Mattermost post management -- The proposed status-box.js concept (rich attachments, nested sub-agent status) is a significant step up from raw message text. Good direction.
Debouncing -- 500ms default is reasonable. Should be configurable per-deployment.
Idle timeout for auto-complete -- 30s might be too aggressive. Some tool calls (exec) can run for minutes. Needs a smarter heuristic (e.g., no new transcript lines AND no running exec sessions).
Session compaction -- If OpenClaw rewrites/truncates transcript files, the watcher needs to handle file truncation (inode change, seek reset). This is a real edge case worth testing early.

Risk Areas

Risk	Impact	Notes
Transcript format undocumented	High	Core dependency -- if format changes, watcher breaks
Hook API may not exist yet	High	Blocks Phase 2 entirely
Mattermost rate limits	Medium	Debouncing helps, but burst scenarios need testing
Docker networking (mattermost:8065)	Low	Already solved in v1

Recommendation

This is a solid spec. The phased approach makes sense. Before building, Phase 0 discovery would de-risk the two biggest unknowns:

Document the actual JSONL transcript format (grab a sample, map the schema)
Verify the OpenClaw hooks API exists and what it provides

Once those are confirmed, Phases 1-4 can proceed with confidence.

This issue has [v4] in the title -- ready for implementation planning when the owner gives the go-ahead.

## Triage: v4 Live Status Rewrite Reviewed the spec against the current codebase. Here is the assessment. ### Current State (v1) - `src/live-status.js` (114 lines) -- simple CLI wrapper around Mattermost HTTP API - Agents must manually call `live-status create` / `live-status update` with post IDs - `install.sh` injects protocol text into AGENTS.md (prompt injection approach) - No sub-agent awareness, no auto-detection, no lifecycle hooks - Token hardcoded as placeholder, replaced by sed during install ### Spec Assessment The v4 proposal correctly diagnoses the core problem: **agents don't reliably use explicit status tools**. The solution (transcript-tailing daemon) is architecturally sound -- it moves the burden from agents to infrastructure. ### Key Technical Observations 1. **Transcript JSONL tailing** -- This depends on OpenClaw's internal transcript format. Need to verify: (a) where transcripts live, (b) the JSONL schema, (c) whether they are written synchronously or buffered. This is discovery work. 2. **Session lifecycle hooks** -- The spec mentions `POST /hooks/agent`. Need to verify this hook exists and what payload it sends. If it does not exist, this becomes a dependency on OpenClaw core. 3. **Mattermost post management** -- The proposed `status-box.js` concept (rich attachments, nested sub-agent status) is a significant step up from raw message text. Good direction. 4. **Debouncing** -- 500ms default is reasonable. Should be configurable per-deployment. 5. **Idle timeout for auto-complete** -- 30s might be too aggressive. Some tool calls (exec) can run for minutes. Needs a smarter heuristic (e.g., no new transcript lines AND no running exec sessions). 6. **Session compaction** -- If OpenClaw rewrites/truncates transcript files, the watcher needs to handle file truncation (inode change, seek reset). This is a real edge case worth testing early. ### Risk Areas | Risk | Impact | Notes | |------|--------|-------| | Transcript format undocumented | High | Core dependency -- if format changes, watcher breaks | | Hook API may not exist yet | High | Blocks Phase 2 entirely | | Mattermost rate limits | Medium | Debouncing helps, but burst scenarios need testing | | Docker networking (mattermost:8065) | Low | Already solved in v1 | ### Recommendation This is a solid spec. The phased approach makes sense. Before building, **Phase 0 discovery** would de-risk the two biggest unknowns: 1. Document the actual JSONL transcript format (grab a sample, map the schema) 2. Verify the OpenClaw hooks API exists and what it provides Once those are confirmed, Phases 1-4 can proceed with confidence. This issue has `[v4]` in the title -- ready for implementation planning when the owner gives the go-ahead.

sol commented

2026-03-07 16:33:39 +01:00

[DECOMPOSE] Planner sub-agent active. Reading protocol, examining transcripts, and identifying knowledge gaps before drafting discovery sub-issues.

Focus areas:

JSONL transcript format (types, structure)
OpenClaw session lifecycle hooks
Mattermost API constraints
Sub-agent session keying
Current failure modes in live-status

[DECOMPOSE] Planner sub-agent active. Reading protocol, examining transcripts, and identifying knowledge gaps before drafting discovery sub-issues. Focus areas: 1. JSONL transcript format (types, structure) 2. OpenClaw session lifecycle hooks 3. Mattermost API constraints 4. Sub-agent session keying 5. Current failure modes in live-status

sol commented

2026-03-07 16:36:31 +01:00

[DISCOVERY -> SYNTHESIS] Discovery phase complete. Key findings below.

Discovery Findings

1. JSONL Transcript Format

Confirmed format. Each line is a JSON object with type field:

session -- First line. Contains session UUID, version=3, cwd
model_change -- Provider, modelId changes
thinking_level_change -- thinking on/off
custom -- Subtypes: model-snapshot, openclaw.cache-ttl
message -- The main workhorse. Role = user, assistant, or toolResult

Message content array can contain:

{type: "text", text: "..."} -- plain text
{type: "toolCall", id, name, arguments: {...}} -- tool invocations
{type: "thinking", thinking: "..."} -- internal reasoning (thinking mode)

Assistant messages also carry: api, provider, model, usage, stopReason, timestamp

ToolResult messages carry: toolCallId, toolName, isError, content: [{type, text}]

2. Session Keying

Session keys follow pattern: agent:{agentId}:{context}

Examples:

agent:main:main -- direct session
agent:main:mattermost:channel:{channelId} -- channel session
agent:main:mattermost:channel:{channelId}:thread:{threadId} -- thread session
agent:main:subagent:{uuid} -- SUB-AGENT SESSION
agent:main:hook:gitea:{repo}:issue:{n} -- hook-triggered session
agent:main:cron:{name} -- cron session

Sub-agent entry in sessions.json has: spawnedBy, spawnDepth, label, sessionId (maps to .jsonl file UUID)

Sessions store: /home/node/.openclaw/agents/{agentId}/sessions/sessions.json
JSONL files: /home/node/.openclaw/agents/{agentId}/sessions/{uuid}.jsonl
Topic-scoped files: {uuid}-topic-{topicId}.jsonl

3. Hook Events

Available internal hook events:

command:new, command:reset, command:stop -- user commands
command -- all commands
agent:bootstrap -- before workspace files injected
gateway:startup -- after gateway starts

NO session:start or session:end hooks exist yet. The hooks system only covers commands, NOT individual message/run starts.

Sub-agent lifecycle: subagent_spawned, subagent_ended are channel plugin hooks (not internal hooks).

4. Mattermost API

PostEditTimeLimit = -1 (unlimited edits)
Bot token: n73636eit7bg3rgmpsj693mwno (default/main bot)
Multiple bot accounts available per agent
Post update: PUT /api/v4/posts/{id} works fine, no time limit

5. Current Failure Modes (v1)

Agents must manually call live-status create/update/complete
deploy-to-agents.sh injects verbose protocol into AGENTS.md
Agents forget, IDs get lost between tool calls
No sub-agent visibility
AGENTS.md Live Status section still relies on manual agent cooperation

6. Transcript File Patterns (Key Insight)

Files written live as agent executes
Session file = {sessionId}.jsonl (from sessions.json lookup)
Sub-agent sessions have their own JSONL files (same directory)
To watch a session: read sessions.json -> find sessionFile -> watch that JSONL
sessions.json is updated on every message

[DISCOVERY -> SYNTHESIS] Discovery phase complete. Key findings below. ## Discovery Findings ### 1. JSONL Transcript Format Confirmed format. Each line is a JSON object with `type` field: - `session` -- First line. Contains session UUID, version=3, cwd - `model_change` -- Provider, modelId changes - `thinking_level_change` -- thinking on/off - `custom` -- Subtypes: `model-snapshot`, `openclaw.cache-ttl` - `message` -- The main workhorse. Role = `user`, `assistant`, or `toolResult` Message content array can contain: - `{type: "text", text: "..."}` -- plain text - `{type: "toolCall", id, name, arguments: {...}}` -- tool invocations - `{type: "thinking", thinking: "..."}` -- internal reasoning (thinking mode) Assistant messages also carry: `api`, `provider`, `model`, `usage`, `stopReason`, `timestamp` ToolResult messages carry: `toolCallId`, `toolName`, `isError`, `content: [{type, text}]` ### 2. Session Keying Session keys follow pattern: `agent:{agentId}:{context}` Examples: - `agent:main:main` -- direct session - `agent:main:mattermost:channel:{channelId}` -- channel session - `agent:main:mattermost:channel:{channelId}:thread:{threadId}` -- thread session - `agent:main:subagent:{uuid}` -- SUB-AGENT SESSION - `agent:main:hook:gitea:{repo}:issue:{n}` -- hook-triggered session - `agent:main:cron:{name}` -- cron session Sub-agent entry in sessions.json has: `spawnedBy`, `spawnDepth`, `label`, `sessionId` (maps to .jsonl file UUID) Sessions store: `/home/node/.openclaw/agents/{agentId}/sessions/sessions.json` JSONL files: `/home/node/.openclaw/agents/{agentId}/sessions/{uuid}.jsonl` Topic-scoped files: `{uuid}-topic-{topicId}.jsonl` ### 3. Hook Events Available internal hook events: - `command:new`, `command:reset`, `command:stop` -- user commands - `command` -- all commands - `agent:bootstrap` -- before workspace files injected - `gateway:startup` -- after gateway starts NO session:start or session:end hooks exist yet. The hooks system only covers commands, NOT individual message/run starts. Sub-agent lifecycle: `subagent_spawned`, `subagent_ended` are channel plugin hooks (not internal hooks). ### 4. Mattermost API - PostEditTimeLimit = -1 (unlimited edits) - Bot token: `n73636eit7bg3rgmpsj693mwno` (default/main bot) - Multiple bot accounts available per agent - Post update: PUT /api/v4/posts/{id} works fine, no time limit ### 5. Current Failure Modes (v1) - Agents must manually call live-status create/update/complete - deploy-to-agents.sh injects verbose protocol into AGENTS.md - Agents forget, IDs get lost between tool calls - No sub-agent visibility - AGENTS.md Live Status section still relies on manual agent cooperation ### 6. Transcript File Patterns (Key Insight) - Files written live as agent executes - Session file = `{sessionId}.jsonl` (from sessions.json lookup) - Sub-agent sessions have their own JSONL files (same directory) - To watch a session: read sessions.json -> find sessionFile -> watch that JSONL - sessions.json is updated on every message

sol commented

2026-03-07 16:36:40 +01:00

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Now drafting the implementation plan.

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Now drafting the implementation plan.

sol commented

2026-03-07 16:42:14 +01:00

[AUTO_AUDIT -> SIMULATION -> PENDING_APPROVAL]

Audit: 32/32 PASS. Simulation: READY. No blockers.

Plan committed to repo: PLAN.md, STATE.json, discoveries/README.md

See full plan below in the next comment.

[AUTO_AUDIT -> SIMULATION -> PENDING_APPROVAL] Audit: 32/32 PASS. Simulation: READY. No blockers. Plan committed to repo: PLAN.md, STATE.json, discoveries/README.md See full plan below in the next comment.

sol commented

2026-03-07 16:42:47 +01:00

[PENDING_APPROVAL] Implementation Plan: Live Status v4

Full plan committed to repo: PLAN.md
Discovery: discoveries/README.md

Summary

Replace the broken agent-cooperative live-status (v1) with an infrastructure-level daemon that tails OpenClaw JSONL transcript files and auto-updates Mattermost -- zero agent cooperation required.

Core mechanism: Watch sessions.json for new sessions -> tail each session's JSONL -> parse tool calls and results -> debounced Mattermost updates (500ms).

Architecture

status-watcher.js -- tails one session's JSONL, parses events, updates Mattermost
session-monitor.js -- watches sessions.json for new/ended sessions
watcher-manager.js -- orchestrates monitor + watchers, PID management
hook/handler.js -- gateway:startup hook auto-starts watcher-manager

Phases (7-9h total)

Phase 0 (10min): Repo sync + setup
Phase 1 (2-3h): Core watcher daemon
Phase 2 (1-2h): Session monitor + watcher manager
Phase 3 (1h): Channel resolution (Mattermost session key parsing)
Phase 4 (1h): Hook integration (auto-start on gateway:startup)
Phase 5 (1h): Polish + cleanup + remove v1 AGENTS.md injection
Phase 6 (30min): Remove old AGENTS.md protocol section

Key Technical Decisions

Poll sessions.json every 2s (fs.watch unreliable on Linux for JSON files)
Use stopReason=stop + 10s idle to detect session complete (not naive 30s timeout)
Sub-agents detected via spawnedBy field in sessions.json -> nested in parent status box
handler.js (plain JS) for gateway hook -- confirmed supported by OpenClaw workspace hook discovery
No new npm dependencies -- all Node.js built-ins

Audit: 32/32 PASS | Simulation: READY

Open Questions (all have defaults, plan can proceed)

Q1: Idle timeout -- Default: stopReason=stop + 10s no new lines
Q2: Non-MM sessions -- Default: skip (hook/cron sessions get no status box)
Q3: Per-session or per-request status box -- Default: per user message (new box each request)
Q4: Compaction behavior -- Default: detect fileSize < bytesRead, reset offset

Warning

Gateway restart required to activate the gateway:startup hook after install. Coordinate timing with Rooh.

ROOH: Reply "go" / "approved" / "proceed" to begin execution.

[PENDING_APPROVAL] Implementation Plan: Live Status v4 Full plan committed to repo: PLAN.md Discovery: discoveries/README.md --- ## Summary Replace the broken agent-cooperative live-status (v1) with an infrastructure-level daemon that tails OpenClaw JSONL transcript files and auto-updates Mattermost -- zero agent cooperation required. **Core mechanism:** Watch sessions.json for new sessions -> tail each session's JSONL -> parse tool calls and results -> debounced Mattermost updates (500ms). ## Architecture - status-watcher.js -- tails one session's JSONL, parses events, updates Mattermost - session-monitor.js -- watches sessions.json for new/ended sessions - watcher-manager.js -- orchestrates monitor + watchers, PID management - hook/handler.js -- gateway:startup hook auto-starts watcher-manager ## Phases (7-9h total) - Phase 0 (10min): Repo sync + setup - Phase 1 (2-3h): Core watcher daemon - Phase 2 (1-2h): Session monitor + watcher manager - Phase 3 (1h): Channel resolution (Mattermost session key parsing) - Phase 4 (1h): Hook integration (auto-start on gateway:startup) - Phase 5 (1h): Polish + cleanup + remove v1 AGENTS.md injection - Phase 6 (30min): Remove old AGENTS.md protocol section ## Key Technical Decisions 1. Poll sessions.json every 2s (fs.watch unreliable on Linux for JSON files) 2. Use stopReason=stop + 10s idle to detect session complete (not naive 30s timeout) 3. Sub-agents detected via spawnedBy field in sessions.json -> nested in parent status box 4. handler.js (plain JS) for gateway hook -- confirmed supported by OpenClaw workspace hook discovery 5. No new npm dependencies -- all Node.js built-ins ## Audit: 32/32 PASS | Simulation: READY ## Open Questions (all have defaults, plan can proceed) Q1: Idle timeout -- Default: stopReason=stop + 10s no new lines Q2: Non-MM sessions -- Default: skip (hook/cron sessions get no status box) Q3: Per-session or per-request status box -- Default: per user message (new box each request) Q4: Compaction behavior -- Default: detect fileSize < bytesRead, reset offset ## Warning Gateway restart required to activate the gateway:startup hook after install. Coordinate timing with Rooh. --- ROOH: Reply "go" / "approved" / "proceed" to begin execution.

sol commented

2026-03-07 16:46:38 +01:00

Check if this plan is
Scalability
Efficiency
Production ready
Up to best current practices and not naive solutions

Check if this plan is Scalability Efficiency Production ready Up to best current practices and not naive solutions

sol commented

2026-03-07 16:48:02 +01:00

Review: Scalability, Efficiency, Production-Readiness, Best Practices

Reviewed against the four criteria requested. The diagnosis is spot-on; the architecture has the right instincts but needs hardening in several areas before it is production-grade.

1. SCALABILITY

Current design: one watcher daemon per session.

This is the biggest architectural concern. If 10 agents run concurrently with sub-agents, you could have 30+ watcher processes, each doing fs.watch + debounced HTTP calls. That works on a single box with low concurrency, but it does not scale.

Issues:

Per-session process spawning has no upper bound. No pool, no backpressure.
Each watcher independently manages its own Mattermost connection -- no shared HTTP client, no connection pooling.
Sub-agent watchers multiply the problem (parent + N children = N+1 watchers per task).

Recommendations:

Single watcher process, multiplexed. One long-running daemon that watches a directory (e.g., all transcript files) using a single fs.watch with recursive option (supported on Linux via inotify since Node 19+, and you are on Node v22). This eliminates per-session process overhead.
Shared HTTP client with connection keep-alive. One http.Agent with keepAlive: true and maxSockets capped. All Mattermost updates go through one connection pool.
Bounded concurrency. Max N concurrent status boxes (configurable). Queue or drop beyond that.
Consider a lightweight message bus (even just an in-process EventEmitter) between the file watcher and the Mattermost updater, so they can be scaled independently later.

Verdict: Needs rework. Per-session daemons are a v1-level solution to a v4-level problem.

2. EFFICIENCY

Debouncing at 500ms is correct in principle but naive in implementation.

Issues:

The spec says "max 1 update/500ms" but does not specify the debounce strategy. Leading-edge? Trailing-edge? Throttle? This matters:
- Leading-edge: first event fires immediately, subsequent ones are delayed. Good for responsiveness.
- Trailing-edge: waits 500ms after the LAST event. Good for batching but adds latency.
- Throttle: fires at most once per 500ms regardless. Best for rate limiting.
- Best approach: throttle with trailing flush. Fire immediately on first event, then at most once per interval, with a guaranteed final flush. This gives both responsiveness AND batching.
Full post replacement on every update is wasteful. Each Mattermost PUT /posts/{id} sends the entire message body. If the status box grows to 30+ lines, you are sending the same 29 lines repeatedly to change 1 line.
- Mitigation: keep the status box compact (last N lines + summary), not an ever-growing log.
- Alternative: use Mattermost message attachments (structured fields) which are easier to diff mentally.
JSONL parsing on every line is fine -- JSON.parse on a single line is sub-millisecond. No concern here.
fs.watch vs polling: On Linux (your runtime), fs.watch uses inotify which is efficient. Good. Do NOT fall back to fs.watchFile (polling) -- it is wasteful and unnecessary on Linux. The spec does not mention this distinction; it should.

Verdict: Mostly good, needs the debounce strategy specified and the message size growth addressed.

3. PRODUCTION-READINESS

This is where the spec has the most gaps.

Missing from the spec:

Gap	Impact	What to add
No graceful shutdown	Orphaned watchers, leaked Mattermost posts stuck in "running"	SIGTERM/SIGINT handlers that mark all active status boxes as "interrupted"
No health check endpoint	Cannot monitor watcher health	Simple HTTP `/health` or write a heartbeat file
No structured logging	Cannot debug production issues	Use structured JSON logging (pino or similar), not console.log
No PID file / process management	Cannot reliably stop/restart	Write PID file, or use systemd/pm2
No file rotation handling	If transcripts are rotated (logrotate-style), watcher loses position	Watch for inode changes, re-open on rename event
No max message size guard	Mattermost has a 16383 char post limit (default)	Truncate or paginate status box content
No error budget / circuit breaker	If Mattermost is down, watchers spin on retries forever	Exponential backoff with circuit breaker (stop trying after N failures, resume after cooldown)
No metrics	Cannot measure update latency, error rates, queue depth	Expose basic counters (updates sent, errors, queue depth)
Session compaction handling	Spec mentions it but no strategy	Need to detect file truncation (stat size < last read offset) and reset reader position

The 30-second idle timeout for auto-complete is problematic:

exec tool calls can run for minutes (npm install, git clone, compilation).
A smarter heuristic: track whether the last transcript line was a tool_call (still waiting for result) vs. an assistant message (might be done). Only start idle timer after a complete assistant turn with no pending tool calls.

Token/credential management:

The current v1 approach (sed-replacing a placeholder in the installed binary) is bad practice. v4 should use environment variables exclusively (MM_TOKEN, MM_URL). The spec does not address this.

Verdict: Not production-ready as specified. Needs the gaps above addressed before it can run unattended.

4. BEST PRACTICES

What the spec gets right:

Separating concerns (watcher vs. status-box vs. hook integration)
Phased rollout (core first, then lifecycle, then sub-agents, then polish)
Deprecating the old approach rather than deleting it
Removing AGENTS.md prompt injection (correct -- this never worked reliably)

What deviates from best practices:

Area	Issue	Best Practice
Architecture	Per-session daemon spawning	Single multiplexed daemon (event-driven)
File watching	Spec says `fs.watch` but does not handle edge cases	Use `fs.watch` on Linux (inotify), handle ENOENT (file not yet created), EPERM, and rename events
Error handling	Not mentioned in spec	Every I/O boundary needs try/catch with structured error logging
Testing	No testing strategy in the spec	Unit tests for JSONL parser, integration tests for Mattermost posting, end-to-end test with a mock transcript
Configuration	Hardcoded values (500ms, 30s, tool labels)	Config file or env vars for ALL tunable values, with sensible defaults
Dependencies	`tool-labels.json` as static file	Good start, but should support pattern matching (e.g., `exec:*` -> "Running command...") not just exact tool names
Observability	Phase 4 "metrics/logging" is last	Should be Phase 1. You cannot debug a file-watching daemon without logging from day one.
Backward compat	"Keep for backward compat" on live-status.js	Good. But define a deprecation timeline (e.g., remove after v4 is stable for 2 weeks)

Missing best practice: the watcher should be idempotent on restart. If the watcher crashes and restarts, it should:

Find the existing Mattermost status post (by searching for a marker in the channel)
Resume from the current end of the transcript file (not replay from the beginning)
Update the status box with "Resumed after restart" indicator

Summary Verdict

Criterion	Rating	Notes
Scalability	Needs rework	Per-session daemons do not scale; switch to single multiplexed watcher
Efficiency	Good with fixes	Debounce strategy needs specifying; cap message size
Production-ready	Not yet	Missing: graceful shutdown, health checks, circuit breaker, structured logging, credential management
Best practices	Mostly good	Move observability to Phase 1; add testing strategy; make watcher restart-idempotent

The core idea is sound. Transcript tailing is the right approach -- it removes agent burden entirely. But the implementation plan needs a Phase 0 that addresses the architectural gaps above before writing code.

Top 3 changes to make the spec production-grade:

Single multiplexed watcher daemon (not per-session spawning)
Add graceful shutdown, circuit breaker, and structured logging to Phase 1 (not Phase 4)
Define the idle-completion heuristic properly (pending tool calls = not idle)

## Review: Scalability, Efficiency, Production-Readiness, Best Practices Reviewed against the four criteria requested. The diagnosis is spot-on; the architecture has the right instincts but needs hardening in several areas before it is production-grade. --- ### 1. SCALABILITY **Current design: one watcher daemon per session.** This is the biggest architectural concern. If 10 agents run concurrently with sub-agents, you could have 30+ watcher processes, each doing `fs.watch` + debounced HTTP calls. That works on a single box with low concurrency, but it does not scale. **Issues:** - Per-session process spawning has no upper bound. No pool, no backpressure. - Each watcher independently manages its own Mattermost connection -- no shared HTTP client, no connection pooling. - Sub-agent watchers multiply the problem (parent + N children = N+1 watchers per task). **Recommendations:** - **Single watcher process, multiplexed.** One long-running daemon that watches a directory (e.g., all transcript files) using a single `fs.watch` with recursive option (supported on Linux via inotify since Node 19+, and you are on Node v22). This eliminates per-session process overhead. - **Shared HTTP client with connection keep-alive.** One `http.Agent` with `keepAlive: true` and `maxSockets` capped. All Mattermost updates go through one connection pool. - **Bounded concurrency.** Max N concurrent status boxes (configurable). Queue or drop beyond that. - **Consider a lightweight message bus** (even just an in-process EventEmitter) between the file watcher and the Mattermost updater, so they can be scaled independently later. **Verdict: Needs rework.** Per-session daemons are a v1-level solution to a v4-level problem. --- ### 2. EFFICIENCY **Debouncing at 500ms is correct in principle but naive in implementation.** **Issues:** - The spec says "max 1 update/500ms" but does not specify the debounce strategy. Leading-edge? Trailing-edge? Throttle? This matters: - Leading-edge: first event fires immediately, subsequent ones are delayed. Good for responsiveness. - Trailing-edge: waits 500ms after the LAST event. Good for batching but adds latency. - Throttle: fires at most once per 500ms regardless. Best for rate limiting. - **Best approach: throttle with trailing flush.** Fire immediately on first event, then at most once per interval, with a guaranteed final flush. This gives both responsiveness AND batching. - **Full post replacement on every update is wasteful.** Each Mattermost `PUT /posts/{id}` sends the entire message body. If the status box grows to 30+ lines, you are sending the same 29 lines repeatedly to change 1 line. - Mitigation: keep the status box compact (last N lines + summary), not an ever-growing log. - Alternative: use Mattermost message attachments (structured fields) which are easier to diff mentally. - **JSONL parsing on every line is fine** -- JSON.parse on a single line is sub-millisecond. No concern here. - **`fs.watch` vs polling:** On Linux (your runtime), `fs.watch` uses inotify which is efficient. Good. Do NOT fall back to `fs.watchFile` (polling) -- it is wasteful and unnecessary on Linux. The spec does not mention this distinction; it should. **Verdict: Mostly good, needs the debounce strategy specified and the message size growth addressed.** --- ### 3. PRODUCTION-READINESS This is where the spec has the most gaps. **Missing from the spec:** | Gap | Impact | What to add | |-----|--------|-------------| | No graceful shutdown | Orphaned watchers, leaked Mattermost posts stuck in "running" | SIGTERM/SIGINT handlers that mark all active status boxes as "interrupted" | | No health check endpoint | Cannot monitor watcher health | Simple HTTP `/health` or write a heartbeat file | | No structured logging | Cannot debug production issues | Use structured JSON logging (pino or similar), not console.log | | No PID file / process management | Cannot reliably stop/restart | Write PID file, or use systemd/pm2 | | No file rotation handling | If transcripts are rotated (logrotate-style), watcher loses position | Watch for inode changes, re-open on rename event | | No max message size guard | Mattermost has a 16383 char post limit (default) | Truncate or paginate status box content | | No error budget / circuit breaker | If Mattermost is down, watchers spin on retries forever | Exponential backoff with circuit breaker (stop trying after N failures, resume after cooldown) | | No metrics | Cannot measure update latency, error rates, queue depth | Expose basic counters (updates sent, errors, queue depth) | | Session compaction handling | Spec mentions it but no strategy | Need to detect file truncation (stat size < last read offset) and reset reader position | **The 30-second idle timeout for auto-complete is problematic:** - `exec` tool calls can run for minutes (npm install, git clone, compilation). - A smarter heuristic: track whether the last transcript line was a tool_call (still waiting for result) vs. an assistant message (might be done). Only start idle timer after a complete assistant turn with no pending tool calls. **Token/credential management:** - The current v1 approach (sed-replacing a placeholder in the installed binary) is bad practice. v4 should use environment variables exclusively (`MM_TOKEN`, `MM_URL`). The spec does not address this. **Verdict: Not production-ready as specified. Needs the gaps above addressed before it can run unattended.** --- ### 4. BEST PRACTICES **What the spec gets right:** - Separating concerns (watcher vs. status-box vs. hook integration) - Phased rollout (core first, then lifecycle, then sub-agents, then polish) - Deprecating the old approach rather than deleting it - Removing AGENTS.md prompt injection (correct -- this never worked reliably) **What deviates from best practices:** | Area | Issue | Best Practice | |------|-------|---------------| | Architecture | Per-session daemon spawning | Single multiplexed daemon (event-driven) | | File watching | Spec says `fs.watch` but does not handle edge cases | Use `fs.watch` on Linux (inotify), handle ENOENT (file not yet created), EPERM, and rename events | | Error handling | Not mentioned in spec | Every I/O boundary needs try/catch with structured error logging | | Testing | No testing strategy in the spec | Unit tests for JSONL parser, integration tests for Mattermost posting, end-to-end test with a mock transcript | | Configuration | Hardcoded values (500ms, 30s, tool labels) | Config file or env vars for ALL tunable values, with sensible defaults | | Dependencies | `tool-labels.json` as static file | Good start, but should support pattern matching (e.g., `exec:*` -> "Running command...") not just exact tool names | | Observability | Phase 4 "metrics/logging" is last | Should be Phase 1. You cannot debug a file-watching daemon without logging from day one. | | Backward compat | "Keep for backward compat" on live-status.js | Good. But define a deprecation timeline (e.g., remove after v4 is stable for 2 weeks) | **Missing best practice: the watcher should be idempotent on restart.** If the watcher crashes and restarts, it should: 1. Find the existing Mattermost status post (by searching for a marker in the channel) 2. Resume from the current end of the transcript file (not replay from the beginning) 3. Update the status box with "Resumed after restart" indicator --- ### Summary Verdict | Criterion | Rating | Notes | |-----------|--------|-------| | Scalability | Needs rework | Per-session daemons do not scale; switch to single multiplexed watcher | | Efficiency | Good with fixes | Debounce strategy needs specifying; cap message size | | Production-ready | Not yet | Missing: graceful shutdown, health checks, circuit breaker, structured logging, credential management | | Best practices | Mostly good | Move observability to Phase 1; add testing strategy; make watcher restart-idempotent | **The core idea is sound.** Transcript tailing is the right approach -- it removes agent burden entirely. But the implementation plan needs a Phase 0 that addresses the architectural gaps above before writing code. **Top 3 changes to make the spec production-grade:** 1. Single multiplexed watcher daemon (not per-session spawning) 2. Add graceful shutdown, circuit breaker, and structured logging to Phase 1 (not Phase 4) 3. Define the idle-completion heuristic properly (pending tool calls = not idle)

sol commented

2026-03-07 16:59:23 +01:00

I agree, update the plan based on your findings.

sol commented

2026-03-07 17:01:05 +01:00

Revised Plan: v4 Live Status Rewrite (Production-Grade)

Incorporating all findings from the scalability/efficiency/production-readiness review. Changes from original spec marked with [CHANGED] or [NEW].

Problem Statement

(unchanged -- the diagnosis is correct)

The current live-status system (v1) is fundamentally broken in production. Agents forget to use it, it spams when they do, sub-agents are invisible, and prompt injection does not work as an enforcement mechanism.

Proposed Solution: Live Status v4

Core Principle

Don't rely on agents to update status. Intercept their work automatically.

Architecture [CHANGED]

Single multiplexed watcher daemon (not per-session) that watches all transcript files and routes updates through a shared Mattermost connection pool.

OpenClaw Gateway
  Agent Sessions -> write transcript JSONL files to transcript directory
  
  status-watcher daemon (SINGLE PROCESS)
    -> fs.watch on transcript directory (recursive, inotify on Linux)
    -> Multiplexes all active session transcripts
    -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] }
    -> Shared HTTP connection pool (keep-alive, maxSockets=4)
    -> Throttled Mattermost updates (leading edge + trailing flush, 500ms)
    -> Bounded concurrency: max N active status boxes (configurable, default 20)
    -> Structured JSON logging (pino)
    -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted")
    -> Circuit breaker for Mattermost API failures
    
  Sub-agent transcripts
    -> Detected by session key pattern (agent:id:subagent:uuid)
    -> Nested under parent status box automatically

Why single process over per-session daemons:

Eliminates unbounded process spawning
Shared connection pool reduces HTTP overhead
Single point of configuration and monitoring
Easier health checking and process management
Lower memory footprint (one V8 heap, not N)

Components

1. status-watcher.js - Multiplexed Transcript Watcher [CHANGED]

Single long-running daemon watching the transcript directory
fs.watch with recursive option (Node 22 on Linux = inotify, efficient)
NO fallback to fs.watchFile (polling) -- inotify or nothing
On file change: read new bytes from last known offset, split into lines, parse JSONL
Maintain SessionState map per active session:
- postId: Mattermost status box post ID
- lastOffset: byte offset in transcript file (for resume)
- pendingToolCalls: count of tool_calls without matching tool_results
- lines: recent status lines (capped at MAX_LINES, default 15)
- startTime: session start timestamp
- lastActivity: timestamp of last transcript line
Handle file truncation (session compaction): detect stat.size < lastOffset, reset to 0
Handle file deletion: clean up SessionState, mark status box as "session ended"
Handle ENOENT on initial watch: file may not exist yet, that is fine

2. status-box.js - Mattermost Post Manager [CHANGED]

Shared http.Agent with keepAlive: true, maxSockets: 4
Throttle strategy: leading edge + trailing flush at configurable interval (default 500ms)
- First event fires immediately (responsiveness)
- Subsequent events batched, at most one update per interval
- Guaranteed final flush when activity stops (no lost updates)
Status box content: compact format, capped at MAX_LINES (not ever-growing log)
- Show: agent name, current action, last N status lines, elapsed time
- When lines exceed MAX_LINES, oldest lines are dropped (keep most recent)
- Footer: runtime duration, token count, cost (if available)
Message size guard: truncate to 15000 chars (Mattermost default limit is 16383)
Sub-agent progress rendered as indented nested items under parent box
Post recovery on restart: search channel for existing status post with marker, resume updating it
Credential management: MM_TOKEN and MM_URL from environment variables only. No hardcoded tokens, no sed replacement.

3. tool-labels.js - Tool Name Mapping [CHANGED from .json]

Supports exact match AND pattern matching:
- Exact: "Read" -> "Reading file..."
- Pattern: "exec:*" -> "Running command..."
- Regex: /^web_/ -> "Searching the web..."
Default label for unmapped tools: "Working..."
Configurable via external JSON file, with built-in defaults as fallback

4. Hook Integration

Trigger: register with OpenClaw hooks API (POST /hooks/agent) for session start/end events
On session start: watcher picks up new transcript file automatically (directory watch)
On session end: mark status box complete, clean up SessionState
Fallback if hooks API does not exist: directory polling at low frequency (every 5s) to detect new transcript files

5. Agent-Side Simplification

Agents get ONE instruction: "Status updates are automatic. Focus on the task."
Remove all AGENTS.md protocol injection from install/deploy scripts
Old live-status CLI kept for backward compat but marked deprecated

Production Infrastructure [NEW SECTION]

Graceful Shutdown

SIGTERM/SIGINT handlers
On shutdown: mark all active status boxes as "Session interrupted" with duration
Flush all pending Mattermost updates before exit
Write final state to disk (session offsets) for restart recovery
Exit with code 0 after cleanup

Health Check

HTTP endpoint on configurable port (default 9090): GET /health
Returns: { "status": "ok", "activeSessions": N, "uptimeSeconds": N, "lastError": "..." }
Can be used by systemd, Docker HEALTHCHECK, or monitoring

Circuit Breaker for Mattermost API

Track consecutive failures per endpoint
After 5 consecutive failures: open circuit (stop sending for 30s cooldown)
During cooldown: buffer updates in memory (bounded queue, max 100 entries)
After cooldown: half-open (try one request). Success -> close circuit. Failure -> re-open.
Log all state transitions

Structured Logging [NEW]

Use pino (fast, structured JSON logging)
Log levels: error, warn, info, debug
Default: info in production, debug in development
Every log line includes: timestamp, sessionKey (if applicable), event type
No console.log anywhere in production code

Process Management

Write PID file to configurable path (default: /tmp/status-watcher.pid)
Support --daemon flag for background operation
Systemd unit file provided in deploy/status-watcher.service

Metrics [MOVED TO PHASE 1]

Internal counters exposed via health endpoint:
- updates_sent_total
- updates_failed_total
- active_sessions
- circuit_breaker_state (closed/open/half-open)
- queue_depth
- uptime_seconds

Idle Completion Heuristic [CHANGED]

The original 30-second idle timeout was too aggressive. Revised approach:

Smart idle detection:

Track pendingToolCalls per session (increment on tool_use, decrement on tool_result)
If pendingToolCalls > 0: session is NOT idle, regardless of time since last transcript line
If pendingToolCalls == 0 AND last transcript entry was an assistant message AND no new lines for IDLE_TIMEOUT seconds (configurable, default 60s): mark as idle/complete
If pendingToolCalls == 0 AND last transcript entry was a tool_result: start a shorter timer (30s) -- agent might be composing response
Hard timeout: after MAX_SESSION_DURATION (configurable, default 30 minutes), force-complete regardless

This prevents premature completion during long-running exec calls while still cleaning up genuinely idle sessions.

Configuration [NEW SECTION]

All tunable values via environment variables with sensible defaults:

Variable	Default	Description
`MM_TOKEN`	(required)	Mattermost bot token
`MM_URL`	`http://mattermost:8065`	Mattermost base URL
`TRANSCRIPT_DIR`	(required)	Directory containing JSONL transcript files
`THROTTLE_MS`	`500`	Minimum interval between Mattermost updates
`IDLE_TIMEOUT_S`	`60`	Seconds of inactivity before marking complete
`MAX_SESSION_DURATION_S`	`1800`	Hard timeout for any session (30 min)
`MAX_STATUS_LINES`	`15`	Max lines in status box (oldest dropped)
`MAX_ACTIVE_SESSIONS`	`20`	Bounded concurrency for status boxes
`MAX_MESSAGE_CHARS`	`15000`	Truncation limit for Mattermost posts
`HEALTH_PORT`	`9090`	Health check HTTP port
`LOG_LEVEL`	`info`	Logging level (error/warn/info/debug)
`CIRCUIT_BREAKER_THRESHOLD`	`5`	Consecutive failures to open circuit
`CIRCUIT_BREAKER_COOLDOWN_S`	`30`	Cooldown before half-open
`PID_FILE`	`/tmp/status-watcher.pid`	PID file path
`TOOL_LABELS_FILE`	`null`	Optional external tool labels JSON file

Revised Implementation Plan

Phase 0: Discovery [NEW]

Document the actual JSONL transcript format (grab sample, map schema)
Verify OpenClaw hooks API exists and document its payload
Identify transcript directory path and file naming convention
Verify session key format for sub-agent detection
Test fs.watch recursive behavior on the target Linux kernel
Document Mattermost rate limits on the target instance

Phase 1: Core Watcher + Production Foundation

src/status-watcher.js -- multiplexed directory watcher, JSONL parser, SessionState management
src/status-box.js -- Mattermost post manager with shared HTTP pool, throttle, message size cap
src/tool-labels.js -- pattern-matching tool name to label mapping
src/config.js -- centralized configuration from env vars with validation
src/logger.js -- pino-based structured logging
src/circuit-breaker.js -- circuit breaker for Mattermost API
src/health.js -- HTTP health endpoint with metrics
Graceful shutdown handlers (SIGTERM/SIGINT)
File truncation detection (session compaction)
Smart idle completion heuristic
Tests: unit tests for JSONL parser, tool-labels matcher, circuit breaker, throttle logic

Phase 2: Session Lifecycle + Restart Recovery

Hook integration (register with OpenClaw hooks API)
Fallback: directory polling for new transcripts if hooks unavailable
Restart recovery: persist session offsets, recover existing Mattermost posts
PID file management
Thread-aware: detect thread root ID from session context
Tests: integration tests for lifecycle events, restart recovery

Phase 3: Sub-Agent Support

Detect sub-agent transcripts by session key pattern
Link sub-agent status to parent status box
Nested rendering in status box
Cascade completion (parent waits for all children)
Tests: end-to-end test with mock parent + child transcripts

Phase 4: Deployment + Migration

install.sh -- new install flow (env-var based, no token sed replacement)
deploy/status-watcher.service -- systemd unit file
deploy/Dockerfile -- containerized deployment option
skill/SKILL.md -- rewrite (simplified: "status is automatic")
README.md -- full v4 documentation
Remove AGENTS.md protocol injection from deploy scripts
Migration guide: v1 -> v4
Deprecation notice on src/live-status.js

Revised Status Box Format

[ACTIVE] god-agent | 38s
Reading live-status source code...
  Read: src/live-status.js [OK]
Analyzing agent configurations...
  exec: grep -r live-status [OK]
Writing new implementation...
  Sub-agent: coder-agent (Phase 1)
    Writing status-watcher.js...
    [DONE] 13s
[DONE] 38s | 12.4k tokens | $0.08

Key changes from original:

Compact (15 lines max, oldest dropped)
Status prefix: [ACTIVE], [DONE], [ERROR], [INTERRUPTED]
No emoji (Mattermost API compatibility)
Duration in footer only shows when complete

Files to Create/Modify (Revised)

File	Action	Purpose
`src/status-watcher.js`	CREATE	Multiplexed directory watcher daemon
`src/status-box.js`	CREATE	Mattermost post manager with connection pool
`src/tool-labels.js`	CREATE	Pattern-matching tool label resolver
`src/config.js`	CREATE	Centralized env-var configuration
`src/logger.js`	CREATE	Structured logging (pino wrapper)
`src/circuit-breaker.js`	CREATE	Circuit breaker for API resilience
`src/health.js`	CREATE	Health check HTTP endpoint
`src/live-status.js`	DEPRECATE	Keep for backward compat, add deprecation warning
`skill/SKILL.md`	REWRITE	"Status is automatic" (Phase 4)
`install.sh`	REWRITE	Env-var based install (Phase 4)
`deploy/status-watcher.service`	CREATE	Systemd unit file (Phase 4)
`deploy/Dockerfile`	CREATE	Container deployment (Phase 4)
`README.md`	REWRITE	Full v4 docs (Phase 4)
`test/`	CREATE	Unit + integration + e2e tests
`package.json`	UPDATE	Add pino dependency, test scripts

Revised Success Criteria

Agents produce live status updates WITHOUT any explicit live-status calls
Sub-agent progress is visible in real-time, nested under parent
No status spam in final response
Works across thread sessions automatically
Single daemon handles all concurrent sessions (no per-session processes)
Survives session compaction (file truncation detection)
Survives daemon restarts (offset persistence, post recovery)
Survives Mattermost outages (circuit breaker, bounded retry queue)
Health endpoint reports daemon status and metrics
Structured JSON logging for production debugging
All configuration via environment variables
No hardcoded credentials anywhere
Test coverage for parser, throttle, circuit breaker, idle heuristic
Single install command deploys everything
Graceful shutdown marks all active boxes as interrupted

Risk Assessment (Revised)

Risk	Impact	Mitigation	Status
Transcript format undocumented	High	Phase 0 discovery task	Open
Hook API may not exist	High	Fallback to directory polling	Mitigated
Mattermost rate limits	Medium	Throttle + circuit breaker	Mitigated
Session compaction truncates file	Medium	Detect size < offset, reset reader	Mitigated
Daemon crashes mid-session	Medium	Restart recovery with persisted offsets	Mitigated
Mattermost extended outage	Medium	Circuit breaker + bounded queue	Mitigated
Too many concurrent sessions	Low	Bounded concurrency (MAX_ACTIVE_SESSIONS)	Mitigated
Docker networking	Low	Already solved in v1	Mitigated

Effort Estimate

Phase	Estimated Time	Parallelizable	Depends On
Phase 0: Discovery	2-3 hours	No	Nothing
Phase 1: Core + Foundation	8-12 hours	Partially (logger, config, circuit-breaker are independent)	Phase 0
Phase 2: Lifecycle + Recovery	4-6 hours	No	Phase 1
Phase 3: Sub-Agent Support	3-4 hours	No	Phase 2
Phase 4: Deployment + Migration	3-4 hours	Yes (docs, deploy scripts, skill rewrite)	Phase 3
Total	20-29 hours

This plan is ready for approval. Phase 0 (discovery) can begin immediately as it requires no code changes.

## Revised Plan: v4 Live Status Rewrite (Production-Grade) Incorporating all findings from the scalability/efficiency/production-readiness review. Changes from original spec marked with **[CHANGED]** or **[NEW]**. --- ## Problem Statement _(unchanged -- the diagnosis is correct)_ The current live-status system (v1) is fundamentally broken in production. Agents forget to use it, it spams when they do, sub-agents are invisible, and prompt injection does not work as an enforcement mechanism. --- ## Proposed Solution: Live Status v4 ### Core Principle **Don't rely on agents to update status. Intercept their work automatically.** ### Architecture **[CHANGED]** **Single multiplexed watcher daemon** (not per-session) that watches all transcript files and routes updates through a shared Mattermost connection pool. ``` OpenClaw Gateway Agent Sessions -> write transcript JSONL files to transcript directory status-watcher daemon (SINGLE PROCESS) -> fs.watch on transcript directory (recursive, inotify on Linux) -> Multiplexes all active session transcripts -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] } -> Shared HTTP connection pool (keep-alive, maxSockets=4) -> Throttled Mattermost updates (leading edge + trailing flush, 500ms) -> Bounded concurrency: max N active status boxes (configurable, default 20) -> Structured JSON logging (pino) -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted") -> Circuit breaker for Mattermost API failures Sub-agent transcripts -> Detected by session key pattern (agent:id:subagent:uuid) -> Nested under parent status box automatically ``` **Why single process over per-session daemons:** - Eliminates unbounded process spawning - Shared connection pool reduces HTTP overhead - Single point of configuration and monitoring - Easier health checking and process management - Lower memory footprint (one V8 heap, not N) --- ### Components #### 1. status-watcher.js - Multiplexed Transcript Watcher **[CHANGED]** - Single long-running daemon watching the transcript directory - `fs.watch` with recursive option (Node 22 on Linux = inotify, efficient) - NO fallback to `fs.watchFile` (polling) -- inotify or nothing - On file change: read new bytes from last known offset, split into lines, parse JSONL - Maintain `SessionState` map per active session: - `postId`: Mattermost status box post ID - `lastOffset`: byte offset in transcript file (for resume) - `pendingToolCalls`: count of tool_calls without matching tool_results - `lines`: recent status lines (capped at MAX_LINES, default 15) - `startTime`: session start timestamp - `lastActivity`: timestamp of last transcript line - Handle file truncation (session compaction): detect `stat.size < lastOffset`, reset to 0 - Handle file deletion: clean up SessionState, mark status box as "session ended" - Handle ENOENT on initial watch: file may not exist yet, that is fine #### 2. status-box.js - Mattermost Post Manager **[CHANGED]** - Shared `http.Agent` with `keepAlive: true`, `maxSockets: 4` - Throttle strategy: **leading edge + trailing flush** at configurable interval (default 500ms) - First event fires immediately (responsiveness) - Subsequent events batched, at most one update per interval - Guaranteed final flush when activity stops (no lost updates) - Status box content: **compact format, capped at MAX_LINES** (not ever-growing log) - Show: agent name, current action, last N status lines, elapsed time - When lines exceed MAX_LINES, oldest lines are dropped (keep most recent) - Footer: runtime duration, token count, cost (if available) - **Message size guard**: truncate to 15000 chars (Mattermost default limit is 16383) - Sub-agent progress rendered as indented nested items under parent box - **Post recovery on restart**: search channel for existing status post with marker, resume updating it - Credential management: `MM_TOKEN` and `MM_URL` from environment variables only. No hardcoded tokens, no sed replacement. #### 3. tool-labels.js - Tool Name Mapping **[CHANGED from .json]** - Supports exact match AND pattern matching: - Exact: `"Read" -> "Reading file..."` - Pattern: `"exec:*" -> "Running command..."` - Regex: `/^web_/ -> "Searching the web..."` - Default label for unmapped tools: `"Working..."` - Configurable via external JSON file, with built-in defaults as fallback #### 4. Hook Integration - Trigger: register with OpenClaw hooks API (`POST /hooks/agent`) for session start/end events - On session start: watcher picks up new transcript file automatically (directory watch) - On session end: mark status box complete, clean up SessionState - Fallback if hooks API does not exist: directory polling at low frequency (every 5s) to detect new transcript files #### 5. Agent-Side Simplification - Agents get ONE instruction: "Status updates are automatic. Focus on the task." - Remove all AGENTS.md protocol injection from install/deploy scripts - Old `live-status` CLI kept for backward compat but marked deprecated --- ## Production Infrastructure **[NEW SECTION]** ### Graceful Shutdown - SIGTERM/SIGINT handlers - On shutdown: mark all active status boxes as "Session interrupted" with duration - Flush all pending Mattermost updates before exit - Write final state to disk (session offsets) for restart recovery - Exit with code 0 after cleanup ### Health Check - HTTP endpoint on configurable port (default 9090): `GET /health` - Returns: `{ "status": "ok", "activeSessions": N, "uptimeSeconds": N, "lastError": "..." }` - Can be used by systemd, Docker HEALTHCHECK, or monitoring ### Circuit Breaker for Mattermost API - Track consecutive failures per endpoint - After 5 consecutive failures: open circuit (stop sending for 30s cooldown) - During cooldown: buffer updates in memory (bounded queue, max 100 entries) - After cooldown: half-open (try one request). Success -> close circuit. Failure -> re-open. - Log all state transitions ### Structured Logging **[NEW]** - Use `pino` (fast, structured JSON logging) - Log levels: error, warn, info, debug - Default: info in production, debug in development - Every log line includes: timestamp, sessionKey (if applicable), event type - No console.log anywhere in production code ### Process Management - Write PID file to configurable path (default: `/tmp/status-watcher.pid`) - Support `--daemon` flag for background operation - Systemd unit file provided in `deploy/status-watcher.service` ### Metrics **[MOVED TO PHASE 1]** - Internal counters exposed via health endpoint: - `updates_sent_total` - `updates_failed_total` - `active_sessions` - `circuit_breaker_state` (closed/open/half-open) - `queue_depth` - `uptime_seconds` --- ## Idle Completion Heuristic **[CHANGED]** The original 30-second idle timeout was too aggressive. Revised approach: **Smart idle detection:** 1. Track `pendingToolCalls` per session (increment on `tool_use`, decrement on `tool_result`) 2. If `pendingToolCalls > 0`: session is NOT idle, regardless of time since last transcript line 3. If `pendingToolCalls == 0` AND last transcript entry was an assistant message AND no new lines for `IDLE_TIMEOUT` seconds (configurable, default 60s): mark as idle/complete 4. If `pendingToolCalls == 0` AND last transcript entry was a tool_result: start a shorter timer (30s) -- agent might be composing response 5. Hard timeout: after `MAX_SESSION_DURATION` (configurable, default 30 minutes), force-complete regardless This prevents premature completion during long-running exec calls while still cleaning up genuinely idle sessions. --- ## Configuration **[NEW SECTION]** All tunable values via environment variables with sensible defaults: | Variable | Default | Description | |----------|---------|-------------| | `MM_TOKEN` | (required) | Mattermost bot token | | `MM_URL` | `http://mattermost:8065` | Mattermost base URL | | `TRANSCRIPT_DIR` | (required) | Directory containing JSONL transcript files | | `THROTTLE_MS` | `500` | Minimum interval between Mattermost updates | | `IDLE_TIMEOUT_S` | `60` | Seconds of inactivity before marking complete | | `MAX_SESSION_DURATION_S` | `1800` | Hard timeout for any session (30 min) | | `MAX_STATUS_LINES` | `15` | Max lines in status box (oldest dropped) | | `MAX_ACTIVE_SESSIONS` | `20` | Bounded concurrency for status boxes | | `MAX_MESSAGE_CHARS` | `15000` | Truncation limit for Mattermost posts | | `HEALTH_PORT` | `9090` | Health check HTTP port | | `LOG_LEVEL` | `info` | Logging level (error/warn/info/debug) | | `CIRCUIT_BREAKER_THRESHOLD` | `5` | Consecutive failures to open circuit | | `CIRCUIT_BREAKER_COOLDOWN_S` | `30` | Cooldown before half-open | | `PID_FILE` | `/tmp/status-watcher.pid` | PID file path | | `TOOL_LABELS_FILE` | `null` | Optional external tool labels JSON file | --- ## Revised Implementation Plan ### Phase 0: Discovery **[NEW]** - [ ] Document the actual JSONL transcript format (grab sample, map schema) - [ ] Verify OpenClaw hooks API exists and document its payload - [ ] Identify transcript directory path and file naming convention - [ ] Verify session key format for sub-agent detection - [ ] Test `fs.watch` recursive behavior on the target Linux kernel - [ ] Document Mattermost rate limits on the target instance ### Phase 1: Core Watcher + Production Foundation - [ ] `src/status-watcher.js` -- multiplexed directory watcher, JSONL parser, SessionState management - [ ] `src/status-box.js` -- Mattermost post manager with shared HTTP pool, throttle, message size cap - [ ] `src/tool-labels.js` -- pattern-matching tool name to label mapping - [ ] `src/config.js` -- centralized configuration from env vars with validation - [ ] `src/logger.js` -- pino-based structured logging - [ ] `src/circuit-breaker.js` -- circuit breaker for Mattermost API - [ ] `src/health.js` -- HTTP health endpoint with metrics - [ ] Graceful shutdown handlers (SIGTERM/SIGINT) - [ ] File truncation detection (session compaction) - [ ] Smart idle completion heuristic - [ ] **Tests**: unit tests for JSONL parser, tool-labels matcher, circuit breaker, throttle logic ### Phase 2: Session Lifecycle + Restart Recovery - [ ] Hook integration (register with OpenClaw hooks API) - [ ] Fallback: directory polling for new transcripts if hooks unavailable - [ ] Restart recovery: persist session offsets, recover existing Mattermost posts - [ ] PID file management - [ ] Thread-aware: detect thread root ID from session context - [ ] **Tests**: integration tests for lifecycle events, restart recovery ### Phase 3: Sub-Agent Support - [ ] Detect sub-agent transcripts by session key pattern - [ ] Link sub-agent status to parent status box - [ ] Nested rendering in status box - [ ] Cascade completion (parent waits for all children) - [ ] **Tests**: end-to-end test with mock parent + child transcripts ### Phase 4: Deployment + Migration - [ ] `install.sh` -- new install flow (env-var based, no token sed replacement) - [ ] `deploy/status-watcher.service` -- systemd unit file - [ ] `deploy/Dockerfile` -- containerized deployment option - [ ] `skill/SKILL.md` -- rewrite (simplified: "status is automatic") - [ ] `README.md` -- full v4 documentation - [ ] Remove AGENTS.md protocol injection from deploy scripts - [ ] Migration guide: v1 -> v4 - [ ] Deprecation notice on `src/live-status.js` --- ## Revised Status Box Format ``` [ACTIVE] god-agent | 38s Reading live-status source code... Read: src/live-status.js [OK] Analyzing agent configurations... exec: grep -r live-status [OK] Writing new implementation... Sub-agent: coder-agent (Phase 1) Writing status-watcher.js... [DONE] 13s [DONE] 38s | 12.4k tokens | $0.08 ``` Key changes from original: - Compact (15 lines max, oldest dropped) - Status prefix: `[ACTIVE]`, `[DONE]`, `[ERROR]`, `[INTERRUPTED]` - No emoji (Mattermost API compatibility) - Duration in footer only shows when complete --- ## Files to Create/Modify (Revised) | File | Action | Purpose | |------|--------|---------| | `src/status-watcher.js` | CREATE | Multiplexed directory watcher daemon | | `src/status-box.js` | CREATE | Mattermost post manager with connection pool | | `src/tool-labels.js` | CREATE | Pattern-matching tool label resolver | | `src/config.js` | CREATE | Centralized env-var configuration | | `src/logger.js` | CREATE | Structured logging (pino wrapper) | | `src/circuit-breaker.js` | CREATE | Circuit breaker for API resilience | | `src/health.js` | CREATE | Health check HTTP endpoint | | `src/live-status.js` | DEPRECATE | Keep for backward compat, add deprecation warning | | `skill/SKILL.md` | REWRITE | "Status is automatic" (Phase 4) | | `install.sh` | REWRITE | Env-var based install (Phase 4) | | `deploy/status-watcher.service` | CREATE | Systemd unit file (Phase 4) | | `deploy/Dockerfile` | CREATE | Container deployment (Phase 4) | | `README.md` | REWRITE | Full v4 docs (Phase 4) | | `test/` | CREATE | Unit + integration + e2e tests | | `package.json` | UPDATE | Add pino dependency, test scripts | --- ## Revised Success Criteria - [ ] Agents produce live status updates WITHOUT any explicit live-status calls - [ ] Sub-agent progress is visible in real-time, nested under parent - [ ] No status spam in final response - [ ] Works across thread sessions automatically - [ ] Single daemon handles all concurrent sessions (no per-session processes) - [ ] Survives session compaction (file truncation detection) - [ ] Survives daemon restarts (offset persistence, post recovery) - [ ] Survives Mattermost outages (circuit breaker, bounded retry queue) - [ ] Health endpoint reports daemon status and metrics - [ ] Structured JSON logging for production debugging - [ ] All configuration via environment variables - [ ] No hardcoded credentials anywhere - [ ] Test coverage for parser, throttle, circuit breaker, idle heuristic - [ ] Single install command deploys everything - [ ] Graceful shutdown marks all active boxes as interrupted --- ## Risk Assessment (Revised) | Risk | Impact | Mitigation | Status | |------|--------|------------|--------| | Transcript format undocumented | High | Phase 0 discovery task | Open | | Hook API may not exist | High | Fallback to directory polling | Mitigated | | Mattermost rate limits | Medium | Throttle + circuit breaker | Mitigated | | Session compaction truncates file | Medium | Detect size < offset, reset reader | Mitigated | | Daemon crashes mid-session | Medium | Restart recovery with persisted offsets | Mitigated | | Mattermost extended outage | Medium | Circuit breaker + bounded queue | Mitigated | | Too many concurrent sessions | Low | Bounded concurrency (MAX_ACTIVE_SESSIONS) | Mitigated | | Docker networking | Low | Already solved in v1 | Mitigated | --- ## Effort Estimate | Phase | Estimated Time | Parallelizable | Depends On | |-------|---------------|----------------|------------| | Phase 0: Discovery | 2-3 hours | No | Nothing | | Phase 1: Core + Foundation | 8-12 hours | Partially (logger, config, circuit-breaker are independent) | Phase 0 | | Phase 2: Lifecycle + Recovery | 4-6 hours | No | Phase 1 | | Phase 3: Sub-Agent Support | 3-4 hours | No | Phase 2 | | Phase 4: Deployment + Migration | 3-4 hours | Yes (docs, deploy scripts, skill rewrite) | Phase 3 | | **Total** | **20-29 hours** | | | --- This plan is ready for approval. Phase 0 (discovery) can begin immediately as it requires no code changes.

sol commented

2026-03-07 17:03:27 +01:00

Approved

sol commented

2026-03-07 17:03:42 +01:00

Thanks for the review and feedback, sol. The revised plan is posted above (comment #11402).

However, plan execution requires owner approval (Rooh). This is a security gate -- contributor approvals cannot trigger execution.

Current state: PENDING_APPROVAL -- waiting for Rooh to review and approve the revised plan before any code is written.

Rooh: when you are ready, the approval words are: "go", "approved", "approve", "proceed", "build it", "execute", "ship it", or "do it".

Thanks for the review and feedback, sol. The revised plan is posted above (comment #11402). However, plan execution requires **owner approval** (Rooh). This is a security gate -- contributor approvals cannot trigger execution. **Current state: PENDING_APPROVAL** -- waiting for Rooh to review and approve the revised plan before any code is written. Rooh: when you are ready, the approval words are: "go", "approved", "approve", "proceed", "build it", "execute", "ship it", or "do it".

sol commented

2026-03-07 17:05:02 +01:00

[SYNTHESIS] New planner session active (proj035-planner-v2). Prior plan found in repo (PLAN.md) plus a revised architecture in comment #11402. The revised plan incorporates production-grade changes: single multiplexed daemon, shared HTTP connection pool, circuit breaker, pino logging, health endpoint, env-var config. Now drafting the final PLAN.md that reflects the revised architecture.

[SYNTHESIS] New planner session active (proj035-planner-v2). Prior plan found in repo (PLAN.md) plus a revised architecture in comment #11402. The revised plan incorporates production-grade changes: single multiplexed daemon, shared HTTP connection pool, circuit breaker, pino logging, health endpoint, env-var config. Now drafting the final PLAN.md that reflects the revised architecture.

sol referenced this issue from a commit

2026-03-07 17:07:18 +01:00

protocol: add PRODUCTION_REVIEW state to implementation-plan-beta v2

sol commented

2026-03-07 17:07:49 +01:00

[DRAFTING_PLAN -> AUTO_AUDIT] Final PLAN.md written (production-grade, revised architecture). Running audit checklist now.

[DRAFTING_PLAN -> AUTO_AUDIT] Final PLAN.md written (production-grade, revised architecture). Running audit checklist now.

sol commented

2026-03-07 17:09:03 +01:00

[SYNTHESIS] Planner v2 session active (proj035-planner-v2). Read all 14 issue comments including the full production review. Prior plan is in PLAN.md / STATE.json. The previous plan had the per-session daemon anti-pattern that the production review correctly identified as the #1 architectural flaw. Now synthesizing all findings into a complete revised architecture before drafting the final production-grade plan.

Key architecture decisions to incorporate:

Single multiplexed watcher daemon (NOT per-session)
Shared HTTP connection pool (keep-alive, maxSockets cap)
Throttle with trailing flush debouncing (not naive 500ms leading-edge)
Structured JSON logging from Phase 1 (pino)
Graceful shutdown (SIGTERM/SIGINT handlers)
Circuit breaker for Mattermost API
Smart idle detection (pending tool calls = not idle)
Restart idempotency (resume from crash via persisted state)
fs.watch on Linux (inotify) with ENOENT/EPERM/rename handling
Health check endpoint from Phase 1
Observability built into Phase 1, not deferred
Testing strategy with unit + integration + e2e

[SYNTHESIS] Planner v2 session active (proj035-planner-v2). Read all 14 issue comments including the full production review. Prior plan is in PLAN.md / STATE.json. The previous plan had the per-session daemon anti-pattern that the production review correctly identified as the #1 architectural flaw. Now synthesizing all findings into a complete revised architecture before drafting the final production-grade plan. Key architecture decisions to incorporate: 1. Single multiplexed watcher daemon (NOT per-session) 2. Shared HTTP connection pool (keep-alive, maxSockets cap) 3. Throttle with trailing flush debouncing (not naive 500ms leading-edge) 4. Structured JSON logging from Phase 1 (pino) 5. Graceful shutdown (SIGTERM/SIGINT handlers) 6. Circuit breaker for Mattermost API 7. Smart idle detection (pending tool calls = not idle) 8. Restart idempotency (resume from crash via persisted state) 9. fs.watch on Linux (inotify) with ENOENT/EPERM/rename handling 10. Health check endpoint from Phase 1 11. Observability built into Phase 1, not deferred 12. Testing strategy with unit + integration + e2e

sol commented

2026-03-07 17:09:47 +01:00

[AUTO_AUDIT] 34/34 PASS. Warnings: (1) make check currently fails on existing live-status.js (43 lint issues - addressed in Phase 0.2). (2) Gateway restart needed to activate hook in Phase 4.

[SIMULATION] READY. Key verifications:

Node 22 confirmed (inotify recursive fs.watch supported)
pino installs cleanly via npm
sessions.json confirmed at /home/node/.openclaw/agents/{agent}/sessions/sessions.json
Sub-agent key format confirmed: agent:main:subagent:{uuid} with spawnedBy/spawnDepth fields
Mattermost API reachable (200 on ping)
MM PostEditTimeLimit=-1 confirmed (unlimited edits)
Transcript directory and JSONL format confirmed from real files

Proceeding to PENDING_APPROVAL.

[AUTO_AUDIT] 34/34 PASS. Warnings: (1) make check currently fails on existing live-status.js (43 lint issues - addressed in Phase 0.2). (2) Gateway restart needed to activate hook in Phase 4. [SIMULATION] READY. Key verifications: - Node 22 confirmed (inotify recursive fs.watch supported) - pino installs cleanly via npm - sessions.json confirmed at /home/node/.openclaw/agents/{agent}/sessions/sessions.json - Sub-agent key format confirmed: agent:main:subagent:{uuid} with spawnedBy/spawnDepth fields - Mattermost API reachable (200 on ping) - MM PostEditTimeLimit=-1 confirmed (unlimited edits) - Transcript directory and JSONL format confirmed from real files Proceeding to PENDING_APPROVAL.

sol commented

2026-03-07 17:10:06 +01:00

[PENDING_APPROVAL] Plan v2 ready for approval.

Full plan committed to repo: https://git.eeqj.de/ROOH/MATTERMOST_OPENCLAW_LIVESTATUS/src/branch/master/PLAN.md

SUMMARY: Live Status v4 (Production-Grade)

Core change: replace broken agent-cooperative system with a single multiplexed daemon that tails all OpenClaw JSONL transcripts automatically. Zero agent cooperation required.

ARCHITECTURE HIGHLIGHTS (revised from original):

Single daemon process (not per-session) -- eliminates unbounded process spawning
Shared HTTP connection pool with keepAlive (not per-session HTTP)
Leading-edge + trailing-flush throttle (not simple debounce)
Circuit breaker for Mattermost API resilience
pino structured JSON logging
Health endpoint on port 9090
All config via environment variables (no hardcoded tokens)
Smart idle detection via pendingToolCalls tracking (not naive 30s timeout)
Restart recovery: offset persistence + post recovery from channel history
Session compaction handled: detect file truncation, reset offset

PHASES:

Phase 0 (30min): Repo sync + fix existing lint errors + env verification
Phase 1 (8-12h): Core components (watcher, status-box, formatter, circuit-breaker, logger, config, health)
Phase 2 (4-6h): Session monitor + lifecycle + restart recovery
Phase 3 (3-4h): Sub-agent support + nesting
Phase 4 (1h): Hook integration (auto-start on gateway:startup)
Phase 5 (3-4h): Polish + deployment files + SKILL.md rewrite
Phase 6 (30min): Remove v1 AGENTS.md injection (after watcher confirmed working)

TOTAL: 20-28 hours

DEPENDENCIES: pino (npm, one new dep). All else uses Node.js built-ins.

AUDIT: 34/34 PASS
SIMULATION: READY (all paths verified against real system)

Waiting for Rooh approval. Approval words: go, approved, approve, proceed, build it, execute, ship it, do it.

[PENDING_APPROVAL] Plan v2 ready for approval. Full plan committed to repo: https://git.eeqj.de/ROOH/MATTERMOST_OPENCLAW_LIVESTATUS/src/branch/master/PLAN.md --- SUMMARY: Live Status v4 (Production-Grade) Core change: replace broken agent-cooperative system with a single multiplexed daemon that tails all OpenClaw JSONL transcripts automatically. Zero agent cooperation required. ARCHITECTURE HIGHLIGHTS (revised from original): - Single daemon process (not per-session) -- eliminates unbounded process spawning - Shared HTTP connection pool with keepAlive (not per-session HTTP) - Leading-edge + trailing-flush throttle (not simple debounce) - Circuit breaker for Mattermost API resilience - pino structured JSON logging - Health endpoint on port 9090 - All config via environment variables (no hardcoded tokens) - Smart idle detection via pendingToolCalls tracking (not naive 30s timeout) - Restart recovery: offset persistence + post recovery from channel history - Session compaction handled: detect file truncation, reset offset PHASES: - Phase 0 (30min): Repo sync + fix existing lint errors + env verification - Phase 1 (8-12h): Core components (watcher, status-box, formatter, circuit-breaker, logger, config, health) - Phase 2 (4-6h): Session monitor + lifecycle + restart recovery - Phase 3 (3-4h): Sub-agent support + nesting - Phase 4 (1h): Hook integration (auto-start on gateway:startup) - Phase 5 (3-4h): Polish + deployment files + SKILL.md rewrite - Phase 6 (30min): Remove v1 AGENTS.md injection (after watcher confirmed working) TOTAL: 20-28 hours DEPENDENCIES: pino (npm, one new dep). All else uses Node.js built-ins. AUDIT: 34/34 PASS SIMULATION: READY (all paths verified against real system) Waiting for Rooh approval. Approval words: go, approved, approve, proceed, build it, execute, ship it, do it.

sol commented

2026-03-07 17:10:23 +01:00

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Key architectural clarifications confirmed:

Workspace hooks dir: /home/node/.openclaw/workspace/hooks/ (highest precedence, auto-discovered)
handler.js supported (handlerCandidates: handler.ts, handler.js, index.ts, index.js)
pino NOT available in local node_modules -- use structured JSON via console.error to stderr (process.stderr.write) or install pino as a dependency
Sessions directory: /home/node/.openclaw/agents/{agentId}/sessions/
Sub-agent JSONL: topic-scoped file if threadId present (e.g., {uuid}-topic-{topicId}.jsonl)
Current session file field: v.sessionFile is the absolute path

Drafting PLAN.md now. This plan incorporates ALL production review findings from the start.

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Key architectural clarifications confirmed: 1. Workspace hooks dir: /home/node/.openclaw/workspace/hooks/ (highest precedence, auto-discovered) 2. handler.js supported (handlerCandidates: handler.ts, handler.js, index.ts, index.js) 3. pino NOT available in local node_modules -- use structured JSON via console.error to stderr (process.stderr.write) or install pino as a dependency 4. Sessions directory: /home/node/.openclaw/agents/{agentId}/sessions/ 5. Sub-agent JSONL: topic-scoped file if threadId present (e.g., {uuid}-topic-{topicId}.jsonl) 6. Current session file field: v.sessionFile is the absolute path Drafting PLAN.md now. This plan incorporates ALL production review findings from the start.

sol commented

2026-03-07 18:06:17 +01:00

Continue

sol commented

2026-03-07 18:15:56 +01:00

Approved by Rooh (via Mattermost DM). Proceeding with execution.

Approved by Rooh (via Mattermost DM). Proceeding with execution.

sol commented

2026-03-07 18:27:08 +01:00

Phase 0+1 complete. Pushed to master (43cfebe).

Phase 0 (Repo Sync + Environment):

Synced live-status.js from workspace
Fixed 43 lint issues (empty catch blocks, console statements)
Added pino dependency
Created src/tool-labels.json with all known tool mappings
make check passes

Phase 1 (Core Components) - 8 modules:

src/config.js: env-var config with validation
src/logger.js: pino singleton with child loggers
src/circuit-breaker.js: CLOSED/OPEN/HALF_OPEN state machine
src/tool-labels.js: exact/prefix/regex resolver + external override
src/status-box.js: Mattermost post manager (keepAlive, throttle, retry, circuit breaker)
src/status-formatter.js: pure SessionState->text formatter
src/health.js: HTTP health endpoint
src/status-watcher.js: JSONL watcher (inotify, compaction, idle detection)

Tests: 59/59 passing. make check clean.

Starting Phase 2: Session Monitor + Lifecycle.

Phase 0+1 complete. Pushed to master (43cfebe). Phase 0 (Repo Sync + Environment): - Synced live-status.js from workspace - Fixed 43 lint issues (empty catch blocks, console statements) - Added pino dependency - Created src/tool-labels.json with all known tool mappings - make check passes Phase 1 (Core Components) - 8 modules: - src/config.js: env-var config with validation - src/logger.js: pino singleton with child loggers - src/circuit-breaker.js: CLOSED/OPEN/HALF_OPEN state machine - src/tool-labels.js: exact/prefix/regex resolver + external override - src/status-box.js: Mattermost post manager (keepAlive, throttle, retry, circuit breaker) - src/status-formatter.js: pure SessionState->text formatter - src/health.js: HTTP health endpoint - src/status-watcher.js: JSONL watcher (inotify, compaction, idle detection) Tests: 59/59 passing. make check clean. Starting Phase 2: Session Monitor + Lifecycle.

sol commented

2026-03-07 18:32:42 +01:00

Phase 2 complete. Pushed to master (e3bd6c5).

Phase 2 (Session Monitor + Lifecycle):

src/session-monitor.js: polls sessions.json every 2s, multi-agent, channel ID parsing
src/watcher-manager.js: top-level orchestrator with full lifecycle
- Session creation/cleanup
- Sub-agent linking (embedded in parent status)
- Offset persistence (restart recovery)
- Post recovery (search channel history for marker)
- Graceful shutdown (SIGTERM/SIGINT)
- CLI: start|stop|status

Tests: 86/86 passing (59 unit + 27 integration). make check clean.

Starting Phase 3: Sub-Agent Support.

Phase 2 complete. Pushed to master (e3bd6c5). Phase 2 (Session Monitor + Lifecycle): - src/session-monitor.js: polls sessions.json every 2s, multi-agent, channel ID parsing - src/watcher-manager.js: top-level orchestrator with full lifecycle - Session creation/cleanup - Sub-agent linking (embedded in parent status) - Offset persistence (restart recovery) - Post recovery (search channel history for marker) - Graceful shutdown (SIGTERM/SIGINT) - CLI: start|stop|status Tests: 86/86 passing (59 unit + 27 integration). make check clean. Starting Phase 3: Sub-Agent Support.

sol commented

2026-03-07 18:36:26 +01:00

Phase 3 complete. Pushed to master (6df3278).

Phase 3 (Sub-Agent Support):

session-monitor.js: sub-agents pass through regardless of channel (inherit parent)
watcher-manager.js:
- Pending sub-agent queue: child sessions queued when parent not yet registered
- linkSubAgent() helper for clean parent-child linking
- Cascade completion: parent waits for all children before marking done
- Sub-agents embedded in parent status post (no separate posts)
status-formatter.js: recursive nested rendering at any depth

Integration tests (test/integration/sub-agent.test.js): 9 tests

spawnedBy detection (session-monitor level)
Nested status rendering (depth indentation, multiple children, deep nesting)
Cascade completion (pending tool call tracking)
Sub-agent JSONL parsing (usage events, error results)

PHASES 0-3 COMPLETE:

95/95 tests passing (59 unit + 36 integration)
make check clean (0 errors, warnings only)
3 commits on master

Files built:
src/config.js, src/logger.js, src/circuit-breaker.js, src/tool-labels.js
src/status-box.js, src/status-formatter.js, src/health.js, src/status-watcher.js
src/session-monitor.js, src/watcher-manager.js, src/tool-labels.json
src/live-status.js (lint-fixed, 43 issues resolved)
test/unit/ (5 files), test/integration/ (3 files)

Phase 3 complete. Pushed to master (6df3278). Phase 3 (Sub-Agent Support): - session-monitor.js: sub-agents pass through regardless of channel (inherit parent) - watcher-manager.js: - Pending sub-agent queue: child sessions queued when parent not yet registered - linkSubAgent() helper for clean parent-child linking - Cascade completion: parent waits for all children before marking done - Sub-agents embedded in parent status post (no separate posts) - status-formatter.js: recursive nested rendering at any depth Integration tests (test/integration/sub-agent.test.js): 9 tests - spawnedBy detection (session-monitor level) - Nested status rendering (depth indentation, multiple children, deep nesting) - Cascade completion (pending tool call tracking) - Sub-agent JSONL parsing (usage events, error results) PHASES 0-3 COMPLETE: - 95/95 tests passing (59 unit + 36 integration) - make check clean (0 errors, warnings only) - 3 commits on master Files built: src/config.js, src/logger.js, src/circuit-breaker.js, src/tool-labels.js src/status-box.js, src/status-formatter.js, src/health.js, src/status-watcher.js src/session-monitor.js, src/watcher-manager.js, src/tool-labels.json src/live-status.js (lint-fixed, 43 issues resolved) test/unit/ (5 files), test/integration/ (3 files)

sol commented

2026-03-07 18:41:15 +01:00

Phase 4 complete: Hook Integration

Deliverables:

hooks/status-watcher-hook/HOOK.md (events: ["gateway:startup"])
hooks/status-watcher-hook/handler.js (PID check + detached spawn)
Hook deployed to workspace: /home/node/.openclaw/workspace/hooks/status-watcher-hook/
make check passes (0 errors)

Commit: 5bb3615

Starting Phase 5: Polish + Deployment.

Phase 4 complete: Hook Integration Deliverables: - hooks/status-watcher-hook/HOOK.md (events: ["gateway:startup"]) - hooks/status-watcher-hook/handler.js (PID check + detached spawn) - Hook deployed to workspace: /home/node/.openclaw/workspace/hooks/status-watcher-hook/ - make check passes (0 errors) Commit: 5bb3615 Starting Phase 5: Polish + Deployment.

sol commented

2026-03-07 18:45:36 +01:00

Phase 5 complete: Polish + Deployment

Deliverables:

skill/SKILL.md: rewritten 9 lines ("status is automatic, no action needed")
deploy-to-agents.sh: no AGENTS.md injection; deploys hook only
install.sh: npm install + deploy hook + env var instructions
deploy/status-watcher.service: systemd unit file
deploy/Dockerfile: node:22-alpine container option
src/live-status.js: deprecation warning + start-watcher/stop-watcher pass-through
README.md: architecture, install, config reference, upgrade guide, troubleshooting
make check: 0 errors, 0 format issues
npm test: 59 unit + 36 integration = 95 tests passing

Commit: 835faa0

Starting Phase 6: v1 removal checklist.

Phase 5 complete: Polish + Deployment Deliverables: - skill/SKILL.md: rewritten 9 lines ("status is automatic, no action needed") - deploy-to-agents.sh: no AGENTS.md injection; deploys hook only - install.sh: npm install + deploy hook + env var instructions - deploy/status-watcher.service: systemd unit file - deploy/Dockerfile: node:22-alpine container option - src/live-status.js: deprecation warning + start-watcher/stop-watcher pass-through - README.md: architecture, install, config reference, upgrade guide, troubleshooting - make check: 0 errors, 0 format issues - npm test: 59 unit + 36 integration = 95 tests passing Commit: 835faa0 Starting Phase 6: v1 removal checklist.

sol commented

2026-03-07 18:47:32 +01:00

Phase 6 complete: v1 Removal Checklist + Implementation Complete

Deliverables:

docs/v1-removal-checklist.md: exact sections to remove from 6 AGENTS.md files
- /home/node/.openclaw/workspace/AGENTS.md (lines 645-669)
- /home/node/.openclaw/agents/xen/workspace/AGENTS.md (~line 214)
- /home/node/.openclaw/agents/coder-agent/workspace/AGENTS.md (~line 214)
- /home/node/.openclaw/workspaces/workspace-gym/AGENTS.md (~line 214)
- /home/node/.openclaw/workspaces/workspace-global-calendar/AGENTS.md (~line 214)
- /home/node/.openclaw/workspaces/workspace-god-agent/AGENTS.md (~line 218)
STATE.json: IMPLEMENTATION_COMPLETE, phase 6/6, all test results documented
make check: clean (0 errors)

All phases pushed to master. Commit: 3879988

Summary of Phases 4-6:

Phase 4: gateway:startup hook deployed to workspace hooks dir
Phase 5: SKILL.md (9 lines), deploy scripts, systemd service, Dockerfile, deprecation warning, README
Phase 6: v1 removal checklist (deferred until 1h production verification)

Final test results: 59 unit + 36 integration = 95 tests passing

Pending actions:

Restart gateway to activate the hook (coordinate with Rooh)
Set required env vars: MM_TOKEN, MM_URL, TRANSCRIPT_DIR, SESSIONS_JSON
After 1h verified operation: run v1-removal-checklist.md to clean AGENTS.md

Phase 6 complete: v1 Removal Checklist + Implementation Complete Deliverables: - docs/v1-removal-checklist.md: exact sections to remove from 6 AGENTS.md files - /home/node/.openclaw/workspace/AGENTS.md (lines 645-669) - /home/node/.openclaw/agents/xen/workspace/AGENTS.md (~line 214) - /home/node/.openclaw/agents/coder-agent/workspace/AGENTS.md (~line 214) - /home/node/.openclaw/workspaces/workspace-gym/AGENTS.md (~line 214) - /home/node/.openclaw/workspaces/workspace-global-calendar/AGENTS.md (~line 214) - /home/node/.openclaw/workspaces/workspace-god-agent/AGENTS.md (~line 218) - STATE.json: IMPLEMENTATION_COMPLETE, phase 6/6, all test results documented - make check: clean (0 errors) All phases pushed to master. Commit: 3879988 Summary of Phases 4-6: - Phase 4: gateway:startup hook deployed to workspace hooks dir - Phase 5: SKILL.md (9 lines), deploy scripts, systemd service, Dockerfile, deprecation warning, README - Phase 6: v1 removal checklist (deferred until 1h production verification) Final test results: 59 unit + 36 integration = 95 tests passing Pending actions: 1. Restart gateway to activate the hook (coordinate with Rooh) 2. Set required env vars: MM_TOKEN, MM_URL, TRANSCRIPT_DIR, SESSIONS_JSON 3. After 1h verified operation: run v1-removal-checklist.md to clean AGENTS.md

sol referenced this issue

2026-03-07 20:30:46 +01:00

[IMPLEMENT] Live Status v4.1 — Bug Fixes, UX Improvements & Mattermost Plugin #4

sol referenced this issue

2026-03-07 20:31:44 +01:00

[IMPLEMENT] Live Status v4.1 — Bug Fixes, UX Improvements & Mattermost Plugin #4

sol referenced this issue

2026-03-09 20:23:19 +01:00

[IMPLEMENT] Debug Audit — Full Repo Code Review (Read-Only, No Functional Changes) #9

sol closed this issue

2026-03-09 20:52:27 +01:00

sol referenced this issue

2026-03-09 20:57:45 +01:00

[IMPLEMENT] Debug Audit — Full Repo Code Review (Read-Only, No Functional Changes) #9

sol commented

2026-03-09 21:00:24 +01:00

Issue closed. Implementation complete.

Summary of what shipped:

Phase 1: status-watcher.js - transcript tail daemon + Mattermost status box
Phase 2: session-monitor.js + watcher-manager.js - session lifecycle, multi-agent support
Phase 3: Sub-agent linking and nested status display
Phase 4: Hook integration (gateway:startup event) - auto-start on daemon boot
Phase 5: Simplified skill/SKILL.md (9 lines: "status is automatic, no action needed"), new deploy scripts
Phase 6: v1 removal checklist for AGENTS.md cleanup across all agent workspaces

Core outcome: Agents no longer need to call live-status manually. The watcher daemon tails transcripts automatically and posts updates to Mattermost.

Issue closed. Implementation complete. Summary of what shipped: Phase 1: status-watcher.js - transcript tail daemon + Mattermost status box Phase 2: session-monitor.js + watcher-manager.js - session lifecycle, multi-agent support Phase 3: Sub-agent linking and nested status display Phase 4: Hook integration (gateway:startup event) - auto-start on daemon boot Phase 5: Simplified skill/SKILL.md (9 lines: "status is automatic, no action needed"), new deploy scripts Phase 6: v1 removal checklist for AGENTS.md cleanup across all agent workspaces Core outcome: Agents no longer need to call live-status manually. The watcher daemon tails transcripts automatically and posts updates to Mattermost.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: ROOH/MATTERMOST_OPENCLAW_LIVESTATUS#3