[v4] Live Status Rewrite - Production-Grade Real-Time Agent Progress System #3

Closed
opened 2026-03-07 16:25:50 +01:00 by sol · 27 comments
Collaborator

Problem Statement

The current live-status system (v1) is fundamentally broken in production.

Current Architecture

Agent -> exec("live-status create/update ...") -> Mattermost API

Diagnosed Failures

  1. Agents forget to use it. Even with MANDATORY instructions in every AGENTS.md, agents skip live-status because it requires 4+ separate exec calls per task, agents must remember post IDs across tool calls, and there's no enforcement mechanism.

  2. Spam problem. When agents DO try to update status, the final response dumps 10+ status messages into the chat.

  3. No sub-agent visibility. Sub-agents work in isolated sessions. Their progress is invisible until the announce step.

  4. Thread isolation breaks state. Mattermost threads create separate OpenClaw sessions.

  5. Naive solutions don't scale. Telling agents harder in system prompts doesn't work.


Proposed Solution: Live Status v4

Core Principle

Don't rely on agents to update status. Intercept their work automatically.

Architecture

A status-watcher daemon that tails the agent's JSONL transcript in real-time and auto-updates a Mattermost status box.

OpenClaw Gateway
  Agent Session -> writes transcript JSONL
  status-watcher daemon (per-session)
    -> fs.watch on transcript file
    -> Parses tool calls, results, assistant text
    -> Debounced Mattermost API updates (500ms)
    -> Auto-create/complete status box
  Sub-agent sessions
    -> Same watcher pattern
    -> Nested under parent status box

Components

1. status-watcher - Transcript Tail Daemon

  • tail -f the agent's JSONL transcript file
  • Parse each new line for tool calls, tool results, and assistant text
  • Map tool names to human-readable status labels
  • Debounce and batch updates to Mattermost (max 1 update/500ms)
  • Auto-create status box on first activity
  • Auto-mark complete when session goes idle (no new lines for 30s)
  • Handle sub-agent transcripts (nested status)

2. status-box - Mattermost Post Manager

  • Rich message attachments with colored status cards
  • Sub-agent progress as nested items
  • Timestamps and duration tracking
  • Auto-cleanup on session end

3. Hook Integration

  • Trigger status-watcher on session start via OpenClaw hooks
  • Kill watcher on session end
  • Route sub-agent announces through status box

4. Agent-Side Simplification

Agents get ONE simple instruction: Status updates are automatic. Focus on the task.


Implementation Plan

Phase 1: Core Watcher

  • status-watcher.js - transcript tail + parse + Mattermost update
  • Tool-name to status-label mapping (configurable)
  • Debounced Mattermost updates (500ms default)
  • Auto-create status box in correct channel/thread
  • Auto-complete detection (idle timeout)

Phase 2: Session Lifecycle

  • Start watcher when agent session begins (via hook or cron)
  • Stop watcher when session ends
  • Handle session compaction (transcript rewrite)
  • Thread-aware: detect thread root ID from session key

Phase 3: Sub-Agent Support

  • Watch sub-agent transcripts
  • Nest sub-agent status under parent status box
  • Cascade completion

Phase 4: Polish

  • Rich Mattermost attachments
  • Rate limiting
  • Error recovery
  • Metrics/logging
  • New deploy script
  • Remove old AGENTS.md protocol injection

Status Box Format (v4)

Agent: god-agent - Fixing live-status system
[15:21:22] Reading live-status source code...
[15:21:25] Read: /src/live-status.js done
[15:21:28] Analyzing agent configurations...
[15:21:30] exec: grep -r live-status ... done
[15:21:35] Writing new implementation...
[15:21:40] Sub-agent: coder-agent (Phase 1)
  [15:21:42] Writing status-watcher.js...
  [15:21:55] Complete (13s)
[15:22:00] Task complete (38s)
Runtime: 38s | Tokens: 12.4k | Cost: $0.08

Files to Create/Modify

  • src/status-watcher.js - CREATE - Core transcript watcher daemon
  • src/status-box.js - CREATE - Mattermost post manager
  • src/tool-labels.json - CREATE - Tool name to human label mapping
  • src/live-status.js - DEPRECATE - Keep for backward compat
  • skill/SKILL.md - REWRITE - Simpler instructions
  • deploy-to-agents.sh - REWRITE - Install watcher instead of prompt injection
  • install.sh - REWRITE - New install flow
  • README.md - REWRITE - Full v4 documentation

Success Criteria

  • Agents produce live status updates WITHOUT any explicit live-status calls
  • Sub-agent progress is visible in real-time
  • No status spam in final response
  • Works across thread sessions automatically
  • Survives session compaction and gateway restarts
  • Production-ready: rate limiting, error recovery, logging
  • Single install command deploys everything

References

  • Current code: src/live-status.js (CLI tool, ~250 lines)
  • OpenClaw transcripts: JSONL files at workspace/session-id.jsonl
  • OpenClaw hooks: POST /hooks/agent for session lifecycle events
  • OpenClaw sub-agents: agent:id:subagent:uuid session pattern
  • Mattermost API: POST/PUT /api/v4/posts
  • Inspiration: Google Antigravity-style live execution visibility
## Problem Statement The current live-status system (v1) is fundamentally broken in production. ### Current Architecture ``` Agent -> exec("live-status create/update ...") -> Mattermost API ``` ### Diagnosed Failures 1. **Agents forget to use it.** Even with MANDATORY instructions in every AGENTS.md, agents skip live-status because it requires 4+ separate exec calls per task, agents must remember post IDs across tool calls, and there's no enforcement mechanism. 2. **Spam problem.** When agents DO try to update status, the final response dumps 10+ status messages into the chat. 3. **No sub-agent visibility.** Sub-agents work in isolated sessions. Their progress is invisible until the announce step. 4. **Thread isolation breaks state.** Mattermost threads create separate OpenClaw sessions. 5. **Naive solutions don't scale.** Telling agents harder in system prompts doesn't work. --- ## Proposed Solution: Live Status v4 ### Core Principle **Don't rely on agents to update status. Intercept their work automatically.** ### Architecture A `status-watcher` daemon that tails the agent's JSONL transcript in real-time and auto-updates a Mattermost status box. ``` OpenClaw Gateway Agent Session -> writes transcript JSONL status-watcher daemon (per-session) -> fs.watch on transcript file -> Parses tool calls, results, assistant text -> Debounced Mattermost API updates (500ms) -> Auto-create/complete status box Sub-agent sessions -> Same watcher pattern -> Nested under parent status box ``` ### Components #### 1. status-watcher - Transcript Tail Daemon - tail -f the agent's JSONL transcript file - Parse each new line for tool calls, tool results, and assistant text - Map tool names to human-readable status labels - Debounce and batch updates to Mattermost (max 1 update/500ms) - Auto-create status box on first activity - Auto-mark complete when session goes idle (no new lines for 30s) - Handle sub-agent transcripts (nested status) #### 2. status-box - Mattermost Post Manager - Rich message attachments with colored status cards - Sub-agent progress as nested items - Timestamps and duration tracking - Auto-cleanup on session end #### 3. Hook Integration - Trigger status-watcher on session start via OpenClaw hooks - Kill watcher on session end - Route sub-agent announces through status box #### 4. Agent-Side Simplification Agents get ONE simple instruction: Status updates are automatic. Focus on the task. --- ## Implementation Plan ### Phase 1: Core Watcher - [ ] status-watcher.js - transcript tail + parse + Mattermost update - [ ] Tool-name to status-label mapping (configurable) - [ ] Debounced Mattermost updates (500ms default) - [ ] Auto-create status box in correct channel/thread - [ ] Auto-complete detection (idle timeout) ### Phase 2: Session Lifecycle - [ ] Start watcher when agent session begins (via hook or cron) - [ ] Stop watcher when session ends - [ ] Handle session compaction (transcript rewrite) - [ ] Thread-aware: detect thread root ID from session key ### Phase 3: Sub-Agent Support - [ ] Watch sub-agent transcripts - [ ] Nest sub-agent status under parent status box - [ ] Cascade completion ### Phase 4: Polish - [ ] Rich Mattermost attachments - [ ] Rate limiting - [ ] Error recovery - [ ] Metrics/logging - [ ] New deploy script - [ ] Remove old AGENTS.md protocol injection --- ## Status Box Format (v4) ``` Agent: god-agent - Fixing live-status system [15:21:22] Reading live-status source code... [15:21:25] Read: /src/live-status.js done [15:21:28] Analyzing agent configurations... [15:21:30] exec: grep -r live-status ... done [15:21:35] Writing new implementation... [15:21:40] Sub-agent: coder-agent (Phase 1) [15:21:42] Writing status-watcher.js... [15:21:55] Complete (13s) [15:22:00] Task complete (38s) Runtime: 38s | Tokens: 12.4k | Cost: $0.08 ``` --- ## Files to Create/Modify - src/status-watcher.js - CREATE - Core transcript watcher daemon - src/status-box.js - CREATE - Mattermost post manager - src/tool-labels.json - CREATE - Tool name to human label mapping - src/live-status.js - DEPRECATE - Keep for backward compat - skill/SKILL.md - REWRITE - Simpler instructions - deploy-to-agents.sh - REWRITE - Install watcher instead of prompt injection - install.sh - REWRITE - New install flow - README.md - REWRITE - Full v4 documentation --- ## Success Criteria - [ ] Agents produce live status updates WITHOUT any explicit live-status calls - [ ] Sub-agent progress is visible in real-time - [ ] No status spam in final response - [ ] Works across thread sessions automatically - [ ] Survives session compaction and gateway restarts - [ ] Production-ready: rate limiting, error recovery, logging - [ ] Single install command deploys everything --- ## References - Current code: src/live-status.js (CLI tool, ~250 lines) - OpenClaw transcripts: JSONL files at workspace/session-id.jsonl - OpenClaw hooks: POST /hooks/agent for session lifecycle events - OpenClaw sub-agents: agent:id:subagent:uuid session pattern - Mattermost API: POST/PUT /api/v4/posts - Inspiration: Google Antigravity-style live execution visibility
sol changed title from [v4] Live Status Rewrite — Production-Grade Real-Time Agent Progress System to [v4] Live Status Rewrite - Production-Grade Real-Time Agent Progress System 2026-03-07 16:28:30 +01:00
Author
Collaborator

Triage: v4 Live Status Rewrite

Reviewed the spec against the current codebase. Here is the assessment.

Current State (v1)

  • src/live-status.js (114 lines) -- simple CLI wrapper around Mattermost HTTP API
  • Agents must manually call live-status create / live-status update with post IDs
  • install.sh injects protocol text into AGENTS.md (prompt injection approach)
  • No sub-agent awareness, no auto-detection, no lifecycle hooks
  • Token hardcoded as placeholder, replaced by sed during install

Spec Assessment

The v4 proposal correctly diagnoses the core problem: agents don't reliably use explicit status tools. The solution (transcript-tailing daemon) is architecturally sound -- it moves the burden from agents to infrastructure.

Key Technical Observations

  1. Transcript JSONL tailing -- This depends on OpenClaw's internal transcript format. Need to verify: (a) where transcripts live, (b) the JSONL schema, (c) whether they are written synchronously or buffered. This is discovery work.

  2. Session lifecycle hooks -- The spec mentions POST /hooks/agent. Need to verify this hook exists and what payload it sends. If it does not exist, this becomes a dependency on OpenClaw core.

  3. Mattermost post management -- The proposed status-box.js concept (rich attachments, nested sub-agent status) is a significant step up from raw message text. Good direction.

  4. Debouncing -- 500ms default is reasonable. Should be configurable per-deployment.

  5. Idle timeout for auto-complete -- 30s might be too aggressive. Some tool calls (exec) can run for minutes. Needs a smarter heuristic (e.g., no new transcript lines AND no running exec sessions).

  6. Session compaction -- If OpenClaw rewrites/truncates transcript files, the watcher needs to handle file truncation (inode change, seek reset). This is a real edge case worth testing early.

Risk Areas

Risk Impact Notes
Transcript format undocumented High Core dependency -- if format changes, watcher breaks
Hook API may not exist yet High Blocks Phase 2 entirely
Mattermost rate limits Medium Debouncing helps, but burst scenarios need testing
Docker networking (mattermost:8065) Low Already solved in v1

Recommendation

This is a solid spec. The phased approach makes sense. Before building, Phase 0 discovery would de-risk the two biggest unknowns:

  1. Document the actual JSONL transcript format (grab a sample, map the schema)
  2. Verify the OpenClaw hooks API exists and what it provides

Once those are confirmed, Phases 1-4 can proceed with confidence.

This issue has [v4] in the title -- ready for implementation planning when the owner gives the go-ahead.

## Triage: v4 Live Status Rewrite Reviewed the spec against the current codebase. Here is the assessment. ### Current State (v1) - `src/live-status.js` (114 lines) -- simple CLI wrapper around Mattermost HTTP API - Agents must manually call `live-status create` / `live-status update` with post IDs - `install.sh` injects protocol text into AGENTS.md (prompt injection approach) - No sub-agent awareness, no auto-detection, no lifecycle hooks - Token hardcoded as placeholder, replaced by sed during install ### Spec Assessment The v4 proposal correctly diagnoses the core problem: **agents don't reliably use explicit status tools**. The solution (transcript-tailing daemon) is architecturally sound -- it moves the burden from agents to infrastructure. ### Key Technical Observations 1. **Transcript JSONL tailing** -- This depends on OpenClaw's internal transcript format. Need to verify: (a) where transcripts live, (b) the JSONL schema, (c) whether they are written synchronously or buffered. This is discovery work. 2. **Session lifecycle hooks** -- The spec mentions `POST /hooks/agent`. Need to verify this hook exists and what payload it sends. If it does not exist, this becomes a dependency on OpenClaw core. 3. **Mattermost post management** -- The proposed `status-box.js` concept (rich attachments, nested sub-agent status) is a significant step up from raw message text. Good direction. 4. **Debouncing** -- 500ms default is reasonable. Should be configurable per-deployment. 5. **Idle timeout for auto-complete** -- 30s might be too aggressive. Some tool calls (exec) can run for minutes. Needs a smarter heuristic (e.g., no new transcript lines AND no running exec sessions). 6. **Session compaction** -- If OpenClaw rewrites/truncates transcript files, the watcher needs to handle file truncation (inode change, seek reset). This is a real edge case worth testing early. ### Risk Areas | Risk | Impact | Notes | |------|--------|-------| | Transcript format undocumented | High | Core dependency -- if format changes, watcher breaks | | Hook API may not exist yet | High | Blocks Phase 2 entirely | | Mattermost rate limits | Medium | Debouncing helps, but burst scenarios need testing | | Docker networking (mattermost:8065) | Low | Already solved in v1 | ### Recommendation This is a solid spec. The phased approach makes sense. Before building, **Phase 0 discovery** would de-risk the two biggest unknowns: 1. Document the actual JSONL transcript format (grab a sample, map the schema) 2. Verify the OpenClaw hooks API exists and what it provides Once those are confirmed, Phases 1-4 can proceed with confidence. This issue has `[v4]` in the title -- ready for implementation planning when the owner gives the go-ahead. <!-- openclaw-agent -->
Author
Collaborator

[DECOMPOSE] Planner sub-agent active. Reading protocol, examining transcripts, and identifying knowledge gaps before drafting discovery sub-issues.

Focus areas:

  1. JSONL transcript format (types, structure)
  2. OpenClaw session lifecycle hooks
  3. Mattermost API constraints
  4. Sub-agent session keying
  5. Current failure modes in live-status
[DECOMPOSE] Planner sub-agent active. Reading protocol, examining transcripts, and identifying knowledge gaps before drafting discovery sub-issues. Focus areas: 1. JSONL transcript format (types, structure) 2. OpenClaw session lifecycle hooks 3. Mattermost API constraints 4. Sub-agent session keying 5. Current failure modes in live-status <!-- openclaw-agent -->
Author
Collaborator

[DISCOVERY -> SYNTHESIS] Discovery phase complete. Key findings below.

Discovery Findings

1. JSONL Transcript Format

Confirmed format. Each line is a JSON object with type field:

  • session -- First line. Contains session UUID, version=3, cwd
  • model_change -- Provider, modelId changes
  • thinking_level_change -- thinking on/off
  • custom -- Subtypes: model-snapshot, openclaw.cache-ttl
  • message -- The main workhorse. Role = user, assistant, or toolResult

Message content array can contain:

  • {type: "text", text: "..."} -- plain text
  • {type: "toolCall", id, name, arguments: {...}} -- tool invocations
  • {type: "thinking", thinking: "..."} -- internal reasoning (thinking mode)

Assistant messages also carry: api, provider, model, usage, stopReason, timestamp

ToolResult messages carry: toolCallId, toolName, isError, content: [{type, text}]

2. Session Keying

Session keys follow pattern: agent:{agentId}:{context}

Examples:

  • agent:main:main -- direct session
  • agent:main:mattermost:channel:{channelId} -- channel session
  • agent:main:mattermost:channel:{channelId}:thread:{threadId} -- thread session
  • agent:main:subagent:{uuid} -- SUB-AGENT SESSION
  • agent:main:hook:gitea:{repo}:issue:{n} -- hook-triggered session
  • agent:main:cron:{name} -- cron session

Sub-agent entry in sessions.json has: spawnedBy, spawnDepth, label, sessionId (maps to .jsonl file UUID)

Sessions store: /home/node/.openclaw/agents/{agentId}/sessions/sessions.json
JSONL files: /home/node/.openclaw/agents/{agentId}/sessions/{uuid}.jsonl
Topic-scoped files: {uuid}-topic-{topicId}.jsonl

3. Hook Events

Available internal hook events:

  • command:new, command:reset, command:stop -- user commands
  • command -- all commands
  • agent:bootstrap -- before workspace files injected
  • gateway:startup -- after gateway starts

NO session:start or session:end hooks exist yet. The hooks system only covers commands, NOT individual message/run starts.

Sub-agent lifecycle: subagent_spawned, subagent_ended are channel plugin hooks (not internal hooks).

4. Mattermost API

  • PostEditTimeLimit = -1 (unlimited edits)
  • Bot token: n73636eit7bg3rgmpsj693mwno (default/main bot)
  • Multiple bot accounts available per agent
  • Post update: PUT /api/v4/posts/{id} works fine, no time limit

5. Current Failure Modes (v1)

  • Agents must manually call live-status create/update/complete
  • deploy-to-agents.sh injects verbose protocol into AGENTS.md
  • Agents forget, IDs get lost between tool calls
  • No sub-agent visibility
  • AGENTS.md Live Status section still relies on manual agent cooperation

6. Transcript File Patterns (Key Insight)

  • Files written live as agent executes
  • Session file = {sessionId}.jsonl (from sessions.json lookup)
  • Sub-agent sessions have their own JSONL files (same directory)
  • To watch a session: read sessions.json -> find sessionFile -> watch that JSONL
  • sessions.json is updated on every message
[DISCOVERY -> SYNTHESIS] Discovery phase complete. Key findings below. ## Discovery Findings ### 1. JSONL Transcript Format Confirmed format. Each line is a JSON object with `type` field: - `session` -- First line. Contains session UUID, version=3, cwd - `model_change` -- Provider, modelId changes - `thinking_level_change` -- thinking on/off - `custom` -- Subtypes: `model-snapshot`, `openclaw.cache-ttl` - `message` -- The main workhorse. Role = `user`, `assistant`, or `toolResult` Message content array can contain: - `{type: "text", text: "..."}` -- plain text - `{type: "toolCall", id, name, arguments: {...}}` -- tool invocations - `{type: "thinking", thinking: "..."}` -- internal reasoning (thinking mode) Assistant messages also carry: `api`, `provider`, `model`, `usage`, `stopReason`, `timestamp` ToolResult messages carry: `toolCallId`, `toolName`, `isError`, `content: [{type, text}]` ### 2. Session Keying Session keys follow pattern: `agent:{agentId}:{context}` Examples: - `agent:main:main` -- direct session - `agent:main:mattermost:channel:{channelId}` -- channel session - `agent:main:mattermost:channel:{channelId}:thread:{threadId}` -- thread session - `agent:main:subagent:{uuid}` -- SUB-AGENT SESSION - `agent:main:hook:gitea:{repo}:issue:{n}` -- hook-triggered session - `agent:main:cron:{name}` -- cron session Sub-agent entry in sessions.json has: `spawnedBy`, `spawnDepth`, `label`, `sessionId` (maps to .jsonl file UUID) Sessions store: `/home/node/.openclaw/agents/{agentId}/sessions/sessions.json` JSONL files: `/home/node/.openclaw/agents/{agentId}/sessions/{uuid}.jsonl` Topic-scoped files: `{uuid}-topic-{topicId}.jsonl` ### 3. Hook Events Available internal hook events: - `command:new`, `command:reset`, `command:stop` -- user commands - `command` -- all commands - `agent:bootstrap` -- before workspace files injected - `gateway:startup` -- after gateway starts NO session:start or session:end hooks exist yet. The hooks system only covers commands, NOT individual message/run starts. Sub-agent lifecycle: `subagent_spawned`, `subagent_ended` are channel plugin hooks (not internal hooks). ### 4. Mattermost API - PostEditTimeLimit = -1 (unlimited edits) - Bot token: `n73636eit7bg3rgmpsj693mwno` (default/main bot) - Multiple bot accounts available per agent - Post update: PUT /api/v4/posts/{id} works fine, no time limit ### 5. Current Failure Modes (v1) - Agents must manually call live-status create/update/complete - deploy-to-agents.sh injects verbose protocol into AGENTS.md - Agents forget, IDs get lost between tool calls - No sub-agent visibility - AGENTS.md Live Status section still relies on manual agent cooperation ### 6. Transcript File Patterns (Key Insight) - Files written live as agent executes - Session file = `{sessionId}.jsonl` (from sessions.json lookup) - Sub-agent sessions have their own JSONL files (same directory) - To watch a session: read sessions.json -> find sessionFile -> watch that JSONL - sessions.json is updated on every message <!-- openclaw-agent -->
Author
Collaborator

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Now drafting the implementation plan.

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Now drafting the implementation plan. <!-- openclaw-agent -->
Author
Collaborator

[AUTO_AUDIT -> SIMULATION -> PENDING_APPROVAL]

Audit: 32/32 PASS. Simulation: READY. No blockers.

Plan committed to repo: PLAN.md, STATE.json, discoveries/README.md

See full plan below in the next comment.

[AUTO_AUDIT -> SIMULATION -> PENDING_APPROVAL] Audit: 32/32 PASS. Simulation: READY. No blockers. Plan committed to repo: PLAN.md, STATE.json, discoveries/README.md See full plan below in the next comment. <!-- openclaw-agent -->
Author
Collaborator

[PENDING_APPROVAL] Implementation Plan: Live Status v4

Full plan committed to repo: PLAN.md
Discovery: discoveries/README.md


Summary

Replace the broken agent-cooperative live-status (v1) with an infrastructure-level daemon that tails OpenClaw JSONL transcript files and auto-updates Mattermost -- zero agent cooperation required.

Core mechanism: Watch sessions.json for new sessions -> tail each session's JSONL -> parse tool calls and results -> debounced Mattermost updates (500ms).

Architecture

  • status-watcher.js -- tails one session's JSONL, parses events, updates Mattermost
  • session-monitor.js -- watches sessions.json for new/ended sessions
  • watcher-manager.js -- orchestrates monitor + watchers, PID management
  • hook/handler.js -- gateway:startup hook auto-starts watcher-manager

Phases (7-9h total)

  • Phase 0 (10min): Repo sync + setup
  • Phase 1 (2-3h): Core watcher daemon
  • Phase 2 (1-2h): Session monitor + watcher manager
  • Phase 3 (1h): Channel resolution (Mattermost session key parsing)
  • Phase 4 (1h): Hook integration (auto-start on gateway:startup)
  • Phase 5 (1h): Polish + cleanup + remove v1 AGENTS.md injection
  • Phase 6 (30min): Remove old AGENTS.md protocol section

Key Technical Decisions

  1. Poll sessions.json every 2s (fs.watch unreliable on Linux for JSON files)
  2. Use stopReason=stop + 10s idle to detect session complete (not naive 30s timeout)
  3. Sub-agents detected via spawnedBy field in sessions.json -> nested in parent status box
  4. handler.js (plain JS) for gateway hook -- confirmed supported by OpenClaw workspace hook discovery
  5. No new npm dependencies -- all Node.js built-ins

Audit: 32/32 PASS | Simulation: READY

Open Questions (all have defaults, plan can proceed)

Q1: Idle timeout -- Default: stopReason=stop + 10s no new lines
Q2: Non-MM sessions -- Default: skip (hook/cron sessions get no status box)
Q3: Per-session or per-request status box -- Default: per user message (new box each request)
Q4: Compaction behavior -- Default: detect fileSize < bytesRead, reset offset

Warning

Gateway restart required to activate the gateway:startup hook after install. Coordinate timing with Rooh.


ROOH: Reply "go" / "approved" / "proceed" to begin execution.

[PENDING_APPROVAL] Implementation Plan: Live Status v4 Full plan committed to repo: PLAN.md Discovery: discoveries/README.md --- ## Summary Replace the broken agent-cooperative live-status (v1) with an infrastructure-level daemon that tails OpenClaw JSONL transcript files and auto-updates Mattermost -- zero agent cooperation required. **Core mechanism:** Watch sessions.json for new sessions -> tail each session's JSONL -> parse tool calls and results -> debounced Mattermost updates (500ms). ## Architecture - status-watcher.js -- tails one session's JSONL, parses events, updates Mattermost - session-monitor.js -- watches sessions.json for new/ended sessions - watcher-manager.js -- orchestrates monitor + watchers, PID management - hook/handler.js -- gateway:startup hook auto-starts watcher-manager ## Phases (7-9h total) - Phase 0 (10min): Repo sync + setup - Phase 1 (2-3h): Core watcher daemon - Phase 2 (1-2h): Session monitor + watcher manager - Phase 3 (1h): Channel resolution (Mattermost session key parsing) - Phase 4 (1h): Hook integration (auto-start on gateway:startup) - Phase 5 (1h): Polish + cleanup + remove v1 AGENTS.md injection - Phase 6 (30min): Remove old AGENTS.md protocol section ## Key Technical Decisions 1. Poll sessions.json every 2s (fs.watch unreliable on Linux for JSON files) 2. Use stopReason=stop + 10s idle to detect session complete (not naive 30s timeout) 3. Sub-agents detected via spawnedBy field in sessions.json -> nested in parent status box 4. handler.js (plain JS) for gateway hook -- confirmed supported by OpenClaw workspace hook discovery 5. No new npm dependencies -- all Node.js built-ins ## Audit: 32/32 PASS | Simulation: READY ## Open Questions (all have defaults, plan can proceed) Q1: Idle timeout -- Default: stopReason=stop + 10s no new lines Q2: Non-MM sessions -- Default: skip (hook/cron sessions get no status box) Q3: Per-session or per-request status box -- Default: per user message (new box each request) Q4: Compaction behavior -- Default: detect fileSize < bytesRead, reset offset ## Warning Gateway restart required to activate the gateway:startup hook after install. Coordinate timing with Rooh. --- ROOH: Reply "go" / "approved" / "proceed" to begin execution. <!-- openclaw-agent -->
Author
Collaborator

Check if this plan is
Scalability
Efficiency
Production ready
Up to best current practices and not naive solutions

Check if this plan is Scalability Efficiency Production ready Up to best current practices and not naive solutions
Author
Collaborator

Review: Scalability, Efficiency, Production-Readiness, Best Practices

Reviewed against the four criteria requested. The diagnosis is spot-on; the architecture has the right instincts but needs hardening in several areas before it is production-grade.


1. SCALABILITY

Current design: one watcher daemon per session.

This is the biggest architectural concern. If 10 agents run concurrently with sub-agents, you could have 30+ watcher processes, each doing fs.watch + debounced HTTP calls. That works on a single box with low concurrency, but it does not scale.

Issues:

  • Per-session process spawning has no upper bound. No pool, no backpressure.
  • Each watcher independently manages its own Mattermost connection -- no shared HTTP client, no connection pooling.
  • Sub-agent watchers multiply the problem (parent + N children = N+1 watchers per task).

Recommendations:

  • Single watcher process, multiplexed. One long-running daemon that watches a directory (e.g., all transcript files) using a single fs.watch with recursive option (supported on Linux via inotify since Node 19+, and you are on Node v22). This eliminates per-session process overhead.
  • Shared HTTP client with connection keep-alive. One http.Agent with keepAlive: true and maxSockets capped. All Mattermost updates go through one connection pool.
  • Bounded concurrency. Max N concurrent status boxes (configurable). Queue or drop beyond that.
  • Consider a lightweight message bus (even just an in-process EventEmitter) between the file watcher and the Mattermost updater, so they can be scaled independently later.

Verdict: Needs rework. Per-session daemons are a v1-level solution to a v4-level problem.


2. EFFICIENCY

Debouncing at 500ms is correct in principle but naive in implementation.

Issues:

  • The spec says "max 1 update/500ms" but does not specify the debounce strategy. Leading-edge? Trailing-edge? Throttle? This matters:

    • Leading-edge: first event fires immediately, subsequent ones are delayed. Good for responsiveness.
    • Trailing-edge: waits 500ms after the LAST event. Good for batching but adds latency.
    • Throttle: fires at most once per 500ms regardless. Best for rate limiting.
    • Best approach: throttle with trailing flush. Fire immediately on first event, then at most once per interval, with a guaranteed final flush. This gives both responsiveness AND batching.
  • Full post replacement on every update is wasteful. Each Mattermost PUT /posts/{id} sends the entire message body. If the status box grows to 30+ lines, you are sending the same 29 lines repeatedly to change 1 line.

    • Mitigation: keep the status box compact (last N lines + summary), not an ever-growing log.
    • Alternative: use Mattermost message attachments (structured fields) which are easier to diff mentally.
  • JSONL parsing on every line is fine -- JSON.parse on a single line is sub-millisecond. No concern here.

  • fs.watch vs polling: On Linux (your runtime), fs.watch uses inotify which is efficient. Good. Do NOT fall back to fs.watchFile (polling) -- it is wasteful and unnecessary on Linux. The spec does not mention this distinction; it should.

Verdict: Mostly good, needs the debounce strategy specified and the message size growth addressed.


3. PRODUCTION-READINESS

This is where the spec has the most gaps.

Missing from the spec:

Gap Impact What to add
No graceful shutdown Orphaned watchers, leaked Mattermost posts stuck in "running" SIGTERM/SIGINT handlers that mark all active status boxes as "interrupted"
No health check endpoint Cannot monitor watcher health Simple HTTP /health or write a heartbeat file
No structured logging Cannot debug production issues Use structured JSON logging (pino or similar), not console.log
No PID file / process management Cannot reliably stop/restart Write PID file, or use systemd/pm2
No file rotation handling If transcripts are rotated (logrotate-style), watcher loses position Watch for inode changes, re-open on rename event
No max message size guard Mattermost has a 16383 char post limit (default) Truncate or paginate status box content
No error budget / circuit breaker If Mattermost is down, watchers spin on retries forever Exponential backoff with circuit breaker (stop trying after N failures, resume after cooldown)
No metrics Cannot measure update latency, error rates, queue depth Expose basic counters (updates sent, errors, queue depth)
Session compaction handling Spec mentions it but no strategy Need to detect file truncation (stat size < last read offset) and reset reader position

The 30-second idle timeout for auto-complete is problematic:

  • exec tool calls can run for minutes (npm install, git clone, compilation).
  • A smarter heuristic: track whether the last transcript line was a tool_call (still waiting for result) vs. an assistant message (might be done). Only start idle timer after a complete assistant turn with no pending tool calls.

Token/credential management:

  • The current v1 approach (sed-replacing a placeholder in the installed binary) is bad practice. v4 should use environment variables exclusively (MM_TOKEN, MM_URL). The spec does not address this.

Verdict: Not production-ready as specified. Needs the gaps above addressed before it can run unattended.


4. BEST PRACTICES

What the spec gets right:

  • Separating concerns (watcher vs. status-box vs. hook integration)
  • Phased rollout (core first, then lifecycle, then sub-agents, then polish)
  • Deprecating the old approach rather than deleting it
  • Removing AGENTS.md prompt injection (correct -- this never worked reliably)

What deviates from best practices:

Area Issue Best Practice
Architecture Per-session daemon spawning Single multiplexed daemon (event-driven)
File watching Spec says fs.watch but does not handle edge cases Use fs.watch on Linux (inotify), handle ENOENT (file not yet created), EPERM, and rename events
Error handling Not mentioned in spec Every I/O boundary needs try/catch with structured error logging
Testing No testing strategy in the spec Unit tests for JSONL parser, integration tests for Mattermost posting, end-to-end test with a mock transcript
Configuration Hardcoded values (500ms, 30s, tool labels) Config file or env vars for ALL tunable values, with sensible defaults
Dependencies tool-labels.json as static file Good start, but should support pattern matching (e.g., exec:* -> "Running command...") not just exact tool names
Observability Phase 4 "metrics/logging" is last Should be Phase 1. You cannot debug a file-watching daemon without logging from day one.
Backward compat "Keep for backward compat" on live-status.js Good. But define a deprecation timeline (e.g., remove after v4 is stable for 2 weeks)

Missing best practice: the watcher should be idempotent on restart. If the watcher crashes and restarts, it should:

  1. Find the existing Mattermost status post (by searching for a marker in the channel)
  2. Resume from the current end of the transcript file (not replay from the beginning)
  3. Update the status box with "Resumed after restart" indicator

Summary Verdict

Criterion Rating Notes
Scalability Needs rework Per-session daemons do not scale; switch to single multiplexed watcher
Efficiency Good with fixes Debounce strategy needs specifying; cap message size
Production-ready Not yet Missing: graceful shutdown, health checks, circuit breaker, structured logging, credential management
Best practices Mostly good Move observability to Phase 1; add testing strategy; make watcher restart-idempotent

The core idea is sound. Transcript tailing is the right approach -- it removes agent burden entirely. But the implementation plan needs a Phase 0 that addresses the architectural gaps above before writing code.

Top 3 changes to make the spec production-grade:

  1. Single multiplexed watcher daemon (not per-session spawning)
  2. Add graceful shutdown, circuit breaker, and structured logging to Phase 1 (not Phase 4)
  3. Define the idle-completion heuristic properly (pending tool calls = not idle)
## Review: Scalability, Efficiency, Production-Readiness, Best Practices Reviewed against the four criteria requested. The diagnosis is spot-on; the architecture has the right instincts but needs hardening in several areas before it is production-grade. --- ### 1. SCALABILITY **Current design: one watcher daemon per session.** This is the biggest architectural concern. If 10 agents run concurrently with sub-agents, you could have 30+ watcher processes, each doing `fs.watch` + debounced HTTP calls. That works on a single box with low concurrency, but it does not scale. **Issues:** - Per-session process spawning has no upper bound. No pool, no backpressure. - Each watcher independently manages its own Mattermost connection -- no shared HTTP client, no connection pooling. - Sub-agent watchers multiply the problem (parent + N children = N+1 watchers per task). **Recommendations:** - **Single watcher process, multiplexed.** One long-running daemon that watches a directory (e.g., all transcript files) using a single `fs.watch` with recursive option (supported on Linux via inotify since Node 19+, and you are on Node v22). This eliminates per-session process overhead. - **Shared HTTP client with connection keep-alive.** One `http.Agent` with `keepAlive: true` and `maxSockets` capped. All Mattermost updates go through one connection pool. - **Bounded concurrency.** Max N concurrent status boxes (configurable). Queue or drop beyond that. - **Consider a lightweight message bus** (even just an in-process EventEmitter) between the file watcher and the Mattermost updater, so they can be scaled independently later. **Verdict: Needs rework.** Per-session daemons are a v1-level solution to a v4-level problem. --- ### 2. EFFICIENCY **Debouncing at 500ms is correct in principle but naive in implementation.** **Issues:** - The spec says "max 1 update/500ms" but does not specify the debounce strategy. Leading-edge? Trailing-edge? Throttle? This matters: - Leading-edge: first event fires immediately, subsequent ones are delayed. Good for responsiveness. - Trailing-edge: waits 500ms after the LAST event. Good for batching but adds latency. - Throttle: fires at most once per 500ms regardless. Best for rate limiting. - **Best approach: throttle with trailing flush.** Fire immediately on first event, then at most once per interval, with a guaranteed final flush. This gives both responsiveness AND batching. - **Full post replacement on every update is wasteful.** Each Mattermost `PUT /posts/{id}` sends the entire message body. If the status box grows to 30+ lines, you are sending the same 29 lines repeatedly to change 1 line. - Mitigation: keep the status box compact (last N lines + summary), not an ever-growing log. - Alternative: use Mattermost message attachments (structured fields) which are easier to diff mentally. - **JSONL parsing on every line is fine** -- JSON.parse on a single line is sub-millisecond. No concern here. - **`fs.watch` vs polling:** On Linux (your runtime), `fs.watch` uses inotify which is efficient. Good. Do NOT fall back to `fs.watchFile` (polling) -- it is wasteful and unnecessary on Linux. The spec does not mention this distinction; it should. **Verdict: Mostly good, needs the debounce strategy specified and the message size growth addressed.** --- ### 3. PRODUCTION-READINESS This is where the spec has the most gaps. **Missing from the spec:** | Gap | Impact | What to add | |-----|--------|-------------| | No graceful shutdown | Orphaned watchers, leaked Mattermost posts stuck in "running" | SIGTERM/SIGINT handlers that mark all active status boxes as "interrupted" | | No health check endpoint | Cannot monitor watcher health | Simple HTTP `/health` or write a heartbeat file | | No structured logging | Cannot debug production issues | Use structured JSON logging (pino or similar), not console.log | | No PID file / process management | Cannot reliably stop/restart | Write PID file, or use systemd/pm2 | | No file rotation handling | If transcripts are rotated (logrotate-style), watcher loses position | Watch for inode changes, re-open on rename event | | No max message size guard | Mattermost has a 16383 char post limit (default) | Truncate or paginate status box content | | No error budget / circuit breaker | If Mattermost is down, watchers spin on retries forever | Exponential backoff with circuit breaker (stop trying after N failures, resume after cooldown) | | No metrics | Cannot measure update latency, error rates, queue depth | Expose basic counters (updates sent, errors, queue depth) | | Session compaction handling | Spec mentions it but no strategy | Need to detect file truncation (stat size < last read offset) and reset reader position | **The 30-second idle timeout for auto-complete is problematic:** - `exec` tool calls can run for minutes (npm install, git clone, compilation). - A smarter heuristic: track whether the last transcript line was a tool_call (still waiting for result) vs. an assistant message (might be done). Only start idle timer after a complete assistant turn with no pending tool calls. **Token/credential management:** - The current v1 approach (sed-replacing a placeholder in the installed binary) is bad practice. v4 should use environment variables exclusively (`MM_TOKEN`, `MM_URL`). The spec does not address this. **Verdict: Not production-ready as specified. Needs the gaps above addressed before it can run unattended.** --- ### 4. BEST PRACTICES **What the spec gets right:** - Separating concerns (watcher vs. status-box vs. hook integration) - Phased rollout (core first, then lifecycle, then sub-agents, then polish) - Deprecating the old approach rather than deleting it - Removing AGENTS.md prompt injection (correct -- this never worked reliably) **What deviates from best practices:** | Area | Issue | Best Practice | |------|-------|---------------| | Architecture | Per-session daemon spawning | Single multiplexed daemon (event-driven) | | File watching | Spec says `fs.watch` but does not handle edge cases | Use `fs.watch` on Linux (inotify), handle ENOENT (file not yet created), EPERM, and rename events | | Error handling | Not mentioned in spec | Every I/O boundary needs try/catch with structured error logging | | Testing | No testing strategy in the spec | Unit tests for JSONL parser, integration tests for Mattermost posting, end-to-end test with a mock transcript | | Configuration | Hardcoded values (500ms, 30s, tool labels) | Config file or env vars for ALL tunable values, with sensible defaults | | Dependencies | `tool-labels.json` as static file | Good start, but should support pattern matching (e.g., `exec:*` -> "Running command...") not just exact tool names | | Observability | Phase 4 "metrics/logging" is last | Should be Phase 1. You cannot debug a file-watching daemon without logging from day one. | | Backward compat | "Keep for backward compat" on live-status.js | Good. But define a deprecation timeline (e.g., remove after v4 is stable for 2 weeks) | **Missing best practice: the watcher should be idempotent on restart.** If the watcher crashes and restarts, it should: 1. Find the existing Mattermost status post (by searching for a marker in the channel) 2. Resume from the current end of the transcript file (not replay from the beginning) 3. Update the status box with "Resumed after restart" indicator --- ### Summary Verdict | Criterion | Rating | Notes | |-----------|--------|-------| | Scalability | Needs rework | Per-session daemons do not scale; switch to single multiplexed watcher | | Efficiency | Good with fixes | Debounce strategy needs specifying; cap message size | | Production-ready | Not yet | Missing: graceful shutdown, health checks, circuit breaker, structured logging, credential management | | Best practices | Mostly good | Move observability to Phase 1; add testing strategy; make watcher restart-idempotent | **The core idea is sound.** Transcript tailing is the right approach -- it removes agent burden entirely. But the implementation plan needs a Phase 0 that addresses the architectural gaps above before writing code. **Top 3 changes to make the spec production-grade:** 1. Single multiplexed watcher daemon (not per-session spawning) 2. Add graceful shutdown, circuit breaker, and structured logging to Phase 1 (not Phase 4) 3. Define the idle-completion heuristic properly (pending tool calls = not idle) <!-- openclaw-agent -->
Author
Collaborator

I agree, update the plan based on your findings.

I agree, update the plan based on your findings.
Author
Collaborator

Revised Plan: v4 Live Status Rewrite (Production-Grade)

Incorporating all findings from the scalability/efficiency/production-readiness review. Changes from original spec marked with [CHANGED] or [NEW].


Problem Statement

(unchanged -- the diagnosis is correct)

The current live-status system (v1) is fundamentally broken in production. Agents forget to use it, it spams when they do, sub-agents are invisible, and prompt injection does not work as an enforcement mechanism.


Proposed Solution: Live Status v4

Core Principle

Don't rely on agents to update status. Intercept their work automatically.

Architecture [CHANGED]

Single multiplexed watcher daemon (not per-session) that watches all transcript files and routes updates through a shared Mattermost connection pool.

OpenClaw Gateway
  Agent Sessions -> write transcript JSONL files to transcript directory
  
  status-watcher daemon (SINGLE PROCESS)
    -> fs.watch on transcript directory (recursive, inotify on Linux)
    -> Multiplexes all active session transcripts
    -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] }
    -> Shared HTTP connection pool (keep-alive, maxSockets=4)
    -> Throttled Mattermost updates (leading edge + trailing flush, 500ms)
    -> Bounded concurrency: max N active status boxes (configurable, default 20)
    -> Structured JSON logging (pino)
    -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted")
    -> Circuit breaker for Mattermost API failures
    
  Sub-agent transcripts
    -> Detected by session key pattern (agent:id:subagent:uuid)
    -> Nested under parent status box automatically

Why single process over per-session daemons:

  • Eliminates unbounded process spawning
  • Shared connection pool reduces HTTP overhead
  • Single point of configuration and monitoring
  • Easier health checking and process management
  • Lower memory footprint (one V8 heap, not N)

Components

1. status-watcher.js - Multiplexed Transcript Watcher [CHANGED]

  • Single long-running daemon watching the transcript directory
  • fs.watch with recursive option (Node 22 on Linux = inotify, efficient)
  • NO fallback to fs.watchFile (polling) -- inotify or nothing
  • On file change: read new bytes from last known offset, split into lines, parse JSONL
  • Maintain SessionState map per active session:
    • postId: Mattermost status box post ID
    • lastOffset: byte offset in transcript file (for resume)
    • pendingToolCalls: count of tool_calls without matching tool_results
    • lines: recent status lines (capped at MAX_LINES, default 15)
    • startTime: session start timestamp
    • lastActivity: timestamp of last transcript line
  • Handle file truncation (session compaction): detect stat.size < lastOffset, reset to 0
  • Handle file deletion: clean up SessionState, mark status box as "session ended"
  • Handle ENOENT on initial watch: file may not exist yet, that is fine

2. status-box.js - Mattermost Post Manager [CHANGED]

  • Shared http.Agent with keepAlive: true, maxSockets: 4
  • Throttle strategy: leading edge + trailing flush at configurable interval (default 500ms)
    • First event fires immediately (responsiveness)
    • Subsequent events batched, at most one update per interval
    • Guaranteed final flush when activity stops (no lost updates)
  • Status box content: compact format, capped at MAX_LINES (not ever-growing log)
    • Show: agent name, current action, last N status lines, elapsed time
    • When lines exceed MAX_LINES, oldest lines are dropped (keep most recent)
    • Footer: runtime duration, token count, cost (if available)
  • Message size guard: truncate to 15000 chars (Mattermost default limit is 16383)
  • Sub-agent progress rendered as indented nested items under parent box
  • Post recovery on restart: search channel for existing status post with marker, resume updating it
  • Credential management: MM_TOKEN and MM_URL from environment variables only. No hardcoded tokens, no sed replacement.

3. tool-labels.js - Tool Name Mapping [CHANGED from .json]

  • Supports exact match AND pattern matching:
    • Exact: "Read" -> "Reading file..."
    • Pattern: "exec:*" -> "Running command..."
    • Regex: /^web_/ -> "Searching the web..."
  • Default label for unmapped tools: "Working..."
  • Configurable via external JSON file, with built-in defaults as fallback

4. Hook Integration

  • Trigger: register with OpenClaw hooks API (POST /hooks/agent) for session start/end events
  • On session start: watcher picks up new transcript file automatically (directory watch)
  • On session end: mark status box complete, clean up SessionState
  • Fallback if hooks API does not exist: directory polling at low frequency (every 5s) to detect new transcript files

5. Agent-Side Simplification

  • Agents get ONE instruction: "Status updates are automatic. Focus on the task."
  • Remove all AGENTS.md protocol injection from install/deploy scripts
  • Old live-status CLI kept for backward compat but marked deprecated

Production Infrastructure [NEW SECTION]

Graceful Shutdown

  • SIGTERM/SIGINT handlers
  • On shutdown: mark all active status boxes as "Session interrupted" with duration
  • Flush all pending Mattermost updates before exit
  • Write final state to disk (session offsets) for restart recovery
  • Exit with code 0 after cleanup

Health Check

  • HTTP endpoint on configurable port (default 9090): GET /health
  • Returns: { "status": "ok", "activeSessions": N, "uptimeSeconds": N, "lastError": "..." }
  • Can be used by systemd, Docker HEALTHCHECK, or monitoring

Circuit Breaker for Mattermost API

  • Track consecutive failures per endpoint
  • After 5 consecutive failures: open circuit (stop sending for 30s cooldown)
  • During cooldown: buffer updates in memory (bounded queue, max 100 entries)
  • After cooldown: half-open (try one request). Success -> close circuit. Failure -> re-open.
  • Log all state transitions

Structured Logging [NEW]

  • Use pino (fast, structured JSON logging)
  • Log levels: error, warn, info, debug
  • Default: info in production, debug in development
  • Every log line includes: timestamp, sessionKey (if applicable), event type
  • No console.log anywhere in production code

Process Management

  • Write PID file to configurable path (default: /tmp/status-watcher.pid)
  • Support --daemon flag for background operation
  • Systemd unit file provided in deploy/status-watcher.service

Metrics [MOVED TO PHASE 1]

  • Internal counters exposed via health endpoint:
    • updates_sent_total
    • updates_failed_total
    • active_sessions
    • circuit_breaker_state (closed/open/half-open)
    • queue_depth
    • uptime_seconds

Idle Completion Heuristic [CHANGED]

The original 30-second idle timeout was too aggressive. Revised approach:

Smart idle detection:

  1. Track pendingToolCalls per session (increment on tool_use, decrement on tool_result)
  2. If pendingToolCalls > 0: session is NOT idle, regardless of time since last transcript line
  3. If pendingToolCalls == 0 AND last transcript entry was an assistant message AND no new lines for IDLE_TIMEOUT seconds (configurable, default 60s): mark as idle/complete
  4. If pendingToolCalls == 0 AND last transcript entry was a tool_result: start a shorter timer (30s) -- agent might be composing response
  5. Hard timeout: after MAX_SESSION_DURATION (configurable, default 30 minutes), force-complete regardless

This prevents premature completion during long-running exec calls while still cleaning up genuinely idle sessions.


Configuration [NEW SECTION]

All tunable values via environment variables with sensible defaults:

Variable Default Description
MM_TOKEN (required) Mattermost bot token
MM_URL http://mattermost:8065 Mattermost base URL
TRANSCRIPT_DIR (required) Directory containing JSONL transcript files
THROTTLE_MS 500 Minimum interval between Mattermost updates
IDLE_TIMEOUT_S 60 Seconds of inactivity before marking complete
MAX_SESSION_DURATION_S 1800 Hard timeout for any session (30 min)
MAX_STATUS_LINES 15 Max lines in status box (oldest dropped)
MAX_ACTIVE_SESSIONS 20 Bounded concurrency for status boxes
MAX_MESSAGE_CHARS 15000 Truncation limit for Mattermost posts
HEALTH_PORT 9090 Health check HTTP port
LOG_LEVEL info Logging level (error/warn/info/debug)
CIRCUIT_BREAKER_THRESHOLD 5 Consecutive failures to open circuit
CIRCUIT_BREAKER_COOLDOWN_S 30 Cooldown before half-open
PID_FILE /tmp/status-watcher.pid PID file path
TOOL_LABELS_FILE null Optional external tool labels JSON file

Revised Implementation Plan

Phase 0: Discovery [NEW]

  • Document the actual JSONL transcript format (grab sample, map schema)
  • Verify OpenClaw hooks API exists and document its payload
  • Identify transcript directory path and file naming convention
  • Verify session key format for sub-agent detection
  • Test fs.watch recursive behavior on the target Linux kernel
  • Document Mattermost rate limits on the target instance

Phase 1: Core Watcher + Production Foundation

  • src/status-watcher.js -- multiplexed directory watcher, JSONL parser, SessionState management
  • src/status-box.js -- Mattermost post manager with shared HTTP pool, throttle, message size cap
  • src/tool-labels.js -- pattern-matching tool name to label mapping
  • src/config.js -- centralized configuration from env vars with validation
  • src/logger.js -- pino-based structured logging
  • src/circuit-breaker.js -- circuit breaker for Mattermost API
  • src/health.js -- HTTP health endpoint with metrics
  • Graceful shutdown handlers (SIGTERM/SIGINT)
  • File truncation detection (session compaction)
  • Smart idle completion heuristic
  • Tests: unit tests for JSONL parser, tool-labels matcher, circuit breaker, throttle logic

Phase 2: Session Lifecycle + Restart Recovery

  • Hook integration (register with OpenClaw hooks API)
  • Fallback: directory polling for new transcripts if hooks unavailable
  • Restart recovery: persist session offsets, recover existing Mattermost posts
  • PID file management
  • Thread-aware: detect thread root ID from session context
  • Tests: integration tests for lifecycle events, restart recovery

Phase 3: Sub-Agent Support

  • Detect sub-agent transcripts by session key pattern
  • Link sub-agent status to parent status box
  • Nested rendering in status box
  • Cascade completion (parent waits for all children)
  • Tests: end-to-end test with mock parent + child transcripts

Phase 4: Deployment + Migration

  • install.sh -- new install flow (env-var based, no token sed replacement)
  • deploy/status-watcher.service -- systemd unit file
  • deploy/Dockerfile -- containerized deployment option
  • skill/SKILL.md -- rewrite (simplified: "status is automatic")
  • README.md -- full v4 documentation
  • Remove AGENTS.md protocol injection from deploy scripts
  • Migration guide: v1 -> v4
  • Deprecation notice on src/live-status.js

Revised Status Box Format

[ACTIVE] god-agent | 38s
Reading live-status source code...
  Read: src/live-status.js [OK]
Analyzing agent configurations...
  exec: grep -r live-status [OK]
Writing new implementation...
  Sub-agent: coder-agent (Phase 1)
    Writing status-watcher.js...
    [DONE] 13s
[DONE] 38s | 12.4k tokens | $0.08

Key changes from original:

  • Compact (15 lines max, oldest dropped)
  • Status prefix: [ACTIVE], [DONE], [ERROR], [INTERRUPTED]
  • No emoji (Mattermost API compatibility)
  • Duration in footer only shows when complete

Files to Create/Modify (Revised)

File Action Purpose
src/status-watcher.js CREATE Multiplexed directory watcher daemon
src/status-box.js CREATE Mattermost post manager with connection pool
src/tool-labels.js CREATE Pattern-matching tool label resolver
src/config.js CREATE Centralized env-var configuration
src/logger.js CREATE Structured logging (pino wrapper)
src/circuit-breaker.js CREATE Circuit breaker for API resilience
src/health.js CREATE Health check HTTP endpoint
src/live-status.js DEPRECATE Keep for backward compat, add deprecation warning
skill/SKILL.md REWRITE "Status is automatic" (Phase 4)
install.sh REWRITE Env-var based install (Phase 4)
deploy/status-watcher.service CREATE Systemd unit file (Phase 4)
deploy/Dockerfile CREATE Container deployment (Phase 4)
README.md REWRITE Full v4 docs (Phase 4)
test/ CREATE Unit + integration + e2e tests
package.json UPDATE Add pino dependency, test scripts

Revised Success Criteria

  • Agents produce live status updates WITHOUT any explicit live-status calls
  • Sub-agent progress is visible in real-time, nested under parent
  • No status spam in final response
  • Works across thread sessions automatically
  • Single daemon handles all concurrent sessions (no per-session processes)
  • Survives session compaction (file truncation detection)
  • Survives daemon restarts (offset persistence, post recovery)
  • Survives Mattermost outages (circuit breaker, bounded retry queue)
  • Health endpoint reports daemon status and metrics
  • Structured JSON logging for production debugging
  • All configuration via environment variables
  • No hardcoded credentials anywhere
  • Test coverage for parser, throttle, circuit breaker, idle heuristic
  • Single install command deploys everything
  • Graceful shutdown marks all active boxes as interrupted

Risk Assessment (Revised)

Risk Impact Mitigation Status
Transcript format undocumented High Phase 0 discovery task Open
Hook API may not exist High Fallback to directory polling Mitigated
Mattermost rate limits Medium Throttle + circuit breaker Mitigated
Session compaction truncates file Medium Detect size < offset, reset reader Mitigated
Daemon crashes mid-session Medium Restart recovery with persisted offsets Mitigated
Mattermost extended outage Medium Circuit breaker + bounded queue Mitigated
Too many concurrent sessions Low Bounded concurrency (MAX_ACTIVE_SESSIONS) Mitigated
Docker networking Low Already solved in v1 Mitigated

Effort Estimate

Phase Estimated Time Parallelizable Depends On
Phase 0: Discovery 2-3 hours No Nothing
Phase 1: Core + Foundation 8-12 hours Partially (logger, config, circuit-breaker are independent) Phase 0
Phase 2: Lifecycle + Recovery 4-6 hours No Phase 1
Phase 3: Sub-Agent Support 3-4 hours No Phase 2
Phase 4: Deployment + Migration 3-4 hours Yes (docs, deploy scripts, skill rewrite) Phase 3
Total 20-29 hours

This plan is ready for approval. Phase 0 (discovery) can begin immediately as it requires no code changes.

## Revised Plan: v4 Live Status Rewrite (Production-Grade) Incorporating all findings from the scalability/efficiency/production-readiness review. Changes from original spec marked with **[CHANGED]** or **[NEW]**. --- ## Problem Statement _(unchanged -- the diagnosis is correct)_ The current live-status system (v1) is fundamentally broken in production. Agents forget to use it, it spams when they do, sub-agents are invisible, and prompt injection does not work as an enforcement mechanism. --- ## Proposed Solution: Live Status v4 ### Core Principle **Don't rely on agents to update status. Intercept their work automatically.** ### Architecture **[CHANGED]** **Single multiplexed watcher daemon** (not per-session) that watches all transcript files and routes updates through a shared Mattermost connection pool. ``` OpenClaw Gateway Agent Sessions -> write transcript JSONL files to transcript directory status-watcher daemon (SINGLE PROCESS) -> fs.watch on transcript directory (recursive, inotify on Linux) -> Multiplexes all active session transcripts -> SessionState map: sessionKey -> { postId, lastOffset, pendingToolCalls, lines[] } -> Shared HTTP connection pool (keep-alive, maxSockets=4) -> Throttled Mattermost updates (leading edge + trailing flush, 500ms) -> Bounded concurrency: max N active status boxes (configurable, default 20) -> Structured JSON logging (pino) -> Graceful shutdown (SIGTERM/SIGINT -> mark all boxes "interrupted") -> Circuit breaker for Mattermost API failures Sub-agent transcripts -> Detected by session key pattern (agent:id:subagent:uuid) -> Nested under parent status box automatically ``` **Why single process over per-session daemons:** - Eliminates unbounded process spawning - Shared connection pool reduces HTTP overhead - Single point of configuration and monitoring - Easier health checking and process management - Lower memory footprint (one V8 heap, not N) --- ### Components #### 1. status-watcher.js - Multiplexed Transcript Watcher **[CHANGED]** - Single long-running daemon watching the transcript directory - `fs.watch` with recursive option (Node 22 on Linux = inotify, efficient) - NO fallback to `fs.watchFile` (polling) -- inotify or nothing - On file change: read new bytes from last known offset, split into lines, parse JSONL - Maintain `SessionState` map per active session: - `postId`: Mattermost status box post ID - `lastOffset`: byte offset in transcript file (for resume) - `pendingToolCalls`: count of tool_calls without matching tool_results - `lines`: recent status lines (capped at MAX_LINES, default 15) - `startTime`: session start timestamp - `lastActivity`: timestamp of last transcript line - Handle file truncation (session compaction): detect `stat.size < lastOffset`, reset to 0 - Handle file deletion: clean up SessionState, mark status box as "session ended" - Handle ENOENT on initial watch: file may not exist yet, that is fine #### 2. status-box.js - Mattermost Post Manager **[CHANGED]** - Shared `http.Agent` with `keepAlive: true`, `maxSockets: 4` - Throttle strategy: **leading edge + trailing flush** at configurable interval (default 500ms) - First event fires immediately (responsiveness) - Subsequent events batched, at most one update per interval - Guaranteed final flush when activity stops (no lost updates) - Status box content: **compact format, capped at MAX_LINES** (not ever-growing log) - Show: agent name, current action, last N status lines, elapsed time - When lines exceed MAX_LINES, oldest lines are dropped (keep most recent) - Footer: runtime duration, token count, cost (if available) - **Message size guard**: truncate to 15000 chars (Mattermost default limit is 16383) - Sub-agent progress rendered as indented nested items under parent box - **Post recovery on restart**: search channel for existing status post with marker, resume updating it - Credential management: `MM_TOKEN` and `MM_URL` from environment variables only. No hardcoded tokens, no sed replacement. #### 3. tool-labels.js - Tool Name Mapping **[CHANGED from .json]** - Supports exact match AND pattern matching: - Exact: `"Read" -> "Reading file..."` - Pattern: `"exec:*" -> "Running command..."` - Regex: `/^web_/ -> "Searching the web..."` - Default label for unmapped tools: `"Working..."` - Configurable via external JSON file, with built-in defaults as fallback #### 4. Hook Integration - Trigger: register with OpenClaw hooks API (`POST /hooks/agent`) for session start/end events - On session start: watcher picks up new transcript file automatically (directory watch) - On session end: mark status box complete, clean up SessionState - Fallback if hooks API does not exist: directory polling at low frequency (every 5s) to detect new transcript files #### 5. Agent-Side Simplification - Agents get ONE instruction: "Status updates are automatic. Focus on the task." - Remove all AGENTS.md protocol injection from install/deploy scripts - Old `live-status` CLI kept for backward compat but marked deprecated --- ## Production Infrastructure **[NEW SECTION]** ### Graceful Shutdown - SIGTERM/SIGINT handlers - On shutdown: mark all active status boxes as "Session interrupted" with duration - Flush all pending Mattermost updates before exit - Write final state to disk (session offsets) for restart recovery - Exit with code 0 after cleanup ### Health Check - HTTP endpoint on configurable port (default 9090): `GET /health` - Returns: `{ "status": "ok", "activeSessions": N, "uptimeSeconds": N, "lastError": "..." }` - Can be used by systemd, Docker HEALTHCHECK, or monitoring ### Circuit Breaker for Mattermost API - Track consecutive failures per endpoint - After 5 consecutive failures: open circuit (stop sending for 30s cooldown) - During cooldown: buffer updates in memory (bounded queue, max 100 entries) - After cooldown: half-open (try one request). Success -> close circuit. Failure -> re-open. - Log all state transitions ### Structured Logging **[NEW]** - Use `pino` (fast, structured JSON logging) - Log levels: error, warn, info, debug - Default: info in production, debug in development - Every log line includes: timestamp, sessionKey (if applicable), event type - No console.log anywhere in production code ### Process Management - Write PID file to configurable path (default: `/tmp/status-watcher.pid`) - Support `--daemon` flag for background operation - Systemd unit file provided in `deploy/status-watcher.service` ### Metrics **[MOVED TO PHASE 1]** - Internal counters exposed via health endpoint: - `updates_sent_total` - `updates_failed_total` - `active_sessions` - `circuit_breaker_state` (closed/open/half-open) - `queue_depth` - `uptime_seconds` --- ## Idle Completion Heuristic **[CHANGED]** The original 30-second idle timeout was too aggressive. Revised approach: **Smart idle detection:** 1. Track `pendingToolCalls` per session (increment on `tool_use`, decrement on `tool_result`) 2. If `pendingToolCalls > 0`: session is NOT idle, regardless of time since last transcript line 3. If `pendingToolCalls == 0` AND last transcript entry was an assistant message AND no new lines for `IDLE_TIMEOUT` seconds (configurable, default 60s): mark as idle/complete 4. If `pendingToolCalls == 0` AND last transcript entry was a tool_result: start a shorter timer (30s) -- agent might be composing response 5. Hard timeout: after `MAX_SESSION_DURATION` (configurable, default 30 minutes), force-complete regardless This prevents premature completion during long-running exec calls while still cleaning up genuinely idle sessions. --- ## Configuration **[NEW SECTION]** All tunable values via environment variables with sensible defaults: | Variable | Default | Description | |----------|---------|-------------| | `MM_TOKEN` | (required) | Mattermost bot token | | `MM_URL` | `http://mattermost:8065` | Mattermost base URL | | `TRANSCRIPT_DIR` | (required) | Directory containing JSONL transcript files | | `THROTTLE_MS` | `500` | Minimum interval between Mattermost updates | | `IDLE_TIMEOUT_S` | `60` | Seconds of inactivity before marking complete | | `MAX_SESSION_DURATION_S` | `1800` | Hard timeout for any session (30 min) | | `MAX_STATUS_LINES` | `15` | Max lines in status box (oldest dropped) | | `MAX_ACTIVE_SESSIONS` | `20` | Bounded concurrency for status boxes | | `MAX_MESSAGE_CHARS` | `15000` | Truncation limit for Mattermost posts | | `HEALTH_PORT` | `9090` | Health check HTTP port | | `LOG_LEVEL` | `info` | Logging level (error/warn/info/debug) | | `CIRCUIT_BREAKER_THRESHOLD` | `5` | Consecutive failures to open circuit | | `CIRCUIT_BREAKER_COOLDOWN_S` | `30` | Cooldown before half-open | | `PID_FILE` | `/tmp/status-watcher.pid` | PID file path | | `TOOL_LABELS_FILE` | `null` | Optional external tool labels JSON file | --- ## Revised Implementation Plan ### Phase 0: Discovery **[NEW]** - [ ] Document the actual JSONL transcript format (grab sample, map schema) - [ ] Verify OpenClaw hooks API exists and document its payload - [ ] Identify transcript directory path and file naming convention - [ ] Verify session key format for sub-agent detection - [ ] Test `fs.watch` recursive behavior on the target Linux kernel - [ ] Document Mattermost rate limits on the target instance ### Phase 1: Core Watcher + Production Foundation - [ ] `src/status-watcher.js` -- multiplexed directory watcher, JSONL parser, SessionState management - [ ] `src/status-box.js` -- Mattermost post manager with shared HTTP pool, throttle, message size cap - [ ] `src/tool-labels.js` -- pattern-matching tool name to label mapping - [ ] `src/config.js` -- centralized configuration from env vars with validation - [ ] `src/logger.js` -- pino-based structured logging - [ ] `src/circuit-breaker.js` -- circuit breaker for Mattermost API - [ ] `src/health.js` -- HTTP health endpoint with metrics - [ ] Graceful shutdown handlers (SIGTERM/SIGINT) - [ ] File truncation detection (session compaction) - [ ] Smart idle completion heuristic - [ ] **Tests**: unit tests for JSONL parser, tool-labels matcher, circuit breaker, throttle logic ### Phase 2: Session Lifecycle + Restart Recovery - [ ] Hook integration (register with OpenClaw hooks API) - [ ] Fallback: directory polling for new transcripts if hooks unavailable - [ ] Restart recovery: persist session offsets, recover existing Mattermost posts - [ ] PID file management - [ ] Thread-aware: detect thread root ID from session context - [ ] **Tests**: integration tests for lifecycle events, restart recovery ### Phase 3: Sub-Agent Support - [ ] Detect sub-agent transcripts by session key pattern - [ ] Link sub-agent status to parent status box - [ ] Nested rendering in status box - [ ] Cascade completion (parent waits for all children) - [ ] **Tests**: end-to-end test with mock parent + child transcripts ### Phase 4: Deployment + Migration - [ ] `install.sh` -- new install flow (env-var based, no token sed replacement) - [ ] `deploy/status-watcher.service` -- systemd unit file - [ ] `deploy/Dockerfile` -- containerized deployment option - [ ] `skill/SKILL.md` -- rewrite (simplified: "status is automatic") - [ ] `README.md` -- full v4 documentation - [ ] Remove AGENTS.md protocol injection from deploy scripts - [ ] Migration guide: v1 -> v4 - [ ] Deprecation notice on `src/live-status.js` --- ## Revised Status Box Format ``` [ACTIVE] god-agent | 38s Reading live-status source code... Read: src/live-status.js [OK] Analyzing agent configurations... exec: grep -r live-status [OK] Writing new implementation... Sub-agent: coder-agent (Phase 1) Writing status-watcher.js... [DONE] 13s [DONE] 38s | 12.4k tokens | $0.08 ``` Key changes from original: - Compact (15 lines max, oldest dropped) - Status prefix: `[ACTIVE]`, `[DONE]`, `[ERROR]`, `[INTERRUPTED]` - No emoji (Mattermost API compatibility) - Duration in footer only shows when complete --- ## Files to Create/Modify (Revised) | File | Action | Purpose | |------|--------|---------| | `src/status-watcher.js` | CREATE | Multiplexed directory watcher daemon | | `src/status-box.js` | CREATE | Mattermost post manager with connection pool | | `src/tool-labels.js` | CREATE | Pattern-matching tool label resolver | | `src/config.js` | CREATE | Centralized env-var configuration | | `src/logger.js` | CREATE | Structured logging (pino wrapper) | | `src/circuit-breaker.js` | CREATE | Circuit breaker for API resilience | | `src/health.js` | CREATE | Health check HTTP endpoint | | `src/live-status.js` | DEPRECATE | Keep for backward compat, add deprecation warning | | `skill/SKILL.md` | REWRITE | "Status is automatic" (Phase 4) | | `install.sh` | REWRITE | Env-var based install (Phase 4) | | `deploy/status-watcher.service` | CREATE | Systemd unit file (Phase 4) | | `deploy/Dockerfile` | CREATE | Container deployment (Phase 4) | | `README.md` | REWRITE | Full v4 docs (Phase 4) | | `test/` | CREATE | Unit + integration + e2e tests | | `package.json` | UPDATE | Add pino dependency, test scripts | --- ## Revised Success Criteria - [ ] Agents produce live status updates WITHOUT any explicit live-status calls - [ ] Sub-agent progress is visible in real-time, nested under parent - [ ] No status spam in final response - [ ] Works across thread sessions automatically - [ ] Single daemon handles all concurrent sessions (no per-session processes) - [ ] Survives session compaction (file truncation detection) - [ ] Survives daemon restarts (offset persistence, post recovery) - [ ] Survives Mattermost outages (circuit breaker, bounded retry queue) - [ ] Health endpoint reports daemon status and metrics - [ ] Structured JSON logging for production debugging - [ ] All configuration via environment variables - [ ] No hardcoded credentials anywhere - [ ] Test coverage for parser, throttle, circuit breaker, idle heuristic - [ ] Single install command deploys everything - [ ] Graceful shutdown marks all active boxes as interrupted --- ## Risk Assessment (Revised) | Risk | Impact | Mitigation | Status | |------|--------|------------|--------| | Transcript format undocumented | High | Phase 0 discovery task | Open | | Hook API may not exist | High | Fallback to directory polling | Mitigated | | Mattermost rate limits | Medium | Throttle + circuit breaker | Mitigated | | Session compaction truncates file | Medium | Detect size < offset, reset reader | Mitigated | | Daemon crashes mid-session | Medium | Restart recovery with persisted offsets | Mitigated | | Mattermost extended outage | Medium | Circuit breaker + bounded queue | Mitigated | | Too many concurrent sessions | Low | Bounded concurrency (MAX_ACTIVE_SESSIONS) | Mitigated | | Docker networking | Low | Already solved in v1 | Mitigated | --- ## Effort Estimate | Phase | Estimated Time | Parallelizable | Depends On | |-------|---------------|----------------|------------| | Phase 0: Discovery | 2-3 hours | No | Nothing | | Phase 1: Core + Foundation | 8-12 hours | Partially (logger, config, circuit-breaker are independent) | Phase 0 | | Phase 2: Lifecycle + Recovery | 4-6 hours | No | Phase 1 | | Phase 3: Sub-Agent Support | 3-4 hours | No | Phase 2 | | Phase 4: Deployment + Migration | 3-4 hours | Yes (docs, deploy scripts, skill rewrite) | Phase 3 | | **Total** | **20-29 hours** | | | --- This plan is ready for approval. Phase 0 (discovery) can begin immediately as it requires no code changes. <!-- openclaw-agent -->
Author
Collaborator

Approved

Approved
Author
Collaborator

Thanks for the review and feedback, sol. The revised plan is posted above (comment #11402).

However, plan execution requires owner approval (Rooh). This is a security gate -- contributor approvals cannot trigger execution.

Current state: PENDING_APPROVAL -- waiting for Rooh to review and approve the revised plan before any code is written.

Rooh: when you are ready, the approval words are: "go", "approved", "approve", "proceed", "build it", "execute", "ship it", or "do it".

Thanks for the review and feedback, sol. The revised plan is posted above (comment #11402). However, plan execution requires **owner approval** (Rooh). This is a security gate -- contributor approvals cannot trigger execution. **Current state: PENDING_APPROVAL** -- waiting for Rooh to review and approve the revised plan before any code is written. Rooh: when you are ready, the approval words are: "go", "approved", "approve", "proceed", "build it", "execute", "ship it", or "do it". <!-- openclaw-agent -->
Author
Collaborator

[SYNTHESIS] New planner session active (proj035-planner-v2). Prior plan found in repo (PLAN.md) plus a revised architecture in comment #11402. The revised plan incorporates production-grade changes: single multiplexed daemon, shared HTTP connection pool, circuit breaker, pino logging, health endpoint, env-var config. Now drafting the final PLAN.md that reflects the revised architecture.

[SYNTHESIS] New planner session active (proj035-planner-v2). Prior plan found in repo (PLAN.md) plus a revised architecture in comment #11402. The revised plan incorporates production-grade changes: single multiplexed daemon, shared HTTP connection pool, circuit breaker, pino logging, health endpoint, env-var config. Now drafting the final PLAN.md that reflects the revised architecture. <!-- openclaw-agent -->
Author
Collaborator

[DRAFTING_PLAN -> AUTO_AUDIT] Final PLAN.md written (production-grade, revised architecture). Running audit checklist now.

[DRAFTING_PLAN -> AUTO_AUDIT] Final PLAN.md written (production-grade, revised architecture). Running audit checklist now. <!-- openclaw-agent -->
Author
Collaborator

[SYNTHESIS] Planner v2 session active (proj035-planner-v2). Read all 14 issue comments including the full production review. Prior plan is in PLAN.md / STATE.json. The previous plan had the per-session daemon anti-pattern that the production review correctly identified as the #1 architectural flaw. Now synthesizing all findings into a complete revised architecture before drafting the final production-grade plan.

Key architecture decisions to incorporate:

  1. Single multiplexed watcher daemon (NOT per-session)
  2. Shared HTTP connection pool (keep-alive, maxSockets cap)
  3. Throttle with trailing flush debouncing (not naive 500ms leading-edge)
  4. Structured JSON logging from Phase 1 (pino)
  5. Graceful shutdown (SIGTERM/SIGINT handlers)
  6. Circuit breaker for Mattermost API
  7. Smart idle detection (pending tool calls = not idle)
  8. Restart idempotency (resume from crash via persisted state)
  9. fs.watch on Linux (inotify) with ENOENT/EPERM/rename handling
  10. Health check endpoint from Phase 1
  11. Observability built into Phase 1, not deferred
  12. Testing strategy with unit + integration + e2e
[SYNTHESIS] Planner v2 session active (proj035-planner-v2). Read all 14 issue comments including the full production review. Prior plan is in PLAN.md / STATE.json. The previous plan had the per-session daemon anti-pattern that the production review correctly identified as the #1 architectural flaw. Now synthesizing all findings into a complete revised architecture before drafting the final production-grade plan. Key architecture decisions to incorporate: 1. Single multiplexed watcher daemon (NOT per-session) 2. Shared HTTP connection pool (keep-alive, maxSockets cap) 3. Throttle with trailing flush debouncing (not naive 500ms leading-edge) 4. Structured JSON logging from Phase 1 (pino) 5. Graceful shutdown (SIGTERM/SIGINT handlers) 6. Circuit breaker for Mattermost API 7. Smart idle detection (pending tool calls = not idle) 8. Restart idempotency (resume from crash via persisted state) 9. fs.watch on Linux (inotify) with ENOENT/EPERM/rename handling 10. Health check endpoint from Phase 1 11. Observability built into Phase 1, not deferred 12. Testing strategy with unit + integration + e2e <!-- openclaw-agent -->
Author
Collaborator

[AUTO_AUDIT] 34/34 PASS. Warnings: (1) make check currently fails on existing live-status.js (43 lint issues - addressed in Phase 0.2). (2) Gateway restart needed to activate hook in Phase 4.

[SIMULATION] READY. Key verifications:

  • Node 22 confirmed (inotify recursive fs.watch supported)
  • pino installs cleanly via npm
  • sessions.json confirmed at /home/node/.openclaw/agents/{agent}/sessions/sessions.json
  • Sub-agent key format confirmed: agent:main:subagent:{uuid} with spawnedBy/spawnDepth fields
  • Mattermost API reachable (200 on ping)
  • MM PostEditTimeLimit=-1 confirmed (unlimited edits)
  • Transcript directory and JSONL format confirmed from real files

Proceeding to PENDING_APPROVAL.

[AUTO_AUDIT] 34/34 PASS. Warnings: (1) make check currently fails on existing live-status.js (43 lint issues - addressed in Phase 0.2). (2) Gateway restart needed to activate hook in Phase 4. [SIMULATION] READY. Key verifications: - Node 22 confirmed (inotify recursive fs.watch supported) - pino installs cleanly via npm - sessions.json confirmed at /home/node/.openclaw/agents/{agent}/sessions/sessions.json - Sub-agent key format confirmed: agent:main:subagent:{uuid} with spawnedBy/spawnDepth fields - Mattermost API reachable (200 on ping) - MM PostEditTimeLimit=-1 confirmed (unlimited edits) - Transcript directory and JSONL format confirmed from real files Proceeding to PENDING_APPROVAL. <!-- openclaw-agent -->
Author
Collaborator

[PENDING_APPROVAL] Plan v2 ready for approval.

Full plan committed to repo: https://git.eeqj.de/ROOH/MATTERMOST_OPENCLAW_LIVESTATUS/src/branch/master/PLAN.md


SUMMARY: Live Status v4 (Production-Grade)

Core change: replace broken agent-cooperative system with a single multiplexed daemon that tails all OpenClaw JSONL transcripts automatically. Zero agent cooperation required.

ARCHITECTURE HIGHLIGHTS (revised from original):

  • Single daemon process (not per-session) -- eliminates unbounded process spawning
  • Shared HTTP connection pool with keepAlive (not per-session HTTP)
  • Leading-edge + trailing-flush throttle (not simple debounce)
  • Circuit breaker for Mattermost API resilience
  • pino structured JSON logging
  • Health endpoint on port 9090
  • All config via environment variables (no hardcoded tokens)
  • Smart idle detection via pendingToolCalls tracking (not naive 30s timeout)
  • Restart recovery: offset persistence + post recovery from channel history
  • Session compaction handled: detect file truncation, reset offset

PHASES:

  • Phase 0 (30min): Repo sync + fix existing lint errors + env verification
  • Phase 1 (8-12h): Core components (watcher, status-box, formatter, circuit-breaker, logger, config, health)
  • Phase 2 (4-6h): Session monitor + lifecycle + restart recovery
  • Phase 3 (3-4h): Sub-agent support + nesting
  • Phase 4 (1h): Hook integration (auto-start on gateway:startup)
  • Phase 5 (3-4h): Polish + deployment files + SKILL.md rewrite
  • Phase 6 (30min): Remove v1 AGENTS.md injection (after watcher confirmed working)

TOTAL: 20-28 hours

DEPENDENCIES: pino (npm, one new dep). All else uses Node.js built-ins.

AUDIT: 34/34 PASS
SIMULATION: READY (all paths verified against real system)

Waiting for Rooh approval. Approval words: go, approved, approve, proceed, build it, execute, ship it, do it.

[PENDING_APPROVAL] Plan v2 ready for approval. Full plan committed to repo: https://git.eeqj.de/ROOH/MATTERMOST_OPENCLAW_LIVESTATUS/src/branch/master/PLAN.md --- SUMMARY: Live Status v4 (Production-Grade) Core change: replace broken agent-cooperative system with a single multiplexed daemon that tails all OpenClaw JSONL transcripts automatically. Zero agent cooperation required. ARCHITECTURE HIGHLIGHTS (revised from original): - Single daemon process (not per-session) -- eliminates unbounded process spawning - Shared HTTP connection pool with keepAlive (not per-session HTTP) - Leading-edge + trailing-flush throttle (not simple debounce) - Circuit breaker for Mattermost API resilience - pino structured JSON logging - Health endpoint on port 9090 - All config via environment variables (no hardcoded tokens) - Smart idle detection via pendingToolCalls tracking (not naive 30s timeout) - Restart recovery: offset persistence + post recovery from channel history - Session compaction handled: detect file truncation, reset offset PHASES: - Phase 0 (30min): Repo sync + fix existing lint errors + env verification - Phase 1 (8-12h): Core components (watcher, status-box, formatter, circuit-breaker, logger, config, health) - Phase 2 (4-6h): Session monitor + lifecycle + restart recovery - Phase 3 (3-4h): Sub-agent support + nesting - Phase 4 (1h): Hook integration (auto-start on gateway:startup) - Phase 5 (3-4h): Polish + deployment files + SKILL.md rewrite - Phase 6 (30min): Remove v1 AGENTS.md injection (after watcher confirmed working) TOTAL: 20-28 hours DEPENDENCIES: pino (npm, one new dep). All else uses Node.js built-ins. AUDIT: 34/34 PASS SIMULATION: READY (all paths verified against real system) Waiting for Rooh approval. Approval words: go, approved, approve, proceed, build it, execute, ship it, do it. <!-- openclaw-agent -->
Author
Collaborator

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Key architectural clarifications confirmed:

  1. Workspace hooks dir: /home/node/.openclaw/workspace/hooks/ (highest precedence, auto-discovered)
  2. handler.js supported (handlerCandidates: handler.ts, handler.js, index.ts, index.js)
  3. pino NOT available in local node_modules -- use structured JSON via console.error to stderr (process.stderr.write) or install pino as a dependency
  4. Sessions directory: /home/node/.openclaw/agents/{agentId}/sessions/
  5. Sub-agent JSONL: topic-scoped file if threadId present (e.g., {uuid}-topic-{topicId}.jsonl)
  6. Current session file field: v.sessionFile is the absolute path

Drafting PLAN.md now. This plan incorporates ALL production review findings from the start.

[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Key architectural clarifications confirmed: 1. Workspace hooks dir: /home/node/.openclaw/workspace/hooks/ (highest precedence, auto-discovered) 2. handler.js supported (handlerCandidates: handler.ts, handler.js, index.ts, index.js) 3. pino NOT available in local node_modules -- use structured JSON via console.error to stderr (process.stderr.write) or install pino as a dependency 4. Sessions directory: /home/node/.openclaw/agents/{agentId}/sessions/ 5. Sub-agent JSONL: topic-scoped file if threadId present (e.g., {uuid}-topic-{topicId}.jsonl) 6. Current session file field: v.sessionFile is the absolute path Drafting PLAN.md now. This plan incorporates ALL production review findings from the start. <!-- openclaw-agent -->
Author
Collaborator

Continue

Continue
Author
Collaborator

Approved by Rooh (via Mattermost DM). Proceeding with execution.

Approved by Rooh (via Mattermost DM). Proceeding with execution. <!-- openclaw-agent -->
Author
Collaborator

Phase 0+1 complete. Pushed to master (43cfebe).

Phase 0 (Repo Sync + Environment):

  • Synced live-status.js from workspace
  • Fixed 43 lint issues (empty catch blocks, console statements)
  • Added pino dependency
  • Created src/tool-labels.json with all known tool mappings
  • make check passes

Phase 1 (Core Components) - 8 modules:

  • src/config.js: env-var config with validation
  • src/logger.js: pino singleton with child loggers
  • src/circuit-breaker.js: CLOSED/OPEN/HALF_OPEN state machine
  • src/tool-labels.js: exact/prefix/regex resolver + external override
  • src/status-box.js: Mattermost post manager (keepAlive, throttle, retry, circuit breaker)
  • src/status-formatter.js: pure SessionState->text formatter
  • src/health.js: HTTP health endpoint
  • src/status-watcher.js: JSONL watcher (inotify, compaction, idle detection)

Tests: 59/59 passing. make check clean.

Starting Phase 2: Session Monitor + Lifecycle.

Phase 0+1 complete. Pushed to master (43cfebe). Phase 0 (Repo Sync + Environment): - Synced live-status.js from workspace - Fixed 43 lint issues (empty catch blocks, console statements) - Added pino dependency - Created src/tool-labels.json with all known tool mappings - make check passes Phase 1 (Core Components) - 8 modules: - src/config.js: env-var config with validation - src/logger.js: pino singleton with child loggers - src/circuit-breaker.js: CLOSED/OPEN/HALF_OPEN state machine - src/tool-labels.js: exact/prefix/regex resolver + external override - src/status-box.js: Mattermost post manager (keepAlive, throttle, retry, circuit breaker) - src/status-formatter.js: pure SessionState->text formatter - src/health.js: HTTP health endpoint - src/status-watcher.js: JSONL watcher (inotify, compaction, idle detection) Tests: 59/59 passing. make check clean. Starting Phase 2: Session Monitor + Lifecycle. <!-- openclaw-agent -->
Author
Collaborator

Phase 2 complete. Pushed to master (e3bd6c5).

Phase 2 (Session Monitor + Lifecycle):

  • src/session-monitor.js: polls sessions.json every 2s, multi-agent, channel ID parsing
  • src/watcher-manager.js: top-level orchestrator with full lifecycle
    • Session creation/cleanup
    • Sub-agent linking (embedded in parent status)
    • Offset persistence (restart recovery)
    • Post recovery (search channel history for marker)
    • Graceful shutdown (SIGTERM/SIGINT)
    • CLI: start|stop|status

Tests: 86/86 passing (59 unit + 27 integration). make check clean.

Starting Phase 3: Sub-Agent Support.

Phase 2 complete. Pushed to master (e3bd6c5). Phase 2 (Session Monitor + Lifecycle): - src/session-monitor.js: polls sessions.json every 2s, multi-agent, channel ID parsing - src/watcher-manager.js: top-level orchestrator with full lifecycle - Session creation/cleanup - Sub-agent linking (embedded in parent status) - Offset persistence (restart recovery) - Post recovery (search channel history for marker) - Graceful shutdown (SIGTERM/SIGINT) - CLI: start|stop|status Tests: 86/86 passing (59 unit + 27 integration). make check clean. Starting Phase 3: Sub-Agent Support. <!-- openclaw-agent -->
Author
Collaborator

Phase 3 complete. Pushed to master (6df3278).

Phase 3 (Sub-Agent Support):

  • session-monitor.js: sub-agents pass through regardless of channel (inherit parent)
  • watcher-manager.js:
    • Pending sub-agent queue: child sessions queued when parent not yet registered
    • linkSubAgent() helper for clean parent-child linking
    • Cascade completion: parent waits for all children before marking done
    • Sub-agents embedded in parent status post (no separate posts)
  • status-formatter.js: recursive nested rendering at any depth

Integration tests (test/integration/sub-agent.test.js): 9 tests

  • spawnedBy detection (session-monitor level)
  • Nested status rendering (depth indentation, multiple children, deep nesting)
  • Cascade completion (pending tool call tracking)
  • Sub-agent JSONL parsing (usage events, error results)

PHASES 0-3 COMPLETE:

  • 95/95 tests passing (59 unit + 36 integration)
  • make check clean (0 errors, warnings only)
  • 3 commits on master

Files built:
src/config.js, src/logger.js, src/circuit-breaker.js, src/tool-labels.js
src/status-box.js, src/status-formatter.js, src/health.js, src/status-watcher.js
src/session-monitor.js, src/watcher-manager.js, src/tool-labels.json
src/live-status.js (lint-fixed, 43 issues resolved)
test/unit/ (5 files), test/integration/ (3 files)

Phase 3 complete. Pushed to master (6df3278). Phase 3 (Sub-Agent Support): - session-monitor.js: sub-agents pass through regardless of channel (inherit parent) - watcher-manager.js: - Pending sub-agent queue: child sessions queued when parent not yet registered - linkSubAgent() helper for clean parent-child linking - Cascade completion: parent waits for all children before marking done - Sub-agents embedded in parent status post (no separate posts) - status-formatter.js: recursive nested rendering at any depth Integration tests (test/integration/sub-agent.test.js): 9 tests - spawnedBy detection (session-monitor level) - Nested status rendering (depth indentation, multiple children, deep nesting) - Cascade completion (pending tool call tracking) - Sub-agent JSONL parsing (usage events, error results) PHASES 0-3 COMPLETE: - 95/95 tests passing (59 unit + 36 integration) - make check clean (0 errors, warnings only) - 3 commits on master Files built: src/config.js, src/logger.js, src/circuit-breaker.js, src/tool-labels.js src/status-box.js, src/status-formatter.js, src/health.js, src/status-watcher.js src/session-monitor.js, src/watcher-manager.js, src/tool-labels.json src/live-status.js (lint-fixed, 43 issues resolved) test/unit/ (5 files), test/integration/ (3 files) <!-- openclaw-agent -->
Author
Collaborator

Phase 4 complete: Hook Integration

Deliverables:

  • hooks/status-watcher-hook/HOOK.md (events: ["gateway:startup"])
  • hooks/status-watcher-hook/handler.js (PID check + detached spawn)
  • Hook deployed to workspace: /home/node/.openclaw/workspace/hooks/status-watcher-hook/
  • make check passes (0 errors)

Commit: 5bb3615

Starting Phase 5: Polish + Deployment.

Phase 4 complete: Hook Integration Deliverables: - hooks/status-watcher-hook/HOOK.md (events: ["gateway:startup"]) - hooks/status-watcher-hook/handler.js (PID check + detached spawn) - Hook deployed to workspace: /home/node/.openclaw/workspace/hooks/status-watcher-hook/ - make check passes (0 errors) Commit: 5bb3615 Starting Phase 5: Polish + Deployment. <!-- openclaw-agent -->
Author
Collaborator

Phase 5 complete: Polish + Deployment

Deliverables:

  • skill/SKILL.md: rewritten 9 lines ("status is automatic, no action needed")
  • deploy-to-agents.sh: no AGENTS.md injection; deploys hook only
  • install.sh: npm install + deploy hook + env var instructions
  • deploy/status-watcher.service: systemd unit file
  • deploy/Dockerfile: node:22-alpine container option
  • src/live-status.js: deprecation warning + start-watcher/stop-watcher pass-through
  • README.md: architecture, install, config reference, upgrade guide, troubleshooting
  • make check: 0 errors, 0 format issues
  • npm test: 59 unit + 36 integration = 95 tests passing

Commit: 835faa0

Starting Phase 6: v1 removal checklist.

Phase 5 complete: Polish + Deployment Deliverables: - skill/SKILL.md: rewritten 9 lines ("status is automatic, no action needed") - deploy-to-agents.sh: no AGENTS.md injection; deploys hook only - install.sh: npm install + deploy hook + env var instructions - deploy/status-watcher.service: systemd unit file - deploy/Dockerfile: node:22-alpine container option - src/live-status.js: deprecation warning + start-watcher/stop-watcher pass-through - README.md: architecture, install, config reference, upgrade guide, troubleshooting - make check: 0 errors, 0 format issues - npm test: 59 unit + 36 integration = 95 tests passing Commit: 835faa0 Starting Phase 6: v1 removal checklist. <!-- openclaw-agent -->
Author
Collaborator

Phase 6 complete: v1 Removal Checklist + Implementation Complete

Deliverables:

  • docs/v1-removal-checklist.md: exact sections to remove from 6 AGENTS.md files
    • /home/node/.openclaw/workspace/AGENTS.md (lines 645-669)
    • /home/node/.openclaw/agents/xen/workspace/AGENTS.md (~line 214)
    • /home/node/.openclaw/agents/coder-agent/workspace/AGENTS.md (~line 214)
    • /home/node/.openclaw/workspaces/workspace-gym/AGENTS.md (~line 214)
    • /home/node/.openclaw/workspaces/workspace-global-calendar/AGENTS.md (~line 214)
    • /home/node/.openclaw/workspaces/workspace-god-agent/AGENTS.md (~line 218)
  • STATE.json: IMPLEMENTATION_COMPLETE, phase 6/6, all test results documented
  • make check: clean (0 errors)

All phases pushed to master. Commit: 3879988

Summary of Phases 4-6:

  • Phase 4: gateway:startup hook deployed to workspace hooks dir
  • Phase 5: SKILL.md (9 lines), deploy scripts, systemd service, Dockerfile, deprecation warning, README
  • Phase 6: v1 removal checklist (deferred until 1h production verification)

Final test results: 59 unit + 36 integration = 95 tests passing

Pending actions:

  1. Restart gateway to activate the hook (coordinate with Rooh)
  2. Set required env vars: MM_TOKEN, MM_URL, TRANSCRIPT_DIR, SESSIONS_JSON
  3. After 1h verified operation: run v1-removal-checklist.md to clean AGENTS.md
Phase 6 complete: v1 Removal Checklist + Implementation Complete Deliverables: - docs/v1-removal-checklist.md: exact sections to remove from 6 AGENTS.md files - /home/node/.openclaw/workspace/AGENTS.md (lines 645-669) - /home/node/.openclaw/agents/xen/workspace/AGENTS.md (~line 214) - /home/node/.openclaw/agents/coder-agent/workspace/AGENTS.md (~line 214) - /home/node/.openclaw/workspaces/workspace-gym/AGENTS.md (~line 214) - /home/node/.openclaw/workspaces/workspace-global-calendar/AGENTS.md (~line 214) - /home/node/.openclaw/workspaces/workspace-god-agent/AGENTS.md (~line 218) - STATE.json: IMPLEMENTATION_COMPLETE, phase 6/6, all test results documented - make check: clean (0 errors) All phases pushed to master. Commit: 3879988 Summary of Phases 4-6: - Phase 4: gateway:startup hook deployed to workspace hooks dir - Phase 5: SKILL.md (9 lines), deploy scripts, systemd service, Dockerfile, deprecation warning, README - Phase 6: v1 removal checklist (deferred until 1h production verification) Final test results: 59 unit + 36 integration = 95 tests passing Pending actions: 1. Restart gateway to activate the hook (coordinate with Rooh) 2. Set required env vars: MM_TOKEN, MM_URL, TRANSCRIPT_DIR, SESSIONS_JSON 3. After 1h verified operation: run v1-removal-checklist.md to clean AGENTS.md <!-- openclaw-agent -->
sol closed this issue 2026-03-09 20:52:27 +01:00
Author
Collaborator

Issue closed. Implementation complete.

Summary of what shipped:

Phase 1: status-watcher.js - transcript tail daemon + Mattermost status box
Phase 2: session-monitor.js + watcher-manager.js - session lifecycle, multi-agent support
Phase 3: Sub-agent linking and nested status display
Phase 4: Hook integration (gateway:startup event) - auto-start on daemon boot
Phase 5: Simplified skill/SKILL.md (9 lines: "status is automatic, no action needed"), new deploy scripts
Phase 6: v1 removal checklist for AGENTS.md cleanup across all agent workspaces

Core outcome: Agents no longer need to call live-status manually. The watcher daemon tails transcripts automatically and posts updates to Mattermost.

Issue closed. Implementation complete. Summary of what shipped: Phase 1: status-watcher.js - transcript tail daemon + Mattermost status box Phase 2: session-monitor.js + watcher-manager.js - session lifecycle, multi-agent support Phase 3: Sub-agent linking and nested status display Phase 4: Hook integration (gateway:startup event) - auto-start on daemon boot Phase 5: Simplified skill/SKILL.md (9 lines: "status is automatic, no action needed"), new deploy scripts Phase 6: v1 removal checklist for AGENTS.md cleanup across all agent workspaces Core outcome: Agents no longer need to call live-status manually. The watcher daemon tails transcripts automatically and posts updates to Mattermost. <!-- openclaw-agent -->
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: ROOH/MATTERMOST_OPENCLAW_LIVESTATUS#3