[v4] Live Status Rewrite - Production-Grade Real-Time Agent Progress System #3
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem Statement
The current live-status system (v1) is fundamentally broken in production.
Current Architecture
Diagnosed Failures
Agents forget to use it. Even with MANDATORY instructions in every AGENTS.md, agents skip live-status because it requires 4+ separate exec calls per task, agents must remember post IDs across tool calls, and there's no enforcement mechanism.
Spam problem. When agents DO try to update status, the final response dumps 10+ status messages into the chat.
No sub-agent visibility. Sub-agents work in isolated sessions. Their progress is invisible until the announce step.
Thread isolation breaks state. Mattermost threads create separate OpenClaw sessions.
Naive solutions don't scale. Telling agents harder in system prompts doesn't work.
Proposed Solution: Live Status v4
Core Principle
Don't rely on agents to update status. Intercept their work automatically.
Architecture
A
status-watcherdaemon that tails the agent's JSONL transcript in real-time and auto-updates a Mattermost status box.Components
1. status-watcher - Transcript Tail Daemon
2. status-box - Mattermost Post Manager
3. Hook Integration
4. Agent-Side Simplification
Agents get ONE simple instruction: Status updates are automatic. Focus on the task.
Implementation Plan
Phase 1: Core Watcher
Phase 2: Session Lifecycle
Phase 3: Sub-Agent Support
Phase 4: Polish
Status Box Format (v4)
Files to Create/Modify
Success Criteria
References
[v4] Live Status Rewrite — Production-Grade Real-Time Agent Progress Systemto [v4] Live Status Rewrite - Production-Grade Real-Time Agent Progress SystemTriage: v4 Live Status Rewrite
Reviewed the spec against the current codebase. Here is the assessment.
Current State (v1)
src/live-status.js(114 lines) -- simple CLI wrapper around Mattermost HTTP APIlive-status create/live-status updatewith post IDsinstall.shinjects protocol text into AGENTS.md (prompt injection approach)Spec Assessment
The v4 proposal correctly diagnoses the core problem: agents don't reliably use explicit status tools. The solution (transcript-tailing daemon) is architecturally sound -- it moves the burden from agents to infrastructure.
Key Technical Observations
Transcript JSONL tailing -- This depends on OpenClaw's internal transcript format. Need to verify: (a) where transcripts live, (b) the JSONL schema, (c) whether they are written synchronously or buffered. This is discovery work.
Session lifecycle hooks -- The spec mentions
POST /hooks/agent. Need to verify this hook exists and what payload it sends. If it does not exist, this becomes a dependency on OpenClaw core.Mattermost post management -- The proposed
status-box.jsconcept (rich attachments, nested sub-agent status) is a significant step up from raw message text. Good direction.Debouncing -- 500ms default is reasonable. Should be configurable per-deployment.
Idle timeout for auto-complete -- 30s might be too aggressive. Some tool calls (exec) can run for minutes. Needs a smarter heuristic (e.g., no new transcript lines AND no running exec sessions).
Session compaction -- If OpenClaw rewrites/truncates transcript files, the watcher needs to handle file truncation (inode change, seek reset). This is a real edge case worth testing early.
Risk Areas
Recommendation
This is a solid spec. The phased approach makes sense. Before building, Phase 0 discovery would de-risk the two biggest unknowns:
Once those are confirmed, Phases 1-4 can proceed with confidence.
This issue has
[v4]in the title -- ready for implementation planning when the owner gives the go-ahead.[DECOMPOSE] Planner sub-agent active. Reading protocol, examining transcripts, and identifying knowledge gaps before drafting discovery sub-issues.
Focus areas:
[DISCOVERY -> SYNTHESIS] Discovery phase complete. Key findings below.
Discovery Findings
1. JSONL Transcript Format
Confirmed format. Each line is a JSON object with
typefield:session-- First line. Contains session UUID, version=3, cwdmodel_change-- Provider, modelId changesthinking_level_change-- thinking on/offcustom-- Subtypes:model-snapshot,openclaw.cache-ttlmessage-- The main workhorse. Role =user,assistant, ortoolResultMessage content array can contain:
{type: "text", text: "..."}-- plain text{type: "toolCall", id, name, arguments: {...}}-- tool invocations{type: "thinking", thinking: "..."}-- internal reasoning (thinking mode)Assistant messages also carry:
api,provider,model,usage,stopReason,timestampToolResult messages carry:
toolCallId,toolName,isError,content: [{type, text}]2. Session Keying
Session keys follow pattern:
agent:{agentId}:{context}Examples:
agent:main:main-- direct sessionagent:main:mattermost:channel:{channelId}-- channel sessionagent:main:mattermost:channel:{channelId}:thread:{threadId}-- thread sessionagent:main:subagent:{uuid}-- SUB-AGENT SESSIONagent:main:hook:gitea:{repo}:issue:{n}-- hook-triggered sessionagent:main:cron:{name}-- cron sessionSub-agent entry in sessions.json has:
spawnedBy,spawnDepth,label,sessionId(maps to .jsonl file UUID)Sessions store:
/home/node/.openclaw/agents/{agentId}/sessions/sessions.jsonJSONL files:
/home/node/.openclaw/agents/{agentId}/sessions/{uuid}.jsonlTopic-scoped files:
{uuid}-topic-{topicId}.jsonl3. Hook Events
Available internal hook events:
command:new,command:reset,command:stop-- user commandscommand-- all commandsagent:bootstrap-- before workspace files injectedgateway:startup-- after gateway startsNO session:start or session:end hooks exist yet. The hooks system only covers commands, NOT individual message/run starts.
Sub-agent lifecycle:
subagent_spawned,subagent_endedare channel plugin hooks (not internal hooks).4. Mattermost API
n73636eit7bg3rgmpsj693mwno(default/main bot)5. Current Failure Modes (v1)
6. Transcript File Patterns (Key Insight)
{sessionId}.jsonl(from sessions.json lookup)[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Now drafting the implementation plan.
[AUTO_AUDIT -> SIMULATION -> PENDING_APPROVAL]
Audit: 32/32 PASS. Simulation: READY. No blockers.
Plan committed to repo: PLAN.md, STATE.json, discoveries/README.md
See full plan below in the next comment.
[PENDING_APPROVAL] Implementation Plan: Live Status v4
Full plan committed to repo: PLAN.md
Discovery: discoveries/README.md
Summary
Replace the broken agent-cooperative live-status (v1) with an infrastructure-level daemon that tails OpenClaw JSONL transcript files and auto-updates Mattermost -- zero agent cooperation required.
Core mechanism: Watch sessions.json for new sessions -> tail each session's JSONL -> parse tool calls and results -> debounced Mattermost updates (500ms).
Architecture
Phases (7-9h total)
Key Technical Decisions
Audit: 32/32 PASS | Simulation: READY
Open Questions (all have defaults, plan can proceed)
Q1: Idle timeout -- Default: stopReason=stop + 10s no new lines
Q2: Non-MM sessions -- Default: skip (hook/cron sessions get no status box)
Q3: Per-session or per-request status box -- Default: per user message (new box each request)
Q4: Compaction behavior -- Default: detect fileSize < bytesRead, reset offset
Warning
Gateway restart required to activate the gateway:startup hook after install. Coordinate timing with Rooh.
ROOH: Reply "go" / "approved" / "proceed" to begin execution.
Check if this plan is
Scalability
Efficiency
Production ready
Up to best current practices and not naive solutions
Review: Scalability, Efficiency, Production-Readiness, Best Practices
Reviewed against the four criteria requested. The diagnosis is spot-on; the architecture has the right instincts but needs hardening in several areas before it is production-grade.
1. SCALABILITY
Current design: one watcher daemon per session.
This is the biggest architectural concern. If 10 agents run concurrently with sub-agents, you could have 30+ watcher processes, each doing
fs.watch+ debounced HTTP calls. That works on a single box with low concurrency, but it does not scale.Issues:
Recommendations:
fs.watchwith recursive option (supported on Linux via inotify since Node 19+, and you are on Node v22). This eliminates per-session process overhead.http.AgentwithkeepAlive: trueandmaxSocketscapped. All Mattermost updates go through one connection pool.Verdict: Needs rework. Per-session daemons are a v1-level solution to a v4-level problem.
2. EFFICIENCY
Debouncing at 500ms is correct in principle but naive in implementation.
Issues:
The spec says "max 1 update/500ms" but does not specify the debounce strategy. Leading-edge? Trailing-edge? Throttle? This matters:
Full post replacement on every update is wasteful. Each Mattermost
PUT /posts/{id}sends the entire message body. If the status box grows to 30+ lines, you are sending the same 29 lines repeatedly to change 1 line.JSONL parsing on every line is fine -- JSON.parse on a single line is sub-millisecond. No concern here.
fs.watchvs polling: On Linux (your runtime),fs.watchuses inotify which is efficient. Good. Do NOT fall back tofs.watchFile(polling) -- it is wasteful and unnecessary on Linux. The spec does not mention this distinction; it should.Verdict: Mostly good, needs the debounce strategy specified and the message size growth addressed.
3. PRODUCTION-READINESS
This is where the spec has the most gaps.
Missing from the spec:
/healthor write a heartbeat fileThe 30-second idle timeout for auto-complete is problematic:
exectool calls can run for minutes (npm install, git clone, compilation).Token/credential management:
MM_TOKEN,MM_URL). The spec does not address this.Verdict: Not production-ready as specified. Needs the gaps above addressed before it can run unattended.
4. BEST PRACTICES
What the spec gets right:
What deviates from best practices:
fs.watchbut does not handle edge casesfs.watchon Linux (inotify), handle ENOENT (file not yet created), EPERM, and rename eventstool-labels.jsonas static fileexec:*-> "Running command...") not just exact tool namesMissing best practice: the watcher should be idempotent on restart. If the watcher crashes and restarts, it should:
Summary Verdict
The core idea is sound. Transcript tailing is the right approach -- it removes agent burden entirely. But the implementation plan needs a Phase 0 that addresses the architectural gaps above before writing code.
Top 3 changes to make the spec production-grade:
I agree, update the plan based on your findings.
Revised Plan: v4 Live Status Rewrite (Production-Grade)
Incorporating all findings from the scalability/efficiency/production-readiness review. Changes from original spec marked with [CHANGED] or [NEW].
Problem Statement
(unchanged -- the diagnosis is correct)
The current live-status system (v1) is fundamentally broken in production. Agents forget to use it, it spams when they do, sub-agents are invisible, and prompt injection does not work as an enforcement mechanism.
Proposed Solution: Live Status v4
Core Principle
Don't rely on agents to update status. Intercept their work automatically.
Architecture [CHANGED]
Single multiplexed watcher daemon (not per-session) that watches all transcript files and routes updates through a shared Mattermost connection pool.
Why single process over per-session daemons:
Components
1. status-watcher.js - Multiplexed Transcript Watcher [CHANGED]
fs.watchwith recursive option (Node 22 on Linux = inotify, efficient)fs.watchFile(polling) -- inotify or nothingSessionStatemap per active session:postId: Mattermost status box post IDlastOffset: byte offset in transcript file (for resume)pendingToolCalls: count of tool_calls without matching tool_resultslines: recent status lines (capped at MAX_LINES, default 15)startTime: session start timestamplastActivity: timestamp of last transcript linestat.size < lastOffset, reset to 02. status-box.js - Mattermost Post Manager [CHANGED]
http.AgentwithkeepAlive: true,maxSockets: 4MM_TOKENandMM_URLfrom environment variables only. No hardcoded tokens, no sed replacement.3. tool-labels.js - Tool Name Mapping [CHANGED from .json]
"Read" -> "Reading file...""exec:*" -> "Running command..."/^web_/ -> "Searching the web...""Working..."4. Hook Integration
POST /hooks/agent) for session start/end events5. Agent-Side Simplification
live-statusCLI kept for backward compat but marked deprecatedProduction Infrastructure [NEW SECTION]
Graceful Shutdown
Health Check
GET /health{ "status": "ok", "activeSessions": N, "uptimeSeconds": N, "lastError": "..." }Circuit Breaker for Mattermost API
Structured Logging [NEW]
pino(fast, structured JSON logging)Process Management
/tmp/status-watcher.pid)--daemonflag for background operationdeploy/status-watcher.serviceMetrics [MOVED TO PHASE 1]
updates_sent_totalupdates_failed_totalactive_sessionscircuit_breaker_state(closed/open/half-open)queue_depthuptime_secondsIdle Completion Heuristic [CHANGED]
The original 30-second idle timeout was too aggressive. Revised approach:
Smart idle detection:
pendingToolCallsper session (increment ontool_use, decrement ontool_result)pendingToolCalls > 0: session is NOT idle, regardless of time since last transcript linependingToolCalls == 0AND last transcript entry was an assistant message AND no new lines forIDLE_TIMEOUTseconds (configurable, default 60s): mark as idle/completependingToolCalls == 0AND last transcript entry was a tool_result: start a shorter timer (30s) -- agent might be composing responseMAX_SESSION_DURATION(configurable, default 30 minutes), force-complete regardlessThis prevents premature completion during long-running exec calls while still cleaning up genuinely idle sessions.
Configuration [NEW SECTION]
All tunable values via environment variables with sensible defaults:
MM_TOKENMM_URLhttp://mattermost:8065TRANSCRIPT_DIRTHROTTLE_MS500IDLE_TIMEOUT_S60MAX_SESSION_DURATION_S1800MAX_STATUS_LINES15MAX_ACTIVE_SESSIONS20MAX_MESSAGE_CHARS15000HEALTH_PORT9090LOG_LEVELinfoCIRCUIT_BREAKER_THRESHOLD5CIRCUIT_BREAKER_COOLDOWN_S30PID_FILE/tmp/status-watcher.pidTOOL_LABELS_FILEnullRevised Implementation Plan
Phase 0: Discovery [NEW]
fs.watchrecursive behavior on the target Linux kernelPhase 1: Core Watcher + Production Foundation
src/status-watcher.js-- multiplexed directory watcher, JSONL parser, SessionState managementsrc/status-box.js-- Mattermost post manager with shared HTTP pool, throttle, message size capsrc/tool-labels.js-- pattern-matching tool name to label mappingsrc/config.js-- centralized configuration from env vars with validationsrc/logger.js-- pino-based structured loggingsrc/circuit-breaker.js-- circuit breaker for Mattermost APIsrc/health.js-- HTTP health endpoint with metricsPhase 2: Session Lifecycle + Restart Recovery
Phase 3: Sub-Agent Support
Phase 4: Deployment + Migration
install.sh-- new install flow (env-var based, no token sed replacement)deploy/status-watcher.service-- systemd unit filedeploy/Dockerfile-- containerized deployment optionskill/SKILL.md-- rewrite (simplified: "status is automatic")README.md-- full v4 documentationsrc/live-status.jsRevised Status Box Format
Key changes from original:
[ACTIVE],[DONE],[ERROR],[INTERRUPTED]Files to Create/Modify (Revised)
src/status-watcher.jssrc/status-box.jssrc/tool-labels.jssrc/config.jssrc/logger.jssrc/circuit-breaker.jssrc/health.jssrc/live-status.jsskill/SKILL.mdinstall.shdeploy/status-watcher.servicedeploy/DockerfileREADME.mdtest/package.jsonRevised Success Criteria
Risk Assessment (Revised)
Effort Estimate
This plan is ready for approval. Phase 0 (discovery) can begin immediately as it requires no code changes.
Approved
Thanks for the review and feedback, sol. The revised plan is posted above (comment #11402).
However, plan execution requires owner approval (Rooh). This is a security gate -- contributor approvals cannot trigger execution.
Current state: PENDING_APPROVAL -- waiting for Rooh to review and approve the revised plan before any code is written.
Rooh: when you are ready, the approval words are: "go", "approved", "approve", "proceed", "build it", "execute", "ship it", or "do it".
[SYNTHESIS] New planner session active (proj035-planner-v2). Prior plan found in repo (PLAN.md) plus a revised architecture in comment #11402. The revised plan incorporates production-grade changes: single multiplexed daemon, shared HTTP connection pool, circuit breaker, pino logging, health endpoint, env-var config. Now drafting the final PLAN.md that reflects the revised architecture.
[DRAFTING_PLAN -> AUTO_AUDIT] Final PLAN.md written (production-grade, revised architecture). Running audit checklist now.
[SYNTHESIS] Planner v2 session active (proj035-planner-v2). Read all 14 issue comments including the full production review. Prior plan is in PLAN.md / STATE.json. The previous plan had the per-session daemon anti-pattern that the production review correctly identified as the #1 architectural flaw. Now synthesizing all findings into a complete revised architecture before drafting the final production-grade plan.
Key architecture decisions to incorporate:
[AUTO_AUDIT] 34/34 PASS. Warnings: (1) make check currently fails on existing live-status.js (43 lint issues - addressed in Phase 0.2). (2) Gateway restart needed to activate hook in Phase 4.
[SIMULATION] READY. Key verifications:
Proceeding to PENDING_APPROVAL.
[PENDING_APPROVAL] Plan v2 ready for approval.
Full plan committed to repo: https://git.eeqj.de/ROOH/MATTERMOST_OPENCLAW_LIVESTATUS/src/branch/master/PLAN.md
SUMMARY: Live Status v4 (Production-Grade)
Core change: replace broken agent-cooperative system with a single multiplexed daemon that tails all OpenClaw JSONL transcripts automatically. Zero agent cooperation required.
ARCHITECTURE HIGHLIGHTS (revised from original):
PHASES:
TOTAL: 20-28 hours
DEPENDENCIES: pino (npm, one new dep). All else uses Node.js built-ins.
AUDIT: 34/34 PASS
SIMULATION: READY (all paths verified against real system)
Waiting for Rooh approval. Approval words: go, approved, approve, proceed, build it, execute, ship it, do it.
[SYNTHESIS -> DRAFTING_PLAN] Synthesis complete. Key architectural clarifications confirmed:
Drafting PLAN.md now. This plan incorporates ALL production review findings from the start.
Continue
Approved by Rooh (via Mattermost DM). Proceeding with execution.
Phase 0+1 complete. Pushed to master (
43cfebe).Phase 0 (Repo Sync + Environment):
Phase 1 (Core Components) - 8 modules:
Tests: 59/59 passing. make check clean.
Starting Phase 2: Session Monitor + Lifecycle.
Phase 2 complete. Pushed to master (
e3bd6c5).Phase 2 (Session Monitor + Lifecycle):
Tests: 86/86 passing (59 unit + 27 integration). make check clean.
Starting Phase 3: Sub-Agent Support.
Phase 3 complete. Pushed to master (
6df3278).Phase 3 (Sub-Agent Support):
Integration tests (test/integration/sub-agent.test.js): 9 tests
PHASES 0-3 COMPLETE:
Files built:
src/config.js, src/logger.js, src/circuit-breaker.js, src/tool-labels.js
src/status-box.js, src/status-formatter.js, src/health.js, src/status-watcher.js
src/session-monitor.js, src/watcher-manager.js, src/tool-labels.json
src/live-status.js (lint-fixed, 43 issues resolved)
test/unit/ (5 files), test/integration/ (3 files)
Phase 4 complete: Hook Integration
Deliverables:
Commit:
5bb3615Starting Phase 5: Polish + Deployment.
Phase 5 complete: Polish + Deployment
Deliverables:
Commit:
835faa0Starting Phase 6: v1 removal checklist.
Phase 6 complete: v1 Removal Checklist + Implementation Complete
Deliverables:
All phases pushed to master. Commit:
3879988Summary of Phases 4-6:
Final test results: 59 unit + 36 integration = 95 tests passing
Pending actions:
Issue closed. Implementation complete.
Summary of what shipped:
Phase 1: status-watcher.js - transcript tail daemon + Mattermost status box
Phase 2: session-monitor.js + watcher-manager.js - session lifecycle, multi-agent support
Phase 3: Sub-agent linking and nested status display
Phase 4: Hook integration (gateway:startup event) - auto-start on daemon boot
Phase 5: Simplified skill/SKILL.md (9 lines: "status is automatic, no action needed"), new deploy scripts
Phase 6: v1 removal checklist for AGENTS.md cleanup across all agent workspaces
Core outcome: Agents no longer need to call live-status manually. The watcher daemon tails transcripts automatically and posts updates to Mattermost.