Files

sol fe81de308f plan: v4 implementation plan + discovery findings

PROJ-035 Live Status v4 - implementation plan created by planner subagent.

Discovery findings documented in discoveries/README.md covering:
- JSONL transcript format (confirmed v3 schema)
- Session keying patterns (subagent spawnedBy linking)
- Hook events available (gateway:startup confirmed)
- Mattermost API (no edit time limit)
- Current v1 failure modes

Audit: 32/32 PASS, Simulation: READY

2026-03-07 15:41:50 +00:00

17 KiB

Raw Blame History

Implementation Plan: Live Status v4

Generated: 2026-03-07 | Agent: planner:proj035 | Status: DRAFT

1. Goal

Replace the broken agent-cooperative live-status system with a transparent infrastructure-level daemon that tails OpenClaw's JSONL transcript files in real-time and updates a Mattermost status box automatically — zero agent cooperation required. Sub-agents become visible. Spam is eliminated. Sessions never lose state. Works from gateway startup without any AGENTS.md instruction injection.

2. Architecture

OpenClaw Gateway
├── Agent Sessions (main, coder-agent, sub-agents, hooks...)
│   └── writes {uuid}.jsonl as it works
│
└── status-watcher daemon (per active session)
        ├── Polls/watches {uuid}.jsonl (new line = new event)
        ├── Parses tool calls, results, assistant text
        ├── Maps tool names → human-readable labels
        ├── Debounces Mattermost updates (500ms)
        ├── Auto-creates status box in correct channel/thread
        ├── Detects sub-agent spawns → nests sub-agent status
        └── Auto-completes when agent stops writing (idle timeout)

sessions.json (runtime registry)
├── session key → {sessionId, sessionFile, spawnedBy, spawnDepth, channel, ...}
└── used to: resolve JSONL file path, determine channel, link parent→child

OpenClaw Hook (gateway:startup + command:new)
└── Spawns status-watcher for the right session

Mattermost API (slack.solio.tech)
├── POST /api/v4/posts  → create status box
├── PUT /api/v4/posts/{id} → update in-place (no edit time limit)
└── Multiple bot tokens per agent

Key Design Decisions (from discovery)

Watch sessions.json, not just transcript files. sessions.json is the authoritative registry that maps session keys (including sub-agents) to JSONL files. Monitor it to detect new sessions.
No new hook events needed. We cannot use session:start/session:end hooks (they don't exist). Instead: use gateway:startup to begin watching all active sessions, and poll sessions.json for new sessions.
Sub-agent detection via spawnedBy field. When sessions.json gets a new entry with spawnedBy, we know it's a sub-agent of the given parent session. Nest its status under the parent status box.
JSONL format is stable. Version 3 format confirmed. Key events:
- message with role=assistant + content toolCall → tool being called
- message with role=toolResult → tool completed
- message with role=assistant + content text → agent thinking/responding
- custom with customType: openclaw.cache-ttl → turn boundary (good idle signal)
Mattermost post edit is unlimited. PostEditTimeLimit = -1. We can update the status post indefinitely. No workaround needed.
Keep live-status.js as thin orchestration layer. agents can still call it manually for special cases, but it's no longer the primary mechanism.

3. Tech Stack

Layer	Technology	Version	Reason
Watcher daemon	Node.js	22.x (existing)	Already installed, fs.watch/setInterval available
File watching	fs.watch + fallback polling	built-in	fs.watch is iffy on Linux; polling fallback needed
Mattermost API	https (built-in)	-	Already used in live-status.js
Session registry	JSON file watch	-	sessions.json updated on every message
IPC (parent↔watcher)	PID file + signals	-	Simple, no deps
Hook integration	OpenClaw hooks system	existing	gateway:startup hook for auto-start

4. Project Structure

MATTERMOST_OPENCLAW_LIVESTATUS/
├── src/
│   ├── status-watcher.js      CREATE  Core transcript tail + parse + debounce
│   ├── session-monitor.js     CREATE  Watch sessions.json for new/ended sessions
│   ├── mattermost-client.js   CREATE  Mattermost HTTP API wrapper (rate-limited)
│   ├── tool-labels.json       CREATE  Tool name → human-readable label map
│   ├── status-formatter.js    CREATE  Format status box message (text + sub-agents)
│   ├── watcher-manager.js     CREATE  Start/stop watchers per session, PID tracking
│   ├── live-status.js         MODIFY  Add start-watcher/stop-watcher commands; keep create/update/complete
│   └── agent-accounts.json    KEEP    Agent ID → bot account mapping
│
├── hooks/
│   └── status-watcher-hook/
│       ├── HOOK.md            CREATE  Hook metadata (events: gateway:startup, command:new)
│       └── handler.ts         CREATE  Spawns watcher-manager on gateway start
│
├── skill/
│   └── SKILL.md               REWRITE  Remove verbose manual protocol; just note status is automatic
│
├── deploy-to-agents.sh        REWRITE  Installs hook instead of AGENTS.md injection
├── install.sh                 REWRITE  New install flow: npm install + hook enable
├── README.md                  REWRITE  Full v4 documentation
├── package.json               MODIFY   Add start/stop/status npm scripts
└── Makefile                   MODIFY   Add check/test/lint/fmt targets

5. Dependencies

Package	Version	Purpose	New/Existing
node.js	22.x	Runtime	Existing (system)
(none)	-	All built-in: https, fs, path, child_process	-

No new npm dependencies. Everything uses Node.js built-ins to keep install footprint at zero.

6. Data Model

sessions.json entry (relevant fields)

{
  "agent:main:subagent:uuid": {
    "sessionId": "50dc13ad-...",
    "sessionFile": "50dc13ad-....jsonl",
    "spawnedBy": "agent:main:main",
    "spawnDepth": 1,
    "label": "proj035-planner",
    "channel": "mattermost",
    "groupChannel": "#channelId__botUserId"
  }
}

JSONL event schema (parsed by watcher)

type=session    → session UUID, cwd (first line only)
type=message    → role=user|assistant|toolResult; content[]=text|toolCall|toolResult
type=custom     → customType=openclaw.cache-ttl (turn boundary marker)

Watcher state per session

{
  "sessionKey": "agent:main:subagent:uuid",
  "sessionFile": "/path/to/uuid.jsonl",
  "bytesRead": 1024,
  "statusPostId": "abc123def456...",
  "channelId": "yy8agcha...",
  "rootPostId": null,
  "lastActivity": 1772897576000,
  "subAgentWatchers": ["child-session-key"],
  "statusLines": ["[15:21] Reading file... done", ...],
  "parentStatusPostId": null
}

Status box format

Agent: main — PROJ-035 Plan
[15:21:22] Reading transcript format...
[15:21:25] exec: ls /agents/sessions done (0.8s)
[15:21:28] Writing implementation plan...
  Sub-agent: proj035-planner
    [15:21:42] Reading protocol...
    [15:21:55] Analyzing JSONL format...
    [15:22:10] Complete (28s)
[15:22:15] Plan ready. Awaiting approval.
Runtime: 53s

7. Task Checklist

Phase 0: Repo Sync + Setup ⏱️ 10min

Parallelizable: no | Dependencies: none

0.1: Sync workspace live-status.js to remote repo (git push) → remote matches workspace
0.2: Verify Makefile has check/test/lint/fmt targets (or add them) → make check passes
0.3: Create src/tool-labels.json with initial tool→label mapping → file exists
0.4: Create src/agent-accounts.json (already exists, verify) → agent→account mapping

Phase 1: Core Watcher ⏱️ 2-3h

Parallelizable: no | Dependencies: Phase 0

1.1: Create src/mattermost-client.js — HTTP wrapper with rate limiting (max 2 req/s), retry on 429, create/update/delete post methods → tested with curl
1.2: Create src/status-formatter.js — formats status box lines from events, sub-agent nesting, timestamps → unit testable pure function
1.3: Create src/status-watcher.js — core daemon:
- Accepts: sessionKey, sessionFile, channelId, rootPostId (optional), statusPostId (optional)
- Reads JSONL file from current byte offset
- On new lines: parse event type, extract human-readable status
- Debounce 500ms before Mattermost update
- Idle timeout: 30s after last new line → mark complete
- Emits events: status-update, session-complete
- Returns: statusPostId (created on first event)
1.4: Add src/tool-labels.json with all known tools → exec, read, write, edit, web_search, web_fetch, message, subagents, nodes, browser, image, camofox_, claude_code_
1.5: Manual test — start watcher against a real session file, verify Mattermost post appears → post created and updated

Phase 2: Session Monitor ⏱️ 1-2h

Parallelizable: no | Dependencies: Phase 1

2.1: Create src/session-monitor.js — watches sessions.json for changes:
- Polls every 2s (fs.watch unreliable on Linux for JSON files)
- Diffs previous vs current sessions.json
- On new session: emit session-added with session details
- On removed session: emit session-removed
- Resolves channel/thread from session key format
2.2: Create src/watcher-manager.js — coordinates monitor + watchers:
- On session-added: resolve channel (from session key), start status-watcher
- Tracks active watchers in memory (Map: sessionKey → watcher)
- On session-removed or watcher-complete: clean up
- Handles sub-agents: on spawnedBy session added, nest under parent watcher
- PID file at /tmp/openclaw-status-watcher.pid for single-instance enforcement
2.3: Entry point src/watcher-manager.js CLI: node watcher-manager.js start|stop|status → process management
2.4: End-to-end test — run manager in foreground, trigger agent session, verify status box appears → automated smoke test

Phase 3: Channel Resolution ⏱️ 1h

Parallelizable: no | Dependencies: Phase 2

3.1: Implement channel resolver — given a session key like agent:main:mattermost:channel:abc123, extract the Mattermost channel ID → function with unit test
3.2: Handle thread sessions — agent:main:mattermost:channel:abc123:thread:def456 → channel=abc123, rootPost=def456
3.3: Fallback for non-Mattermost sessions (hook sessions, cron sessions) — use configured default channel → configurable in openclaw.json or env var
3.4: Sub-agent channel resolution — inherit parent session's channel + use parent status box as rootPostId → sub-agent status appears under parent

Phase 4: Hook Integration ⏱️ 1h

Parallelizable: no | Dependencies: Phase 2, Phase 3

4.1: Create hooks/status-watcher-hook/HOOK.md with events: ["gateway:startup"] → discovered by OpenClaw hooks system
4.2: Create hooks/status-watcher-hook/handler.js (plain JS) — on gateway:startup, spawn watcher-manager.js start as background child_process → watcher manager auto-starts with gateway. Note: OpenClaw hooks system discovers handler.ts first, then handler.js — both are supported natively via dynamic import. Plain .js is confirmed to work.
4.3: Add hooks/status-watcher-hook/ to workspace hooks dir (/home/node/.openclaw/workspace/hooks/) via deploy-to-agents.sh → hook auto-discovered
4.4: Test: restart gateway → watcher-manager starts → verify PID file exists

Phase 5: Polish + Cleanup ⏱️ 1h

Parallelizable: no | Dependencies: Phase 4

5.1: Rewrite skill/SKILL.md — remove manual protocol; say "live status is automatic, no action needed" → 10-line skill file
5.2: Rewrite deploy-to-agents.sh — remove AGENTS.md injection; install hook into workspace hooks dir; restart gateway → one-command deploy
5.3: Update install.sh — npm install, deploy hook, optionally restart gateway
5.4: Update src/live-status.js — add start-watcher and stop-watcher commands for manual control; mark create/update/complete as deprecated but keep working
5.5: Handle session compaction — detect if JSONL file gets smaller (compaction rewrites) → reset byte offset and re-read from start
5.6: Write README.md — full v4 documentation with architecture diagram, install steps, config reference
5.7: Run make check to verify lint/format passes → clean CI

Phase 6: Remove v1 Injection from AGENTS.md ⏱️ 30min

Parallelizable: no | Dependencies: Phase 5 (after watcher confirmed working)

6.1: Remove "📡 Live Status Protocol (MANDATORY)" section from main agent's AGENTS.md
6.2: Remove from all other agent AGENTS.md files (coder-agent, xen, global-calendar, etc.)
6.3: Confirm watcher is running before removing (safety check) → watcher PID file exists

8. Testing Strategy

What	Type	How	Success Criteria
Mattermost client	Unit	Direct API call with test channel	Post created and updated
Status formatter	Unit	Input JSONL events → verify output strings	Correct labels, timestamps
Channel resolver	Unit	Test session key strings → verify channel/thread extracted	All formats parsed
JSONL parser	Unit	Sample events from real transcripts	All types handled
Session monitor	Integration	Write to sessions.json, verify events emitted	New session detected in <2s
Status watcher	Integration	Append to JSONL file, verify Mattermost post updates	Update within 1s of new line
Sub-agent nesting	Integration	Spawn real sub-agent, verify nested status	Sub-agent visible in parent box
Idle timeout	Integration	Stop writing to JSONL, verify complete after 30s	Status box marked done
Compaction	Integration	Truncate JSONL file, verify watcher recovers	No duplicate events, no crash
E2E	Manual smoke test	Real agent task in Mattermost, verify status box	Real-time updates visible

9. Risks & Mitigations

Risk	Impact	Mitigation
fs.watch unreliable on Linux	High	Fall back to polling (setInterval 2s). fs.watch as optimization
Sessions.json write race condition	Medium	Use atomic read (retry on parse error), debounce diff
Mattermost rate limit (10 req/s)	Medium	Debounce updates to 500ms; queue + batch; exponential backoff on 429
Session compaction truncates JSONL	Medium	Compare file size on each poll; if smaller, reset offset
Multiple gateway restarts create duplicate watchers	Medium	PID file check + kill old process before spawning new
Sub-agent session key not stable across restarts	Low	Use sessionId (UUID) as key, not session key string
Watcher dies silently	Low	Cron health check or gateway boot-md restart
Non-Mattermost sessions (xen, hook) get status boxes	Low	Channel resolver returns null for non-MM sessions; skip gracefully
JSONL format change in future OpenClaw version	Medium	Abstract parser behind interface; version check on session record

10. Effort Estimate

Phase	Time	Can Parallelize?	Depends On
Phase 0: Repo Setup	10min	No	—
Phase 1: Core Watcher	2-3h	No	Phase 0
Phase 2: Session Monitor	1-2h	No	Phase 1
Phase 3: Channel Resolution	1h	No	Phase 2
Phase 4: Hook Integration	1h	No	Phase 2+3
Phase 5: Polish + Cleanup	1h	No	Phase 4
Phase 6: Remove v1 Injection	30min	No	Phase 5 (verified)
Total	7-9h

11. Open Questions

Q1: Idle timeout threshold. 30s is aggressive — exec commands can run for minutes. Should we use a smarter heuristic? E.g., detect stopReason: "toolUse" (agent is waiting for tool) vs stopReason: "stop" (agent is done). Default if unanswered: Use stopReason: "stop" in the most recent assistant message as the idle signal, combined with 10s of no new lines. If stop_reason=toolUse, reset idle timer on every toolResult line. This is accurate and avoids false completions during long tool runs.
Q2: Default channel for non-MM sessions. Hook-triggered sessions (agent:main:hook...) don't have a Mattermost channel. Should we (a) skip them, (b) post to a default monitoring channel, or (c) allow config per-session-type? Default if unanswered: (a) Skip non-MM sessions. Hook and cron sessions are largely invisible today and not causing user pain. The priority is Mattermost interactive sessions. Non-MM support can be Phase 7.
Q3: Status box per-session or per-request? A single agent session may handle multiple sequential requests. Should each new user message create a new status box, or does one session = one status box? Default if unanswered: One status box per user message (per-request). Each incoming user message starts a new status cycle. When agent sends final response (stopReason=stop + no tool calls), mark current box complete. On next user message, create a new box. This matches expected UX: one progress indicator per task.
Q4: Compaction behavior. When OpenClaw compacts a transcript (rewrites the JSONL), does it preserve the original file or create a new one? Default if unanswered: Assume in-place truncation (most likely based on compactionCount field in sessions.json). Detect by checking if fileSize < bytesRead on each poll. If truncated, reset bytesRead to 0 and re-read from start (with deduplication via message IDs to avoid re-posting old events).

17 KiB Raw Blame History