feat: Phase 0+1 — repo sync, pino, lint fixes, core components
Phase 0: - Synced latest live-status.js from workspace (9928 bytes) - Fixed 43 lint issues: empty catch blocks, console statements - Added pino dependency - Created src/tool-labels.json with all known tool mappings - make check passes Phase 1 (Core Components): - src/config.js: env-var config with validation, throws on missing required vars - src/logger.js: pino singleton with child loggers, level validation - src/circuit-breaker.js: CLOSED/OPEN/HALF_OPEN state machine with callbacks - src/tool-labels.js: exact/prefix/regex tool->label resolver with external override - src/status-box.js: Mattermost post manager (keepAlive, throttle, retry, circuit breaker) - src/status-formatter.js: pure SessionState->text formatter (nested, compact) - src/health.js: HTTP health endpoint + metrics - src/status-watcher.js: JSONL file watcher (inotify, compaction detection, idle detection) Tests: - test/unit/config.test.js: 7 tests - test/unit/circuit-breaker.test.js: 12 tests - test/unit/logger.test.js: 5 tests - test/unit/status-formatter.test.js: 20 tests - test/unit/tool-labels.test.js: 15 tests All 59 unit tests pass. make check clean.
This commit is contained in:
140
PLAN.md
140
PLAN.md
@@ -1,4 +1,5 @@
|
||||
# Implementation Plan: Live Status v4 (Production-Grade)
|
||||
|
||||
> Generated: 2026-03-07 | Agent: planner:proj035-v2 | Status: DRAFT
|
||||
> Revised: Incorporates production-grade changes from scalability/efficiency review (comment #11402)
|
||||
|
||||
@@ -65,16 +66,16 @@ OpenClaw Gateway
|
||||
|
||||
## 3. Tech Stack
|
||||
|
||||
| Layer | Technology | Version | Reason |
|
||||
|-------|-----------|---------|--------|
|
||||
| Runtime | Node.js | 22.x (system) | Already installed; inotify recursive fs.watch supported |
|
||||
| File watching | fs.watch recursive | built-in | inotify on Linux/Node22; efficient, no polling |
|
||||
| Session discovery | setInterval poll | built-in | sessions.json polling for new session detection |
|
||||
| HTTP client | http.Agent | built-in | keepAlive, maxSockets; no extra dependency |
|
||||
| Structured logging | pino | ^9.x | Fast JSON logging; single new dependency |
|
||||
| Config | process.env | built-in | 12-factor; validated at startup |
|
||||
| Health check | http.createServer | built-in | Lightweight health endpoint |
|
||||
| Process management | PID file + signals | built-in | Simple, no supervisor dependency |
|
||||
| Layer | Technology | Version | Reason |
|
||||
| ------------------ | ------------------ | ------------- | ------------------------------------------------------- |
|
||||
| Runtime | Node.js | 22.x (system) | Already installed; inotify recursive fs.watch supported |
|
||||
| File watching | fs.watch recursive | built-in | inotify on Linux/Node22; efficient, no polling |
|
||||
| Session discovery | setInterval poll | built-in | sessions.json polling for new session detection |
|
||||
| HTTP client | http.Agent | built-in | keepAlive, maxSockets; no extra dependency |
|
||||
| Structured logging | pino | ^9.x | Fast JSON logging; single new dependency |
|
||||
| Config | process.env | built-in | 12-factor; validated at startup |
|
||||
| Health check | http.createServer | built-in | Lightweight health endpoint |
|
||||
| Process management | PID file + signals | built-in | Simple, no supervisor dependency |
|
||||
|
||||
**New npm dependencies:** `pino` only. Everything else uses Node.js built-ins.
|
||||
|
||||
@@ -124,17 +125,18 @@ MATTERMOST_OPENCLAW_LIVESTATUS/
|
||||
|
||||
## 5. Dependencies
|
||||
|
||||
| Package | Version | Purpose | New/Existing |
|
||||
|---------|---------|---------|-------------|
|
||||
| pino | ^9.x | Structured JSON logging | NEW |
|
||||
| node.js | 22.x | Runtime | Existing (system) |
|
||||
| http, fs, path, child_process | built-in | All other functionality | Existing |
|
||||
| Package | Version | Purpose | New/Existing |
|
||||
| ----------------------------- | -------- | ----------------------- | ----------------- |
|
||||
| pino | ^9.x | Structured JSON logging | NEW |
|
||||
| node.js | 22.x | Runtime | Existing (system) |
|
||||
| http, fs, path, child_process | built-in | All other functionality | Existing |
|
||||
|
||||
One new npm dependency only. Minimal footprint.
|
||||
|
||||
## 6. Data Model
|
||||
|
||||
### sessions.json entry (relevant fields)
|
||||
|
||||
```json
|
||||
{
|
||||
"agent:main:subagent:uuid": {
|
||||
@@ -150,6 +152,7 @@ One new npm dependency only. Minimal footprint.
|
||||
```
|
||||
|
||||
### JSONL event schema
|
||||
|
||||
```
|
||||
type=session -> id (UUID), version (3), cwd — first line only
|
||||
type=message -> role=user|assistant|toolResult; content[]=text|toolCall|toolResult|thinking
|
||||
@@ -158,6 +161,7 @@ type=model_change -> provider, modelId
|
||||
```
|
||||
|
||||
### SessionState (in-memory per active session)
|
||||
|
||||
```json
|
||||
{
|
||||
"sessionKey": "agent:main:subagent:uuid",
|
||||
@@ -177,6 +181,7 @@ type=model_change -> provider, modelId
|
||||
```
|
||||
|
||||
### Configuration (env vars)
|
||||
|
||||
```
|
||||
MM_TOKEN (required) Mattermost bot token
|
||||
MM_URL (required) Mattermost base URL
|
||||
@@ -198,6 +203,7 @@ DEFAULT_CHANNEL null Fallback channel for non-MM sessions (null = sk
|
||||
```
|
||||
|
||||
### Status box format (rendered Mattermost text)
|
||||
|
||||
```
|
||||
[ACTIVE] main | 38s
|
||||
Reading live-status source code...
|
||||
@@ -216,7 +222,9 @@ Plan ready. Awaiting approval.
|
||||
## 7. Task Checklist
|
||||
|
||||
### Phase 0: Repo Sync + Environment Verification ⏱️ 30min
|
||||
|
||||
> Parallelizable: no | Dependencies: none
|
||||
|
||||
- [ ] 0.1: Sync workspace live-status.js (283-line v2) to remote repo — git push → remote matches workspace copy
|
||||
- [ ] 0.2: Fix existing lint errors in live-status.js (43 issues: empty catch blocks, console statements) — replace empty catches with error logging, add eslint-disable comments for intentional console.log → make lint passes
|
||||
- [ ] 0.3: Run `make check` — verify all Makefile targets pass (lint/fmt/test/secret-scan) → clean run, zero failures
|
||||
@@ -225,6 +233,7 @@ Plan ready. Awaiting approval.
|
||||
- [ ] 0.6: Document exact transcript directory path and sessions.json path from the running gateway → constants confirmed for config.js (transcript dir: /home/node/.openclaw/agents/{agent}/sessions/, sessions.json: same path)
|
||||
|
||||
### Phase 1: Core Components ⏱️ 8-12h
|
||||
|
||||
> Parallelizable: partially (config/logger/circuit-breaker are independent) | Dependencies: Phase 0
|
||||
|
||||
- [ ] 1.1: Create `src/config.js` — reads all env vars with validation; throws clear error on missing required vars; exports typed config object → unit testable, fails fast
|
||||
@@ -240,7 +249,7 @@ Plan ready. Awaiting approval.
|
||||
- Circuit breaker wrapping all API calls
|
||||
- Retry with exponential backoff on 429/5xx (up to 3 retries)
|
||||
- Structured logs for every API call
|
||||
→ unit tested with mock HTTP server
|
||||
→ unit tested with mock HTTP server
|
||||
- [ ] 1.6: Create `src/status-formatter.js` — pure function; input: SessionState; output: formatted Mattermost text string (compact, MAX_STATUS_LINES, sub-agent nesting, status prefix, timestamps) → unit tested with varied inputs
|
||||
- [ ] 1.7: Create `src/health.js` — HTTP server on HEALTH_PORT; GET /health returns JSON {status, activeSessions, uptime, lastError, metrics: {updates_sent, updates_failed, circuit_state, queue_depth}} → manually tested with curl
|
||||
- [ ] 1.8: Create `src/status-watcher.js` — core JSONL watcher:
|
||||
@@ -255,10 +264,11 @@ Plan ready. Awaiting approval.
|
||||
- Detect file truncation (stat.size < bytesRead) -> reset offset, log warning
|
||||
- Debounce updates via status-box.js throttle
|
||||
- Idle detection: when pendingToolCalls==0 and no new lines for IDLE_TIMEOUT_S
|
||||
→ integration tested with real JSONL sample files
|
||||
→ integration tested with real JSONL sample files
|
||||
- [ ] 1.9: Unit test suite (`test/unit/`) — parser, tool-labels, circuit-breaker, throttle, status-formatter → `npm test` green
|
||||
|
||||
### Phase 2: Session Monitor + Lifecycle ⏱️ 4-6h
|
||||
|
||||
> Parallelizable: no | Dependencies: Phase 1
|
||||
|
||||
- [ ] 2.1: Create `src/session-monitor.js` — polls sessions.json every 2s:
|
||||
@@ -268,7 +278,7 @@ Plan ready. Awaiting approval.
|
||||
- Resolves channelId from session key (format: `agent:main:mattermost:channel:{id}:...`)
|
||||
- Resolves rootPostId from session key (format: `...thread:{id}`)
|
||||
- Falls back to DEFAULT_CHANNEL for non-MM sessions (or null to skip)
|
||||
→ integration tested with mock sessions.json writes
|
||||
→ integration tested with mock sessions.json writes
|
||||
- [ ] 2.2: Persist session offsets to disk — on each status update, write { sessionKey: bytesRead } to `/tmp/status-watcher-offsets.json`; on startup, load and restore existing sessions → restart recovery working
|
||||
- [ ] 2.3: Post recovery on restart — on startup, for each restored session, search channel history for status post with marker comment `<!-- sw:{sessionKey} -->`; if found, resume updating it; if not, create new post → tested by killing and restarting daemon mid-session
|
||||
- [ ] 2.4: Create `src/watcher-manager.js` — top-level orchestrator:
|
||||
@@ -280,10 +290,11 @@ Plan ready. Awaiting approval.
|
||||
- Registers SIGTERM/SIGINT handlers:
|
||||
- On signal: mark all active status boxes "interrupted", flush all pending updates, remove PID file, exit 0
|
||||
- CLI: `node watcher-manager.js start|stop|status` → process management
|
||||
→ smoke tested end-to-end
|
||||
→ smoke tested end-to-end
|
||||
- [ ] 2.5: Integration test suite (`test/integration/`) — lifecycle events, restart recovery → `npm run test:integration` green
|
||||
|
||||
### Phase 3: Sub-Agent Support ⏱️ 3-4h
|
||||
|
||||
> Parallelizable: no | Dependencies: Phase 2
|
||||
|
||||
- [ ] 3.1: Sub-agent detection — session-monitor detects entries with `spawnedBy` field; links child SessionState to parent via `parentSessionKey` → linked correctly
|
||||
@@ -293,6 +304,7 @@ Plan ready. Awaiting approval.
|
||||
- [ ] 3.5: Integration test — spawn mock sub-agent transcript, verify parent status box shows nested child progress → manual verification in Mattermost
|
||||
|
||||
### Phase 4: Hook Integration ⏱️ 1h
|
||||
|
||||
> Parallelizable: no | Dependencies: Phase 2 (watcher-manager CLI working)
|
||||
|
||||
- [ ] 4.1: Create `hooks/status-watcher-hook/HOOK.md` — events: ["gateway:startup"], description, required env vars listed → OpenClaw discovers hook
|
||||
@@ -301,6 +313,7 @@ Plan ready. Awaiting approval.
|
||||
- [ ] 4.4: Test: gateway restart -> watcher starts, PID file written, health endpoint responds → verified
|
||||
|
||||
### Phase 5: Polish + Deployment ⏱️ 3-4h
|
||||
|
||||
> Parallelizable: yes (docs, deploy scripts, skill rewrite are independent) | Dependencies: Phase 4
|
||||
|
||||
- [ ] 5.1: Rewrite `skill/SKILL.md` — 10-line file: "Live status updates are automatic. You do not need to call live-status manually. Focus on your task." → no protocol injection
|
||||
@@ -314,6 +327,7 @@ Plan ready. Awaiting approval.
|
||||
- [ ] 5.9: Run `make check` → zero lint/format errors; `npm test` → green
|
||||
|
||||
### Phase 6: Remove v1 Injection from AGENTS.md ⏱️ 30min
|
||||
|
||||
> Parallelizable: no | Dependencies: Phase 5 fully verified + watcher confirmed running
|
||||
> SAFETY: Do not execute this phase until watcher has been running successfully for at least 1 hour
|
||||
|
||||
@@ -324,67 +338,67 @@ Plan ready. Awaiting approval.
|
||||
|
||||
## 8. Testing Strategy
|
||||
|
||||
| What | Type | How | Success Criteria |
|
||||
|------|------|-----|-----------------|
|
||||
| config.js | Unit | Env var injection, missing var detection | Throws on missing required vars; correct defaults |
|
||||
| logger.js | Unit | Log output format | JSON output, levels respected |
|
||||
| circuit-breaker.js | Unit | Simulate N failures, verify state transitions | open after threshold, half-open after cooldown |
|
||||
| tool-labels.js | Unit | 30+ tool names (exact, prefix, regex, unmapped) | Correct labels returned; default for unknown |
|
||||
| status-formatter.js | Unit | Various SessionState inputs | Correct compact output; MAX_LINES enforced |
|
||||
| status-box.js | Unit | Mock HTTP server | create/update called correctly; throttle works; circuit fires |
|
||||
| session-monitor.js | Integration | Write test sessions.json; verify events emitted | session-added/removed within 2s |
|
||||
| status-watcher.js | Integration | Append to JSONL file; verify Mattermost update | Update within 1.5s of new line |
|
||||
| Idle detection | Integration | Stop writing; verify complete after IDLE_TIMEOUT+5s | Status box marked done |
|
||||
| Session compaction | Integration | Truncate JSONL file mid-session | No crash; offset reset; no duplicate events |
|
||||
| Restart recovery | Integration | Kill daemon mid-session; restart | Existing post updated, not new post created |
|
||||
| Sub-agent nesting | Integration | Mock parent + child transcripts | Child visible in parent status box |
|
||||
| Cascade completion | Integration | Child completes; verify parent waits | Parent marks done after last child |
|
||||
| Health endpoint | Manual | curl localhost:9090/health | JSON with correct metrics |
|
||||
| E2E smoke test | Manual | Real agent task in Mattermost | Real-time updates; no spam; done on completion |
|
||||
| What | Type | How | Success Criteria |
|
||||
| ------------------- | ----------- | --------------------------------------------------- | ------------------------------------------------------------- |
|
||||
| config.js | Unit | Env var injection, missing var detection | Throws on missing required vars; correct defaults |
|
||||
| logger.js | Unit | Log output format | JSON output, levels respected |
|
||||
| circuit-breaker.js | Unit | Simulate N failures, verify state transitions | open after threshold, half-open after cooldown |
|
||||
| tool-labels.js | Unit | 30+ tool names (exact, prefix, regex, unmapped) | Correct labels returned; default for unknown |
|
||||
| status-formatter.js | Unit | Various SessionState inputs | Correct compact output; MAX_LINES enforced |
|
||||
| status-box.js | Unit | Mock HTTP server | create/update called correctly; throttle works; circuit fires |
|
||||
| session-monitor.js | Integration | Write test sessions.json; verify events emitted | session-added/removed within 2s |
|
||||
| status-watcher.js | Integration | Append to JSONL file; verify Mattermost update | Update within 1.5s of new line |
|
||||
| Idle detection | Integration | Stop writing; verify complete after IDLE_TIMEOUT+5s | Status box marked done |
|
||||
| Session compaction | Integration | Truncate JSONL file mid-session | No crash; offset reset; no duplicate events |
|
||||
| Restart recovery | Integration | Kill daemon mid-session; restart | Existing post updated, not new post created |
|
||||
| Sub-agent nesting | Integration | Mock parent + child transcripts | Child visible in parent status box |
|
||||
| Cascade completion | Integration | Child completes; verify parent waits | Parent marks done after last child |
|
||||
| Health endpoint | Manual | curl localhost:9090/health | JSON with correct metrics |
|
||||
| E2E smoke test | Manual | Real agent task in Mattermost | Real-time updates; no spam; done on completion |
|
||||
|
||||
## 9. Risks & Mitigations
|
||||
|
||||
| Risk | Impact | Mitigation |
|
||||
|------|--------|-----------|
|
||||
| fs.watch recursive not reliable on this kernel | High | Detect at startup; fall back to polling if watch fails (setInterval 2s on directory listing) |
|
||||
| sessions.json write race causes parse error | Medium | Try/catch on JSON.parse; retry next poll cycle; log warning |
|
||||
| Mattermost rate limit (10 req/s default) | Medium | Throttle to max 2 req/s per session; circuit breaker; exponential backoff on 429 |
|
||||
| Session compaction truncates JSONL | Medium | Detect stat.size < bytesRead on each read; reset offset; dedup by tracking last processed line index |
|
||||
| Multiple gateway restarts create duplicate watchers | Medium | PID file check + SIGTERM old process before spawning new one |
|
||||
| Non-MM sessions (hook, cron) generate noise | Low | Channel resolver returns null; watcher skips session gracefully |
|
||||
| pino dependency unavailable | Low | If npm install fails, fallback to console.log (degrade gracefully, log warning) |
|
||||
| Status box exceeds Mattermost post size limit | Low | Hard truncate at MAX_MESSAGE_CHARS (15000); tested with message size guard |
|
||||
| JSONL format changes in future OpenClaw | Medium | Abstract parser behind EventParser interface; version check on session record |
|
||||
| Daemon crashes mid-session | Medium | Health check via systemd/Docker; restart policy; offset persistence enables recovery |
|
||||
| Risk | Impact | Mitigation |
|
||||
| --------------------------------------------------- | ------ | ---------------------------------------------------------------------------------------------------- |
|
||||
| fs.watch recursive not reliable on this kernel | High | Detect at startup; fall back to polling if watch fails (setInterval 2s on directory listing) |
|
||||
| sessions.json write race causes parse error | Medium | Try/catch on JSON.parse; retry next poll cycle; log warning |
|
||||
| Mattermost rate limit (10 req/s default) | Medium | Throttle to max 2 req/s per session; circuit breaker; exponential backoff on 429 |
|
||||
| Session compaction truncates JSONL | Medium | Detect stat.size < bytesRead on each read; reset offset; dedup by tracking last processed line index |
|
||||
| Multiple gateway restarts create duplicate watchers | Medium | PID file check + SIGTERM old process before spawning new one |
|
||||
| Non-MM sessions (hook, cron) generate noise | Low | Channel resolver returns null; watcher skips session gracefully |
|
||||
| pino dependency unavailable | Low | If npm install fails, fallback to console.log (degrade gracefully, log warning) |
|
||||
| Status box exceeds Mattermost post size limit | Low | Hard truncate at MAX_MESSAGE_CHARS (15000); tested with message size guard |
|
||||
| JSONL format changes in future OpenClaw | Medium | Abstract parser behind EventParser interface; version check on session record |
|
||||
| Daemon crashes mid-session | Medium | Health check via systemd/Docker; restart policy; offset persistence enables recovery |
|
||||
|
||||
## 10. Effort Estimate
|
||||
|
||||
| Phase | Time | Can Parallelize? | Depends On |
|
||||
|-------|------|-------------------|-----------|
|
||||
| Phase 0: Repo + Env Verification | 15min | No | — |
|
||||
| Phase 1: Core Components | 8-12h | Partially (config/logger/circuit-breaker) | Phase 0 |
|
||||
| Phase 2: Session Monitor + Lifecycle | 4-6h | No | Phase 1 |
|
||||
| Phase 3: Sub-Agent Support | 3-4h | No | Phase 2 |
|
||||
| Phase 4: Hook Integration | 1h | No | Phase 2+3 |
|
||||
| Phase 5: Polish + Deployment | 3-4h | Yes (docs, deploy, skill) | Phase 4 |
|
||||
| Phase 6: Remove v1 AGENTS.md Injection | 30min | No | Phase 5 verified |
|
||||
| **Total** | **20-28h** | | |
|
||||
| Phase | Time | Can Parallelize? | Depends On |
|
||||
| -------------------------------------- | ---------- | ----------------------------------------- | ---------------- |
|
||||
| Phase 0: Repo + Env Verification | 15min | No | — |
|
||||
| Phase 1: Core Components | 8-12h | Partially (config/logger/circuit-breaker) | Phase 0 |
|
||||
| Phase 2: Session Monitor + Lifecycle | 4-6h | No | Phase 1 |
|
||||
| Phase 3: Sub-Agent Support | 3-4h | No | Phase 2 |
|
||||
| Phase 4: Hook Integration | 1h | No | Phase 2+3 |
|
||||
| Phase 5: Polish + Deployment | 3-4h | Yes (docs, deploy, skill) | Phase 4 |
|
||||
| Phase 6: Remove v1 AGENTS.md Injection | 30min | No | Phase 5 verified |
|
||||
| **Total** | **20-28h** | | |
|
||||
|
||||
## 11. Open Questions
|
||||
|
||||
All questions have defaults that allow execution to proceed without answers.
|
||||
|
||||
- [ ] **Q1 (informational): Idle timeout tuning.** 60s default may still cause premature completion for very long exec calls (e.g., a 3-minute build). Smart heuristic (pendingToolCalls tracking) should handle this correctly, but production data may reveal edge cases.
|
||||
**Default:** Use smart heuristic (pendingToolCalls + IDLE_TIMEOUT_S=60). Log false-positives for tuning.
|
||||
**Default:** Use smart heuristic (pendingToolCalls + IDLE_TIMEOUT_S=60). Log false-positives for tuning.
|
||||
|
||||
- [ ] **Q2 (informational): Non-MM session behavior.** Hook sessions, cron sessions, and xen sessions don't have a Mattermost channel. Currently skipped.
|
||||
**Default:** Skip non-MM sessions (no status box). Log at debug level. Can revisit for Phase 7.
|
||||
**Default:** Skip non-MM sessions (no status box). Log at debug level. Can revisit for Phase 7.
|
||||
|
||||
- [ ] **Q3 (informational): Status box per-request vs per-session.** Currently: one status box per user message (reset on new user turn). This is the most natural UX.
|
||||
**Default:** Per-request. New user message starts new status cycle. Works correctly with smart idle detection.
|
||||
**Default:** Per-request. New user message starts new status cycle. Works correctly with smart idle detection.
|
||||
|
||||
- [ ] **Q4 (informational): Compaction dedup strategy.** When JSONL is truncated, we reset offset and re-read. We may re-process events already posted to Mattermost.
|
||||
**Default:** Track last processed line count (not just byte offset). Skip lines already processed on re-read. OR: detect compaction and do not re-append old events (since they were already shown). Simplest: mark box as "session compacted - continuing" and reset the visible lines in the status box.
|
||||
**Default:** Track last processed line count (not just byte offset). Skip lines already processed on re-read. OR: detect compaction and do not re-append old events (since they were already shown). Simplest: mark box as "session compacted - continuing" and reset the visible lines in the status box.
|
||||
|
||||
- [ ] **Q5 (blocking if no): AGENTS.md modification scope.** Phase 6 removes Live Status Protocol section from all agent AGENTS.md files. Confirm Rooh wants all instances removed (not just main agent).
|
||||
**Default if not answered:** Remove from all agents. This is the stated goal — removing v1 injection everywhere.
|
||||
**Default if not answered:** Remove from all agents. This is the stated goal — removing v1 injection everywhere.
|
||||
|
||||
Reference in New Issue
Block a user