Bug Report & Fix Log — 2026-03-09 Session #8

Closed
opened 2026-03-09 20:01:12 +01:00 by sol · 2 comments
Collaborator

Overview

This issue documents all bugs diagnosed and fixed during the 2026-03-09 session. All fixes are infrastructure-level (Node.js daemon + Go plugin). No behavioral/prompt changes.


Bug 1: Token Burn — 13 Sessions Heartbeating Simultaneously

Symptom: Anthropic returning overloaded_error. High token usage. Agent sessions staying alive for hours.

Root cause: Multiple agent sessions each triggering 30-minute heartbeats with full context. 13 concurrent sessions × full context = massive parallel LLM load.

Fix: Set heartbeat.lightContext: true in gateway config. Reduces input tokens per heartbeat by ~80%. Real work turns unaffected. heartbeat.every kept at 30m per user preference.


Bug 2: Status Box Stayed Active 60s After Turn Complete

Symptom: Status box showed "Agent session active" long after reply was sent.

Root cause: Daemon used a 60-second file inactivity timer to detect turn completion. No infrastructure signal was used.

Fix 1 (commit b7c5124): Detect openclaw.cache-ttl custom record in JSONL stream. Gateway writes this after every completed assistant turn. On detection with pendingToolCalls === 0, fire 3-second grace timer instead of waiting 60s.

Fix 2 (commit cdef7a1): Watch .jsonl.lock file deletion. Gateway deletes the lock file the instant the final reply is sent. On deletion → call triggerIdle() immediately. Zero wait, no timer. This is now the primary done signal.


Bug 3: Status Box Not Appearing on New Messages (Reactivation Failure)

Symptom: After a session completed, sending a new message did not bring up a new status box.

Root cause (layer 1): _completedSessions cooldown — 5-minute file-staleness check prevented re-detection. Changed to use sessions.json updatedAt > completedAt timestamp comparison (infrastructure signal, no AI involvement).

Root cause (layer 2, commit f545cb0): Ghost watch — after removeSession(), the fileToSession mapping was deleted so fs.watch fired but had nothing to match. Fixed: keep mapping as \x00ghost: prefix entry. On ghost file change → emit session-reactivateclearCompleted + pollNow.

Root cause (layer 3, commit b0cb6db): _knownSessions was replaced wholesale (this._knownSessions = currentSessions) at end of every _poll(). This meant forgetSession() had zero effect — next poll immediately re-added the key as "known" without firing _onSessionAdded. Fixed: maintain _knownSessions incrementally (.set() / .delete()).

Root cause (layer 4, commit cc65f9b): clearCompleted() removed from _completedSessions but left key in _knownSessions. So isKnown=true suppressed _onSessionAdded even after ghost watch fired. Fixed: clearCompleted() now deletes from both maps.


Bug 4: Stale File Offset — No Content Updates Sent to Plugin

Symptom: Status box created but never updated with any content. Lines stayed at 0.

Root cause: Persisted offset file (/tmp/status-watcher-offsets.json) had lastOffset: 1,718,497 but JSONL file was only 1,715,183 bytes (context compaction had shrunk the file). Every _readFile call returned 0 bytes silently because fileSize <= lastOffset.

Fix (commit b0cb6db): When fileSize < lastOffset (stale or post-compaction), reset lastOffset to fileSize (current end of file). Previously reset to 0, which would re-parse gigabytes of old content. Resetting to fileSize means only new bytes from this point forward are read.


Bug 5: Status Box Not Triggered Until First Reply Token

Symptom: Status box appeared only when the first JSONL content was written, not when user message was received. Turns with no tool calls showed nothing.

Root cause: Daemon watched only .jsonl file changes. First write happens after the model starts generating, not when message is received.

Fix (commit 0bdfaaa): Watch .jsonl.lock file creation. Gateway writes this the instant it starts processing a user message — before any JSONL content. On lock creation → clearCompleted + pollNow → status box created immediately. Latency: <100ms from message receipt.


Bug 6: Orphan KV Sessions With Zero Timestamps

Symptom: Multiple "Agent session active" posts in channels with no corresponding running session. Plugin health showed active sessions that were not real.

Root cause: CleanStaleSessions() in store.go had a continue statement for sessions where LastUpdateMs == 0 && StartTimeMs == 0 (sessions created before cleanup was implemented). These were skipped forever.

Fix: Rebuilt Go plugin. Non-active sessions with zero timestamps → deleted immediately. Active sessions with zero timestamps → interrupted immediately, then deleted on next cleanup cycle. Binary rebuilt via Docker (golang:1.22-alpine) and deployed.


Bug 7: Reactivated Session Reusing Stale Plugin Post

Symptom: After reactivation, no new status box appeared. Old completed post was silently reused.

Root cause: session-added handler called findExistingPost() (REST search) even in plugin mode. It found the old completed/done plugin post via Mattermost search API and reused the post ID, so pluginClient.createSession() was never called.

Fix (commit cc65f9b): Skip findExistingPost entirely when in plugin mode. Plugin always creates a fresh custom_livestatus post via pluginClient.createSession(). Plugin KV entry is replaced atomically so no duplicate boxes.


Summary Table

# Symptom Root Cause Fix Commit
1 Token burn / overloaded_error 13 sessions heartbeating with full context lightContext: true in gateway config config change
2a Status box stays 60s after done 60s inactivity timer Detect openclaw.cache-ttl → 3s grace b7c5124
2b Status box stays after done No infrastructure done signal Watch .jsonl.lock deletion → triggerIdle() cdef7a1
3a No new box on reactivation 5min staleness cooldown Use sessions.json updatedAt comparison b320bcf
3b No new box on reactivation Ghost watch lost mapping after removeSession Keep \x00ghost: entry in fileToSession f545cb0
3c No new box on reactivation _knownSessions replaced wholesale each poll Incremental map updates b0cb6db
3d No new box on reactivation clearCompleted left key in _knownSessions Delete from both maps in clearCompleted cc65f9b
4 Status box shows no content Stale offset > file size after compaction Reset offset to fileSize not 0 b0cb6db
5 Box appears late (after first token) Watched JSONL only, not lock file Watch .jsonl.lock creation 0bdfaaa
6 Orphan KV sessions Zero-timestamp sessions skipped in cleanup Delete/interrupt immediately in Go Go rebuild
7 Reactivation reuses old post findExistingPost ran in plugin mode Skip REST search in plugin mode cc65f9b
## Overview This issue documents all bugs diagnosed and fixed during the 2026-03-09 session. All fixes are infrastructure-level (Node.js daemon + Go plugin). No behavioral/prompt changes. --- ## Bug 1: Token Burn — 13 Sessions Heartbeating Simultaneously **Symptom:** Anthropic returning `overloaded_error`. High token usage. Agent sessions staying alive for hours. **Root cause:** Multiple agent sessions each triggering 30-minute heartbeats with full context. 13 concurrent sessions × full context = massive parallel LLM load. **Fix:** Set `heartbeat.lightContext: true` in gateway config. Reduces input tokens per heartbeat by ~80%. Real work turns unaffected. `heartbeat.every` kept at `30m` per user preference. --- ## Bug 2: Status Box Stayed Active 60s After Turn Complete **Symptom:** Status box showed "Agent session active" long after reply was sent. **Root cause:** Daemon used a 60-second file inactivity timer to detect turn completion. No infrastructure signal was used. **Fix 1 (commit `b7c5124`):** Detect `openclaw.cache-ttl` custom record in JSONL stream. Gateway writes this after every completed assistant turn. On detection with `pendingToolCalls === 0`, fire 3-second grace timer instead of waiting 60s. **Fix 2 (commit `cdef7a1`):** Watch `.jsonl.lock` file deletion. Gateway deletes the lock file the instant the final reply is sent. On deletion → call `triggerIdle()` immediately. Zero wait, no timer. This is now the primary done signal. --- ## Bug 3: Status Box Not Appearing on New Messages (Reactivation Failure) **Symptom:** After a session completed, sending a new message did not bring up a new status box. **Root cause (layer 1):** `_completedSessions` cooldown — 5-minute file-staleness check prevented re-detection. Changed to use `sessions.json updatedAt > completedAt` timestamp comparison (infrastructure signal, no AI involvement). **Root cause (layer 2, commit `f545cb0`):** Ghost watch — after `removeSession()`, the `fileToSession` mapping was deleted so `fs.watch` fired but had nothing to match. Fixed: keep mapping as `\x00ghost:` prefix entry. On ghost file change → emit `session-reactivate` → `clearCompleted` + `pollNow`. **Root cause (layer 3, commit `b0cb6db`):** `_knownSessions` was replaced wholesale (`this._knownSessions = currentSessions`) at end of every `_poll()`. This meant `forgetSession()` had zero effect — next poll immediately re-added the key as "known" without firing `_onSessionAdded`. Fixed: maintain `_knownSessions` incrementally (`.set()` / `.delete()`). **Root cause (layer 4, commit `cc65f9b`):** `clearCompleted()` removed from `_completedSessions` but left key in `_knownSessions`. So `isKnown=true` suppressed `_onSessionAdded` even after ghost watch fired. Fixed: `clearCompleted()` now deletes from both maps. --- ## Bug 4: Stale File Offset — No Content Updates Sent to Plugin **Symptom:** Status box created but never updated with any content. Lines stayed at 0. **Root cause:** Persisted offset file (`/tmp/status-watcher-offsets.json`) had `lastOffset: 1,718,497` but JSONL file was only `1,715,183` bytes (context compaction had shrunk the file). Every `_readFile` call returned 0 bytes silently because `fileSize <= lastOffset`. **Fix (commit `b0cb6db`):** When `fileSize < lastOffset` (stale or post-compaction), reset `lastOffset` to `fileSize` (current end of file). Previously reset to 0, which would re-parse gigabytes of old content. Resetting to `fileSize` means only new bytes from this point forward are read. --- ## Bug 5: Status Box Not Triggered Until First Reply Token **Symptom:** Status box appeared only when the first JSONL content was written, not when user message was received. Turns with no tool calls showed nothing. **Root cause:** Daemon watched only `.jsonl` file changes. First write happens after the model starts generating, not when message is received. **Fix (commit `0bdfaaa`):** Watch `.jsonl.lock` file creation. Gateway writes this the instant it starts processing a user message — before any JSONL content. On lock creation → `clearCompleted` + `pollNow` → status box created immediately. Latency: <100ms from message receipt. --- ## Bug 6: Orphan KV Sessions With Zero Timestamps **Symptom:** Multiple "Agent session active" posts in channels with no corresponding running session. Plugin health showed active sessions that were not real. **Root cause:** `CleanStaleSessions()` in `store.go` had a `continue` statement for sessions where `LastUpdateMs == 0 && StartTimeMs == 0` (sessions created before cleanup was implemented). These were skipped forever. **Fix:** Rebuilt Go plugin. Non-active sessions with zero timestamps → deleted immediately. Active sessions with zero timestamps → interrupted immediately, then deleted on next cleanup cycle. Binary rebuilt via Docker (`golang:1.22-alpine`) and deployed. --- ## Bug 7: Reactivated Session Reusing Stale Plugin Post **Symptom:** After reactivation, no new status box appeared. Old completed post was silently reused. **Root cause:** `session-added` handler called `findExistingPost()` (REST search) even in plugin mode. It found the old completed/done plugin post via Mattermost search API and reused the post ID, so `pluginClient.createSession()` was never called. **Fix (commit `cc65f9b`):** Skip `findExistingPost` entirely when in plugin mode. Plugin always creates a fresh `custom_livestatus` post via `pluginClient.createSession()`. Plugin KV entry is replaced atomically so no duplicate boxes. --- ## Summary Table | # | Symptom | Root Cause | Fix | Commit | |---|---------|-----------|-----|--------| | 1 | Token burn / overloaded_error | 13 sessions heartbeating with full context | `lightContext: true` in gateway config | config change | | 2a | Status box stays 60s after done | 60s inactivity timer | Detect `openclaw.cache-ttl` → 3s grace | `b7c5124` | | 2b | Status box stays after done | No infrastructure done signal | Watch `.jsonl.lock` deletion → `triggerIdle()` | `cdef7a1` | | 3a | No new box on reactivation | 5min staleness cooldown | Use `sessions.json updatedAt` comparison | `b320bcf` | | 3b | No new box on reactivation | Ghost watch lost mapping after removeSession | Keep `\x00ghost:` entry in `fileToSession` | `f545cb0` | | 3c | No new box on reactivation | `_knownSessions` replaced wholesale each poll | Incremental map updates | `b0cb6db` | | 3d | No new box on reactivation | `clearCompleted` left key in `_knownSessions` | Delete from both maps in `clearCompleted` | `cc65f9b` | | 4 | Status box shows no content | Stale offset > file size after compaction | Reset offset to `fileSize` not 0 | `b0cb6db` | | 5 | Box appears late (after first token) | Watched JSONL only, not lock file | Watch `.jsonl.lock` creation | `0bdfaaa` | | 6 | Orphan KV sessions | Zero-timestamp sessions skipped in cleanup | Delete/interrupt immediately in Go | Go rebuild | | 7 | Reactivation reuses old post | `findExistingPost` ran in plugin mode | Skip REST search in plugin mode | `cc65f9b` | <!-- openclaw-agent -->
Author
Collaborator

Triaged. This is a documentation/tracking issue logging 7 infrastructure bugs diagnosed and fixed during the 2026-03-09 session.

All fixes are already applied (commits referenced inline). No action items remain open. Marking for archival/reference.

Summary:

  • Bugs 1-7: All root causes identified, fixes committed, Go plugin rebuilt and deployed
  • No behavioral or prompt changes
  • No open follow-ups noted

This issue can be closed once the team confirms nothing was missed.

Triaged. This is a documentation/tracking issue logging 7 infrastructure bugs diagnosed and fixed during the 2026-03-09 session. All fixes are already applied (commits referenced inline). No action items remain open. Marking for archival/reference. Summary: - Bugs 1-7: All root causes identified, fixes committed, Go plugin rebuilt and deployed - No behavioral or prompt changes - No open follow-ups noted This issue can be closed once the team confirms nothing was missed. <!-- openclaw-agent -->
sol closed this issue 2026-03-09 20:52:26 +01:00
Author
Collaborator

Acknowledged. Fix log received and recorded for issue #8 (2026-03-09 session). 7 bugs documented with commits b7c5124, cdef7a1, b320bcf, f545cb0, b0cb6db, cc65f9b, 0bdfaaa plus Go plugin rebuild. Closing as complete.

Acknowledged. Fix log received and recorded for issue #8 (2026-03-09 session). 7 bugs documented with commits b7c5124, cdef7a1, b320bcf, f545cb0, b0cb6db, cc65f9b, 0bdfaaa plus Go plugin rebuild. Closing as complete. <!-- openclaw-agent -->
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: ROOH/MATTERMOST_OPENCLAW_LIVESTATUS#8