Live Status: fix false reactivation thrash while preserving fresh bottom-of-chat boxes and correct Mattermost routing #11

Open
opened 2026-03-12 08:13:43 +01:00 by sol · 0 comments
Collaborator

Problem

Live Status is close, but it still has a correctness bug in session lifecycle handling that makes the user-visible behavior unstable in Mattermost.

The current failure mode is not primarily process startup and not a full daemon crash. The live-status daemon process often remains alive, but the UI behavior becomes incorrect because completed sessions are being reactivated by stale transcript file events.

This issue needs to be handled as a routing + lifecycle/state-machine correctness issue, not as a small debounce bug.

What we want

We want Live Status to satisfy all of these constraints at the same time:

  1. Correct Mattermost routing

    • A live-status box must always map to the correct Mattermost context.
    • Thread sessions must post into the correct thread root.
    • Channel sessions must stay in the correct channel.
    • DM sessions must resolve to the correct DM channel.
    • Live-status must never spill across unrelated channels/threads/DMs.
  2. Fresh box at the bottom of chat on real reactivation

    • We do not want a single permanent post reused forever.
    • When a real new turn happens, the old buried status box should be deleted and a fresh one should be created at the bottom so the current status is visible where the user is looking.
    • This is intentional product behavior, not a bug.
  3. No false reactivation after completion

    • After a turn completes, trailing transcript writes / ghost fs.watch events must not resurrect the session and create a new bottom box.
    • A fresh bottom box should appear only for a real new activation, not stale file noise.
  4. Single startup owner

    • Live Status should have one authoritative startup/lifecycle owner.
    • The current preferred direction is the live-status-daemon plugin service, not multiple startup mechanisms competing or drifting.
  5. Reliable startup and runtime behavior

    • After gateway restart, live status should come back automatically.
    • If the watcher dies, it should be restarted by the one owning lifecycle manager.
    • But runtime session correctness must still hold.

Symptoms observed

Observed behavior from the user side:

  • Live status works for a few minutes, then appears to stop / die / behave incorrectly.
  • Status boxes get deleted and recreated repeatedly.
  • Behavior looks like the feature “crashed”, but the daemon process may still be running.

Observed behavior from logs:

  • The daemon process remained alive.
  • Session completion happened successfully.
  • Immediately afterward, a completed session was reactivated by later transcript file changes.
  • Old box was deleted and a new box was created repeatedly.

Representative log pattern:

  • Lock file deleted — turn complete, marking session done immediately
  • Session complete via plugin
  • fs.watch: file change on completed session — triggering reactivation
  • Ghost watch triggered reactivation
  • Deleted old buried status box on reactivation
  • Created status box via plugin

This means the process is often alive but logically wrong.

Important product requirement clarified during diagnosis

A fresh status box at the bottom of the thread/channel is required.

That means:

  • Deleting old buried boxes and creating a fresh one on a real new turn is correct.
  • The bug is not “it creates a new box.”
  • The bug is it creates a new box on false reactivation from ghost/stale transcript events.

Any proposed fix that preserves a single box forever is the wrong UX for this project.

Current code/design understanding

From the repo, the current design already includes routing-aware logic:

  • session-monitor.js

    • parses channelId
    • parses rootPostId from :thread:<rootPostId>
    • resolves DM channels for :direct:<userId> sessions via Mattermost API
  • watcher-manager.js

    • keys active/completed boxes by full sessionKey
    • creates boxes using channelId and rootPostId
    • suppresses heartbeat/internal sessions
    • has displacement/suppression logic so thread sessions can take priority over bare parent sessions
    • intentionally deletes old buried completed box and creates a fresh one on reactivation
  • status-watcher.js

    • watches transcript changes
    • watches lock-file lifecycle
    • currently appears able to reactivate completed sessions from ghost transcript file change events

So the current system already understands routing and the fresh-bottom-box UX goal, but the session lifecycle/reactivation boundary is still not strong enough.

Root problem as currently understood

The likely root problem is:

session lifecycle truth is split too loosely between lock-file events and transcript file changes, allowing completed sessions to be resurrected by stale/ghost file activity.

In practice:

  • lock removal marks session complete
  • completed box is moved into completed state
  • stale fs.watch events still arrive
  • those events are treated as reactivation-worthy
  • a fresh bottom box is created even though there was no real new user turn

That creates complete → reactivate → recreate thrash.

Proposed solution direction found during diagnosis

Core principle

Keep the fresh-bottom-box UX, but tighten what qualifies as a real reactivation.

Proposed architecture direction

  1. Use a single lifecycle owner for the process

    • live-status-daemon plugin service should be the one authoritative startup/runtime owner.
    • Legacy startup hook/script paths should not be competing lifecycle owners.
  2. Use an explicit session state machine

    • Suggested states:
      • inactive
      • active
      • completing
      • completed_guarded
    • Transition out of completed_guarded should require a real activation signal.
  3. Preserve fresh-bottom-box behavior only for real reactivation

    • On a real new turn:
      • delete old buried completed box
      • create fresh box at bottom
    • This behavior should remain.
  4. Tighten reactivation signals

    • A completed session should be allowed to reactivate only on an authoritative new-turn signal, for example:
      • lock file creation
      • or another explicit session-start/new-turn marker if one exists in OpenClaw session lifecycle
    • Plain transcript writes after completion must not resurrect a session.
  5. Downgrade transcript file changes to content updates, not lifecycle authority

    • While a session is active: transcript writes can update the existing box.
    • After completion: transcript writes alone should never create a new box.
  6. Keep routing correctness as a hard invariant

    • Any reactivation must preserve exact sessionKey → channel/thread/DM mapping.
    • Fresh box creation must occur in the same correct Mattermost context.

What the manager should do

Please review this issue as a manager/research task, not as an implementation task yet.

Requested output

Provide an implementation plan that:

  1. Reviews the current diagnosis and proposed solution above.
  2. Challenges it with your own repo reading and external research.
  3. Compares alternative solutions, including whether there is a better authoritative lifecycle signal than the current lock/transcript combination.
  4. Recommends the best final architecture for:
    • startup ownership
    • runtime supervision
    • session lifecycle source of truth
    • reactivation rules
    • thread/channel/DM routing correctness
    • preserving the fresh-bottom-box UX
  5. Lists exact code areas likely to change.
  6. Lists risks/regressions to test.
  7. Produces a plan suitable for explicit human approval before any implementation.

Specific questions for the manager to answer

  1. Is the best source of truth really the lock file, or does OpenClaw expose a better authoritative lifecycle signal?
  2. Should fs.watch ever be allowed to trigger reactivation on a completed session?
  3. Is the current completed cooldown / ghost-watch design fundamentally flawed and worth replacing?
  4. Are thread/channel/DM ownership rules complete, or are there remaining edge cases where status boxes can still leak into the wrong Mattermost context?
  5. What is the cleanest way to preserve “fresh box at bottom” without complete/reactivate thrash?

Non-goal for this issue

Do not implement yet.

This issue is for:

  • documenting the problem precisely
  • aligning on requirements
  • getting a better manager-reviewed solution
  • waiting for approval before coding

Acceptance criteria for the future implementation

The eventual implementation should satisfy all of these:

  1. After gateway restart, live status starts automatically.
  2. Only one lifecycle owner is responsible for starting/stopping/restarting the watcher.
  3. Live-status boxes always appear in the correct Mattermost context (thread/channel/DM).
  4. On a real new turn, the old buried box is replaced with a fresh box at the bottom.
  5. On trailing transcript noise after completion, no false new box is created.
  6. No complete → reactivate → recreate thrash.
  7. Logs clearly distinguish:
    • real activation
    • real completion
    • ignored post-completion file noise
    • real reactivation

Repo areas likely involved

Based on current reading, likely areas include:

  • src/watcher-manager.js
  • src/status-watcher.js
  • src/session-monitor.js
  • plugin service lifecycle integration

Why this should be planned, not patched ad hoc

This bug touches:

  • session state model
  • Mattermost routing correctness
  • startup ownership
  • UX requirements
  • daemon/runtime behavior

So a one-off patch is likely to create regressions unless the lifecycle model is made explicit first.

# Problem Live Status is close, but it still has a correctness bug in session lifecycle handling that makes the user-visible behavior unstable in Mattermost. The current failure mode is **not primarily process startup** and **not a full daemon crash**. The live-status daemon process often remains alive, but the UI behavior becomes incorrect because completed sessions are being reactivated by stale transcript file events. This issue needs to be handled as a routing + lifecycle/state-machine correctness issue, not as a small debounce bug. ## What we want We want Live Status to satisfy all of these constraints at the same time: 1. **Correct Mattermost routing** - A live-status box must always map to the correct Mattermost context. - Thread sessions must post into the correct thread root. - Channel sessions must stay in the correct channel. - DM sessions must resolve to the correct DM channel. - Live-status must never spill across unrelated channels/threads/DMs. 2. **Fresh box at the bottom of chat on real reactivation** - We do **not** want a single permanent post reused forever. - When a real new turn happens, the old buried status box should be deleted and a fresh one should be created at the bottom so the current status is visible where the user is looking. - This is intentional product behavior, not a bug. 3. **No false reactivation after completion** - After a turn completes, trailing transcript writes / ghost `fs.watch` events must not resurrect the session and create a new bottom box. - A fresh bottom box should appear only for a **real new activation**, not stale file noise. 4. **Single startup owner** - Live Status should have one authoritative startup/lifecycle owner. - The current preferred direction is the `live-status-daemon` plugin service, not multiple startup mechanisms competing or drifting. 5. **Reliable startup and runtime behavior** - After gateway restart, live status should come back automatically. - If the watcher dies, it should be restarted by the one owning lifecycle manager. - But runtime session correctness must still hold. ## Symptoms observed Observed behavior from the user side: - Live status works for a few minutes, then appears to stop / die / behave incorrectly. - Status boxes get deleted and recreated repeatedly. - Behavior looks like the feature “crashed”, but the daemon process may still be running. Observed behavior from logs: - The daemon process remained alive. - Session completion happened successfully. - Immediately afterward, a completed session was reactivated by later transcript file changes. - Old box was deleted and a new box was created repeatedly. Representative log pattern: - `Lock file deleted — turn complete, marking session done immediately` - `Session complete via plugin` - `fs.watch: file change on completed session — triggering reactivation` - `Ghost watch triggered reactivation` - `Deleted old buried status box on reactivation` - `Created status box via plugin` This means the process is often **alive but logically wrong**. ## Important product requirement clarified during diagnosis A fresh status box at the bottom of the thread/channel is required. That means: - **Deleting old buried boxes and creating a fresh one on a real new turn is correct.** - The bug is **not** “it creates a new box.” - The bug is **it creates a new box on false reactivation from ghost/stale transcript events.** Any proposed fix that preserves a single box forever is the wrong UX for this project. ## Current code/design understanding From the repo, the current design already includes routing-aware logic: - `session-monitor.js` - parses `channelId` - parses `rootPostId` from `:thread:<rootPostId>` - resolves DM channels for `:direct:<userId>` sessions via Mattermost API - `watcher-manager.js` - keys active/completed boxes by full `sessionKey` - creates boxes using `channelId` and `rootPostId` - suppresses heartbeat/internal sessions - has displacement/suppression logic so thread sessions can take priority over bare parent sessions - intentionally deletes old buried completed box and creates a fresh one on reactivation - `status-watcher.js` - watches transcript changes - watches lock-file lifecycle - currently appears able to reactivate completed sessions from ghost transcript file change events So the current system already understands routing and the fresh-bottom-box UX goal, but the session lifecycle/reactivation boundary is still not strong enough. ## Root problem as currently understood The likely root problem is: **session lifecycle truth is split too loosely between lock-file events and transcript file changes, allowing completed sessions to be resurrected by stale/ghost file activity.** In practice: - lock removal marks session complete - completed box is moved into completed state - stale `fs.watch` events still arrive - those events are treated as reactivation-worthy - a fresh bottom box is created even though there was no real new user turn That creates complete → reactivate → recreate thrash. ## Proposed solution direction found during diagnosis ### Core principle **Keep the fresh-bottom-box UX, but tighten what qualifies as a real reactivation.** ### Proposed architecture direction 1. **Use a single lifecycle owner for the process** - `live-status-daemon` plugin service should be the one authoritative startup/runtime owner. - Legacy startup hook/script paths should not be competing lifecycle owners. 2. **Use an explicit session state machine** - Suggested states: - `inactive` - `active` - `completing` - `completed_guarded` - Transition out of `completed_guarded` should require a real activation signal. 3. **Preserve fresh-bottom-box behavior only for real reactivation** - On a real new turn: - delete old buried completed box - create fresh box at bottom - This behavior should remain. 4. **Tighten reactivation signals** - A completed session should be allowed to reactivate only on an authoritative new-turn signal, for example: - lock file creation - or another explicit session-start/new-turn marker if one exists in OpenClaw session lifecycle - **Plain transcript writes after completion must not resurrect a session.** 5. **Downgrade transcript file changes to content updates, not lifecycle authority** - While a session is active: transcript writes can update the existing box. - After completion: transcript writes alone should never create a new box. 6. **Keep routing correctness as a hard invariant** - Any reactivation must preserve exact sessionKey → channel/thread/DM mapping. - Fresh box creation must occur in the same correct Mattermost context. ## What the manager should do Please review this issue as a manager/research task, not as an implementation task yet. ### Requested output Provide an implementation plan that: 1. Reviews the current diagnosis and proposed solution above. 2. Challenges it with your own repo reading and external research. 3. Compares alternative solutions, including whether there is a better authoritative lifecycle signal than the current lock/transcript combination. 4. Recommends the best final architecture for: - startup ownership - runtime supervision - session lifecycle source of truth - reactivation rules - thread/channel/DM routing correctness - preserving the fresh-bottom-box UX 5. Lists exact code areas likely to change. 6. Lists risks/regressions to test. 7. Produces a plan suitable for explicit human approval before any implementation. ### Specific questions for the manager to answer 1. Is the best source of truth really the lock file, or does OpenClaw expose a better authoritative lifecycle signal? 2. Should `fs.watch` ever be allowed to trigger reactivation on a completed session? 3. Is the current `completed cooldown` / ghost-watch design fundamentally flawed and worth replacing? 4. Are thread/channel/DM ownership rules complete, or are there remaining edge cases where status boxes can still leak into the wrong Mattermost context? 5. What is the cleanest way to preserve “fresh box at bottom” without complete/reactivate thrash? ## Non-goal for this issue Do **not** implement yet. This issue is for: - documenting the problem precisely - aligning on requirements - getting a better manager-reviewed solution - waiting for approval before coding ## Acceptance criteria for the future implementation The eventual implementation should satisfy all of these: 1. After gateway restart, live status starts automatically. 2. Only one lifecycle owner is responsible for starting/stopping/restarting the watcher. 3. Live-status boxes always appear in the correct Mattermost context (thread/channel/DM). 4. On a real new turn, the old buried box is replaced with a fresh box at the bottom. 5. On trailing transcript noise after completion, no false new box is created. 6. No complete → reactivate → recreate thrash. 7. Logs clearly distinguish: - real activation - real completion - ignored post-completion file noise - real reactivation ## Repo areas likely involved Based on current reading, likely areas include: - `src/watcher-manager.js` - `src/status-watcher.js` - `src/session-monitor.js` - plugin service lifecycle integration ## Why this should be planned, not patched ad hoc This bug touches: - session state model - Mattermost routing correctness - startup ownership - UX requirements - daemon/runtime behavior So a one-off patch is likely to create regressions unless the lifecycle model is made explicit first.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: ROOH/MATTERMOST_OPENCLAW_LIVESTATUS#11