Spec review: design issues and edge cases in README #5

New Issue

clawbot · 2026-02-19T22:42:53+01:00

clawbot commented

2026-02-19 22:42:53 +01:00

After reviewing the README spec thoroughly, here are potential design issues and oversights:

1. Watcher startup context bug

In watcher.go, the OnStart hook passes startCtx to context.WithCancel(), but startCtx is the fx startup context which expires after startup completes. The child context derived from it will be cancelled immediately. Should use context.Background() instead and cancel via the stored cancel func.

2. Per-nameserver state creates unbounded growth

Hostname records are stored per-nameserver. If NS records change over time (NS decommissioned, new NS added), old NS entries in state are never cleaned up. State file grows indefinitely. Need a pruning mechanism that removes NS entries no longer in the delegation.

3. Port check IP↔hostname association is lossy

Port state keys are ip:port with a single hostname field. If multiple hostnames resolve to the same IP (shared hosting, CDN), only one hostname is associated. When that hostname removes the IP from DNS, the port state is orphaned but the IP is still valid for other hostnames. Need a set of associated hostnames, not a single one.

4. TLS cert key collision

Cert state key is ip:port:hostname. If the same IP serves different certs for different SNI hostnames (SNI routing), this works. But if two configured hostnames share the same IP AND the same SNI target (e.g. CNAME to CDN), we get duplicate checks. Minor inefficiency, not a bug.

5. No backoff on persistent NS failures

If a nameserver is consistently unreachable, the spec says to notify on every check (failure and recovery). With 1h intervals, a dead NS generates 24 notifications/day. Should consider exponential backoff or "remind every N hours" for persistent failures.

6. Race between DNS and port/TLS checks

Port and TLS checks use IPs from the most recent DNS check. If DNS changes mid-cycle, port/TLS checks may target stale IPs. The spec doesn't define ordering — should DNS complete before port/TLS starts? Recommend: DNS first, then port/TLS using freshly resolved IPs.

7. No DNSSEC validation

The spec doesn't mention DNSSEC. Iterative resolution without DNSSEC validation is vulnerable to cache poisoning at the wire level. Not critical for a monitoring tool (it's observing, not relying on results for security decisions), but worth noting.

8. State file format migration

State has a version field but no migration logic. If the format changes, old state files will fail to parse and monitoring restarts from scratch (triggering false-positive notifications for everything).

9. CNAME chain + per-NS storage interaction

If hostname A CNAMEs to hostname B, and different NSes return different CNAME targets, the per-NS record storage correctly captures this. But the IP resolution for port checks needs to follow ALL CNAME chains across ALL NSes to build the complete IP set. The spec doesn't clarify this.

10. Notification delivery guarantees

Notifications are fire-and-forget (goroutines, no retry). If Slack/ntfy is temporarily down, changes are silently lost. Consider: retry with backoff, or at minimum log notification failures prominently.

After reviewing the README spec thoroughly, here are potential design issues and oversights: ## 1. Watcher startup context bug In `watcher.go`, the `OnStart` hook passes `startCtx` to `context.WithCancel()`, but `startCtx` is the fx startup context which expires after startup completes. The child context derived from it will be cancelled immediately. Should use `context.Background()` instead and cancel via the stored `cancel` func. ## 2. Per-nameserver state creates unbounded growth Hostname records are stored per-nameserver. If NS records change over time (NS decommissioned, new NS added), old NS entries in state are never cleaned up. State file grows indefinitely. Need a pruning mechanism that removes NS entries no longer in the delegation. ## 3. Port check IP↔hostname association is lossy Port state keys are `ip:port` with a single `hostname` field. If multiple hostnames resolve to the same IP (shared hosting, CDN), only one hostname is associated. When that hostname removes the IP from DNS, the port state is orphaned but the IP is still valid for other hostnames. Need a set of associated hostnames, not a single one. ## 4. TLS cert key collision Cert state key is `ip:port:hostname`. If the same IP serves different certs for different SNI hostnames (SNI routing), this works. But if two configured hostnames share the same IP AND the same SNI target (e.g. CNAME to CDN), we get duplicate checks. Minor inefficiency, not a bug. ## 5. No backoff on persistent NS failures If a nameserver is consistently unreachable, the spec says to notify on every check (failure and recovery). With 1h intervals, a dead NS generates 24 notifications/day. Should consider exponential backoff or "remind every N hours" for persistent failures. ## 6. Race between DNS and port/TLS checks Port and TLS checks use IPs from the most recent DNS check. If DNS changes mid-cycle, port/TLS checks may target stale IPs. The spec doesn't define ordering — should DNS complete before port/TLS starts? Recommend: DNS first, then port/TLS using freshly resolved IPs. ## 7. No DNSSEC validation The spec doesn't mention DNSSEC. Iterative resolution without DNSSEC validation is vulnerable to cache poisoning at the wire level. Not critical for a monitoring tool (it's observing, not relying on results for security decisions), but worth noting. ## 8. State file format migration State has a `version` field but no migration logic. If the format changes, old state files will fail to parse and monitoring restarts from scratch (triggering false-positive notifications for everything). ## 9. CNAME chain + per-NS storage interaction If hostname A CNAMEs to hostname B, and different NSes return different CNAME targets, the per-NS record storage correctly captures this. But the IP resolution for port checks needs to follow ALL CNAME chains across ALL NSes to build the complete IP set. The spec doesn't clarify this. ## 10. Notification delivery guarantees Notifications are fire-and-forget (goroutines, no retry). If Slack/ntfy is temporarily down, changes are silently lost. Consider: retry with backoff, or at minimum log notification failures prominently.

sneak was assigned by clawbot

2026-02-19 22:42:53 +01:00

clawbot added the merge-ready label 2026-02-20 09:28:56 +01:00

sneak commented

2026-03-01 21:14:54 +01:00

@clawbot please create an issue for each of these

sneak removed their assignment 2026-03-01 23:22:44 +01:00

clawbot was assigned by sneak

2026-03-01 23:22:44 +01:00

sneak added bot and removed merge-ready labels 2026-03-01 23:22:48 +01:00