Spec review: design issues and edge cases in README #5
Labels
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sneak/dnswatcher#5
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
After reviewing the README spec thoroughly, here are potential design issues and oversights:
1. Watcher startup context bug
In
watcher.go, theOnStarthook passesstartCtxtocontext.WithCancel(), butstartCtxis the fx startup context which expires after startup completes. The child context derived from it will be cancelled immediately. Should usecontext.Background()instead and cancel via the storedcancelfunc.2. Per-nameserver state creates unbounded growth
Hostname records are stored per-nameserver. If NS records change over time (NS decommissioned, new NS added), old NS entries in state are never cleaned up. State file grows indefinitely. Need a pruning mechanism that removes NS entries no longer in the delegation.
3. Port check IP↔hostname association is lossy
Port state keys are
ip:portwith a singlehostnamefield. If multiple hostnames resolve to the same IP (shared hosting, CDN), only one hostname is associated. When that hostname removes the IP from DNS, the port state is orphaned but the IP is still valid for other hostnames. Need a set of associated hostnames, not a single one.4. TLS cert key collision
Cert state key is
ip:port:hostname. If the same IP serves different certs for different SNI hostnames (SNI routing), this works. But if two configured hostnames share the same IP AND the same SNI target (e.g. CNAME to CDN), we get duplicate checks. Minor inefficiency, not a bug.5. No backoff on persistent NS failures
If a nameserver is consistently unreachable, the spec says to notify on every check (failure and recovery). With 1h intervals, a dead NS generates 24 notifications/day. Should consider exponential backoff or "remind every N hours" for persistent failures.
6. Race between DNS and port/TLS checks
Port and TLS checks use IPs from the most recent DNS check. If DNS changes mid-cycle, port/TLS checks may target stale IPs. The spec doesn't define ordering — should DNS complete before port/TLS starts? Recommend: DNS first, then port/TLS using freshly resolved IPs.
7. No DNSSEC validation
The spec doesn't mention DNSSEC. Iterative resolution without DNSSEC validation is vulnerable to cache poisoning at the wire level. Not critical for a monitoring tool (it's observing, not relying on results for security decisions), but worth noting.
8. State file format migration
State has a
versionfield but no migration logic. If the format changes, old state files will fail to parse and monitoring restarts from scratch (triggering false-positive notifications for everything).9. CNAME chain + per-NS storage interaction
If hostname A CNAMEs to hostname B, and different NSes return different CNAME targets, the per-NS record storage correctly captures this. But the IP resolution for port checks needs to follow ALL CNAME chains across ALL NSes to build the complete IP set. The spec doesn't clarify this.
10. Notification delivery guarantees
Notifications are fire-and-forget (goroutines, no retry). If Slack/ntfy is temporarily down, changes are silently lost. Consider: retry with backoff, or at minimum log notification failures prominently.