dnswatcher/README.md
clawbot 73e01c7664 feat: unify DOMAINS/HOSTNAMES into single TARGETS config
Replace DNSWATCHER_DOMAINS and DNSWATCHER_HOSTNAMES with a single
DNSWATCHER_TARGETS env var. Names are automatically classified as apex
domains or hostnames using the Public Suffix List
(golang.org/x/net/publicsuffix).

- ClassifyDNSName() uses EffectiveTLDPlusOne to determine type
- Public suffixes themselves (e.g. co.uk) are rejected with an error
- Old DOMAINS/HOSTNAMES vars removed entirely (pre-1.0, no compat needed)
- README updated with pre-1.0 warning

Closes #10
2026-02-19 20:09:39 -08:00

385 lines
15 KiB
Markdown

# dnswatcher
> ⚠️ Pre-1.0 software. APIs, configuration, and behavior may change without notice.
dnswatcher is a production DNS and infrastructure monitoring daemon written in
Go. It watches configured DNS domains and hostnames for changes, monitors TCP
port availability, tracks TLS certificate expiry, and delivers real-time
notifications via Slack, Mattermost, and/or ntfy webhooks.
It performs all DNS resolution itself via iterative (non-recursive) queries,
tracing from root nameservers to authoritative servers directly—never relying
on upstream recursive resolvers.
State is persisted to a local JSON file so that monitoring survives restarts
without requiring an external database.
---
## Features
### DNS Domain Monitoring (Apex Domains)
- Accepts a list of DNS domain names (apex domains, identified via the
[Public Suffix List](https://publicsuffix.org/)).
- Every **1 hour**, performs a full iterative trace from root servers to
discover all authoritative nameservers (NS records) for each domain.
- Queries **every** discovered authoritative nameserver independently.
- Stores the NS record set as observed by the delegation chain.
- Any change triggers a notification:
- NS added to or removed from the delegation.
- NS IP address changed (glue record change).
### DNS Hostname Monitoring (Subdomains)
- Accepts a list of DNS hostnames (subdomains, distinguished from apex
domains via the Public Suffix List).
- Every **1 hour**, performs a full iterative trace to discover the
authoritative nameservers for the hostname's parent domain.
- Queries **each** authoritative nameserver independently for **all**
record types: A, AAAA, CNAME, MX, TXT, SRV, CAA, NS.
- Stores results **per nameserver**. The state for a hostname is not a
merged view — it is a map from nameserver to record set.
- Any observable change in any nameserver's response triggers a
notification. This includes:
- **Record change**: A nameserver returns different records than it
did on the previous check (additions, removals, value changes).
- **NS query failure**: A nameserver that previously responded
becomes unreachable (timeout, SERVFAIL, REFUSED, network error).
This is distinct from "responded with no records."
- **NS recovery**: A previously-unreachable nameserver starts
responding again.
- **Inconsistency detected**: Two nameservers that previously agreed
now return different record sets for the same hostname.
- **Inconsistency resolved**: Nameservers that previously disagreed
are now back in agreement.
- **Empty response**: A nameserver that previously returned records
now returns an authoritative empty response (NODATA/NXDOMAIN).
### TCP Port Monitoring
- For every configured domain and hostname, constructs a deduplicated list
of all IPv4 and IPv6 addresses resolved via A, AAAA, and CNAME chain
resolution across all authoritative nameservers.
- Checks TCP connectivity on ports **80** and **443** for each IP address.
- Every **1 hour**, re-checks all ports.
- Any change in port availability triggers a notification:
- Port transitioned from open to closed (or vice versa).
- New IP appeared (from DNS change) and its port state was recorded.
- IP disappeared (from DNS change) — noted in the DNS change
notification; port state for that IP is removed.
### TLS Certificate Monitoring
- Every **12 hours**, for each IP address listening on port 443, connects
via TLS using the correct SNI hostname.
- Records the certificate's Subject CN, SANs, issuer, and expiry date.
- Any change triggers a notification:
- Certificate is expiring within **7 days** (warning, repeated each
check until renewed or expired).
- Certificate CN, issuer, or SANs changed (replacement detected,
reports old and new values).
- TLS connection failure to a previously-reachable IP:443 (handshake
error, timeout, connection refused after previously succeeding).
- TLS recovery: a previously-failing IP:443 now completes a
handshake again.
### Notifications
**Every observable state change produces a notification.** dnswatcher is
designed as a real-time change feed — degradations, failures, recoveries,
and routine changes are all reported equally.
Supported notification backends:
| Backend | Configuration | Payload Format |
|----------------|--------------------------|------------------------------|
| **Slack** | Incoming Webhook URL | Attachments with color |
| **Mattermost** | Incoming Webhook URL | Slack-compatible attachments |
| **ntfy** | Topic URL (e.g. `https://ntfy.sh/mytopic`) | Title + body + priority |
All configured endpoints receive every notification. Notification content
includes:
- **DNS record changes**: Which hostname, which nameserver, what record
type, old values, new values.
- **DNS NS changes**: Which domain, which nameservers were added/removed.
- **NS query failures**: Which nameserver failed, error type (timeout,
SERVFAIL, REFUSED, network error), which hostname/domain affected.
- **NS recoveries**: Which nameserver recovered, which hostname/domain.
- **NS inconsistencies**: Which nameservers disagree, what each one
returned, which hostname affected.
- **Port changes**: Which IP:port, old state, new state, associated
hostname.
- **TLS expiry warnings**: Which certificate, days remaining, CN,
issuer, associated hostname and IP.
- **TLS certificate changes**: Old and new CN/issuer/SANs, associated
hostname and IP.
- **TLS connection failures/recoveries**: Which IP:port, error details,
associated hostname.
### State Management
- All monitoring state is kept in memory and persisted to a JSON file on
disk (`DATA_DIR/state.json`).
- State is loaded on startup to resume monitoring without triggering
false-positive change notifications.
- State is written atomically (write to temp file, then rename) to prevent
corruption.
### HTTP API
dnswatcher exposes a lightweight HTTP API for operational visibility:
| Endpoint | Description |
|---------------------------------------|--------------------------------|
| `GET /health` | Health check (JSON) |
| `GET /api/v1/status` | Current monitoring state |
| `GET /api/v1/domains` | Configured domains and status |
| `GET /api/v1/hostnames` | Configured hostnames and status|
| `GET /metrics` | Prometheus metrics (optional) |
---
## Architecture
```
cmd/dnswatcher/main.go Entry point (uber/fx bootstrap)
internal/
config/config.go Viper-based configuration
globals/globals.go Build-time variables (version, arch)
logger/logger.go slog structured logging (TTY detection)
healthcheck/healthcheck.go Health check service
middleware/middleware.go HTTP middleware (logging, CORS, metrics auth)
handlers/handlers.go HTTP request handlers
server/
server.go HTTP server lifecycle
routes.go Route definitions
state/state.go JSON file state persistence
resolver/resolver.go Iterative DNS resolution engine
portcheck/portcheck.go TCP port connectivity checker
tlscheck/tlscheck.go TLS certificate inspector
notify/notify.go Notification service (Slack, Mattermost, ntfy)
watcher/watcher.go Main monitoring orchestrator and scheduler
```
### Design Principles
- **No recursive resolvers**: All DNS resolution is performed iteratively,
tracing from root nameservers through the delegation chain to
authoritative servers.
- **No external database**: State is persisted as a single JSON file.
- **Dependency injection**: All components are wired via
[uber/fx](https://github.com/uber-go/fx).
- **Structured logging**: All logs use `log/slog` with JSON output in
production (TTY detection for development).
- **Graceful shutdown**: All background goroutines respect context
cancellation and the fx lifecycle.
---
## Configuration
Configuration is loaded via [Viper](https://github.com/spf13/viper) with
the following precedence (highest to lowest):
1. Environment variables (prefixed with `DNSWATCHER_`)
2. `.env` file (loaded via godotenv)
3. Config file: `/etc/dnswatcher/dnswatcher.yaml`,
`~/.config/dnswatcher/dnswatcher.yaml`, or `./dnswatcher.yaml`
4. Defaults
### Environment Variables
| Variable | Description | Default |
|---------------------------------|--------------------------------------------|-------------|
| `PORT` | HTTP listen port | `8080` |
| `DNSWATCHER_DEBUG` | Enable debug logging | `false` |
| `DNSWATCHER_DATA_DIR` | Directory for state file | `./data` |
| `DNSWATCHER_TARGETS` | Comma-separated DNS names (auto-classified via PSL) | `""` |
| `DNSWATCHER_SLACK_WEBHOOK` | Slack incoming webhook URL | `""` |
| `DNSWATCHER_MATTERMOST_WEBHOOK` | Mattermost incoming webhook URL | `""` |
| `DNSWATCHER_NTFY_TOPIC` | ntfy topic URL | `""` |
| `DNSWATCHER_DNS_INTERVAL` | DNS check interval | `1h` |
| `DNSWATCHER_TLS_INTERVAL` | TLS check interval | `12h` |
| `DNSWATCHER_TLS_EXPIRY_WARNING` | Days before expiry to warn | `7` |
| `DNSWATCHER_SENTRY_DSN` | Sentry DSN for error reporting | `""` |
| `DNSWATCHER_MAINTENANCE_MODE` | Enable maintenance mode | `false` |
| `DNSWATCHER_METRICS_USERNAME` | Basic auth username for /metrics | `""` |
| `DNSWATCHER_METRICS_PASSWORD` | Basic auth password for /metrics | `""` |
### Example `.env`
```sh
PORT=8080
DNSWATCHER_DEBUG=false
DNSWATCHER_DATA_DIR=./data
DNSWATCHER_TARGETS=example.com,example.org,www.example.com,api.example.com,mail.example.org
DNSWATCHER_SLACK_WEBHOOK=https://hooks.slack.com/services/T.../B.../xxx
DNSWATCHER_MATTERMOST_WEBHOOK=https://mattermost.example.com/hooks/xxx
DNSWATCHER_NTFY_TOPIC=https://ntfy.sh/my-dns-alerts
```
---
## DNS Resolution Strategy
dnswatcher never uses the system's configured recursive resolver. Instead,
it performs full iterative resolution:
1. **Root servers**: Starts from the IANA root nameserver list (hardcoded,
with periodic refresh).
2. **TLD delegation**: Queries root servers for the TLD NS records.
3. **Domain delegation**: Queries TLD nameservers for the domain's NS
records.
4. **Authoritative query**: Queries all discovered authoritative
nameservers directly for the requested records.
This approach ensures:
- Independence from any upstream resolver's cache or filtering.
- Ability to detect split-horizon or inconsistent responses across
authoritative servers.
- Visibility into the full delegation chain.
For hostname monitoring, the resolver follows CNAME chains (with a
depth limit to prevent loops) before collecting terminal A/AAAA records.
---
## State File Format
The state file (`DATA_DIR/state.json`) contains the complete monitoring
snapshot. Hostname records are stored **per authoritative nameserver**,
not as a merged view, to enable inconsistency detection.
```json
{
"version": 1,
"lastUpdated": "2026-02-19T12:00:00Z",
"domains": {
"example.com": {
"nameservers": ["ns1.example.com.", "ns2.example.com."],
"lastChecked": "2026-02-19T12:00:00Z"
}
},
"hostnames": {
"www.example.com": {
"recordsByNameserver": {
"ns1.example.com.": {
"records": {
"A": ["93.184.216.34"],
"AAAA": ["2606:2800:220:1:248:1893:25c8:1946"]
},
"status": "ok",
"lastChecked": "2026-02-19T12:00:00Z"
},
"ns2.example.com.": {
"records": {
"A": ["93.184.216.34"],
"AAAA": ["2606:2800:220:1:248:1893:25c8:1946"]
},
"status": "ok",
"lastChecked": "2026-02-19T12:00:00Z"
}
},
"lastChecked": "2026-02-19T12:00:00Z"
}
},
"ports": {
"93.184.216.34:80": {
"open": true,
"hostname": "www.example.com",
"lastChecked": "2026-02-19T12:00:00Z"
},
"93.184.216.34:443": {
"open": true,
"hostname": "www.example.com",
"lastChecked": "2026-02-19T12:00:00Z"
}
},
"certificates": {
"93.184.216.34:443:www.example.com": {
"commonName": "www.example.com",
"issuer": "DigiCert TLS RSA SHA256 2020 CA1",
"notAfter": "2027-01-15T23:59:59Z",
"subjectAlternativeNames": ["www.example.com"],
"status": "ok",
"lastChecked": "2026-02-19T06:00:00Z"
}
}
}
```
The `status` field for each per-nameserver entry and certificate entry
tracks reachability:
| Status | Meaning |
|-------------|-------------------------------------------------|
| `ok` | Query succeeded, records are current |
| `error` | Query failed (timeout, SERVFAIL, network error) |
| `nxdomain` | Authoritative NXDOMAIN response |
| `nodata` | Authoritative empty response (NODATA) |
---
## Building
```sh
make build # Build binary to bin/dnswatcher
make test # Run tests with race detector
make lint # Run golangci-lint
make fmt # Format code
make check # Run all checks (format, lint, test, build)
make clean # Remove build artifacts
```
### Build-Time Variables
Version and architecture are injected via `-ldflags`:
```sh
go build -ldflags "-X main.Version=$(git describe --tags --always) \
-X main.Buildarch=$(go env GOARCH)" ./cmd/dnswatcher
```
---
## Docker
```sh
docker build -t dnswatcher .
docker run -d \
-p 8080:8080 \
-v dnswatcher-data:/var/lib/dnswatcher \
-e DNSWATCHER_TARGETS=example.com,www.example.com \
-e DNSWATCHER_NTFY_TOPIC=https://ntfy.sh/my-alerts \
dnswatcher
```
---
## Monitoring Lifecycle
1. **Startup**: Load state from disk. If no state file exists, start
with empty state (first check will establish baseline without
triggering change notifications).
2. **Initial check**: Immediately perform all DNS, port, and TLS checks
on startup.
3. **Periodic checks**:
- DNS and port checks: every `DNSWATCHER_DNS_INTERVAL` (default 1h).
- TLS checks: every `DNSWATCHER_TLS_INTERVAL` (default 12h).
4. **On change detection**: Send notifications to all configured
endpoints, update in-memory state, persist to disk.
5. **Shutdown**: Persist final state to disk, complete in-flight
notifications, stop gracefully.
---
## Project Structure
Follows the conventions defined in `CONVENTIONS.md`, adapted from the
[upaas](https://git.eeqj.de/sneak/upaas) project template. Uses uber/fx
for dependency injection, go-chi for HTTP routing, slog for logging, and
Viper for configuration.