Full project structure following upaas conventions: uber/fx DI, go-chi routing, slog logging, Viper config. State persisted as JSON file with per-nameserver record tracking for inconsistency detection. Stub implementations for resolver, portcheck, tlscheck, and watcher.
386 lines
15 KiB
Markdown
386 lines
15 KiB
Markdown
# dnswatcher
|
|
|
|
dnswatcher is a production DNS and infrastructure monitoring daemon written in
|
|
Go. It watches configured DNS domains and hostnames for changes, monitors TCP
|
|
port availability, tracks TLS certificate expiry, and delivers real-time
|
|
notifications via Slack, Mattermost, and/or ntfy webhooks.
|
|
|
|
It performs all DNS resolution itself via iterative (non-recursive) queries,
|
|
tracing from root nameservers to authoritative servers directly—never relying
|
|
on upstream recursive resolvers.
|
|
|
|
State is persisted to a local JSON file so that monitoring survives restarts
|
|
without requiring an external database.
|
|
|
|
---
|
|
|
|
## Features
|
|
|
|
### DNS Domain Monitoring (Apex Domains)
|
|
|
|
- Accepts a list of DNS domain names (apex domains, identified via the
|
|
[Public Suffix List](https://publicsuffix.org/)).
|
|
- Every **1 hour**, performs a full iterative trace from root servers to
|
|
discover all authoritative nameservers (NS records) for each domain.
|
|
- Queries **every** discovered authoritative nameserver independently.
|
|
- Stores the NS record set as observed by the delegation chain.
|
|
- Any change triggers a notification:
|
|
- NS added to or removed from the delegation.
|
|
- NS IP address changed (glue record change).
|
|
|
|
### DNS Hostname Monitoring (Subdomains)
|
|
|
|
- Accepts a list of DNS hostnames (subdomains, distinguished from apex
|
|
domains via the Public Suffix List).
|
|
- Every **1 hour**, performs a full iterative trace to discover the
|
|
authoritative nameservers for the hostname's parent domain.
|
|
- Queries **each** authoritative nameserver independently for **all**
|
|
record types: A, AAAA, CNAME, MX, TXT, SRV, CAA, NS.
|
|
- Stores results **per nameserver**. The state for a hostname is not a
|
|
merged view — it is a map from nameserver to record set.
|
|
- Any observable change in any nameserver's response triggers a
|
|
notification. This includes:
|
|
- **Record change**: A nameserver returns different records than it
|
|
did on the previous check (additions, removals, value changes).
|
|
- **NS query failure**: A nameserver that previously responded
|
|
becomes unreachable (timeout, SERVFAIL, REFUSED, network error).
|
|
This is distinct from "responded with no records."
|
|
- **NS recovery**: A previously-unreachable nameserver starts
|
|
responding again.
|
|
- **Inconsistency detected**: Two nameservers that previously agreed
|
|
now return different record sets for the same hostname.
|
|
- **Inconsistency resolved**: Nameservers that previously disagreed
|
|
are now back in agreement.
|
|
- **Empty response**: A nameserver that previously returned records
|
|
now returns an authoritative empty response (NODATA/NXDOMAIN).
|
|
|
|
### TCP Port Monitoring
|
|
|
|
- For every configured domain and hostname, constructs a deduplicated list
|
|
of all IPv4 and IPv6 addresses resolved via A, AAAA, and CNAME chain
|
|
resolution across all authoritative nameservers.
|
|
- Checks TCP connectivity on ports **80** and **443** for each IP address.
|
|
- Every **1 hour**, re-checks all ports.
|
|
- Any change in port availability triggers a notification:
|
|
- Port transitioned from open to closed (or vice versa).
|
|
- New IP appeared (from DNS change) and its port state was recorded.
|
|
- IP disappeared (from DNS change) — noted in the DNS change
|
|
notification; port state for that IP is removed.
|
|
|
|
### TLS Certificate Monitoring
|
|
|
|
- Every **12 hours**, for each IP address listening on port 443, connects
|
|
via TLS using the correct SNI hostname.
|
|
- Records the certificate's Subject CN, SANs, issuer, and expiry date.
|
|
- Any change triggers a notification:
|
|
- Certificate is expiring within **7 days** (warning, repeated each
|
|
check until renewed or expired).
|
|
- Certificate CN, issuer, or SANs changed (replacement detected,
|
|
reports old and new values).
|
|
- TLS connection failure to a previously-reachable IP:443 (handshake
|
|
error, timeout, connection refused after previously succeeding).
|
|
- TLS recovery: a previously-failing IP:443 now completes a
|
|
handshake again.
|
|
|
|
### Notifications
|
|
|
|
**Every observable state change produces a notification.** dnswatcher is
|
|
designed as a real-time change feed — degradations, failures, recoveries,
|
|
and routine changes are all reported equally.
|
|
|
|
Supported notification backends:
|
|
|
|
| Backend | Configuration | Payload Format |
|
|
|----------------|--------------------------|------------------------------|
|
|
| **Slack** | Incoming Webhook URL | Attachments with color |
|
|
| **Mattermost** | Incoming Webhook URL | Slack-compatible attachments |
|
|
| **ntfy** | Topic URL (e.g. `https://ntfy.sh/mytopic`) | Title + body + priority |
|
|
|
|
All configured endpoints receive every notification. Notification content
|
|
includes:
|
|
|
|
- **DNS record changes**: Which hostname, which nameserver, what record
|
|
type, old values, new values.
|
|
- **DNS NS changes**: Which domain, which nameservers were added/removed.
|
|
- **NS query failures**: Which nameserver failed, error type (timeout,
|
|
SERVFAIL, REFUSED, network error), which hostname/domain affected.
|
|
- **NS recoveries**: Which nameserver recovered, which hostname/domain.
|
|
- **NS inconsistencies**: Which nameservers disagree, what each one
|
|
returned, which hostname affected.
|
|
- **Port changes**: Which IP:port, old state, new state, associated
|
|
hostname.
|
|
- **TLS expiry warnings**: Which certificate, days remaining, CN,
|
|
issuer, associated hostname and IP.
|
|
- **TLS certificate changes**: Old and new CN/issuer/SANs, associated
|
|
hostname and IP.
|
|
- **TLS connection failures/recoveries**: Which IP:port, error details,
|
|
associated hostname.
|
|
|
|
### State Management
|
|
|
|
- All monitoring state is kept in memory and persisted to a JSON file on
|
|
disk (`DATA_DIR/state.json`).
|
|
- State is loaded on startup to resume monitoring without triggering
|
|
false-positive change notifications.
|
|
- State is written atomically (write to temp file, then rename) to prevent
|
|
corruption.
|
|
|
|
### HTTP API
|
|
|
|
dnswatcher exposes a lightweight HTTP API for operational visibility:
|
|
|
|
| Endpoint | Description |
|
|
|---------------------------------------|--------------------------------|
|
|
| `GET /health` | Health check (JSON) |
|
|
| `GET /api/v1/status` | Current monitoring state |
|
|
| `GET /api/v1/domains` | Configured domains and status |
|
|
| `GET /api/v1/hostnames` | Configured hostnames and status|
|
|
| `GET /metrics` | Prometheus metrics (optional) |
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
cmd/dnswatcher/main.go Entry point (uber/fx bootstrap)
|
|
|
|
internal/
|
|
config/config.go Viper-based configuration
|
|
globals/globals.go Build-time variables (version, arch)
|
|
logger/logger.go slog structured logging (TTY detection)
|
|
healthcheck/healthcheck.go Health check service
|
|
middleware/middleware.go HTTP middleware (logging, CORS, metrics auth)
|
|
handlers/handlers.go HTTP request handlers
|
|
server/
|
|
server.go HTTP server lifecycle
|
|
routes.go Route definitions
|
|
state/state.go JSON file state persistence
|
|
resolver/resolver.go Iterative DNS resolution engine
|
|
portcheck/portcheck.go TCP port connectivity checker
|
|
tlscheck/tlscheck.go TLS certificate inspector
|
|
notify/notify.go Notification service (Slack, Mattermost, ntfy)
|
|
watcher/watcher.go Main monitoring orchestrator and scheduler
|
|
```
|
|
|
|
### Design Principles
|
|
|
|
- **No recursive resolvers**: All DNS resolution is performed iteratively,
|
|
tracing from root nameservers through the delegation chain to
|
|
authoritative servers.
|
|
- **No external database**: State is persisted as a single JSON file.
|
|
- **Dependency injection**: All components are wired via
|
|
[uber/fx](https://github.com/uber-go/fx).
|
|
- **Structured logging**: All logs use `log/slog` with JSON output in
|
|
production (TTY detection for development).
|
|
- **Graceful shutdown**: All background goroutines respect context
|
|
cancellation and the fx lifecycle.
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
Configuration is loaded via [Viper](https://github.com/spf13/viper) with
|
|
the following precedence (highest to lowest):
|
|
|
|
1. Environment variables (prefixed with `DNSWATCHER_`)
|
|
2. `.env` file (loaded via godotenv)
|
|
3. Config file: `/etc/dnswatcher/dnswatcher.yaml`,
|
|
`~/.config/dnswatcher/dnswatcher.yaml`, or `./dnswatcher.yaml`
|
|
4. Defaults
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|---------------------------------|--------------------------------------------|-------------|
|
|
| `PORT` | HTTP listen port | `8080` |
|
|
| `DNSWATCHER_DEBUG` | Enable debug logging | `false` |
|
|
| `DNSWATCHER_DATA_DIR` | Directory for state file | `./data` |
|
|
| `DNSWATCHER_DOMAINS` | Comma-separated list of apex domains | `""` |
|
|
| `DNSWATCHER_HOSTNAMES` | Comma-separated list of hostnames | `""` |
|
|
| `DNSWATCHER_SLACK_WEBHOOK` | Slack incoming webhook URL | `""` |
|
|
| `DNSWATCHER_MATTERMOST_WEBHOOK` | Mattermost incoming webhook URL | `""` |
|
|
| `DNSWATCHER_NTFY_TOPIC` | ntfy topic URL | `""` |
|
|
| `DNSWATCHER_DNS_INTERVAL` | DNS check interval | `1h` |
|
|
| `DNSWATCHER_TLS_INTERVAL` | TLS check interval | `12h` |
|
|
| `DNSWATCHER_TLS_EXPIRY_WARNING` | Days before expiry to warn | `7` |
|
|
| `DNSWATCHER_SENTRY_DSN` | Sentry DSN for error reporting | `""` |
|
|
| `DNSWATCHER_MAINTENANCE_MODE` | Enable maintenance mode | `false` |
|
|
| `DNSWATCHER_METRICS_USERNAME` | Basic auth username for /metrics | `""` |
|
|
| `DNSWATCHER_METRICS_PASSWORD` | Basic auth password for /metrics | `""` |
|
|
|
|
### Example `.env`
|
|
|
|
```sh
|
|
PORT=8080
|
|
DNSWATCHER_DEBUG=false
|
|
DNSWATCHER_DATA_DIR=./data
|
|
DNSWATCHER_DOMAINS=example.com,example.org
|
|
DNSWATCHER_HOSTNAMES=www.example.com,api.example.com,mail.example.org
|
|
DNSWATCHER_SLACK_WEBHOOK=https://hooks.slack.com/services/T.../B.../xxx
|
|
DNSWATCHER_MATTERMOST_WEBHOOK=https://mattermost.example.com/hooks/xxx
|
|
DNSWATCHER_NTFY_TOPIC=https://ntfy.sh/my-dns-alerts
|
|
```
|
|
|
|
---
|
|
|
|
## DNS Resolution Strategy
|
|
|
|
dnswatcher never uses the system's configured recursive resolver. Instead,
|
|
it performs full iterative resolution:
|
|
|
|
1. **Root servers**: Starts from the IANA root nameserver list (hardcoded,
|
|
with periodic refresh).
|
|
2. **TLD delegation**: Queries root servers for the TLD NS records.
|
|
3. **Domain delegation**: Queries TLD nameservers for the domain's NS
|
|
records.
|
|
4. **Authoritative query**: Queries all discovered authoritative
|
|
nameservers directly for the requested records.
|
|
|
|
This approach ensures:
|
|
- Independence from any upstream resolver's cache or filtering.
|
|
- Ability to detect split-horizon or inconsistent responses across
|
|
authoritative servers.
|
|
- Visibility into the full delegation chain.
|
|
|
|
For hostname monitoring, the resolver follows CNAME chains (with a
|
|
depth limit to prevent loops) before collecting terminal A/AAAA records.
|
|
|
|
---
|
|
|
|
## State File Format
|
|
|
|
The state file (`DATA_DIR/state.json`) contains the complete monitoring
|
|
snapshot. Hostname records are stored **per authoritative nameserver**,
|
|
not as a merged view, to enable inconsistency detection.
|
|
|
|
```json
|
|
{
|
|
"version": 1,
|
|
"lastUpdated": "2026-02-19T12:00:00Z",
|
|
"domains": {
|
|
"example.com": {
|
|
"nameservers": ["ns1.example.com.", "ns2.example.com."],
|
|
"lastChecked": "2026-02-19T12:00:00Z"
|
|
}
|
|
},
|
|
"hostnames": {
|
|
"www.example.com": {
|
|
"recordsByNameserver": {
|
|
"ns1.example.com.": {
|
|
"records": {
|
|
"A": ["93.184.216.34"],
|
|
"AAAA": ["2606:2800:220:1:248:1893:25c8:1946"]
|
|
},
|
|
"status": "ok",
|
|
"lastChecked": "2026-02-19T12:00:00Z"
|
|
},
|
|
"ns2.example.com.": {
|
|
"records": {
|
|
"A": ["93.184.216.34"],
|
|
"AAAA": ["2606:2800:220:1:248:1893:25c8:1946"]
|
|
},
|
|
"status": "ok",
|
|
"lastChecked": "2026-02-19T12:00:00Z"
|
|
}
|
|
},
|
|
"lastChecked": "2026-02-19T12:00:00Z"
|
|
}
|
|
},
|
|
"ports": {
|
|
"93.184.216.34:80": {
|
|
"open": true,
|
|
"hostname": "www.example.com",
|
|
"lastChecked": "2026-02-19T12:00:00Z"
|
|
},
|
|
"93.184.216.34:443": {
|
|
"open": true,
|
|
"hostname": "www.example.com",
|
|
"lastChecked": "2026-02-19T12:00:00Z"
|
|
}
|
|
},
|
|
"certificates": {
|
|
"93.184.216.34:443:www.example.com": {
|
|
"commonName": "www.example.com",
|
|
"issuer": "DigiCert TLS RSA SHA256 2020 CA1",
|
|
"notAfter": "2027-01-15T23:59:59Z",
|
|
"subjectAlternativeNames": ["www.example.com"],
|
|
"status": "ok",
|
|
"lastChecked": "2026-02-19T06:00:00Z"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The `status` field for each per-nameserver entry and certificate entry
|
|
tracks reachability:
|
|
|
|
| Status | Meaning |
|
|
|-------------|-------------------------------------------------|
|
|
| `ok` | Query succeeded, records are current |
|
|
| `error` | Query failed (timeout, SERVFAIL, network error) |
|
|
| `nxdomain` | Authoritative NXDOMAIN response |
|
|
| `nodata` | Authoritative empty response (NODATA) |
|
|
|
|
---
|
|
|
|
## Building
|
|
|
|
```sh
|
|
make build # Build binary to bin/dnswatcher
|
|
make test # Run tests with race detector
|
|
make lint # Run golangci-lint
|
|
make fmt # Format code
|
|
make check # Run all checks (format, lint, test, build)
|
|
make clean # Remove build artifacts
|
|
```
|
|
|
|
### Build-Time Variables
|
|
|
|
Version and architecture are injected via `-ldflags`:
|
|
|
|
```sh
|
|
go build -ldflags "-X main.Version=$(git describe --tags --always) \
|
|
-X main.Buildarch=$(go env GOARCH)" ./cmd/dnswatcher
|
|
```
|
|
|
|
---
|
|
|
|
## Docker
|
|
|
|
```sh
|
|
docker build -t dnswatcher .
|
|
docker run -d \
|
|
-p 8080:8080 \
|
|
-v dnswatcher-data:/var/lib/dnswatcher \
|
|
-e DNSWATCHER_DOMAINS=example.com \
|
|
-e DNSWATCHER_HOSTNAMES=www.example.com \
|
|
-e DNSWATCHER_NTFY_TOPIC=https://ntfy.sh/my-alerts \
|
|
dnswatcher
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Lifecycle
|
|
|
|
1. **Startup**: Load state from disk. If no state file exists, start
|
|
with empty state (first check will establish baseline without
|
|
triggering change notifications).
|
|
2. **Initial check**: Immediately perform all DNS, port, and TLS checks
|
|
on startup.
|
|
3. **Periodic checks**:
|
|
- DNS and port checks: every `DNSWATCHER_DNS_INTERVAL` (default 1h).
|
|
- TLS checks: every `DNSWATCHER_TLS_INTERVAL` (default 12h).
|
|
4. **On change detection**: Send notifications to all configured
|
|
endpoints, update in-memory state, persist to disk.
|
|
5. **Shutdown**: Persist final state to disk, complete in-flight
|
|
notifications, stop gracefully.
|
|
|
|
---
|
|
|
|
## Project Structure
|
|
|
|
Follows the conventions defined in `CONVENTIONS.md`, adapted from the
|
|
[upaas](https://git.eeqj.de/sneak/upaas) project template. Uses uber/fx
|
|
for dependency injection, go-chi for HTTP routing, slog for logging, and
|
|
Viper for configuration.
|