Reduce DNS query timeout and limit root server fan-out #29

Closed
opened 2026-02-22 12:33:45 +01:00 by clawbot · 0 comments
Collaborator

Problem

The resolver uses a 5s query timeout (queryTimeoutDuration) which is far too high. Measured median RTT to root servers from GU is ~19ms. 5s timeout with 2 retries means a single failed hop burns 10s+.

Additionally, rootServerList() returns all 13 root servers but several code paths try them sequentially (e.g. resolveNSRecursive tries first 3, resolveARecord tries first 3), while followDelegationqueryServers tries ALL servers in the list if earlier ones fail.

Changes needed

  1. Reduce queryTimeoutDuration from 5s to 2s. This is still >100x median RTT to roots and gives plenty of headroom for slower auth/TLD servers worldwide.

  2. Limit root server queries to 3 NSes. The root is run correctly — if 3 of 13 root servers are unreachable, something is wrong with our network, not the root. Shuffle the list and pick 3 to avoid always hitting the same ones.

  3. Keep maxRetries = 2 — that gives 3 attempts per server, which is fine with lower timeout.

Impact

This should significantly reduce test suite time (currently 39s in internal/resolver alone, mostly from timeout-triggered retries). More importantly, production queries will fail fast instead of hanging for 10s on a single bad hop.

## Problem The resolver uses a 5s query timeout (`queryTimeoutDuration`) which is far too high. Measured median RTT to root servers from GU is ~19ms. 5s timeout with 2 retries means a single failed hop burns 10s+. Additionally, `rootServerList()` returns all 13 root servers but several code paths try them sequentially (e.g. `resolveNSRecursive` tries first 3, `resolveARecord` tries first 3), while `followDelegation` → `queryServers` tries ALL servers in the list if earlier ones fail. ## Changes needed 1. **Reduce `queryTimeoutDuration`** from 5s to 2s. This is still >100x median RTT to roots and gives plenty of headroom for slower auth/TLD servers worldwide. 2. **Limit root server queries to 3 NSes.** The root is run correctly — if 3 of 13 root servers are unreachable, something is wrong with our network, not the root. Shuffle the list and pick 3 to avoid always hitting the same ones. 3. Keep `maxRetries = 2` — that gives 3 attempts per server, which is fine with lower timeout. ## Impact This should significantly reduce test suite time (currently 39s in `internal/resolver` alone, mostly from timeout-triggered retries). More importantly, production queries will fail fast instead of hanging for 10s on a single bad hop.
clawbot self-assigned this 2026-02-22 12:33:45 +01:00
sneak closed this issue 2026-02-28 12:07:21 +01:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sneak/dnswatcher#29