refactor: bounded worker pool with DB-mediated retry fallback
All checks were successful
check / check (push) Successful in 58s

Replace unbounded goroutine-per-delivery fan-out with a fixed-size
worker pool (10 workers). Channels serve as bounded queues (10,000
buffer). Workers are the only goroutines doing HTTP delivery.

When retry channel overflows, timers are dropped instead of re-armed.
The delivery stays in 'retrying' status in the DB and a periodic sweep
(every 60s) recovers orphaned retries. The database is the durable
fallback — same path used on startup recovery.

Addresses owner feedback on circuit breaker recovery goroutine flood.
This commit is contained in:
clawbot
2026-03-01 22:52:27 -08:00
parent 9b4ae41c44
commit 10db6c5b84
2 changed files with 503 additions and 304 deletions

View File

@@ -496,9 +496,11 @@ External Service
┌──────────────┐
│ Delivery │◄── retry timers
│ Engine │ (backoff)
│ (worker │
│ pool) │
└──────┬───────┘
┌─── parallel goroutines (fan-out) ──┐
┌── bounded worker pool (N workers) ──┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ HTTP Target│ │Retry Target│ │ Log Target │
@@ -508,28 +510,56 @@ External Service
└────────────┘
```
### Parallel Fan-Out Delivery
### Bounded Worker Pool
When the delivery engine receives a batch of tasks for an event, it
fans out **all targets in parallel** — each `DeliveryTask` is dispatched
in its own goroutine immediately. An HTTP target, a retry target, and
a log target for the same event all start delivering simultaneously
with no sequential bottleneck.
The delivery engine uses a **fixed-size worker pool** (default: 10
workers) to process all deliveries. At most N deliveries are in-flight
at any time, preventing goroutine explosions regardless of queue depth.
**Architecture:**
- **Channels as queues:** Two buffered channels serve as bounded queues:
a delivery channel (new tasks from the webhook handler) and a retry
channel (tasks from backoff timers). Both are buffered to 10,000.
- **Fan-out via channel, not goroutines:** When an event arrives with
multiple targets, each `DeliveryTask` is sent to the delivery channel.
Workers pick them up and process them — no goroutine-per-target.
- **Worker goroutines:** A fixed number of worker goroutines select from
both channels. Each worker processes one task at a time, then picks up
the next. Workers are the ONLY goroutines doing actual HTTP delivery.
- **Retry backpressure with DB fallback:** When a retry timer fires and
the retry channel is full, the timer is dropped — the delivery stays
in `retrying` status in the database. A periodic sweep (every 60s)
scans for these "orphaned" retries and re-queues them. No blocked
goroutines, no unbounded timer chains.
- **Bounded concurrency:** At most N deliveries (N = number of workers)
are in-flight simultaneously. Even if a circuit breaker is open for
hours and thousands of retries queue up in the channels, the workers
drain them at a controlled rate when the circuit closes.
This means:
- **No head-of-line blocking** — a slow HTTP target doesn't delay the
log target or other targets.
- **Maximum throughput** — all targets receive the event as quickly as
possible.
- **Independent results** — each goroutine records its own delivery
result in the per-webhook database without coordination.
- **Fire-and-forget** — the engine doesn't wait for all goroutines to
finish; each delivery is completely independent.
- **No goroutine explosion** — even with 10,000 queued retries, only
N worker goroutines exist.
- **Natural backpressure** — if workers are busy, new tasks wait in the
channel buffer rather than spawning more goroutines.
- **Independent results** — each worker records its own delivery result
in the per-webhook database without coordination.
- **Graceful shutdown** — cancel the context, workers finish their
current task and exit. `WaitGroup.Wait()` ensures clean shutdown.
The same parallel fan-out applies to crash recovery: when the engine
restarts and finds pending deliveries in per-webhook databases, it
recovers them and fans them out in parallel just like fresh deliveries.
**Recovery paths:**
1. **Startup recovery:** When the engine starts, it scans all per-webhook
databases for `pending` and `retrying` deliveries. Pending deliveries
are sent to the delivery channel; retrying deliveries get backoff
timers scheduled.
2. **Periodic retry sweep (DB-mediated fallback):** Every 60 seconds the
engine scans for `retrying` deliveries whose backoff period has
elapsed. This catches "orphaned" retries — ones whose in-memory timer
was dropped because the retry channel was full. The database is the
durable fallback that ensures no retry is permanently lost, even under
extreme backpressure.
### Circuit Breaker (Retry Targets)