refactor: bounded worker pool with DB-mediated retry fallback

Replace unbounded goroutine-per-delivery fan-out with a fixed-size worker pool (10 workers). Channels serve as bounded queues (10,000 buffer). Workers are the only goroutines doing HTTP delivery. When retry channel overflows, timers are dropped instead of re-armed. The delivery stays in 'retrying' status in the DB and a periodic sweep (every 60s) recovers orphaned retries. The database is the durable fallback — same path used on startup recovery. Addresses owner feedback on circuit breaker recovery goroutine flood.
2026-03-01 22:52:27 -08:00
parent 9b4ae41c44
commit 10db6c5b84
2 changed files with 503 additions and 304 deletions
--- a/README.md
+++ b/README.md
@@ -496,9 +496,11 @@ External Service
                                        ┌──────────────┐
                                        │  Delivery    │◄── retry timers
                                        │  Engine      │    (backoff)
+                                        │  (worker     │
+                                        │   pool)      │
                                        └──────┬───────┘
                                               │
-                              ┌─── parallel goroutines (fan-out) ───┐
+                              ┌── bounded worker pool (N workers) ──┐
                              ▼                ▼                    ▼
                       ┌────────────┐   ┌────────────┐      ┌────────────┐
                       │ HTTP Target│   │Retry Target│      │ Log Target │
@@ -508,28 +510,56 @@ External Service
                                        └────────────┘
 ```

-### Parallel Fan-Out Delivery
+### Bounded Worker Pool

-When the delivery engine receives a batch of tasks for an event, it
-fans out **all targets in parallel** — each `DeliveryTask` is dispatched
-in its own goroutine immediately. An HTTP target, a retry target, and
-a log target for the same event all start delivering simultaneously
-with no sequential bottleneck.
+The delivery engine uses a **fixed-size worker pool** (default: 10
+workers) to process all deliveries. At most N deliveries are in-flight
+at any time, preventing goroutine explosions regardless of queue depth.
+
+**Architecture:**
+
+- **Channels as queues:** Two buffered channels serve as bounded queues:
+  a delivery channel (new tasks from the webhook handler) and a retry
+  channel (tasks from backoff timers). Both are buffered to 10,000.
+- **Fan-out via channel, not goroutines:** When an event arrives with
+  multiple targets, each `DeliveryTask` is sent to the delivery channel.
+  Workers pick them up and process them — no goroutine-per-target.
+- **Worker goroutines:** A fixed number of worker goroutines select from
+  both channels. Each worker processes one task at a time, then picks up
+  the next. Workers are the ONLY goroutines doing actual HTTP delivery.
+- **Retry backpressure with DB fallback:** When a retry timer fires and
+  the retry channel is full, the timer is dropped — the delivery stays
+  in `retrying` status in the database. A periodic sweep (every 60s)
+  scans for these "orphaned" retries and re-queues them. No blocked
+  goroutines, no unbounded timer chains.
+- **Bounded concurrency:** At most N deliveries (N = number of workers)
+  are in-flight simultaneously. Even if a circuit breaker is open for
+  hours and thousands of retries queue up in the channels, the workers
+  drain them at a controlled rate when the circuit closes.

 This means:

- **No head-of-line blocking** — a slow HTTP target doesn't delay the
-  log target or other targets.
- **Maximum throughput** — all targets receive the event as quickly as
-  possible.
- **Independent results** — each goroutine records its own delivery
-  result in the per-webhook database without coordination.
- **Fire-and-forget** — the engine doesn't wait for all goroutines to
-  finish; each delivery is completely independent.
+- **No goroutine explosion** — even with 10,000 queued retries, only
+  N worker goroutines exist.
+- **Natural backpressure** — if workers are busy, new tasks wait in the
+  channel buffer rather than spawning more goroutines.
+- **Independent results** — each worker records its own delivery result
+  in the per-webhook database without coordination.
+- **Graceful shutdown** — cancel the context, workers finish their
+  current task and exit. `WaitGroup.Wait()` ensures clean shutdown.

-The same parallel fan-out applies to crash recovery: when the engine
-restarts and finds pending deliveries in per-webhook databases, it
-recovers them and fans them out in parallel just like fresh deliveries.
+**Recovery paths:**
+
+1. **Startup recovery:** When the engine starts, it scans all per-webhook
+   databases for `pending` and `retrying` deliveries. Pending deliveries
+   are sent to the delivery channel; retrying deliveries get backoff
+   timers scheduled.
+2. **Periodic retry sweep (DB-mediated fallback):** Every 60 seconds the
+   engine scans for `retrying` deliveries whose backoff period has
+   elapsed. This catches "orphaned" retries — ones whose in-memory timer
+   was dropped because the retry channel was full. The database is the
+   durable fallback that ensures no retry is permanently lost, even under
+   extreme backpressure.

 ### Circuit Breaker (Retry Targets)