refactor: bounded worker pool with DB-mediated retry fallback
All checks were successful
check / check (push) Successful in 58s
All checks were successful
check / check (push) Successful in 58s
Replace unbounded goroutine-per-delivery fan-out with a fixed-size worker pool (10 workers). Channels serve as bounded queues (10,000 buffer). Workers are the only goroutines doing HTTP delivery. When retry channel overflows, timers are dropped instead of re-armed. The delivery stays in 'retrying' status in the DB and a periodic sweep (every 60s) recovers orphaned retries. The database is the durable fallback — same path used on startup recovery. Addresses owner feedback on circuit breaker recovery goroutine flood.
This commit is contained in:
66
README.md
66
README.md
@@ -496,9 +496,11 @@ External Service
|
||||
┌──────────────┐
|
||||
│ Delivery │◄── retry timers
|
||||
│ Engine │ (backoff)
|
||||
│ (worker │
|
||||
│ pool) │
|
||||
└──────┬───────┘
|
||||
│
|
||||
┌─── parallel goroutines (fan-out) ───┐
|
||||
┌── bounded worker pool (N workers) ──┐
|
||||
▼ ▼ ▼
|
||||
┌────────────┐ ┌────────────┐ ┌────────────┐
|
||||
│ HTTP Target│ │Retry Target│ │ Log Target │
|
||||
@@ -508,28 +510,56 @@ External Service
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
### Parallel Fan-Out Delivery
|
||||
### Bounded Worker Pool
|
||||
|
||||
When the delivery engine receives a batch of tasks for an event, it
|
||||
fans out **all targets in parallel** — each `DeliveryTask` is dispatched
|
||||
in its own goroutine immediately. An HTTP target, a retry target, and
|
||||
a log target for the same event all start delivering simultaneously
|
||||
with no sequential bottleneck.
|
||||
The delivery engine uses a **fixed-size worker pool** (default: 10
|
||||
workers) to process all deliveries. At most N deliveries are in-flight
|
||||
at any time, preventing goroutine explosions regardless of queue depth.
|
||||
|
||||
**Architecture:**
|
||||
|
||||
- **Channels as queues:** Two buffered channels serve as bounded queues:
|
||||
a delivery channel (new tasks from the webhook handler) and a retry
|
||||
channel (tasks from backoff timers). Both are buffered to 10,000.
|
||||
- **Fan-out via channel, not goroutines:** When an event arrives with
|
||||
multiple targets, each `DeliveryTask` is sent to the delivery channel.
|
||||
Workers pick them up and process them — no goroutine-per-target.
|
||||
- **Worker goroutines:** A fixed number of worker goroutines select from
|
||||
both channels. Each worker processes one task at a time, then picks up
|
||||
the next. Workers are the ONLY goroutines doing actual HTTP delivery.
|
||||
- **Retry backpressure with DB fallback:** When a retry timer fires and
|
||||
the retry channel is full, the timer is dropped — the delivery stays
|
||||
in `retrying` status in the database. A periodic sweep (every 60s)
|
||||
scans for these "orphaned" retries and re-queues them. No blocked
|
||||
goroutines, no unbounded timer chains.
|
||||
- **Bounded concurrency:** At most N deliveries (N = number of workers)
|
||||
are in-flight simultaneously. Even if a circuit breaker is open for
|
||||
hours and thousands of retries queue up in the channels, the workers
|
||||
drain them at a controlled rate when the circuit closes.
|
||||
|
||||
This means:
|
||||
|
||||
- **No head-of-line blocking** — a slow HTTP target doesn't delay the
|
||||
log target or other targets.
|
||||
- **Maximum throughput** — all targets receive the event as quickly as
|
||||
possible.
|
||||
- **Independent results** — each goroutine records its own delivery
|
||||
result in the per-webhook database without coordination.
|
||||
- **Fire-and-forget** — the engine doesn't wait for all goroutines to
|
||||
finish; each delivery is completely independent.
|
||||
- **No goroutine explosion** — even with 10,000 queued retries, only
|
||||
N worker goroutines exist.
|
||||
- **Natural backpressure** — if workers are busy, new tasks wait in the
|
||||
channel buffer rather than spawning more goroutines.
|
||||
- **Independent results** — each worker records its own delivery result
|
||||
in the per-webhook database without coordination.
|
||||
- **Graceful shutdown** — cancel the context, workers finish their
|
||||
current task and exit. `WaitGroup.Wait()` ensures clean shutdown.
|
||||
|
||||
The same parallel fan-out applies to crash recovery: when the engine
|
||||
restarts and finds pending deliveries in per-webhook databases, it
|
||||
recovers them and fans them out in parallel just like fresh deliveries.
|
||||
**Recovery paths:**
|
||||
|
||||
1. **Startup recovery:** When the engine starts, it scans all per-webhook
|
||||
databases for `pending` and `retrying` deliveries. Pending deliveries
|
||||
are sent to the delivery channel; retrying deliveries get backoff
|
||||
timers scheduled.
|
||||
2. **Periodic retry sweep (DB-mediated fallback):** Every 60 seconds the
|
||||
engine scans for `retrying` deliveries whose backoff period has
|
||||
elapsed. This catches "orphaned" retries — ones whose in-memory timer
|
||||
was dropped because the retry channel was full. The database is the
|
||||
durable fallback that ensures no retry is permanently lost, even under
|
||||
extreme backpressure.
|
||||
|
||||
### Circuit Breaker (Retry Targets)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user