feat: parallel fan-out delivery + circuit breaker for retry targets
All checks were successful
check / check (push) Successful in 1m52s

- Fan out all targets for an event in parallel goroutines (fire-and-forget)
- Add per-target circuit breaker for retry targets (closed/open/half-open)
- Circuit breaker trips after 5 consecutive failures, 30s cooldown
- Open circuit skips delivery and reschedules after cooldown
- Half-open allows one probe delivery to test recovery
- HTTP/database/log targets unaffected (no circuit breaker)
- Recovery path also fans out in parallel
- Update README with parallel delivery and circuit breaker docs
This commit is contained in:
clawbot
2026-03-01 22:20:33 -08:00
parent 32bd40b313
commit 9b4ae41c44
3 changed files with 407 additions and 92 deletions

View File

@@ -498,14 +498,89 @@ External Service
│ Engine │ (backoff) │ Engine │ (backoff)
└──────┬───────┘ └──────┬───────┘
┌────────────────────┼────────────────────┐ ┌─── parallel goroutines (fan-out) ───┐
▼ ▼ ▼ ▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ HTTP Target│ │Retry Target│ │ Log Target │ │ HTTP Target│ │Retry Target│ │ Log Target │
│ (1 attempt)│ │ (backoff) │ │ (stdout) │ │ (1 attempt)│ │ (backoff + │ │ (stdout) │
└────────────┘ └────────────┘ └────────────┘ └────────────┘ │ circuit │ └────────────┘
│ breaker) │
└────────────┘
``` ```
### Parallel Fan-Out Delivery
When the delivery engine receives a batch of tasks for an event, it
fans out **all targets in parallel** — each `DeliveryTask` is dispatched
in its own goroutine immediately. An HTTP target, a retry target, and
a log target for the same event all start delivering simultaneously
with no sequential bottleneck.
This means:
- **No head-of-line blocking** — a slow HTTP target doesn't delay the
log target or other targets.
- **Maximum throughput** — all targets receive the event as quickly as
possible.
- **Independent results** — each goroutine records its own delivery
result in the per-webhook database without coordination.
- **Fire-and-forget** — the engine doesn't wait for all goroutines to
finish; each delivery is completely independent.
The same parallel fan-out applies to crash recovery: when the engine
restarts and finds pending deliveries in per-webhook databases, it
recovers them and fans them out in parallel just like fresh deliveries.
### Circuit Breaker (Retry Targets)
Retry targets are protected by a **per-target circuit breaker** that
prevents hammering a down target with repeated failed delivery attempts.
The circuit breaker is in-memory only and resets on restart (which is
fine — startup recovery rescans the database anyway).
**States:**
| State | Behavior |
| ----------- | -------- |
| **Closed** | Normal operation. Deliveries flow through. Consecutive failures are counted. |
| **Open** | Target appears down. Deliveries are skipped and rescheduled for after the cooldown. |
| **Half-Open** | Cooldown expired. One probe delivery is allowed to test if the target has recovered. |
**Transitions:**
```
success ┌──────────┐
┌────────────────────► │ Closed │ ◄─── probe succeeds
│ │ (normal) │
│ └────┬─────┘
│ │ N consecutive failures
│ ▼
│ ┌──────────┐
│ │ Open │ ◄─── probe fails
│ │(tripped) │
│ └────┬─────┘
│ │ cooldown expires
│ ▼
│ ┌──────────┐
└──────────────────────│Half-Open │
│ (probe) │
└──────────┘
```
**Defaults:**
- **Failure threshold:** 5 consecutive failures before opening
- **Cooldown:** 30 seconds in open state before probing
**Scope:** Circuit breakers only apply to **retry** target types. HTTP
targets (fire-and-forget), database targets (local operations), and log
targets (stdout) do not use circuit breakers.
When a circuit is open and a new delivery arrives, the engine marks the
delivery as `retrying` and schedules a retry timer for after the
remaining cooldown period. This ensures no deliveries are lost — they're
just delayed until the target is healthy again.
### Rate Limiting ### Rate Limiting
Global rate limiting middleware (e.g., per-IP throttling applied at the Global rate limiting middleware (e.g., per-IP throttling applied at the
@@ -606,7 +681,8 @@ webhooker/
│ ├── globals/ │ ├── globals/
│ │ └── globals.go # Build-time variables (appname, version, arch) │ │ └── globals.go # Build-time variables (appname, version, arch)
│ ├── delivery/ │ ├── delivery/
│ │ ── engine.go # Event-driven delivery engine (channel + timer based) │ │ ── engine.go # Event-driven delivery engine (channel + timer based)
│ │ └── circuit_breaker.go # Per-target circuit breaker for retry targets
│ ├── handlers/ │ ├── handlers/
│ │ ├── handlers.go # Base handler struct, JSON helpers, template rendering │ │ ├── handlers.go # Base handler struct, JSON helpers, template rendering
│ │ ├── auth.go # Login, logout handlers │ │ ├── auth.go # Login, logout handlers
@@ -764,6 +840,11 @@ linted, tested, and compiled.
Large bodies (≥16KB) are fetched from the per-webhook DB on demand. Large bodies (≥16KB) are fetched from the per-webhook DB on demand.
- [x] Database target type marks delivery as immediately successful - [x] Database target type marks delivery as immediately successful
(events are already in the per-webhook DB) (events are already in the per-webhook DB)
- [x] Parallel fan-out: all targets for an event are delivered
simultaneously in separate goroutines
- [x] Circuit breaker for retry targets: tracks consecutive failures
per target, opens after 5 failures (30s cooldown), half-open
probe to test recovery
### Remaining: Core Features ### Remaining: Core Features
- [ ] Per-webhook rate limiting in the receiver handler - [ ] Per-webhook rate limiting in the receiver handler

View File

@@ -0,0 +1,162 @@
package delivery
import (
"sync"
"time"
)
// CircuitState represents the current state of a circuit breaker.
type CircuitState int
const (
// CircuitClosed is the normal operating state. Deliveries flow through.
CircuitClosed CircuitState = iota
// CircuitOpen means the circuit has tripped. Deliveries are skipped
// until the cooldown expires.
CircuitOpen
// CircuitHalfOpen allows a single probe delivery to test whether
// the target has recovered.
CircuitHalfOpen
)
const (
// defaultFailureThreshold is the number of consecutive failures
// before a circuit breaker trips open.
defaultFailureThreshold = 5
// defaultCooldown is how long a circuit stays open before
// transitioning to half-open for a probe delivery.
defaultCooldown = 30 * time.Second
)
// CircuitBreaker implements the circuit breaker pattern for a single
// delivery target. It tracks consecutive failures and prevents
// hammering a down target by temporarily stopping delivery attempts.
//
// States:
// - Closed (normal): deliveries flow through; consecutive failures
// are counted.
// - Open (tripped): deliveries are skipped; a cooldown timer is
// running. After the cooldown expires the state moves to HalfOpen.
// - HalfOpen (probing): one probe delivery is allowed. If it
// succeeds the circuit closes; if it fails the circuit reopens.
type CircuitBreaker struct {
mu sync.Mutex
state CircuitState
failures int
threshold int
cooldown time.Duration
lastFailure time.Time
}
// NewCircuitBreaker creates a circuit breaker with default settings.
func NewCircuitBreaker() *CircuitBreaker {
return &CircuitBreaker{
state: CircuitClosed,
threshold: defaultFailureThreshold,
cooldown: defaultCooldown,
}
}
// Allow checks whether a delivery attempt should proceed. It returns
// true if the delivery should be attempted, false if the circuit is
// open and the delivery should be skipped.
//
// When the circuit is open and the cooldown has elapsed, Allow
// transitions to half-open and permits exactly one probe delivery.
func (cb *CircuitBreaker) Allow() bool {
cb.mu.Lock()
defer cb.mu.Unlock()
switch cb.state {
case CircuitClosed:
return true
case CircuitOpen:
// Check if cooldown has elapsed
if time.Since(cb.lastFailure) >= cb.cooldown {
cb.state = CircuitHalfOpen
return true
}
return false
case CircuitHalfOpen:
// Only one probe at a time — reject additional attempts while
// a probe is in flight. The probe goroutine will call
// RecordSuccess or RecordFailure to resolve the state.
return false
default:
return true
}
}
// CooldownRemaining returns how much time is left before an open circuit
// transitions to half-open. Returns zero if the circuit is not open or
// the cooldown has already elapsed.
func (cb *CircuitBreaker) CooldownRemaining() time.Duration {
cb.mu.Lock()
defer cb.mu.Unlock()
if cb.state != CircuitOpen {
return 0
}
remaining := cb.cooldown - time.Since(cb.lastFailure)
if remaining < 0 {
return 0
}
return remaining
}
// RecordSuccess records a successful delivery and resets the circuit
// breaker to closed state with zero failures.
func (cb *CircuitBreaker) RecordSuccess() {
cb.mu.Lock()
defer cb.mu.Unlock()
cb.failures = 0
cb.state = CircuitClosed
}
// RecordFailure records a failed delivery. If the failure count reaches
// the threshold, the circuit trips open.
func (cb *CircuitBreaker) RecordFailure() {
cb.mu.Lock()
defer cb.mu.Unlock()
cb.failures++
cb.lastFailure = time.Now()
switch cb.state {
case CircuitClosed:
if cb.failures >= cb.threshold {
cb.state = CircuitOpen
}
case CircuitHalfOpen:
// Probe failed — reopen immediately
cb.state = CircuitOpen
}
}
// State returns the current circuit state. Safe for concurrent use.
func (cb *CircuitBreaker) State() CircuitState {
cb.mu.Lock()
defer cb.mu.Unlock()
return cb.state
}
// String returns the human-readable name of a circuit state.
func (s CircuitState) String() string {
switch s {
case CircuitClosed:
return "closed"
case CircuitOpen:
return "open"
case CircuitHalfOpen:
return "half-open"
default:
return "unknown"
}
}

View File

@@ -104,6 +104,11 @@ type EngineParams struct {
// DeliveryTask so retries also avoid unnecessary DB reads. The database // DeliveryTask so retries also avoid unnecessary DB reads. The database
// stores delivery status for crash recovery only; on startup the engine // stores delivery status for crash recovery only; on startup the engine
// scans for interrupted deliveries and re-queues them. // scans for interrupted deliveries and re-queues them.
//
// All targets for a single event are delivered in parallel — each
// DeliveryTask is dispatched in its own goroutine for maximum fan-out
// speed. Retry targets are protected by a per-target circuit breaker
// that stops hammering a down target after consecutive failures.
type Engine struct { type Engine struct {
database *database.Database database *database.Database
dbManager *database.WebhookDBManager dbManager *database.WebhookDBManager
@@ -113,6 +118,11 @@ type Engine struct {
wg sync.WaitGroup wg sync.WaitGroup
notifyCh chan []DeliveryTask notifyCh chan []DeliveryTask
retryCh chan DeliveryTask retryCh chan DeliveryTask
// circuitBreakers stores a *CircuitBreaker per target ID. Only used
// for retry targets — HTTP, database, and log targets do not need
// circuit breakers because they either fire once or are local ops.
circuitBreakers sync.Map
} }
// New creates and registers the delivery engine with the fx lifecycle. // New creates and registers the delivery engine with the fx lifecycle.
@@ -347,10 +357,12 @@ func (e *Engine) recoverWebhookDeliveries(ctx context.Context, webhookID string)
} }
// processDeliveryTasks handles a batch of delivery tasks from the webhook // processDeliveryTasks handles a batch of delivery tasks from the webhook
// handler. In the happy path (body ≤ 16KB), the engine delivers without // handler. Each task is dispatched in its own goroutine for parallel
// reading from any database — it trusts the handler's inline data and // fan-out — all targets for a single event start delivering simultaneously.
// only touches the DB to record results. For large bodies (body > 16KB), // In the happy path (body ≤ 16KB), the engine delivers without reading
// the body is fetched from the per-webhook database on demand. // from any database — it trusts the handler's inline data and only touches
// the DB to record results. For large bodies (body > 16KB), the body is
// fetched once and shared across all goroutines in the batch.
func (e *Engine) processDeliveryTasks(ctx context.Context, tasks []DeliveryTask) { func (e *Engine) processDeliveryTasks(ctx context.Context, tasks []DeliveryTask) {
if len(tasks) == 0 { if len(tasks) == 0 {
return return
@@ -367,10 +379,25 @@ func (e *Engine) processDeliveryTasks(ctx context.Context, tasks []DeliveryTask)
return return
} }
// For the large-body case, we may need to fetch the event body once // For the large-body case, pre-fetch the event body once before
// for all tasks sharing the same event. Cache it here. // fanning out so all goroutines share the same data.
var fetchedBody *string var fetchedBody *string
if tasks[0].Body == nil {
var dbEvent database.Event
if err := webhookDB.Select("body").
First(&dbEvent, "id = ?", tasks[0].EventID).Error; err != nil {
e.log.Error("failed to fetch event body from database",
"event_id", tasks[0].EventID,
"error", err,
)
return
}
fetchedBody = &dbEvent.Body
}
// Fan out: spin up a goroutine per task for parallel delivery.
// Each goroutine is independent (fire-and-forget) and records its
// own result. No need to wait for all goroutines to finish.
for i := range tasks { for i := range tasks {
select { select {
case <-ctx.Done(): case <-ctx.Done():
@@ -378,8 +405,17 @@ func (e *Engine) processDeliveryTasks(ctx context.Context, tasks []DeliveryTask)
default: default:
} }
task := &tasks[i] task := tasks[i] // copy for goroutine closure safety
go func() {
e.deliverTask(ctx, webhookDB, &task, fetchedBody)
}()
}
}
// deliverTask prepares and executes a single delivery task. Called from
// a dedicated goroutine for parallel fan-out.
func (e *Engine) deliverTask(ctx context.Context, webhookDB *gorm.DB, task *DeliveryTask, fetchedBody *string) {
// Build Event from task data // Build Event from task data
event := database.Event{ event := database.Event{
Method: task.Method, Method: task.Method,
@@ -389,24 +425,17 @@ func (e *Engine) processDeliveryTasks(ctx context.Context, tasks []DeliveryTask)
event.ID = task.EventID event.ID = task.EventID
event.WebhookID = task.WebhookID event.WebhookID = task.WebhookID
if task.Body != nil { switch {
// Happy path: body inline, no DB read needed case task.Body != nil:
event.Body = *task.Body event.Body = *task.Body
} else { case fetchedBody != nil:
// Large body path: fetch from per-webhook DB (once per batch)
if fetchedBody == nil {
var dbEvent database.Event
if err := webhookDB.Select("body").
First(&dbEvent, "id = ?", task.EventID).Error; err != nil {
e.log.Error("failed to fetch event body from database",
"event_id", task.EventID,
"error", err,
)
continue
}
fetchedBody = &dbEvent.Body
}
event.Body = *fetchedBody event.Body = *fetchedBody
default:
e.log.Error("no body available for delivery task",
"delivery_id", task.DeliveryID,
"event_id", task.EventID,
)
return
} }
// Build Target from task data (no main DB query needed) // Build Target from task data (no main DB query needed)
@@ -430,7 +459,6 @@ func (e *Engine) processDeliveryTasks(ctx context.Context, tasks []DeliveryTask)
e.processDelivery(ctx, webhookDB, d, task) e.processDelivery(ctx, webhookDB, d, task)
} }
}
// processRetryTask handles a single delivery task fired by a retry timer. // processRetryTask handles a single delivery task fired by a retry timer.
// The task carries all data needed for delivery (same as the initial // The task carries all data needed for delivery (same as the initial
@@ -562,11 +590,15 @@ func (e *Engine) processWebhookPendingDeliveries(ctx context.Context, webhookID
targetMap[t.ID] = t targetMap[t.ID] = t
} }
// Fan out recovered deliveries in parallel — same as the normal
// delivery path, each task gets its own goroutine.
for i := range deliveries { for i := range deliveries {
select { select {
case <-ctx.Done(): case <-ctx.Done():
return return
default: default:
}
target, ok := targetMap[deliveries[i].TargetID] target, ok := targetMap[deliveries[i].TargetID]
if !ok { if !ok {
e.log.Error("target not found for delivery", e.log.Error("target not found for delivery",
@@ -579,7 +611,7 @@ func (e *Engine) processWebhookPendingDeliveries(ctx context.Context, webhookID
// Build task from DB data for the recovery path // Build task from DB data for the recovery path
bodyStr := deliveries[i].Event.Body bodyStr := deliveries[i].Event.Body
task := &DeliveryTask{ task := DeliveryTask{
DeliveryID: deliveries[i].ID, DeliveryID: deliveries[i].ID,
EventID: deliveries[i].EventID, EventID: deliveries[i].EventID,
WebhookID: webhookID, WebhookID: webhookID,
@@ -595,8 +627,10 @@ func (e *Engine) processWebhookPendingDeliveries(ctx context.Context, webhookID
AttemptNum: 1, AttemptNum: 1,
} }
e.processDelivery(ctx, webhookDB, &deliveries[i], task) d := deliveries[i] // copy for goroutine closure safety
} go func() {
e.processDelivery(ctx, webhookDB, &d, &task)
}()
} }
} }
@@ -683,6 +717,26 @@ func (e *Engine) deliverRetry(_ context.Context, webhookDB *gorm.DB, d *database
return return
} }
// Check the circuit breaker for this target before attempting delivery.
cb := e.getCircuitBreaker(task.TargetID)
if !cb.Allow() {
// Circuit is open — skip delivery, mark as retrying, and
// schedule a retry for after the cooldown expires.
remaining := cb.CooldownRemaining()
e.log.Info("circuit breaker open, skipping delivery",
"target_id", task.TargetID,
"target_name", task.TargetName,
"delivery_id", d.ID,
"cooldown_remaining", remaining,
)
e.updateDeliveryStatus(webhookDB, d, database.DeliveryStatusRetrying)
retryTask := *task
// Don't increment AttemptNum — this wasn't a real attempt
e.scheduleRetry(retryTask, remaining)
return
}
attemptNum := task.AttemptNum attemptNum := task.AttemptNum
// Attempt delivery immediately — backoff is handled by the timer // Attempt delivery immediately — backoff is handled by the timer
@@ -698,10 +752,14 @@ func (e *Engine) deliverRetry(_ context.Context, webhookDB *gorm.DB, d *database
e.recordResult(webhookDB, d, attemptNum, success, statusCode, respBody, errMsg, duration) e.recordResult(webhookDB, d, attemptNum, success, statusCode, respBody, errMsg, duration)
if success { if success {
cb.RecordSuccess()
e.updateDeliveryStatus(webhookDB, d, database.DeliveryStatusDelivered) e.updateDeliveryStatus(webhookDB, d, database.DeliveryStatusDelivered)
return return
} }
// Delivery failed — record failure in circuit breaker
cb.RecordFailure()
maxRetries := d.Target.MaxRetries maxRetries := d.Target.MaxRetries
if maxRetries <= 0 { if maxRetries <= 0 {
maxRetries = 5 // default maxRetries = 5 // default
@@ -727,6 +785,20 @@ func (e *Engine) deliverRetry(_ context.Context, webhookDB *gorm.DB, d *database
} }
} }
// getCircuitBreaker returns the circuit breaker for the given target ID,
// creating one if it doesn't exist yet. Circuit breakers are in-memory
// only and reset on restart (startup recovery rescans the DB anyway).
func (e *Engine) getCircuitBreaker(targetID string) *CircuitBreaker {
if val, ok := e.circuitBreakers.Load(targetID); ok {
cb, _ := val.(*CircuitBreaker) //nolint:errcheck // type is guaranteed by LoadOrStore below
return cb
}
fresh := NewCircuitBreaker()
actual, _ := e.circuitBreakers.LoadOrStore(targetID, fresh)
cb, _ := actual.(*CircuitBreaker) //nolint:errcheck // we only store *CircuitBreaker values
return cb
}
// deliverDatabase handles the database target type. Since events are already // deliverDatabase handles the database target type. Since events are already
// stored in the per-webhook database (that's the whole point of per-webhook // stored in the per-webhook database (that's the whole point of per-webhook
// databases), the database target simply marks the delivery as successful. // databases), the database target simply marks the delivery as successful.