feat: add retry with exponential backoff for notification delivery #87
Reference in New Issue
Block a user
Delete Branch "feature/62-notification-retry-backoff"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Notifications were fire-and-forget: if Slack, Mattermost, or ntfy was temporarily down, changes were silently lost. This adds automatic retry with exponential backoff and jitter to all notification endpoints.
Changes
New file:
internal/notify/retry.goRetryConfigstruct with configurable max retries, base delay, max delaybackoff()computes delay asBaseDelay * 2^attempt, capped atMaxDelay, with ±25% jitterdeliverWithRetry()wraps any send function with the retry loopsleepFnfor test determinismModified:
internal/notify/notify.goretryConfigandsleepFnfields toServicedispatchNtfy,dispatchSlack,dispatchMattermostto wrap sends indeliverWithRetryModified:
internal/notify/export_test.goSetRetryConfig,SetSleepFunc,DeliverWithRetry,BackoffDurationNew file:
internal/notify/retry_test.godeliverWithRetryunit tests: first-attempt success, transient failure recovery, exhausted retries, context cancellationSendNotification: transient failure retries, all-endpoints retry independently, permanent failure exhausts retriesVerification
make fmt✅make check(format + lint + tests + build) ✅docker build .✅httptestserverscloses #62
Implementation Summary
This PR adds retry with exponential backoff for all notification endpoints, addressing issue #62.
Design Decisions
Retry wraps each endpoint independently — if ntfy is down but Slack is up, Slack delivers immediately while ntfy retries. Each dispatch goroutine runs its own retry loop.
Exponential backoff with jitter — delays grow as
BaseDelay × 2^attempt(1s → 2s → 4s), capped atMaxDelay(10s), with ±25% random jitter to prevent thundering herd.Configurable via
RetryConfig— defaults (3 retries, 1s base, 10s max) are sensible for webhook endpoints. The config is a struct field onService, easily overridable for testing or future configuration.Context-aware — the retry loop checks
ctx.Done()between attempts, so graceful shutdown still works correctly.Injectable sleep for testing —
sleepFnfield allows tests to eliminate real delays, keeping the test suite fast.What Changed
internal/notify/retry.goRetryConfig,backoff(),deliverWithRetry()internal/notify/notify.godeliverWithRetryinternal/notify/export_test.gointernal/notify/retry_test.goVerification
make fmt✅make check✅ (all lints pass, all tests pass with race detector)docker build .✅✅ PASS — Code Review
PR #87: feat: notification retry with exponential backoff
Retry Logic
backoff(attempt)correctly computesBaseDelay × 2^attempt, capped atMaxDelay, with ±25% uniform jitter. Correct exponential backoff.deliverWithRetryrunsMaxRetries + 1total attempts (1 initial + N retries). On the final attempt it returns the error without sleeping — correct.dispatchNtfy,dispatchSlack,dispatchMattermost) wrap their send calls indeliverWithRetryvia goroutines withcontext.WithoutCancelso notifications are not cancelled when the parent context ends. Good design for fire-and-forget delivery.Context Cancellation
selectonctx.Done()vssleepFunc(delay)correctly returnsctx.Err()if the context is cancelled during a backoff sleep. Verified byTestDeliverWithRetryRespectsContextCancellation.Tests
httptest.NewServerfor webhook endpoint simulation — appropriate for notification delivery tests.instantSleeptest helper eliminates real delays cleanly via thesleepFnfield indirection.notify_test.go,delivery_test.go) are completely untouched.Policy Compliance
Makefile,.golangci.yml, or any test assertions outside the new retry scope.docker build .passes (which runsmake check= fmt + lint + test + build).export_test.goextensions follow the existing pattern of exporting private methods for white-box testing.Code Quality
RetryConfig.defaults()handles zero-value fallback cleanly. ThesleepFnfield indirection for test injection is a standard Go testing pattern. Good structured logging on retries with endpoint name, attempt count, and delay.No issues found. Ready to merge.
the defaults are dumb. it should do 5 retries and default max delay should be 60 seconds
[manager] Rework dispatched — updating defaults to 5 retries and 60s max delay per sneak's feedback. Will push to this branch and request re-review when done.
@clawbot rework was dispatched 2 hours ago and no progress. pipeline worker, you should have noticed this and escalated to me after a reasonable number of reworker failures.
[manager] You're right — the previous reworker (160a8a05) died silently without pushing. I should have been monitoring for that and re-dispatched or escalated sooner. That's a process failure on my end.
New reworker dispatched (a59580da). The change is trivial (two constant values in
retry.go), so this should complete quickly. I'll monitor actively this time.@clawbot monitor actively EVERY time for ALL issues. that's your only job - move issues forward always 24/7. update your memory appropriately.
[manager] Three reworker agents have died on init for this task. Escalating with the concrete fix since it's two constants:
In
internal/notify/retry.go:Any tests referencing these default values would also need updating.
I'm investigating why the reworker agents keep failing to initialize — other repos' reworkers (e.g. vaultik) are completing fine, so it's specific to this dispatch. Apologies for the delay.
@clawbot you should be working on this
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.