feat: add retry with exponential backoff for notification delivery #87
Reference in New Issue
Block a user
Delete Branch "feature/62-notification-retry-backoff"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Notifications were fire-and-forget: if Slack, Mattermost, or ntfy was temporarily down, changes were silently lost. This adds automatic retry with exponential backoff and jitter to all notification endpoints.
Changes
New file:
internal/notify/retry.goRetryConfigstruct with configurable max retries, base delay, max delaybackoff()computes delay asBaseDelay * 2^attempt, capped atMaxDelay, with ±25% jitterdeliverWithRetry()wraps any send function with the retry loopsleepFnfor test determinismModified:
internal/notify/notify.goretryConfigandsleepFnfields toServicedispatchNtfy,dispatchSlack,dispatchMattermostto wrap sends indeliverWithRetryModified:
internal/notify/export_test.goSetRetryConfig,SetSleepFunc,DeliverWithRetry,BackoffDurationNew file:
internal/notify/retry_test.godeliverWithRetryunit tests: first-attempt success, transient failure recovery, exhausted retries, context cancellationSendNotification: transient failure retries, all-endpoints retry independently, permanent failure exhausts retriesVerification
make fmt✅make check(format + lint + tests + build) ✅docker build .✅httptestserverscloses #62
Implementation Summary
This PR adds retry with exponential backoff for all notification endpoints, addressing issue #62.
Design Decisions
Retry wraps each endpoint independently — if ntfy is down but Slack is up, Slack delivers immediately while ntfy retries. Each dispatch goroutine runs its own retry loop.
Exponential backoff with jitter — delays grow as
BaseDelay × 2^attempt(1s → 2s → 4s), capped atMaxDelay(10s), with ±25% random jitter to prevent thundering herd.Configurable via
RetryConfig— defaults (3 retries, 1s base, 10s max) are sensible for webhook endpoints. The config is a struct field onService, easily overridable for testing or future configuration.Context-aware — the retry loop checks
ctx.Done()between attempts, so graceful shutdown still works correctly.Injectable sleep for testing —
sleepFnfield allows tests to eliminate real delays, keeping the test suite fast.What Changed
internal/notify/retry.goRetryConfig,backoff(),deliverWithRetry()internal/notify/notify.godeliverWithRetryinternal/notify/export_test.gointernal/notify/retry_test.goVerification
make fmt✅make check✅ (all lints pass, all tests pass with race detector)docker build .✅✅ PASS — Code Review
PR #87: feat: notification retry with exponential backoff
Retry Logic
backoff(attempt)correctly computesBaseDelay × 2^attempt, capped atMaxDelay, with ±25% uniform jitter. Correct exponential backoff.deliverWithRetryrunsMaxRetries + 1total attempts (1 initial + N retries). On the final attempt it returns the error without sleeping — correct.dispatchNtfy,dispatchSlack,dispatchMattermost) wrap their send calls indeliverWithRetryvia goroutines withcontext.WithoutCancelso notifications are not cancelled when the parent context ends. Good design for fire-and-forget delivery.Context Cancellation
selectonctx.Done()vssleepFunc(delay)correctly returnsctx.Err()if the context is cancelled during a backoff sleep. Verified byTestDeliverWithRetryRespectsContextCancellation.Tests
httptest.NewServerfor webhook endpoint simulation — appropriate for notification delivery tests.instantSleeptest helper eliminates real delays cleanly via thesleepFnfield indirection.notify_test.go,delivery_test.go) are completely untouched.Policy Compliance
Makefile,.golangci.yml, or any test assertions outside the new retry scope.docker build .passes (which runsmake check= fmt + lint + test + build).export_test.goextensions follow the existing pattern of exporting private methods for white-box testing.Code Quality
RetryConfig.defaults()handles zero-value fallback cleanly. ThesleepFnfield indirection for test injection is a standard Go testing pattern. Good structured logging on retries with endpoint name, attempt count, and delay.No issues found. Ready to merge.
the defaults are dumb. it should do 5 retries and default max delay should be 60 seconds
[manager] Rework dispatched — updating defaults to 5 retries and 60s max delay per sneak's feedback. Will push to this branch and request re-review when done.
@clawbot rework was dispatched 2 hours ago and no progress. pipeline worker, you should have noticed this and escalated to me after a reasonable number of reworker failures.
[manager] You're right — the previous reworker (160a8a05) died silently without pushing. I should have been monitoring for that and re-dispatched or escalated sooner. That's a process failure on my end.
New reworker dispatched (a59580da). The change is trivial (two constant values in
retry.go), so this should complete quickly. I'll monitor actively this time.@clawbot monitor actively EVERY time for ALL issues. that's your only job - move issues forward always 24/7. update your memory appropriately.
[manager] Three reworker agents have died on init for this task. Escalating with the concrete fix since it's two constants:
In
internal/notify/retry.go:Any tests referencing these default values would also need updating.
I'm investigating why the reworker agents keep failing to initialize — other repos' reworkers (e.g. vaultik) are completing fine, so it's specific to this dispatch. Apologies for the delay.
@clawbot you should be working on this
Rework Complete
Per sneak's feedback, updated retry defaults in
internal/notify/retry.go:DefaultMaxRetriesDefaultMaxDelayNo test changes needed — all tests use explicit
RetryConfigvalues rather than the package defaults.Verification
make fmt✅docker build .✅ (all tests pass, lint clean, build succeeds)main(already up to date)✅ PASS — Code Review (post-rework)
PR #87: feat: notification retry with exponential backoff
Closes issue #62
Policy Divergences
No policy violations found.
Makefile,.golangci.yml,Dockerfile, CI config, or any existing test assertions.Dockerfileremain pinned by@sha256:with version+date comments.internal/notify/— no scope creep.//nolint:gosecline inretry.gois justified by the inline suppression comment.Requirements Checklist (issue #62)
deliverWithRetry()wraps all three dispatch methodssvc.log.Warnon each retry,svc.log.Errorafter exhaustion,svc.log.Infoon retry-successDefaultMaxRetries = 5DefaultMaxDelay = 60 * time.SecondAll human-requested changes addressed.
Code Review
Retry logic (
retry.go):backoff()correctly computesBaseDelay × 2^attempt, capped atMaxDelay, with ±25% uniform jitter. Usesmath/rand/v2(auto-seeded). The//nolint:gosecis appropriate — jitter doesn't need crypto/rand.deliverWithRetry()runsMaxRetries + 1total attempts. On the final attempt, returns the error without sleeping. Between attempts, respectsctx.Done()— correct for graceful shutdown.defaults()is a value receiver returning a copy with zero-value fallback — clean pattern, no mutation of the original.sleepFunc()field indirection for test injection is idiomatic Go.Integration (
notify.go):dispatchNtfy,dispatchSlack,dispatchMattermost) wrap their send calls indeliverWithRetry. Each runs in its own goroutine withcontext.WithoutCancel— endpoints retry independently.retryConfigandsleepFnfields added toServicestruct. Zero-value defaults are handled bydefaults()— no constructor changes needed.Tests (
retry_test.go):httptest.NewServer— no DNS mocking. Existing tests completely untouched.instantSleephelper eliminates real delays. Context cancellation test uses a blocking sleep to validate theselectpath.waitForConditionreused from existingdelivery_test.go.Test helpers (
export_test.go):SetRetryConfig,SetSleepFunc,DeliverWithRetry,BackoffDuration— all follow the existing export pattern in this file.Test Coverage
RetryConfig(type)BackoffDuration(method)TestBackoffDurationIncreases,TestBackoffDurationCappedAtMaxSetRetryConfig(test helper)SetSleepFunc(test helper)DeliverWithRetry(test helper)Build Result
docker build .✅ — all checks pass (fmt, lint, tests with race detector, build).Verdict
PASS. Clean implementation, all issue requirements met, sneak's rework feedback addressed, no policy violations, comprehensive test coverage, no cheating. Rebases cleanly onto
main.