Implementing Webhook Retries
Delivering a webhook is an optimistic act. You're sending an HTTP request to a server you don't control, over a network you can't predict, and hoping it arrives intact. Most of the time it works. But when it doesn't — and it will eventually not work — your retry strategy determines whether that failed delivery is a minor blip or a lost event that silently breaks your customer's integration.
A well-designed retry system turns an inherently unreliable transport mechanism into something your users can depend on. A poorly designed one compounds the original failure with duplicate processing, stampeded servers, and frustrated developers digging through logs. This guide covers what it takes to get retries right, from the algorithmic foundations to the operational details that separate production-grade webhook infrastructure from a best-effort fire-and-forget.
Why Retries Exist
HTTP is not a guaranteed delivery protocol. Requests fail for dozens of reasons: a consumer's server is mid-deploy, a load balancer is rotating, a cloud region is experiencing degraded networking, or the receiving application simply crashed. These are transient conditions. The event itself is perfectly valid, but it arrived at a bad moment.
Without retries, every transient failure becomes permanent data loss. Your customer misses a payment confirmation, a shipping update, or a user provisioning event. They don't find out until something downstream breaks, and by then the trust damage is done.
Retries exist to bridge the gap between at-least-once delivery guarantees and the messy reality of distributed systems. They give the receiving system time to recover and give the event another chance to land. But the way you retry matters enormously, both in terms of the reliability of delivery and the health of the systems on both ends of the connection.
Exponential Backoff: Giving Endpoints Room to Breathe
When a delivery attempt fails, the natural impulse is to resend right away. For the very first retry, that makes sense, a brief network interruption may have already cleared. But if the second and third attempts also bounce, firing requests at the same pace starts doing more harm than good. An endpoint that's struggling under load doesn't benefit from a barrage of duplicate traffic.
Exponential backoff addresses this by spacing attempts further and further apart as failures accumulate. A representative schedule might look like:
| Attempt | Delay After Failure |
|---|---|
| 1 | Immediate |
| 2 | 5 seconds |
| 3 | 5 minutes |
| 4 | 30 minutes |
| 5 | 2 hours |
| 6 | 5 hours |
| 7 | 10 hours |
| 8 | 10 hours |
The early attempts catch transient glitches quickly. The later attempts accommodate longer outages (like a server migration, an expired TLS certificate, or a cloud provider incident) without burning through all your retries in the first few minutes. The total window in this example spans roughly 27 hours, giving the consumer a full business day to notice and fix whatever went wrong.
The key design choice is where to set the ceiling on your backoff interval. Too low and you exhaust your attempts before the consumer has time to respond. Too high and urgent events sit in limbo for an unreasonable period. Most production webhook systems cap individual retry intervals somewhere between 6 and 12 hours, with a total retry window spanning 1 to 3 days.
Adding Jitter to Break Retry Collisions
However, deterministic backoff has a blind spot. Picture a consumer's server going offline at 2:00 PM while 10,000 webhook events are queued for delivery. Every one of those events fails its first attempt simultaneously, and because the backoff formula is the same for all of them, every one will retry at the exact same future moments: 2:00:05, 2:05:00, 2:30:00. The server comes back up and immediately absorbs a synchronized wall of requests that can push it right back into failure.
This stampede effect is why production retry systems introduce jitter — a controlled dose of randomness applied to each computed delay. If the backoff algorithm says "retry in 5 minutes," jitter might adjust that to anywhere between 3:30 and 6:30 for any given event. The result is that retries trickle in over a window instead of arriving as a coordinated burst.
The three most common approaches to jitter differ in how aggressively they randomize:
- Full jitter picks a random value anywhere between zero and the calculated delay. This maximizes spread but means some retries happen sooner than you might expect.
- Equal jitter uses the calculated delay as an anchor, randomizing only within the upper half of the range. You get reliable minimum spacing with enough variation to avoid collisions.
- Decorrelated jitter derives each interval from the one before it rather than from the attempt number, which naturally desynchronizes events that started failing at different times.
In practice, any jitter approach dramatically outperforms a bare backoff curve. The specific strategy matters less than the principle: no two events should retry on the same clock.
Using Response Codes to Drive Retry Decisions
The HTTP status code a consumer returns is your single best signal for deciding what to do next. Treating all non-2xx responses identically creates avoidable problems.
The key distinction is between failures that are likely to resolve on their own and failures that won't change no matter how many times you resend. A 503 Service Unavailable almost always clears up within minutes; the server is alive but temporarily overloaded or is under maintenance. A 401 Unauthorized, on the other hand, means the consumer's credentials are wrong or revoked. Sending the same request an hour later won't fix that.
In practical terms, 5xx status codes and connection-level failures (timeouts, refused connections, DNS errors) warrant retries. Most 4xx codes do not, but with two important exceptions. A 408 Request Timeout suggests the server was simply too slow and may succeed on a subsequent attempt. A 429 Too Many Requests is the consumer actively telling you to back off; you should retry, but only after honoring whatever delay they specify.
When a consumer returns 429 with a Retry-After header, that value should override your backoff algorithm entirely. If the header says to wait 120 seconds, wait at least 120 seconds — even if your exponential schedule would have retried sooner. Ignoring explicit backpressure signals risks getting your traffic blocked at the consumer's infrastructure level.
Redirect responses (3xx) are a special case. Following redirects in webhook delivery opens security risks and complicates signature verification. The safer default is to treat them as non-retriable failures and surface the redirect URL to the consumer for endpoint reconfiguration.
Handling Timeouts
Before any response code enters the picture, you need to decide how long to wait for a response at all. If the consumer's server accepts the connection but takes 60 seconds to process the webhook synchronously before responding, you're tying up resources on your end for the duration.
Most webhook providers set aggressive timeout windows — commonly between 5 and 15 seconds. This is deliberate. It encourages consumers to adopt asynchronous processing patterns: accept the webhook, persist it to a queue, return a 200 immediately, and handle the business logic in a background worker.
Whatever you do, though, you should document your timeout clearly. If your consumers don't know they have 5 seconds to respond, they'll build synchronous handlers that intermittently time out under load, triggering retries for events they actually received.
Dead-Letter Queues: A Safety Net
Every retry schedule has a last attempt. When it fails, the event has to go somewhere. Silently discarding it is the worst outcome — your customer loses data and has no way of knowing until something downstream breaks in a confusing way.
A dead-letter queue stores events that could not be delivered after all retry attempts, preserving the original payload alongside metadata about each failed attempt (timestamps, response codes, error messages). The value of a dead-letter queue depends entirely on what you build around it. A queue that events enter but never leave is functionally equivalent to dropping them.
Three capabilities turn a dead-letter queue from passive storage into an actual recovery tool. First, a replay mechanism that lets consumers re-trigger delivery for individual events or bulk-replay everything from a given date forward once they've fixed the underlying problem. Second, a retention policy that keeps dead-lettered events long enough to be useful — 30 days is a reasonable floor, though financial or compliance-sensitive integrations may need longer. Third, proactive notifications that alert the consumer as events accumulate in the queue, rather than waiting for them to notice missing data on their own.
Endpoint Disabling: Protecting Both Sides
If a consumer's endpoint fails consistently over an extended period — say, every delivery attempt over several days — continuing to send traffic to it is wasteful for you and potentially harmful to them. Some failure modes, like a misconfigured firewall, won't resolve without human intervention.
Production webhook systems typically implement automatic endpoint disabling. After a sustained period of failures (often 3 to 5 days), the endpoint is marked inactive and no further deliveries are attempted until the consumer takes action.
The key to doing this well is communication. When you disable an endpoint, fire an operational notification (an email, a dashboard alert, a separate operational webhook, something!) so the consumer knows what happened and can re-enable the endpoint once they've fixed the issue. Silent disabling is nearly as bad as silent data loss.
Idempotency: Accounting for Duplicates
Retries inherently produce a risk of duplicate delivery. The most common scenario: your system sends a webhook, the consumer processes it successfully, but their response is lost due to a network issue. From your perspective, the delivery failed, so you retry. From the consumer's perspective, they've now received the same event twice.
This is why idempotent processing is an essential companion to any retry strategy. As a webhook provider, you can support idempotency by including a unique event identifier in every delivery (typically as a header). Consumers can then deduplicate on their end by tracking which event IDs they've already processed.
Make this identifier stable across retries — the same event should carry the same ID whether it's the first attempt or the fifth. This gives consumers a reliable key to detect and discard duplicates.
Letting Consumers Own Their Retry Policy
No single retry schedule fits every use case. A payment processing integration might need aggressive retries over a short window because stale transaction data is worthless. A CRM sync might prefer gentler retries spread across a longer period because the data remains valid for days.
Where possible, expose retry configuration to your consumers. This might include the number of retry attempts, the backoff multiplier, the maximum retry interval, or the total retry window. Even offering a choice between two or three preset policies (e.g., "aggressive," "standard," "relaxed") is more useful than a single hardcoded schedule.
This flexibility signals to your consumers that you've thought seriously about the diverse ways webhooks get used in production — and it reduces the support burden of fielding requests from users whose needs don't match your defaults.
Circuit Breakers: Stop the Bleeding
Per-event retries treat each webhook delivery as an independent problem. That works well for isolated failures, but falls apart when an endpoint is genuinely down. If a consumer's server has been unreachable for an hour, queuing up hundreds of individual retry chains for every event during that window wastes delivery workers and memory without moving anything closer to success.
Circuit breakers operate at the endpoint level rather than the event level. They track the recent failure rate for a given destination and, when that rate crosses a configurable threshold, short-circuit all pending and new deliveries to that endpoint. Instead of attempting delivery and waiting for a timeout, the system immediately diverts events to error recovery and queues them for later replay.
After a cooldown window — typically 30 to 120 seconds — the circuit breaker allows a single probe request through to test whether the endpoint has recovered. A successful probe reopens normal delivery flow. A failed probe resets the cooldown.
The critical design decision is scoping. Circuit breakers must be implemented per consumer endpoint. A global circuit breaker that trips when any single customer's endpoint degrades would halt delivery to every healthy customer on your platform — the opposite of fault isolation. Per-endpoint breakers contain the blast radius so that one broken integration can't starve delivery capacity for everyone else.
Documenting Your Retry Behavior
Retry policies are only useful if your consumers know about them. Developers integrating your webhooks need clear answers to specific questions:
- What is the exact retry schedule? Not "we retry several times" — the actual intervals.
- Which response codes are treated as failures? Are redirects followed or treated as errors?
- How long is the total retry window before an event is dead-lettered?
- What happens to the endpoint after prolonged failures?
- Is there a way to manually replay failed events?
- What timeout window does the consumer have to respond?
Ambiguity here forces consumers to reverse-engineer your behavior through trial and error. Detailed documentation, by contrast, lets them build their handlers correctly from the start and reduces support tickets on both sides.
Monitoring and Observability
A retry system you can't observe is a retry system you can't trust. Instrument your retry infrastructure to track at minimum:
- First-attempt success rate. What percentage of deliveries succeed without any retries? A declining rate may indicate a systemic issue with a consumer's endpoint or with your own delivery infrastructure.
- Retry distribution. How many events require 1, 2, 3, or more retries? A spike in events needing many retries suggests a consumer experiencing sustained problems.
- Dead-letter rate. How many events exhaust all retry attempts? This is your most important reliability signal.
- Time to successful delivery. For events that eventually succeed after retries, how long does the full cycle take? This tells you how much latency your retry system introduces.
Surface these metrics both internally (for your ops team) and externally (for your consumers via a dashboard or API). Consumers who can see their own delivery success rates are far better equipped to diagnose and fix integration issues.
Putting it all together
Each of the patterns described here solves a specific failure mode. But their real power comes from the way they interlock. Backoff without jitter still creates traffic spikes. Jitter without response code awareness wastes retries on permanent failures etc.
Building a resilient retry system means thinking about these patterns as a single interconnected mechanism rather than a checklist of independent features. When one layer fails to catch a problem, the next layer should. When the automated systems are finally exhausted, the operational tooling should hand the problem to a human in a state they can act on.
The goal isn't perfection. Distributed systems will always find new ways to fail. The goal is to make failures recoverable, visible, and bounded, so that a momentary network issue stays a momentary network issue instead of becoming a customer-facing incident.
Gain control over your webhooks
Try Hookdeck to handle your webhook security, observability, queuing, routing, and error recovery.