Gareth Wilson Gareth Wilson

The Recovery Surge: A webhook failure mode worth planning for

Published


When it comes to webhooks, Shopify is a beast. During peak times, like Black Friday, it can be pushing 500,000 webhooks per second.

That's a lot at the best of times, but things get even wilder when things aren't so good. Because with just a little slow down, huge volumes of events soon back up, resulting in a tidal wave when things recover. We saw this firsthand with yesterday's webhook pile up.

On April 28, Shopify experienced a webhook delivery latency incident. Shopify's baseline latency normally sits around 2 seconds. p90 varies between 3-5 seconds with some p99 spikes over 10 seconds. Yesterday, from around 11:00am UTC we saw delays by minutes, and in some cases over an hour, lasting for around 8 hours.

Shopify webhook latency chart showing the April 28 incident

The Shopify community forums lit up. Posts on X started appearing. Merchants began noticing the symptoms before they understood the cause: orders not flowing into their fulfillment systems, inventory updates lagging, automations that should have fired silently sitting idle.

The bigger problem, though, is what happens next.

The shape of a recovery surge

When a webhook producer at Shopify's scale falls behind, it doesn't just lose events, it queues them. Shopify's infrastructure is designed not to drop webhooks if it can avoid it. So when service was restored and the backlog started flushing, every downstream consumer received a tidal wave of events. Things have improved from prior outages, they no longer send them all at once, but nevertheless, downstream consumers were soon getting slammed with huge volumes of events.

From our vantage point ingesting Shopify webhooks for thousands of merchants, the recovery looked nothing like normal traffic. We saw a sustained volumes of events up to 3x higher than normal, peaking around 6PM as retries came tumbling in.

For systems built to handle Shopify's normal throughput, that recovery curve is the moment things break.

Why recovery surges break webhook handlers

A few patterns we've seen consistently across large-producer incidents explain why the recovery is harder to handle than the outage itself.

  • Webhooks have no flow control. HTTP doesn't have a backpressure mechanism. The producer doesn't know whether your consumer can keep up. When Shopify's queue drains, it drains at Shopify's pace, not yours.

  • Most handlers do real work in the HTTP request. If your endpoint queries a database, calls a third-party API, or writes to multiple downstream systems before returning 200, your tail latency under load goes up sharply. Under a recovery surge, request times balloon, your web tier saturates, and Shopify starts retrying webhooks you haven't even processed yet.

  • Retries compound the storm. When delivery fails because your endpoint is overloaded, Shopify retries with backoff, but so does the producer's next attempt for the original event. Now you have the original surge plus a layer of retries on top. The shape of the curve gets worse before it gets better.

  • Auto-scaling lags reality. If you're on serverless or autoscaling containers, your infrastructure responds to load but with a delay measured in seconds to minutes recovery surges are often over before scaling catches up.

  • The events you do process may be out of order. A backlog flushing in roughly chronological order can still produce out-of-sequence delivery to your handler if your processing layer parallelises work. Order updates can arrive before the order itself or you get cancellations before charges.

The result is a class of failures that look nothing like an outage in your monitoring: handlers timing out, queues backing up downstream of the webhook endpoint, partially-processed events scattered across logs. And because the problem starts when Shopify recovers, your dashboards are reporting "all systems normal" right as your business workflows are silently degrading.

What you can do about it

Most of what makes a system resilient to recovery surges is architectural. If you're handling Shopify webhooks yourself, here's what we'd recommend prioritising.

  • Decouple ingestion from processing. Your webhook endpoint should do one thing: validate the request signature, write the payload to a durable queue, and return 200. Everything else (database writes, third-party calls, business logic) happens asynchronously. This single change is the highest-leverage fix you can make. It's also the one most often skipped because the simple version (process inline, return 200) works fine in steady state. Recovery surges are when steady-state assumptions break.

  • Build for idempotency from day one. During a recovery, you will receive duplicate webhooks. Shopify will retry events that were delivered but timed out on your end. You will replay events from your own queue. Every webhook handler should be safe to run multiple times for the same event ID. Use the X-Shopify-Webhook-Id header as your idempotency key and reject duplicates explicitly.

  • Add backpressure to downstream systems. Decoupling ingestion only helps if your processing layer can refuse work it can't handle yet. A bounded queue, rate-limited consumers, or a worker pool that respects concurrency limits will smooth a 10x spike into a longer-but-survivable processing window.

  • Monitor webhook freshness, not just success rate. "We're returning 200 on every webhook" is a useless metric if the events are thirty minutes stale. Track the gap between created_at on the event and the time you processed it. Alert when that gap exceeds a threshold. This is how you detect upstream incidents before your customers do.

  • Reconcile critical workflows against the API. Webhooks are notifications. Shopify's REST and GraphQL APIs are the source of truth. For events where missing or delayed delivery has business impact (like orders, payments, inventory thresholds) run a periodic reconciliation job that compares what you've processed against what Shopify says happened.

  • Cap retry concurrency. If your handler is failing, retries should not be 1:1 with incoming events. Set a maximum concurrent retry count and a sensible backoff schedule. Otherwise your retry traffic alone can keep your endpoint saturated long after the original surge has passed.

  • Plan for the surge, not just the steady state. Load testing webhook handlers against normal volume isn't enough. Test them at 10x or 50x normal. Test them with a sudden onset, not a gradual ramp. The behaviour under burst load is what matters, and it's almost always different from sustained throughput.

If you do nothing else from this list, do the first two. Decoupled ingestion plus idempotency handles the long tail of webhook problems, not just recovery surges, but most outages and partial failures you'll encounter.

If you'd rather not manage this yourself

The architecture above may be well-understood, but it's also a meaningful amount of infrastructure to maintain, and the failure modes are subtle enough that most teams only get it right after an incident or two.

This is the gap Hookdeck closes. Our ingestion layer is built for exactly the surge pattern we saw yesterday: durable queueing absorbs the spike, retries are managed centrally, idempotency is enforced by default, and your processing layer receives traffic at a rate it can actually handle. Yesterday, our customers' Shopify webhooks were delayed only by however long Shopify's queue took to drain. None of them had to scramble through their own infrastructure during the recovery.

The other half of the problem is visibility. Most merchants didn't know there was a Shopify incident yesterday until customers complained. Status pages lag. Forum threads only surface if you happen to be looking. We built Hookdeck Radar for Shopify for this exact reason — it tracks Shopify webhook health continuously and shows you, in real time, when delivery is degrading. It's free, you don't need a Hookdeck account, and it would have flagged yesterday's incident the moment it started rather than when your support inbox filled up.

If you want to test how your own handler responds to a recovery surge before the next incident, the Hookdeck CLI is the easiest way to replay a burst of historical Shopify events into your local environment. It's the same workflow our customers use to validate their handlers under load.

A note for platforms that send webhooks

If you're a platform that sends webhooks to your own customers, this incident is also a lesson in how to recover. Smoothing recovery delivery by rate-limiting per consumer, spreading the backlog over a longer window, exposing queue depth so customers can prepare is one of the things Hookdeck Outpost does for the platforms that use it. It's the kind of detail that's easy to miss until you're the producer in someone else's incident retrospective.

What yesterday flagged

The merchants and dev teams that came through yesterday cleanly weren't lucky, they had infrastructure designed for the burst, not just the baseline.

The architectural fixes are well within reach for any team willing to invest in them. The shortcut is using infrastructure that handles the pattern by default. Either way, the lesson from April 28 is the same one webhooks have been teaching for years: the protocol gives you no help when the producer falls behind. Whatever you put in place has to assume the next surge is coming.



If you handle Shopify webhooks and want a real-time view of delivery health, Hookdeck Radar is free and requires no signup. If you'd rather skip managing surge-tolerant ingestion, you can get started with Hookdeck in minutes.