Common Mistakes with Outbound Webhooks (And How to Avoid Them)

Webhooks look simple from the provider side. An event happens, you fire an HTTP request, and the consumer handles it.

But "simple" webhook implementations have a way of becoming the most expensive support burden on your platform. Every shortcut you take (skipping signatures, ignoring retries, using inconsistent payloads) lands directly on the desks of the developers integrating with your system. They file tickets. They write angry forum posts. They build brittle workarounds that break on the next deploy. And eventually, they evaluate your competitors.

We've spent years studying and documenting webhook implementations across 60+ platforms, from Stripe, Shopify, and GitHub, to GitLab, Adyen, PayPal, and more. The same problems come up over and over. This post catalogs the most common mistakes we see webhook providers make and offers concrete guidance on what to do instead.

Aggressive timeouts that punish real-world infrastructure

The most universal complaint from webhook consumers is timeout windows that are too short for anything beyond the most trivial processing. Shopify and Paddle give endpoints just 5 seconds to respond. GitLab, Adyen, and Bitbucket allow 10 seconds. These windows include network latency, TLS handshake time, and any cold-start delay if the consumer runs on serverless infrastructure.

Five seconds sounds generous until you consider that an AWS Lambda cold start can eat 2–3 seconds before your code even executes. Add a database write and a downstream API call, and you're well past the limit. The webhook fails, retries kick in, and if enough retries fail, the endpoint gets disabled — silently, in many cases.

The fix isn't necessarily longer timeouts, providers have legitimate reasons to protect their infrastructure from slow consumers. The fix is designing your system so that consumers don't need to do heavy processing synchronously.

What to do instead: If you enforce strict timeouts, pair them with a generous retry policy and clear documentation about expected response times. Give consumers explicit guidance to acknowledge receipt quickly and process asynchronously. Better yet, support delivery to message queues (SQS, Pub/Sub, RabbitMQ) so consumers don't need an always-on HTTP endpoint at all.

Automatic disabling that silently breaks integrations

This is arguably the most damaging pattern in the webhook ecosystem. A consumer's endpoint goes down for a routine deployment. Retries accumulate. The provider hits its failure threshold and disables the webhook subscription entirely. The consumer's system keeps running, unaware that it's no longer receiving events. Hours or days later, someone notices a data gap: orders missing, payments unreconciled, users out of sync.

The thresholds vary. Shopify removes webhook subscriptions after just 8 consecutive failures over 4 hours. WooCommerce disables after 5 failures. GitLab has a two-tier system: temporary disabling after 4 consecutive failures or permanent disabling after 40 total failures. Stripe is more forgiving, waiting 3 days of continuous failures before disabling, but even that window can be exceeded during extended incidents.

The real problem isn't the disabling itself but the lack of visibility. Most providers send an email notification at best. Many don't surface the disabled state prominently in their dashboards. Consumers who don't check their webhook configuration regularly won't know anything is wrong until the downstream effects become visible.

What to do instead: If you must auto-disable, make the threshold generous and the notifications impossible to miss. Provide a dedicated health status endpoint or dashboard that consumers can monitor programmatically. Consider a "degraded" state that slows delivery rather than stopping it entirely. And always provide a way to bulk-replay events that were missed during the disabled period.

Retry policies that guarantee data loss

Some providers barely retry at all. Bitbucket Cloud attempts only 2 retries with no exponential backoff, then triggers circuit-breaking after 5 failures. SparkPost and Mailgun retry over an 8-hour window and then the event is gone forever.

Compare this to Stripe, which retries with exponential backoff over 3 days in live mode, or Adyen, which queues events for up to 30 days. The difference in reliability for consumers is enormous.

When retries are exhausted, most providers simply discard the event. There's no dead letter queue, no archive, no way to retrieve what was missed. The data is gone. Consumers who experience even a brief outage during a period of low retry tolerance lose events permanently, with no way to reconcile except by polling the provider's API (if it even supports the right queries) and manually backfilling.

What to do instead: Retry with exponential backoff over a period measured in days, not hours. Provide a way for consumers to manually retry or replay events after failures, either through a dashboard, an API, or both. Maintain an event log that consumers can query to identify gaps. Never silently discard events without giving the consumer a recovery path.

Duplicate delivery without idempotency support

Every webhook system delivers duplicates. Retries after network timeouts, infrastructure failovers, race conditions in distributed systems — at-least-once delivery is a fact of life. The problem isn't that duplicates happen; it's that many providers don't give consumers the tools to handle them.

The minimum requirement is a stable, globally unique event ID that remains the same across retries. Consumers store these IDs and skip events they've already processed. Without this, duplicates cause real damage: double charges, duplicate emails, repeated inventory adjustments, corrupted state.

Some providers make this worse than it needs to be. WooCommerce, for instance, fires multiple order.updated webhooks for a single logical change, generating genuinely distinct events (not retries) that contain the same effective information. GitLab fires duplicate events when both group-level and project-level webhooks are configured for the same repository. These aren't retry duplicates, they're architectural duplicates that no amount of event-ID tracking will catch.

What to do instead: Include a stable event ID in every payload. Send a separate delivery ID (as a header) that changes per attempt, so consumers can distinguish retries from genuinely distinct events. Document your at-least-once semantics clearly, and provide guidance on implementing idempotent handlers. Avoid architectural patterns that generate multiple events for the same logical change.

No guaranteed ordering (and no tools to cope)

Webhook providers overwhelmingly do not guarantee event ordering. Stripe says so explicitly. Shopify, Paddle, and most others do the same. This means subscription.deleted can arrive before subscription.created, order.updated before order.created, and CAPTURE before AUTHORISATION.

The underlying reason is sound: distributed systems with retry queues and multiple delivery workers can't maintain strict ordering without significant performance costs. But most providers stop at "we don't guarantee ordering" without providing the metadata consumers need to reconstruct order on their end.

At minimum, every event should include a timestamp reflecting when the event actually occurred (not when it was dispatched or delivered). Ideally, update events should include a version number or sequence counter so consumers can detect and discard stale updates. Some providers go further, for instance, Stripe includes pending_webhooks counts that help consumers understand event backlogs.

What to do instead: Include an accurate occurred_at timestamp in every event. For resources that go through state transitions, consider including a version or sequence number that increments with each change. Document the ordering semantics so consumers know to build with eventual consistency in mind. Provide enough metadata that consumers can implement their own ordering logic when they need it.

Limited logging and no replay capability

Debugging webhook integrations is difficult in the best of circumstances. It becomes nearly impossible when the provider's logging is ephemeral and there's no way to replay events.

Bitbucket's webhook request history is available for only 12 hours, and you have to manually enable it before the problem occurs. GitLab retains logs for 2 days in the UI and 7 days via API. Shopify keeps 7 days in the Partner Dashboard. After that, the delivery history is gone.

Even more critically, most providers offer no replay mechanism. If a consumer discovers a bug in their handler that caused events to be processed incorrectly, they can't re-request those events. If an outage caused missed deliveries, there's no "resend everything from the last 6 hours" button. Some platforms offer limited manual retry through a Dashboard for recent events, and GitLab allows individual retries from the UI, but most platforms offer nothing.

This forces consumers to build their own reconciliation systems: periodic API polling jobs that compare local state against the provider's current state and backfill any gaps. It's expensive, error-prone, and defeats much of the purpose of having webhooks in the first place.

What to do instead: Retain event logs for at least 30 days. Provide an API that lets consumers list events with filters (by type, by time range, by delivery status). Then support manual replay for both individual events and bulk replay for a time range.

Missing or weak signature verification

Webhook endpoints are just URLs. Anyone who discovers or guesses the URL can send fabricated events. Signature verification is the defense against this, and it's table stakes for production systems, yet some providers still don't offer it.

Even providers that offer HMAC signing sometimes undermine it by using a single shared signing key across all consumers. This means Consumer A could receive a legitimately signed webhook and forward it to Consumer B's endpoint, and it would pass verification. Per-consumer signing keys are essential.

Secret rotation is another common gap. When a signing key needs to be rotated (because it leaked, an employee left, or compliance requires it) providers that don't support a rotation window force consumers to choose between a period of failed signature verification (breaking deliveries) or skipping verification temporarily (breaking security).

What to do instead: Sign every webhook with HMAC-SHA256 computed over the payload body. Use per-destination signing keys. Support secret rotation with a grace period where both old and new keys are valid simultaneously. Include the timestamp in the signed content to prevent replay attacks.

Inconsistent payloads and breaking changes

When order.created nests the order under data.order but order.updated puts it directly under data, consumers can't write generic handling logic. When your API uses camelCase but your webhooks use snake_case developers spend time on needless translation layers. When a resource webhook sometimes includes related objects and sometimes doesn't, consumers can't predict what data they'll have available.

These inconsistencies compound. Each one is minor in isolation, but a consumer integrating with dozens of event types builds up a thicket of special-case parsing logic that's expensive to maintain and fragile in the face of changes.

Breaking changes are even worse. Renaming a field from user_id to customer_id, changing an integer to a string, or restructuring nested objects can all silently break consumer integrations. Webhook consumers typically don't have sophisticated deserialization monitoring. The breakage might go unnoticed for hours or days, causing data loss that's painful to recover from.

What to do instead: Define an envelope structure (event ID, event type, timestamp, and a data object) and use it for every event type without exception. Match the naming conventions of your REST API. Implement a versioning strategy, ideally one where consumers pin to a version and receive that schema until they explicitly upgrade. Never ship breaking changes without a deprecation period and migration path.

No event filtering

When providers fire every event type to every registered endpoint, consumers drown in irrelevant traffic. A consumer that only cares about payment.succeeded shouldn't have to receive and discard hundreds of inventory.adjusted events per hour.

Adyen enables all default event codes with no way to disable them. Bitbucket offers no content-based filtering. Several smaller providers offer no filtering capability at all.

The cost isn't just wasted bandwidth. Every irrelevant event consumes compute on the consumer's side (even if it's just to parse and discard), increases the surface area for timeout-related failures, and adds noise to logging and monitoring systems. For high-volume providers, the irrelevant traffic can be substantial.

What to do instead: Let consumers subscribe to specific event types. If your platform generates hundreds of event types, consider a topic hierarchy (payments.*, inventory.*) that allows wildcard subscriptions. Make subscription management available through both your API and dashboard.

How Outpost helps you avoid these problems

The shortcomings described above share a common thread: they're infrastructure problems that every webhook provider eventually encounters, and building robust solutions for each one is a significant engineering investment. Retry logic, signature management, event logging, replay capability, per-consumer configuration — these are complex systems to build and operate correctly.

Outpost is Hookdeck's open-source webhook delivery infrastructure, purpose-built to handle exactly this set of problems. Rather than building delivery plumbing from scratch, you publish events to Outpost and it handles the rest.

Delivery is guaranteed at-least-once with automatic retries using exponential backoff. Failed events aren't silently discarded but are tracked, and consumers can manually retry them through the API or Outpost's built-in user portal. That portal gives your consumers direct visibility into delivery status, event history, and destination health, solving the observability gap that plagues most webhook providers.

Signing is per-destination with HMAC-SHA256, and the signature is computed over the timestamp and body together, which protects against replay attacks. Secret rotation is built in: when a consumer triggers a rotation, both old and new keys are valid during a configurable grace period, so there's no moment where deliveries break or verification must be skipped.

Topic-based subscriptions mean consumers only receive the event types they've subscribed to. Each destination is independently tracked, so one consumer's endpoint being down doesn't affect delivery to others.

Perhaps most importantly, Outpost isn't limited to webhooks. It supports delivery to AWS SQS, Google Cloud Pub/Sub, RabbitMQ, Azure Service Bus, AWS Kinesis, and AWS S3 — so consumers who don't want to operate an HTTP endpoint (or who prefer native cloud messaging) can receive events through the infrastructure they already run. This sidesteps many of the timeout and reliability problems that make webhook delivery so fragile in the first place.

Outpost is open source under Apache 2.0, written in Go, and designed for minimal operational overhead. You can self-host it with Docker or Kubernetes, or use Hookdeck's managed service. Either way, you get the delivery infrastructure that took platforms like Stripe years to build (without having to build it yourself).

Ship webhooks that developers trust

The bar for webhook quality keeps rising. A few years ago, consumers tolerated flaky delivery because they had no alternative. Today, developers expect reliable delivery, cryptographic verification, replay capability, and filtering as baseline features. Platforms that don't offer these things lose integrations to platforms that do.

The good news is that you don't need to solve every problem from scratch. Define a consistent envelope. Sign every payload. Retry generously. Keep logs. Let consumers filter and replay. And if building all of that sounds like a distraction from your core product, then off-the-shelf solutions like Outpost exist.

Alerting JavaScript