Why Relying on Webhook Providers Is Not The Answer

In the previous article, we discussed how the delegation of common application functions like payment through integration with third-party providers has led to a rise in the use of webhooks across modern applications.

We also emphasized the need to have a resilient webhook infrastructure with observability features to ensure operational stability.

In this article, I’ll explain how even though we have emphasized the need for resiliency and observability, depending on your webhook provider to provide these features is not recommended.

The role of webhook providers in your integrations

When it comes to the integration of third-party services like e-commerce or payments, SaaS applications are the webhook providers. SaaS applications use webhooks to notify and transfer information about an event to your application.

One important fact to help you understand how much responsibility you need to take for your webhooks is that webhook providers have no idea which webhook is crucial to you, so they optimize sending all webhooks to you as fast as possible.

They often follow a fire-and-forget communication style, allowing you to collect and process webhooks according to how each one is important to you. This behavior helps providers to optimize for speed and delivery. At the same time, you take up the responsibility of prioritization, temporary persistence for asynchronous processing, retries, and other custom activities on your webhooks.

Knowing how important a webhook is to your operations means it’s in your hands to make sure you capture the ones you want because the provider’s job is to make sure all the events go out the door.

In the next section, we look at how much responsibility webhook providers take in cases of webhook failure. This will help us understand the type of problems our webhook infrastructure should aim to solve.

How much responsibility does your webhook provider take when:

Your servers are overloaded

When your servers are overloaded, latency increases on the processing time of each webhook. This situation can cause webhooks to be dropped if they exceed the response time limit set by the provider. Also, if overloading persists, your server(s) will eventually crash.

Providers continue to optimize for delivery and want to ensure you get your webhooks in the fastest time possible; thus, they will keep sending webhooks even when your server is overloaded. Some providers throttle delivery when they notice high response times; however, their main goal remains to get the webhooks to you as fast as possible.

You need to do a seamless migration

When migrating, your servers will need to be offline during the migration period. This situation will cause you to miss all webhooks during migration. Because providers cannot determine how important each webhook is, they will keep optimizing for delivery during this period.

Your webhooks are timing out

Webhooks timing out are retried a couple of times (for providers that offer automatic retries) and then dropped if they keep timing out or dropping immediately. As we have learned in the previous article, dropped webhooks lead to a loss of event information that you need to act on, which can negatively impact your business.

While some providers provide automatic or manual retries (or both) for dropped webhooks, others keep optimizing for delivery by continuously sending webhooks to your servers.

Your servers go down

When your server crashes, you can't serve any more webhooks until it is back up. During your server outage, all webhooks sent to your webhook URL will eventually encounter a server error and be dropped.

As the core responsibility of webhook providers is to optimize delivery so that you don't miss out on any webhooks, you will continue to get webhooks during the outage, and it's now your responsibility to capture these webhooks to be processed when the server is back up.

You need to retry a failed webhook

When a webhook fails, it needs to be captured and retried to reconcile the event information with your application. Not retrying failed events can cause data integrity issues within your application and lead to loss of business.

While some providers automatically retry a failed webhook a couple of times after it drops, the core responsibility of the provider remains to optimize for quick delivery of webhooks to their destination.

For example, Twilio will only retry failed webhooks up to 5 times, which may or not may not be enough for the issue to be resolved.

Stripe provides a way to retry a failed webhook manually. While this is a handy tool, it does not scale for retrying a large batch of failed webhooks. Imagine you need to retry a thousand webhooks that failed due to a server crash; going into the provider's dashboard to retry these webhooks one by one is impractical.

You still require a webhook infrastructure that can retry a large number of failed requests without having to retry them one by one.

For more information on the features available on different webhook providers, check out our webhook platform guides.

You will need resiliency and observability

The scenarios described above are not an exhaustive list but are amongst the most common issues you will face when running webhooks in a production environment. For more information on webhook problems, you can check out our article series to learn all about common issues and recommended solutions.

The scenarios above point in the direction of the most important problems you should be looking to solve: resiliency and observability.

So what are resiliency and observability in the context of webhooks?

Resiliency in webhooks

Resiliency is a measure of the toughness of your webhook infrastructure against load pressures and failures. Functions covered under resiliency include:

  • Fault-tolerance
  • Error/Failure recovery
  • Scalability
  • Self-healing
  • Configurability

Observability of webhook activities

Observability complements resiliency by giving you visibility into the activities of your webhooks and helping you learn patterns over time in order to make informed decisions when it comes to strengthening your system against faults and load pressures.

Activities covered under observability include:

  • Monitoring
  • Alerting
  • Automation of custom failure-recovery and autoscaling activities

By developing a resilient webhook infrastructure with observability features, you reduce the error rate on webhooks, self-heal common webhook failures, debug faster and maintain an efficient turnaround time in fixing critical webhook issues.

One of the webhook providers that clearly details the consumer’s responsibilities is Shopify on their webhook best practices page. Here they explain the webhook consumer’s responsibility to:

  • Respond quickly to avoid timeouts
  • Track failures with observability tools like the metrics on their Shopify dashboard
  • Recover failed webhooks from temporary persistence stores
  • Avoid duplicates by adding unique identifiers and making your endpoints idempotent
  • Using timestamps to avoid spoofed webhooks
  • Implement reconciliation jobs to re-fetch and retry webhooks
  • Build a scalable and reliable webhook infrastructure with a battle-tested messaging system

Shopify best practice

Event-driven architecture solves the need

Fortunately, there is already an architectural pattern that has been stabilized over the years and fits perfectly as a solution for solving the issues that come with webhooks, the event-driven architecture.

Event-driven architecture is a very popular distributed asynchronous architectural pattern for building highly scalable and performant applications. In one of our earlier articles, we already stated the importance of processing your webhooks asynchronously (more on this in the following article), and event-driven architecture is designed to handle this type of processing style.

The pattern shines when it comes to dealing with decoupled event processing components such as we have in the webhook provider to webhook consumer communication channel. One of its significant advantages is that it can be used as a standalone architecture style or embedded within existing architecture. In the following articles, we will dive deeper into how the attributes of this architecture help solve our day-to-day webhook issues.

Conclusion

In this article, I’ve explained that we cannot depend on webhook providers to always reliably deliver our webhooks or help us recover from failures.

A resilient system with built-in observability features is the way to go; fortunately, the event-driven architecture style exists just for this purpose.

The following article dives into the best practices for processing webhooks and how the event-driven architecture's asynchronous style fits.

Happy coding!