Introduction to Webhook Problems

Webhooks are HTTP requests fired from one remote application to another when an event occurs. This communication process is fully discussed in our article “What Are Webhooks and How They Work”. Essentially, webhooks provide a simple communication channel between different applications in a distributed system. But just like every “simple” technology, the devil is in the details.

When it comes to actual implementation, webhooks are highly prone to failures. Even with thorough testing practices, certain failures arise as a result of complex scenarios that only manifest themselves in production environments.

Common causes of failure include, but are not limited to, single points of failure, unreliable networks, slow processes, and unexpected loads. While we have discovered that most of these issues can be mitigated by asynchronously processing your webhooks, this series is focused on the different types of problems that you might face when working with webhooks.

This article is the first in a series of articles that aims to give you a deep insight into the world of problems encountered when using webhooks in production environments.

We will begin by looking at why these problems exist. Then, I will categorize the different types of problems faced when using webhooks, discuss how most teams are (currently) tackling these problems, and recommend solutions and best practices for prevention and effective remediation.

From development to production

So you have decided to use webhooks. You have a provider that allows you to enter the URL of your application’s endpoint where you will be receiving your webhooks. You set up your webhook successfully, receive your first webhook, smile, and walk away. Job well done… too easy.

Later that day you check on your webhooks and notice that your application has been spammed with hundreds of irrelevant events and junk data filling up your database. Apparently, any person that has access to your webhook URL can send requests to your application. You realize that, unlike that test you ran with ngrok in your development environment, webhook URLs require authentication to be protected in production.

So you implement one of your provider’s authentication strategies and once again breathe.

But it didn’t take too long for you to start getting emails from the error monitoring service you have instrumented in your code about database exceptions. Turns out that for a certain number of concurrent webhooks, your database connection pool becomes depleted and your database starts rejecting requests. Something you didn’t notice prior to moving to production because you only tested with a few webhooks. So you fix that, and once again, all is well……or is it?

A couple of weeks pass by and nothing happens, and then a holiday rolls by. You have a big sale on your website which leads thousands of customers to use your website within the same period. Your webhooks increase exponentially, also drastically increasing the number of concurrent webhooks your endpoint needs to process. Yes, you’ve got a spike!

This quickly eats up your server’s resources, saturates your connection pool, and causes your server to eventually crash. You restart the server but it doesn’t take long for requests to spike again as it's peak period all day as a result of the holiday frenzy. Now you don’t have time on your side as businesses are being lost and customer frustration increases each second the issue persists.

From the scenarios I’ve described so far, I’m sure it’s becoming pretty clear to you that webhooks are fragile and require resilience built around them to survive production environments. The scenarios above do not even scratch the surface of the types of problems that arise when using webhooks in production. However, there is no reason to panic or dump webhooks altogether. This series is dedicated to equipping you with the knowledge and tools necessary to mitigate these webhook problems.

Common problems with webhooks

From our years of experience in the webhook space, common problems we have had engineers and stakeholders bring up include statements such as:

I am not sure my webhooks are working
I don’t know which webhooks failed
I am receiving multiple webhooks for the same event
Why are my webhooks slow?
My webhook provider only allows one URL
Why am I getting too many timeout errors?
…and more

We have broken down these webhook problems into 4 major categories which will be introduced in the sections below; I’ll also link to each of their respective articles in this webhook problem/solution series.

Each section below takes care of each problem category. We will look into different types of problems in each category, discuss the problem to understand why it happens in the first place, and then recommend solutions for each problem.

Testing

Testing is the practice of making sure you catch bugs early enough, in other words before your webhooks go into production. The longer it takes to detect a bug, the more expensive it is to fix it. Thus, testing is your first line of defense against webhook problems.

To learn more about problems that arise as a result of the lack of or inadequate testing, and their corresponding solutions, check out our “Webhook Testing Problems and Solutions” article.

Managing

Webhooks need to be managed effectively to ensure that they properly integrate with your existing infrastructure. Oftentimes, you’re working with multiple webhooks and need to harmonize their activities. Authentication schemes and payloads are also different from provider to provider. Thus, you need to coordinate all your webhooks to achieve the desired impact for your business.

To learn more about problems that arise as a result of management issues within your webhook infrastructure, as well as their corresponding solutions, check out our “Webhook Management Problems and Solutions” article.

Monitoring

It is obvious that problems are inevitable when working with webhooks. One way to ensure that you’re focused on fixing the right problems is by keeping tabs on your webhook activities through effective monitoring.

Monitoring is used to detect failures that affect users and can be used to trigger alerts or automated fixes. It can also give you a high-level overview of the overall health of your webhooks.

To learn more about webhook monitoring problems, and their corresponding solutions, check out our “Webhook Monitoring Problems and Solutions” article.

Error recovery

As mentioned in the previous section, failure is inevitable. This is why you need to pay as much attention to recovering from failures as you do preventing them from happening in the first place.

Our problem/solution article on error recovery lists common errors and error recovery scenarios encountered with webhooks in production. I then go into detail on how to recover from these errors with self-healing mechanisms and troubleshooting tools that can be used in production environments.

The current solution

Webhook problems are as old as webhooks and over the years various solutions have been designed to mitigate these pain points. Solutions exist to deal with issues such as timeout, code errors, and traffic spikes.

Below is a table showing the usual “plumbing” that goes into mitigating the problems around webhooks.

Tool	Problem it fixes	Examples
Ingestion runtime	Handle authentication, transform payloads, TLS termination, etc.	AWS Lambda, Cloudflare Workers, NGINX, Kubernetes, VMs, etc.
Queues	Ingest webhooks for asynchronous processing, provider timeout limit issues, load leveling and routing to consumers, etc.	Google Cloud Pub/Sub, Kafka, Amazon SQS, RabbitMQ, Azure EventBus, etc.
Consumer runtime	Process webhooks, horizontal scaling to prevent spikes, etc.	AWS Lambda, Cloudflare Workers, NGINX, Kubernetes, VMs, etc.
Storage	Persist webhook events for the purpose of error recovery, troubleshooting, rate-limiting, and audit trail	Postgres, S3, DynamoDB, Redis, etc.
Alert and logging	Monitoring and alerting, error recovery	ELK stack, Datadog, New Relic, Prometheus, Grafana, etc.
Custom scripts	Retries from dead-lettered messages, troubleshooting, rate-limiting, etc.
Development and testing	Unit and end-to-end testing of webhooks	ngrok, Postman

As you may have observed, it is a lot of work to set up and maintain all these tools in order to have resilient webhooks. And even though engineering teams have done their best with these tools, it is not the optimal solution for these reasons:

The burden of maintaining tools from different vendors
No standardized workflow
You get best-effort resiliency
The quality of your webhook infrastructure is directly proportional to the expertise of your team
A vendor can introduce a change that will cause breaking changes to other dependencies
The inertia of teams in creating or maintaining documentation for the webhook infrastructure
Difficulty in onboarding new employees (new team members have to learn a bunch of tools)

These are some of the drawbacks to this legacy strategy we have all gotten used to over the years. But there is something much better than a composite solution for managing webhooks, and that is what we will discuss next.

The recommended solution

When looking for a solution for webhooks, the focus should be on developer experience. You don’t want your best minds wasting time on making a bunch of tools work together and fixing issues between them when they can be focusing on more customer-facing issues.

You don’t need multiple tools, you only need one. A webhook-focused, well-documented, constantly maintained and reliable tool that provides just the necessary features for making your webhooks resilient in production environments.

You need a webhook infrastructure that sits in between your producer and consumers to decouple and broker all interactions between both entities. This infrastructure handles all the production requirements for your webhooks, such as:

Authentication
Monitoring
Error recovery
Alerting
Troubleshooting
Coordination of workflows for one or multiple webhook providers and consumers
…and many more

Curious as to what this tool is? In comes Hookdeck.

Hookdeck’s mission is simple: to ensure that you never miss a webhook. Hookdeck is built solely for handling webhooks, thus, its features are tailored toward ensuring that your webhooks fulfill their purpose as they travel from point to point.

Hookdeck handles all the responsibilities of a reliable webhook solution listed above and even more. You can work with multiple providers, route your webhooks to as many endpoints as you desire, transform webhook payloads before they hit their destination, get out-of-the-box authentication, automatically retry failed webhooks, pause failing webhooks, and design a workflow that suits your mode of operation.

And, just as you might have guessed, Hookdeck is thoroughly documented to ensure that you have answers to all your webhook questions and can onboard new team members quickly and easily.

To learn more about Hookdeck, check out our documentation.

Conclusion

Webhook's simplicity can be deceiving, making engineers and development teams oblivious to its fragility in production environments. Understanding the problems webhooks encounter in production and how to mitigate them helps to be better prepared for the inevitable. Though we cannot possibly cover all the problems faced by webhooks in this series, we do hope that the information you get helps you to be more proactive in dealing with these issues.

Or you can simply sign up for a Hookdeck account today and never have to worry about missing a webhook.

Resilience Testing