What Are The Costs of Managing Webhooks and What Is a Scalable Solution?
Webhooks have become one of the most popular ways for distributed applications to communicate. SaaS applications provide webhooks for integration with custom applications. One factor that makes webhooks popular is their ease of use; however, a hidden cost is attached.
In this series, we will zoom in on webhooks and their operational costs in the production environment. We will start by exploring potential failures, look into the impact of these failures, and then discuss the frustrations involved in reconciling failed webhooks. We will then look into solution options, calculate the cost of maintaining the solutions, and propose a better, more scalable solution.
At the end of this multi-part series, you will have a complete understanding of the issues faced by teams using webhooks, how to measure the risk, and, most importantly, how to use Hookdeck to handle webhooks in a reliable, scalable, and maintainable manner that offers improved developer experience.
How webhooks work
Webhooks are a one-way communication channel between an event source and an event consumer. When an event occurs in the source application (the webhook producer), the application fires a webhook in the form of an HTTP request to an API endpoint on the consumer application.
This HTTP request signals the event's occurrence and transfers information about the event to the consumer, if necessary.
This communication allows the receiving application to take action based on the event that occurred in the webhook producer. This action ranges from simple record-keeping to orchestrating a string of tasks to be fulfilled due to the event.
To learn more about how webhooks work, you can start with the article “What Are Webhooks and How They Work”.
Different stages in the lifecycle of a missed webhook
Problems arise when a webhook fails. These failures can occur at the point of firing the webhook, for example the producer did not fire the webhook. Other problems include network problems when the webhook is in transit (network failures, server timeout, producer timeout limit reached, etc.) and webhook processing issues (server errors due to bad/buggy releases, memory exhaustion, etc.).
These issues break the communication pipeline and lead to inconsistencies in the state of the receiving application.
While the impact of a webhook failure has been summarized above as an "inconsistency in the state of the receiving application", the result to the customer has more real-life consequences.
An inconsistency in the state of the application means a lack of data integrity in the application, which in real life means a customer’s job has been compromised.
Thus, a failed webhook can be why customers do not get their purchased goods delivered, their e-wallet credited, or their concert tickets not issued.
These impacts lead to customer frustration and can destroy the credibility of a business.
Dev remediation workflow
So, what do we do when a webhook fails? We fix it. Here is what fixing a single webhook failure looks like:
- You find the failed/missing webhook
- Identify the issue
- Fix the problem
- Resolve the issue by reconciling the failed/missing webhook
Easy right? Not quite.
Where the frustration comes is repeating this process for multiple failed webhooks. This cumbersome routine can drain the motivation of your development team and waste person-hours that can be devoted to more essential aspects of your business.
Learn more about the missed webhook problem in the article “Why Webhooks Are Difficult to Manage in Production”.
Destiny is (mostly) out of your hands
Failures can come from any entities involved in the webhook communication pipeline. However, only some of these entities are within your control. While you fully control your destination application, the producer and network layer will most likely be out of your control.
This is because, in most webhook use cases, the webhook producer is a third-party SaaS application like Shopify or Stripe. Also, the producer dictates how the HTTP transaction will be conducted over the network.
This situation makes managing failures from request timeouts, retry policies, data formats, etc., more challenging.
Also, with more success comes more webhooks and, thus, a higher potential for failure. As your business continues to grow, you begin to face issues of scale, and more volume of webhooks will intensify the pressure on your infrastructure. If your infrastructure is not set up to adjust to the expanding scale, it will buckle and crash under the load.
Due to the event-based nature of webhooks, the infrastructure will need to scale up to the standards of an event-based architecture. Thus, the team facing this situation will require a good understanding of event-based architectures.
To better understand the scope of webhook problems, read the article "How an Asynchronous Approach to Managing Webhooks Mitigates Scalability Concerns".
The “Pandora’s box” of webhook solutions
The previous section pointed out the need to be knowledgeable about event-driven architectures to tackle the potential problems that webhooks will face in production environments. So, let's talk about solutions. Let's assume the team has a good understanding of event-based architectures.
To build resiliency into our webhook, here is what the standard solution stack for an event-driven infrastructure will require:
|Solution category||Tool options|
|Webhook developer tools||ngrok, Postman|
|Alerting and logging||ELK, Datadog, PagerDuty, Sentry|
|Ingestion runtime||Nginx, Lambda, Cloudflare Workers, Kubernetes, VMs, etc.|
|Queues||Kafka, AWS SQS, RabbitMQ, Azure EventBus, AWS EventBridge, GCP Pub/Sub|
|Consumer runtime||Nginx, Lambda, Cloudflare Workers, Kubernetes, VMs, etc.|
|Storage engines||Postgres, AWS S3, MySQL, AWS RDS, GCS, DynamoDB, etc.|
|Custom scripts||Dead-lettering recovery, rate-limiters, etc.|
Maybe it's just me, but the table above feels like we have unintentionally opened Pandora's box by seeking a solution to our webhooks. Having to deal with so much tooling screams maintenance issues. Also, mastering these tools and stitching them together effectively has a steep learning curve. This type of stitched tooling composition can lead to poor usability.
For a comprehensive look into what a standard solution for a webhook infrastructure looks like and the cost of building one, check out the article "Why Fixing Webhooks Might Be Harder Than You Thought".
What to do?
When dealing with webhook problems, you can either routinely troubleshoot and fix every problem that comes your way or build a standard solution that will asynchronously process your webhooks in a fault-tolerant manner.
This section will look at two approaches to better understand the cost.
Death by a million cuts
You can live with the problem and optimize your workflows to effectively go through the routine steps described above to deal with each webhook issue. However, it would help if you kept in mind that you will experience the following:
- A ton of time wasted troubleshooting
- Routine operation problems with missed webhooks
- Having to manage each problem reported by end users
- Combing through logs to diagnose each problem
- Setting up for testing and debugging using a myriad of developer tools (ngrok, Postman, dev environment scripts)
- Having to replay on a case-by-case basis with the provider platform
Rebuild the wheel
The option of building the standard solution is also described in the table in the previous section. This solution considers all the moving parts of an event-driven architecture and is tailored to meet all its requirements.
You can be confident that your webhooks will be handled reliably despite the maintenance costs. However, just as with the previous approach, there are a couple of things to remember when choosing this option:
- 2 - 6 months of implementation and testing
- Requires constant monitoring and maintenance (you may need to upgrade software versions each year)
- Writing of tools and custom scripts for new functionalities
- The recurring problem of scale
- A steep learning curve and knowledge drain when the solution builder leaves
Choosing a solution strategy depends on several factors relating to your use case and your situation with your webhooks. For more information on what to consider when choosing a solution strategy, check out the article “How to Choose a Strategy for Managing Webhooks”.
Why not use Hookdeck?
Hookdeck is a webhook infrastructure that helps you receive webhooks reliably even after server outages. With it comes a suite of features like webhook retry (manual, automatic, and bulk), alerts, throttled delivery, pausing of delivery, and an intuitive dashboard that gives you and your team total visibility over the activities of your webhooks.
With Hookdeck, you can stop worrying about your webhooks' reliability and focus more on building your product. Learn more about how Hookdeck ensures resilience in handling webhooks in this article.
This article introduces you to the cost of using webhooks in production and what you need to consider as a team seeking a solution. The following articles in this series will expand on each section in this article to give you all the details that will help you choose the right strategy to scale and run your webhooks reliably.
Get started with the first article in the series: "Why Webhooks Are Difficult to Manage in Production".