How to Choose a Strategy for Managing Webhooks
Throughout this series, we have emphasized that facing problems with webhooks in production is inevitable. We have also learned that fixing problems takes a sizeable amount of effort. We must strike a balance to ensure that our human resource efforts focus on fixing the most critical issues.
This article discusses how to detect webhook problems that affect your business the most, as well as the amount of effort you should invest based on the nature of the issues you are experiencing.
Assessing the risk of webhook failure
Before you decide on your strategy to tackle webhook problems, you must first define what you consider a problem. While every issue within your webhook infrastructure screams for attention and can eventually cause harm to a business, not all problems are equally potent. In this section, we will discuss how to determine how critical each potential failure is.
SLIs and SLOs
To determine how critical your webhook problems are, you have to be able to evaluate your system's performance based on your SLIs and SLOs.
An SLI (service level indicator) is a metric that measures one aspect of the level of service provided by a service to its users. Simply put, an SLI measures a performance characteristic of your webhook processing service.
For a webhook processing service, some of the main performance characteristics are:
- Response time
- Error rate
An SLI is calculated as a measure of the ”good” against the total. For example, if we hope to keep the response time below or equal to 100 milliseconds for a thousand concurrent webhooks, an SLI for the response time is:
Number of webhooks with response time ≤ 100ms / Total number of webhooks
For example, let's say 990 webhooks got a response under or equal to 100ms, while 10 webhooks had response times greater than 100ms; the SLI is: 990/1000 = 0.99, or we can say 100ms is the 99th percentile.
On the other hand, an SLO (service level objective) defines the range of acceptable values for an SLI within which your webhook processing service is considered to be healthy.
For example, 100ms at the 99th percentile is more acceptable than 100ms at the 45th percentile. If the response time of over half of our webhooks is above our cautionary threshold, that's not a very healthy system.
In summary, an SLI gives us a measure of performance characteristics, while an SLO helps determine whether or not the system is healthy.
For more details on setting performance thresholds for your webhook infrastructure, check our article, “Webhook Infrastructure Performance Monitoring, Scalability Tuning, and Resource Estimation”.
Determining your error budget
While defining SLIs and SLOs that reflect the optimal performance levels of your webhook infrastructure, you need to consider how much degradation can be allowed over a period that will not lead to severe consequences for your business.
For example, an SLO could define that 97% of webhooks should have a response time below 150ms measured over a period of 7 days.
This SLO definition means that it is acceptable for 3% of webhooks to have a latency higher than 150ms within this same period.
That 3% is the error budget, representing the number of failures the system can tolerate over the specified period.
These figures help administrators set alerts and determine when and how much effort to put into fixing a particular webhook problem.
How to approach solving webhook problems
If it ain't broke, don't fix it.
If you're within your error budget, you can pause from taking any serious steps or holistic approaches to fix the webhooks that are not falling within your performance thresholds. Also, the amount of effort to resolve the low amount of issues is manageable.
However, this situation can quickly change under certain amounts of increased webhook volumes that your calculations should have considered. Also, the remediation process will still involve the frustrating cycle of routine activities, such as:
- Wasting a ton of time troubleshooting the problem;
- Managing problems reported by end users;
- Combing through logs to diagnose problems;
- Setting up a manual development environment for diagnosis (ngrok, Postman, dev configuration, etc.);
and many more.
Rebuild the wheel
The asynchronous processing model used in the design will help you solve most of the problems webhooks face in production, take control of several aspects of the webhook data pipeline, and have complete visibility.
However, there is no off-the-shelf solution for this design, meaning that you will have to reinvent the same wheel other development teams have built to solve webhook problems. Going on this route will involve the can of worms scenario described in the previous article, summarized below:
- 2 - 6 months of implementation to get everything right;
- Monitoring and maintenance of software tools and versions;
- Writing scripts for custom functionalities;
- Recurring problems of scale; and
- A complicated system that is still difficult to grasp for an average developer.
Use Hookdeck to handle your webhooks.
As a sole developer/development team/administrator/DevOps engineer, etc., who wants to process webhooks effectively? Do you want to deal with all the burden of architectures, SLI and SLO definitions, exhaustive tooling, and the learning curve of event-driven architectures?
My guess is NO.
You’d prefer to offload all that responsibility to a reliable service that guarantees reliable webhook delivery. That's where Hookdeck comes in.
Hookdeck is a webhook infrastructure built for resilience that helps you receive webhooks reliably even after server outages.
Along with this comes a suite of features like automatic replies, rate-limiting, pausing of webhooks, and webhook fan-out, as well as an intuitive dashboard that gives you and your team total control and visibility over the activities of your webhooks.
With Hookdeck, you can stop worrying over your webhooks' reliability and instead enjoy the devX on handling webhooks and focus more on building your product.
In this article, we went over how to approach your webhook problems scientifically and strategize to solve them. From measuring performance attributes to defining SLOs and an error budget, you can determine how your webhook problems affect your business and how much effort and attention to spend on them.
We then introduced Hookdeck, a webhook handling service that takes the burden of webhook reliability off your hands while giving you complete control and visibility over your webhooks. In the following article, I dive deeper into how the Hookdeck approach is our recommended way of handling your webhooks.