How an Asynchronous Approach to Managing Webhooks Mitigates Scalability Concerns
In the previous article, we discussed the problems development teams experience when using webhooks in production. We also emphasized how fixing these problems can be cumbersome, repetitive, and can negatively impact the team's effectiveness.
In this article, we focus on understanding the scope of webhook problems, the contributing factors, and how many of these factors we can control.
Finally, we will explore an alternative approach to handling webhooks in production environments to mitigate the causes.
Truth Bomb 💣: You’re not “really” in control
A webhook communication setup consists of two major players: the webhook provider that sends the webhook and the webhook consumer that receives the webhook.
In most cases, the provider is a SaaS application like Shopify, Stripe, or Twilio. The consumer is your application that needs to receive the webhook and use it to perform an operation in response to an event in the provider.
One side of the communication channel you own, and the other is a complete third party. So while each player can be responsible for a failure in the setup, you can only control things on your end.
Providers are also in charge of defining the attributes of the webhook. Features like timeout limit, message format, request method, request protocol, authentication strategy, or retry behavior are all within the provider's control. The best you can do is adapt (and this is not optional) appropriately to the provider's defined ruleset and find a way to make up for any of its shortcomings.
This arrangement is not good enough if you plan to build a scalable and resilient webhook system. You can't afford to rely on the provider to act in the best interest of your system; that will only be a recipe for disaster.
With more success comes more problems
Even if you somehow find a way to cope with the webhook provider's shortcomings, a lot can still go wrong in a production environment.
Let's discuss what exactly could go wrong.
Low awareness of the issue at hand
The first issue is the team's level of understanding about potential problems that webhooks are prone to in production. Because of the simplicity of webhooks, most engineers without experience dealing with a high volume of webhooks in production just set them up and expect their webhooks to perform as intended.
Webhooks are used within an event-driven architecture. If you're new to event-driven architectures, you will need to learn about the challenges of managing concurrency and resilience and the engineering efforts that go into correctly setting it up.
The growing number of webhooks
Being the sole ticket attendant at a cinema on a Wednesday in the middle of August is easy, but you would need more help on Christmas Day.
One of the biggest problems you will experience in production will come from an increase in the volume of webhooks, as increased traction comes with business success. This increase multiplies the rate of errors and increases the time spent on debugging and fixing failures. Also, the size of batched operations like updates increases and can build up enough operational pressure to crash your server. You will need to design fault tolerance and elasticity into your system to manage failures and withstand the pressures of an increased webhook load.
Concurrency bursts and the pressures of reliability
Working with the cinema attendant analogy from the previous section, it would be much easier to attend to large groups of customers if they could just form a straight line and be patient.
Sadly, just like the rowdy cinema customers, webhook requests don't work that way. An increase in the total number of webhooks sent within a period is proportional to the number of concurrent webhooks your webhook consumer needs to process.
Just as ticket attendants only have two hands, your consuming server has limited resources and will burn out when under pressure to process more webhooks than it can conveniently handle at a point in time.
Webhook volume is also one of the fault factors entirely out of your control. You can't control the volume of webhooks at any point in time, when spikes will happen, or how many you should be handling simultaneously despite the volume you need to process…or can you?
A different approach: asynchronous processing
The communication channel created by a webhook between a provider and a consumer is synchronous. A provider sends a webhook to a consumer and waits till the consumer finishes processing the webhook. If the consumer does not achieve this processing within a period (timeout limit), the provider automatically flags that the webhook has failed.
This sort of arrangement attaches a default RIGHT NOW priority label on all webhooks, which puts a lot of pressure on the (limited) consumer as the number of webhooks continues to increase.
But what if we could take control of time away from the provider? That way, we could decide when to process each webhook and how many webhooks we want to process at a time.
This style of handling webhooks is called asynchronous processing.
Asynchronous processing mitigates the time constraints placed on consumers to process webhooks immediately by introducing a component between the provider and consumer. This component aims to ingest webhooks from providers quickly, so we don't have to worry about timeout limits. It then buffers the webhooks and serves the consumer(s) at a rate within their capacity.
This strategy helps remove all the backpressure issues in the webhook pipeline and concurrency bottlenecks on your server end.
To learn more about asynchronously processing your webhooks, check out our article "Why Implement Asynchronous Processing for Your Webhooks".
In this article, we have discussed how much control we have in dealing with problems that our webhooks will (inevitably) face in production. We explored different reasons why we experienced these problems and introduced a different approach to handling webhooks that helps mitigate some of the factors that cause them.
In the following article in this series, we will look deeper into solution options and measure the workload required to fix these problems.