Why Webhooks Are Difficult to Manage in Production
Often when teams start using webhooks, it’s not apparent how much work is involved to keep them reliable in production. Usually, these lessons are learned painfully after deploying to production.
In this article we will go through the journey of a development team that just deployed webhooks to production. We focus on the issues encountered, the impact on users, and the cumbersome remediation process when it comes to keeping the system running.
Production = problems
Webhooks' sole purpose is to integrate two remote services/applications by creating a one-way communication channel where a source can send a message (webhook) to a destination application.
This channel helps orchestrate the responsibilities shared by these services/applications in a workflow. Therefore, the workflow becomes compromised if there is a fault in this communication channel (or any component).
While the above statement is a fact, it is also a fact that failure is inevitable in production environments.
Webhooks in production environments are subject to more (and varying) external factors from their dependencies than in a controlled development environment.
Take a look at this table illustrating how failures can occur.
|Reason for webhook failure||Explanation|
|Server downtime||This can happen due to completely using up the server resources when put under loads beyond capacity. It can also be an intentional downtime for server upgrades or migration.|
|Server returning errors||This occurs when there is a bad release causing the server to throw errors from the code's logic or I/O operations.|
|Server timeouts||Faults can also arise from the server dropping webhooks after timing out due to long-running processes hogging the connection pool or using up memory.|
|Network errors||Sometimes, the fault comes from the network link between the services. Bandwidth exhaustion and high latencies causing webhook providers to drop webhooks are issues that can arise from the network side.|
Impact of webhook problems on your users
In the previous section, we talked about how webhooks coordinate the responsibilities of components in a single workflow. These workflows are often tasks a user wants to achieve by using your application.
To an end user, this workflow can be:
- Receiving an e-ticket in the mail after payment;
- Process delivery after paying for cart items;
- Updating an e-wallet after purchasing items from an online store; or
- Getting a breakdown of results after finishing an online test.
For us to understand the impact of a failed webhook operation, we will break down these example scenarios in the table below.
|Receiving an e-ticket in the mail after payment||Payment Service → Webhook → Ticketing Service||Money is deducted from the user, but the ticket is not sent. The customer does not get value for the money paid.|
|Process delivery after paying for cart items||Payment Service → Webhook → Delivery Service||Money is taken from the customer, but the delivery is not processed. Thus, the customer does not get delivery of the items purchased.|
|Updating an e-wallet after purchasing items from an online store||Shopping Service → Webhook → Wallet Service||Items are successfully purchased, but the update on the e-wallet fails. The customer gets the goods without payment. Wallet service is compromised.|
|Getting a breakdown of results after finishing an online test||Test Service → Webhook → Results Service||The test is complete, but data transfer for result computation fails. The student doesn't get the test result, and the test data may be lost forever.|
We have learned two major things here. When a webhook fails, the workflow is compromised, and the end user (and/or system) is negatively impacted.
The endless cycle of resolving webhook issues
We now understand that failure is inevitable in production environments, and unfortunately, the impact of these failures is non-trivial. So let's look at how development teams have been solving these issues.
Dev workflow pain points
To fix a webhook problem, teams have to go through the steps outlined below.
Find the missing webhooks. These are the webhooks that were fired but could not make it to their destination due to a failure.
Identify the issue. Observe the failed webhooks to find the root cause of the problem.
Fix the problem. Find and apply the solution to the problem.
Resolve the issue. Reconcile all failed/missing webhooks to get the system back to a consistent state.
Do you see the problem with this approach? This process has to be repeated for every issue that arises from using webhooks in production.
Because failures will always occur, repeating this remediation process over and over is the central pain point for development teams.
The need for a better developer experience
You can't have developers and administrators constantly sifting through thousands of webhooks to find the ones that failed. Solving these issues repeatedly will lead to the loss of many development hours you could use instead to improve your solution. Even if you keep adding more developers to help speed up the process, there will be a breaking point. Adding more developers won’t scale for millions of webhooks with at least a 1% failure rate.
Also, many failures are typical and have a pattern. Therefore you can apply the same solution to these groups of similar failures; this is one area where automation can come in handy.
In summary, you need to avoid spending time fixing the same set of problems over and over. Such experiences kill the development team’s morale and lead to the loss of valuable person-hours.
A solution that accelerates the remediation process and ensures that developers spend the least amount of time possible combating production failures should be the standard.
In this article, we have gone through the diary of a development team using webhooks in production. We discussed failures typical with production environments and their impact on users, including the growing pains of fixing webhook issues. In the following article in this series, we will discuss the scope of the problems and how the issues compound with scale.