Why Fixing Webhooks Might Be Harder Than You Thought
So far in this series, we have discussed webhook difficulties in production environments and the pain of fixing the issues. We also looked into the problems to understand their scope, the factors within our control, and a new proposal for dealing with them.
This article will look at standard solutions used today to solve webhook problems. It begins by categorizing the solutions we should implement to build resiliency into our webhooks, and is followed by a list of tools that can help us implement each solution.
Finally, we zoom out to observe the cost of maintaining each new solution.
Ignorance is bliss…until you have to deal with the problem
In the previous article, I mentioned that most teams without experience with webhooks in production often underestimate the amount of work needed to keep it functioning reliably. This can turn out to be very costly.
Lack of awareness of webhook production issues often leads to putting out fires every day without even knowing the root cause of the problems experienced. Teams can also find themselves in a wild goose chase fixing the wrong webhook problems or compounding simple issues.
The point to be made here is that while webhook problems are inevitable, approaching them with a good understanding of the nature of the potential issues they are prone to in production is the first step to setting up your webhooks like a well-oiled machine.
Understanding the different types of issues to be solved
We now agree that it's imperative to understand the type of problems our webhooks are prone to in production environments. We have also concluded from the previous article that an asynchronous approach is best for handling our webhooks. So let's discuss what these problems are and their corresponding solutions.
Problems and solution categories
|We need to test webhooks before deployment.||Webhook developer tools|
|Failure is inevitable, and we need to know when failure happens in order to troubleshoot it.||Alerting and logging|
|To avoid timeouts, we need to respond quickly to webhooks from the provider.||Ingestion runtime|
|We need to process webhooks asynchronously.||Queues|
|We need to scale by distributing the webhook load across multiple servers.||Consumer runtime|
|We need to persist webhooks until we confirm that they have been processed.||Storage engines|
|We need to perform custom functions like retrying failed webhooks and automating quick fixes.||Custom scripts|
The above list, while not exhaustive, contains the most crucial problems that need to be solved to ensure that we have all that is required to keep our webhooks resilient to faults.
Now let's look at the tools we have at our disposal today to solve these problems.
Tools required for implementing solutions
|Solution category||Tool options|
|Webhook developer tools||ngrok, Postman|
|Alerting and logging||ELK, Datadog, PagerDuty, Sentry|
|Ingestion runtime||Nginx, Lambda, Cloudflare Workers, Kubernetes, VMs, etc.|
|Queues||Kafka, AWS SQS, RabbitMQ, Azure EventBus, AWS EventBridge, GCP Pub/Sub|
|Consumer runtime||Nginx, Lambda, Cloudflare Workers, Kubernetes, VMs, etc.|
|Storage engines||Postgres, AWS S3, MySQL, AWS RDS, GCS, DynamoDB, etc.|
|Custom scripts||Dead-lettering recovery, rate-limiters, etc.|
Below is a diagram of how the new setup will look:
To learn more about this webhook infrastructure architecture, the decisions that led to it, and its many benefits, check out our article “Webhook Infrastructure Requirements and Architecture”.
Is it just me or did life seem a lot easier when the setup was simply
Provider -> Webhook -> Consumer, and now it seems like we’ve just opened a can of worms?
Also, it would help if you had a good understanding of and experience with event-driven architectures to understand how these tools will operate.
If there is one thing we have learned so far: simplicity doesn't necessarily mean stability, and to achieve reliability for our webhooks in production, we will have to deal with this.
However, is this the best solution? That’s what we will be discussing in the next section.
The brittle nature of the resulting solution
No solution is perfect, however the one we just described in the previous section solves all the problems we highlighted.....if we get it right.
We measure the cost of a solution by its maintenance requirements and not its benefits. Since the benefits are already apparent, I'd like to talk about the caveats.
The level of expertise required for the solution
While this solution is well thought out and designed to fix all the concerns highlighted in previous sections, it's a challenging infrastructure to build.
Whoever is responsible for implementing the solution needs to have sound knowledge and experience with event-driven architectures. She also needs to be proficient in one of the tools used for implementing each solution category in order to integrate them all to work harmoniously.
As seen in the Solution/Tool table, you will need to integrate one tool from each section into the standard webhook infrastructure. That would be super easy if the same vendor made all the tools, but sadly that's not the case. It’s quite a learning curve to understand how all these tools fit together, courtesy of the varying and often opinionated APIs they all expose.
Even when you get the integrations right, you will need to deal with the compatibility of different software versions.
Orchestrating these tools the wrong way leads to poor usability and a frustrating experience for the team.
This concern is a continuation of the previous one. Integration work is continuous. You still have to deal with updates and upgrades from vendors from time to time.
Some of these updates/upgrades can lead to breaking changes you need to reconcile within the infrastructure. While you can ignore some updates/upgrades to maintain compatibility, it's detrimental to ignore the ones that address critical issues like security vulnerabilities within the software.
There is also the task of adding new components to the infrastructure to harden or expand its feature set, or changing an existing component to a better one. For example, you might discover that Splunk offers more benefits than the ELK stack. These types of change are non-trivial.
Onboarding and handoff issues
With no proper documentation, onboarding a new team member to manage this webhook infrastructure is almost as complex a task as building it.
And even when there is documentation, there is a chance that the tools the new member is familiar with in each solution category are different from the ones used by the team. This situation may not be as severe as not having proper documentation, but it can still add substantial friction to the onboarding process.
Handing off the infrastructure can also be problematic if the team member leaving is the only one that knows how the entire infrastructure or certain parts of it work.
This article explains how solving webhook problems in production is more than just a walk in the park. There is still a lot to consider, even with the right solutions in place. Understanding this will enable you to prepare yourself for what is ahead.
In the following article, we will use all the information we have so far to strategize the best decision for handling our webhooks in production.