Why Build Resilience to Mitigate Webhooks Performance Issues

Imagine receiving webhooks for payments, and you have to respond to them by issuing concert tickets to users that just made payment. If your payment webhooks are failing, then users cannot get their tickets and the integrity of your ticketing application becomes compromised.

Let's also imagine that your ticket delivery webhook endpoint has security issues. Attackers can easily spoof payment webhooks to your webhook URL, causing your application to issue concert tickets to undeserving users.

Also, if your webhooks are not built to recover from failures, you may not be able to resolve all the payment issues arising from your downtime when the server goes down due to traffic spikes.

In summary, webhook failures bring nothing good. Webhooks communicate with your application and if you cannot respond to them, your application is compromised.

For webhooks to perform optimally in production, there are best practices you should follow.

In this article, you will learn about the problems related to webhooks. We will go through why it is required to have resilient webhooks that are able to withstand the workload and adverse conditions common with production environments, as well as the ability to recover from failure.

What does it mean for a webhook to be resilient?

A webhook in this context refers to your webhook URL endpoint residing on your application server. This is the server that receives webhook requests from external applications (mostly SaaS applications like Stripe, Shopify, and Okta).

A resilient webhook is built with reliability in mind. This means that the developer or architect proactively predicts failure scenarios and designs the webhooks to avoid or withstand them. To ensure this, the webhook is properly tested before it is deployed to production.

Now, we all understand that failure is inevitable in production environments, so one of the major attributes of resilient webhooks is the ability to recover from failure. One example of this is the ability to restore data lost during a failed operation.

Resilient webhooks also ensure that the intention of the webhook request is consistent with the effect it has on your application. A webhook must not cause the wrong effect, as that could compromise the state of the application.

Why it’s important to make your webhooks resilient

Downtime issues due to webhook infrastructure scale

One of the main reasons you want to improve the reliability of your webhooks is to enable them to withstand pressure. In production environments, you can suddenly experience spikes in traffic on your webhook endpoint. If your server is not properly scaled to handle such traffic, it can shut down once its resources are exhausted. A server that is down cannot receive webhook requests until it is back up, and much can be lost in this downtime.

Data integrity

Data is the heart and soul of (almost) every modern-day application. What makes distributed application architectures possible today is the ability to share and synchronize data across all entities in the system.

A webhook is one of the strategies used in transporting data. The failure of a webhook to be successfully consumed by the destination application can easily lead to data loss. If there is no mechanism to retransmit the data to the destination application, then the effect of the data on the application state will be missing and will lead to data inconsistencies.

Data integrity can also be compromised when you don't have well-tested transactional logic in your database operations. If you have a financial system that is to debit user A and credit user B upon receiving a webhook, you have to ensure that you implement proper transactional logic in your code or data layer. Issues arising from such atomic operations can have damaging effects on your system.

Webhook security loopholes

An unsecured webhook URL can be taken advantage of by ill-intentioned individuals and/or bots. Attackers can take advantage of vulnerabilities in your application to spoof webhook requests, thereby sending corrupt data into your application or causing an undesired effect.

An attacker in control of a botnet can easily launch a DDoS attack on your unsecured endpoint, causing your endpoint to be out of service.

Data in webhook requests can also be hijacked using several types of man-in-the-middle attacks. This can lead to sensitive data being leaked to the wrong persons.

Authentication failure

For security purposes, most webhook providers require an authentication step before the destination application can consume the webhook. These are mostly in the form of authentication tokens or API secrets kept in special request headers.

If webhook authentication breaks on your application, you will start dropping webhooks and in effect will not be able to respond to them.

Strategies to prevent performance issues

As explained earlier, you have to be proactive about failures to build resilience into your webhooks. In your design, take time to imagine every possible situation that can cause your webhooks to fail, and then design your webhooks to withstand them.

Let's take a look at some of the preventive solutions for ensuring that your webhooks withstand harsh conditions encountered in production environments (note that this is not an exhaustive list).

Webhook queueing (asynchronous processing)

This is a scalability strategy that removes the pressure on your server to process webhooks immediately. By using a message queue, webhooks can be buffered and then transmitted to the destination application at a rate that your server can handle. You can use message queueing technologies like Apache Kafka and RabbitMQ.

Adding a queueing system adds an extra layer of complexity to your system. One way to avoid this is by using a queueing service like Hookdeck.

Horizontal scaling

When it comes to avoiding downtime, horizontal scaling is an industry-recommended solution for scaling architectures.

Using a load-balancer to distribute your webhook requests, you can share the load across multiple clones of your application server. This makes it easy to increase capacity by adding more clones when traffic spikes. You can also downscale by removing instances from your pool of servers when the traffic is low.

Rate limiting

As a complement to asynchronous processing, you can also attach a rate-limiting system to your queues. This will help enforce rules around how your webhooks are delivered to your destination application. You can set rules on how many webhooks can hit your endpoint at a time, or the total number of webhooks that can be consumed within a certain interval.

Do note that adding a rate-limiting component introduces extra complexity to your system. To make it easier, you can use Hookdeck which has a rate-limiting component built in.

Webhook URL idempotency

One common scenario when receiving webhooks is that you may receive a duplicate. Duplicates are worrisome because they can cause you to perform a single operation twice, which can lead to data inconsistencies.

The solution to this is to make sure your endpoint is idempotent. Idempotent endpoints can easily detect that a webhook has been received before, and do not process it a second time. This is a task the webhook URL developer needs to perform on the endpoint.

Automated testing

Testing helps simulate conditions that webhooks will face in production in order to get useful information on how they fare under these conditions. You can also use testing to check your logic to ensure that it is free of bugs.

Load tests can be added to your workflow to test your webhooks under stressful conditions similar to those caused by traffic spikes. This will help determine the load your current infrastructure can handle and inform you on how to scale it.

The importance of testing cannot be emphasized enough, and it is very important do this before going to production. Because webhooks require a secure URL, they can be a bit cumbersome to test. To make your testing process easier, you can use the Hookdeck CLI which is specifically designed for testing webhooks locally.

Security checks

Your webhook endpoints should not be accessible to everyone. You need to apply security checks to ensure that your webhooks originate from the right source and do not contain corrupt data. Find an exhaustive and well-explained list of security checks for your webhooks here.

Strategies to correct performance issues

Despite all the preventive strategies listed above, just like rain and taxes, failure can be inevitable. Good engineering practice requires that every inevitable negative event must have a reset strategy to restore things back to normal. So let's discuss some corrective measures that can be taken when webhooks fail (this is also not an exhaustive list).

Webhook retries

There should be a mechanism to retry any webhook the receiving application fails to consume. Using the ticketing example above, if payment webhooks fail due to server downtime, they should be retried when the server is back up so that paying customers can get their tickets.

Retry systems can be designed alongside webhook queues.

Logging and monitoring

Visibility into the state of webhook operations is also key to making webhooks resilient. This enables engineers to make informed decisions on improving the reliability of the webhook.

It also helps you to quickly detect errors and trace what the causes are. This leads to quicker debugging operations and allows you to easily fix bugs and bottlenecks.

Debugging

Debugging is the operation of finding bugs. When failure happens, this is the first corrective step you want to take. Debugging helps you track and eliminate causes of webhook failure.

As mentioned in the previous step, a good logging and monitoring system makes debugging a lot easier.

Data backups

Data(base) backups help you to easily roll back to a state before the system's data was compromised.

Let's assume some set of webhooks fired and you ended up with inconsistent data. Having backups allow you to time-travel back to before the webhooks caused a data inconsistency. You can then fix your issue(s) and retry the webhooks.

Conclusion

Building resilience into your webhooks is not a one-time fix-all event but a continuous one. You have to keep monitoring the system to find more ways you can improve its reliability to withstand pressures and remove vulnerabilities.

Having the right tools for the job makes it easier, and as mentioned throughout the article, Hookdeck has a lot of built-in features to help build resilient webhooks.