Best Practices When Deploying Webhooks in Production

So you have introduced webhooks into your architecture and you're now ready to move your setup into a production environment. It's known fact that things don't often go as expected in a production environment, as they might in a development environment. A development environment is more predictable and controllable than a production environment. Some characteristics of the production environment are unpredictable traffic, unwanted data, wrong data types, hosting provider downtimes, and security attacks, among many others.

This article introduces some recommended steps you should consider taking before shipping your webhooks to production.

Webhooks setup

Local webhooks troubleshooting and testing

The first step you want to take is to ensure that you have performed proper testing of your webhooks in your development environment.

Webhook requests fired from applications you do not control (which is the situation most of the time) would most likely not be fired again. Thus, if you do not properly consume your webhook when it's fired, you may not get a second chance and your system might suffer losses.

Through testing, you can fire dummy webhook requests and debug your code of any errors in the processing of the request. Once everything looks good, you can then proceed to deploy your webhooks to production.

One of the sore points of local development with webhooks is that you need a secure endpoint that is not available to locally running servers. Also, the endpoint needs to be publicly accessible for webhook providers to target it.

Secure your webhook URL

Webhook requests are regular HTTP requests, which makes them prone to the same security vulnerabilities common to HTTP operations. Attackers can easily take advantage of a compromised system to inject corrupt data into your webhooks to cause undesired effects in your servers.

Below is the summary of security checks you should be performing:

Encrypt your webhook requests by always using an SSL certificate on your webhook URLs.
Verify where your webhooks are coming from to ensure that they are originating from the right source.
Webhook providers should also verify the consumers of the webhook using strategies like API secrets or authentication tokens.
Verify that the payload does not contain corrupt data.
Prevent Replay attacks by verifying the message's timestamp.

This is not an exhaustive list and you're advised to take more precautions for vulnerabilities you have identified outside of this list.

Webhook scalability

Asynchronous processing with webhook queues

Another recommended setup to have when you need to work with webhooks in production is a message queue.

It's very common to reach capacity limits when trying to process all your webhook requests immediately. Once server capacity is exceeded, your server shuts down and you can no longer process webhooks. We've written an introduction to asynchronous processing if you'd like to learn more.

A message queue helps avoid this situation by buffering your webhooks and serving them to your server at a rate that will not cause the server to crash. A message queue is placed in between the webhook provider and your server and acts as a proxy for your webhook requests.

You can set up message queues using open-source technologies like RabbitMQ and Apache Kafka. However, message queues add an extra layer to your architecture, and at some traffic levels would also need to be scaled up to avoid downtime. Message queues also require that you properly understand the inner workings to use it the right way.

For more information, you can read Intro to Message Queues and Message Queues Deep Dive.

Webhooks rate limiting to avoid issues of scale

Every server is limited by its resources and these resources determine the amount of web traffic the server can handle. Rate limiting allows you to limit the number of requests that hit your server to the rate it can handle at any given point. If your server can only process a hundred requests within a second, then you should consider setting a rate limit of 80-90 requests/second (yeah, not 100, you need to leave some room to ensure dependability).

You can also limit the total number of requests that are processed within a given period, for example, you can set a hard limit hourly that ensures your server only processes 2000 requests within each hour.

Rate limiting is most useful when you're not ready to scale up your servers and you can afford a little delay while buffered requests wait for the active ones to finish processing.

Setting up a rate-limiting system can be done using timers and cron jobs but it's often not that simple. The logic to keep it stable and efficient can sometimes be complex.

Horizontal scaling

When your webhook traffic is getting to a point where you need to process tens of thousands of webhooks daily, you should start considering scaling your servers horizontally.

Horizontal scaling allows you to distribute your web traffic to multiple clones of your server. A load balancer is placed between your webhook provider and your server pool. The load balancer uses different types of algorithms (round-robin, least connection, weighted round-robin, etc.) to effectively distribute your webhook traffic to more than one copy of your server.

This reduces the burden of having only one server process all the requests. Horizontal scaling can also be combined with messaging queues where the load balancer distributes traffic to multiple queue-server pairs.

Webhook error recovery

Webhooks logging for audit trails

One highly recommended setup you want to have for your webhooks in a production environment is logging. You need to be able to trace your webhook requests and be aware of their status at all times especially in situations where they fail.

When webhooks fail, information about their failure helps in debugging and fixing them. You can perform logging using different types of strategies from writing logs to simple flat files to using a standard logging service.

Webhooks retries

Sometimes, no matter how many precautions you take, webhooks still fail. What is important is that you have a way to recover from the failure, which can be achieved by retrying the webhook requests.

Some webhook providers do this for you automatically but it's not advisable to leave this responsibility to them. You should be in control of your retry system.

One way most engineers achieve this is by saving events to the database as they are received before they are processed. This way, if the processing fails, the event persisted in the database can be tried again until it is successfully processed.

Your retry system must also ensure that events are cleared from the store once they are successfully processed. You can also go further by adding configurations for a maximum number of retries, alerting after the maximum number of retries is exceeded in order to investigate the situation, and many more features that can make the retry system more flexible and reliable.

How Hookdeck can help

Best Practices	Hookdeck solution
Troubleshooting Locally	By using the Hookdeck CLI, you can quickly create a publicly accessible secure URL that targets your local server.
Asynchronous Processing	If you need to set up a message queue quickly to process your webhooks asynchronously and avoid all the maintenance costs, you can use Hookdeck to create connections between your webhook providers and your destination server.
Rate Limiting	Rate-limiting features have been built into Hookdeck's infrastructure with user-friendly controls for setting the limits on your destination server.
Logging and Audit Trails	Hookdeck's logging features allow you to inspect properties like status (success or failure), time of delivery, webhook source, headers in the requests, data contained in the payload, and many more.
Retries	Hookdeck uses configurable Rulesets and Rules that allow you to define retry settings for the built-in retry system for your connections. Aside from being able to configure the total number of retries and retry intervals, you can also set alerts to be informed when a webhook fails or when the total amount of retries has been exhausted.

Conclusion

Everything becomes critical once you move from a development environment to a production environment. This is because the tolerance for failure is very low in production environments, compared to development environments.

By following the recommendations in this article, you can sleep peacefully knowing that your webhooks will perform optimally in production. For a more in-depth breakdown of surviving production environments, take a look at this article.

Cost of Multiple Providers Data Integrity