Error Recovery

Errors are inevitable when working with webhooks. That is why it is essential to build resiliency into webhooks by making them fault-tolerant. Fault tolerance for webhooks is the ability to return to the desired state in the consumer application despite the occurrence of errors.

In this article, I will go through the most common error recovery problems you might run into when working with webhooks and explain how to solve them with Hookdeck. Each problem addressed will include a small discussion section.

What are webhook errors?

A webhook error is any situation that inhibits a webhook from achieving its intended purpose in the destination application. Webhook errors can come from the webhook producer, the webhook consumer, the network link between them, and/or any intermediary component within the communication process.

Here are some examples of webhook errors:

  • Network errors: These can be network failures between the producer and consumer, request timeouts, etc.
  • Protocol errors: Authentication errors, bad requests, invalid formats, unauthorized clients, expired access keys, etc. Typically 4xx errors.
  • Consumer errors: Code errors (bugs), database timeouts, etc. Typically 5xx errors.

Recovering from webhook errors (webhook resiliency)

Let’s assume you’re currently dropping webhooks due to an error. How do you pause the communication link between the producer(s) and consumer(s) to ensure that you don’t continue dropping webhooks?

Also, how do you retrieve and retry all failed webhooks after applying a fix? What if you only want to test one failed webhook to confirm the fix and then test the remaining failed webhooks after?

The process of recovering from webhook errors can be divided into 3 stages which we will discuss briefly.

1) Tracking

First is the **********tracking********** stage. This involves monitoring your webhooks to catch errors and raise alerts. Tools like Bugsnag are specially built for this purpose, while standard monitoring tools like Datadog can be set up to watch and alert on errors.

2) Debugging

Next is the debugging stage. This involves investigating the error to determine the root cause. You want a tool that provides just enough data to determine what caused the error and not so much information that it becomes noisy and overwhelming.

3) Recovery

Then we have recovery which, as obvious as it may seem, there really isn’t any standard tool for when it comes to webhooks. Recovery is the ability to return your application back to its intended state after a failure occurs.

In the next section, we’ll take a look at how Hookdeck helps solve these (and many other) error-recovery problems

How Hookdeck helps you recover from errors

You are dropping webhooks because your server crashed during a spike

Problem

You need to rate limit incoming webhooks.

Solution

Hookdeck allows you to set a rate limit per destination, capping the upper bound of how many webhooks your destination can receive in period of time.

Discussion

Sudden spikes in traffic can cause a server to exceed its throughput. When a server exceeds its throughput and the amount of resources available for network connections and processing requests is used up, the server will crash.

To fix this problem, you need to control the rate at which providers are sending webhooks to your servers. This is done by adding a rate-limiting component between your provider(s) and your consumer(s). This way, you can tune the webhook request rate below the server’s throughput to avoid server crashes.

You need to manually rate-limit when reconciling a fix

Problem

You need to configure rate-limiting parameters for troubleshooting.

Solution

Hookdeck’s rate-limiting controls allow you to manually configure the number of webhooks you would like to receive per second, per minute, or per hour. You can also set the time interval between each webhook sent.

Discussion

When applying a fix, you need to recreate the problem to see that the fix has worked. For webhooks, this involves simulating the rate at which webhook requests were received leading to the error.

To achieve this, you need a rate-limiting component that is manually configurable. These rate-limiting knobs will allow you to set the rate of webhooks received per second to the exact amount you want for your troubleshooting activities.

You are dropping webhooks because your server crashed

Problem

You need to pause your webhooks.

Solution

Hookdeck’s connections come with a Pause feature that can be used to temporarily stop the delivery of your webhooks from a source to your destination server. This allows you to prevent the dropping of webhooks while the server outage persists.

Discussion

When a webhook destination server is unavailable, webhooks hitting that server will be automatically dropped. This causes information to be lost and puts the application in an inconsistent state.

To avoid dropping webhooks, you need to stop sending webhooks to your server and hold them in temporary storage until the server is back up. This can be achieved by using a message queue to hold the webhook data. You also need to make sure that the data persistence period set on the queue is enough to last the server outage duration.

You are dropping webhooks because your server is going through downtime

Problem

You need to pause the delivery of webhooks while your server is down and resume when your server is back up.

Solution

Pause the delivery of your webhooks to your destination server. Unpause when it’s safe to receive webhooks.

Discussion

Server downtime is often required when migrating data, applying updates, or upgrading your server. Any webhook sent to the server during this period will be automatically dropped.

To avoid dropping webhooks, you need to stop sending webhooks to your server and hold them in temporary storage until the server is back up. This can be achieved by using a message queue to hold the webhook data. You also need to make sure that the data persistence time length set on the queue is enough to last the downtime period.

You are dropping webhooks because you triggered the rate limit of the API

Problem

You need to control the pace of delivery.

Solution

Discussion

If webhook requests from a provider to a consumer exceed the rate that the consuming server can handle, subsequent webhooks will be dropped. Webhooks lost during this period will cause data inconsistencies within the consuming application.

To fix this, you need to throttle the rate at which webhook requests are sent to your application. This is achieved by adding a rate-limiting component between the producer and consumer to adjust the pace of webhook delivery to a frequency that does not exceed the consuming API’s limit.

You are dropping webhooks because your server is down after a spike

Problem

You need to retry all webhooks that failed as a result of the spike.

Solution

Search Events that returned the HTTP code 503 using Filters. Use the Bulk Retry feature to recover from dropped webhooks.

Discussion

When spikes cause your server to shut down, a number of webhooks will have failed before you find a solution. After a fix is applied, you need to find and retry all the webhooks that failed during the downtime.

The recommended solution is to proactively persist your webhooks in a data store and only discard them when you’re sure they have been reconciled with your server. This way, when there is a server downtime, you don’t lose your webhook information. When your server is back up, you can retry the webhooks that failed.

You are dropping webhooks because your server returns errors after a bad release

Problem

You need to identify dropped webhooks and replay them.

Solution

Hookdeck Issues categorizes failure by connection and status code. You can then browse the issue to identify all the impacted events. Once the problem is fixed, you can retry all the webhooks with the Bulk Retry and mark the issue as resolved.

Discussion

When webhooks fail due to server errors, you need an audit trail of the webhook transactions to figure out why your webhooks are failing.

The standard way of tracing a webhook’s activities is by logging at different points of the webhook’s journey from source to destination. Because the errors are coming from your server, the server’s logs are the source of truth for determining what went wrong with your webhooks.

Once the bug is detected and a fix is applied, you then need to retry all the webhooks that failed as a result of the bug. One of the ways to achieve this is by storing failed webhook requests in dead-letter queues (of a message broker) and retrieving them to be retried once the issue is fixed.

Conclusion

Hookdeck’s feature set is built around making webhooks reliable and resilient to the pressures that are encountered in production environments. The error recovery tools Hookdeck has allow you to track and receive a notification when there is a webhook error, debug the error to find its cause, and manage how you want to recover from it.