What Causes Webhooks Downtime and How to Handle the Issues

All web applications, including APIs that consume webhooks, reside in a host environment that is responsible for serving the application. Application downtime is a state where an application is not available to serve requests, which is always an undesired condition as it could terribly impact the user experience and oftentimes leads to losses in revenue. Thus, the responsibility is on the application engineers to design the system to avoid server downtime and its effects as much as possible.

In this article, we take a look at different causes of server downtimes, and discuss recommended ways of handling such scenarios.

What scenarios cause downtime for webhooks?

Traffic spikes

The capacity of server environments is dependent on the computing resources available to them. Memory and disk space are examples of finite resources in server environments. The more connections a server has to open and requests it has to process concurrently, the more memory and system resources are consumed.

Therefore, if your webhook consumer is processing more webhooks than its server capacity can handle, the server will shut down. During this downtime, webhooks trying to hit your webhook endpoint will be dropped with a 503 (Service Unavailable) error.

Until the server is back up, it won't be able to process any webhooks, and lots of sensitive data can be lost within that period.

To learn how to resolve traffic spikes, checkout this article.

Server migration

Sometimes, due to capacity improvement reasons or preferences, you may need to migrate your application to a different server. During this period, your service will experience downtime until you're fully set up on the new server and have successfully redirected your traffic to the new server environment.

This downtime period will affect your webhooks, as you most likely will not have the luxury of pausing your webhooks or getting your users to stop using your application during the migration.

Data integrity fixes

Data integrity issues occur when mishandling of webhook processing leaves your application data in an inconsistent state. When this is noticed, you may need to lock writes to your database until you are able to restore it to a consistent state.

During this lock period, your webhooks will not be able to perform operations that manipulate data in the database and will timeout after a period.

Sometimes, servers are intentionally shut down to fix issues relating to data compromise, and during this period, webhooks will be dropped which can lead to more data-related issues.

Server update downtimes

Software running on application servers often requires updates. Other than some core functionality that depends on older versions of server software, it is highly recommended to always update server software.

However, this update operation can sometimes hinder the smooth running of the application or require that the application go down while the update is in progress. Due to this, webhook requests targeting your webhook URLs will fail during the update.

Authentication failure

For security purposes, webhook providers often require an authentication step before the information in a webhook can be consumed. If the webhook consumer (i.e. your server) fails to meet the authentication requirements, webhooks will be dropped with a 401 error (Unauthorized).

Your server may not be down, but this will have the same effect on your webhook operations as server downtime would.

How to manage webhook downtime scenarios for reliability

The main issue with server downtime is that webhooks are continuously sent to your endpoint but your server cannot process them. This leads to information being lost and if there is no way of getting the webhooks again, that information may be lost forever.

The recommended solution is to ingest and buffer your webhooks during the downtime period, and the component to help achieve that is known as a message queue.

A message queue helps decouple your webhook provider from your webhook consumer, as it sits in between the two. This helps insulate providers from consumers, which makes the system more robust and fault-tolerant because providers do not depend directly on consumers.

The fact that consumers' availability does not affect webhook providers allows you to stop webhook processing at any time. With this ability, you can perform maintenance, fix issues, install updates, and perform deployments on your servers at any time.

In any scenario where your server goes down, a message queue holds all the webhooks fired during that period inside its internal queue. When your server is back up, your webhooks are served for processing from the message queue. This is why it is also important for your message queue component to also include a retry system that helps resend webhooks after they fail.

message queue webhooks

Message queues can be built using open source libraries like RabbitMQ and Apache Kafka.

How Hookdeck helps

Adding a message queue to your infrastructure for webhooks and making it work effectively requires a good amount of expertise in message queue development. Using a service like Hookdeck makes it simple. It is built with features to handle the planned and unplanned downtime scenarios described above, allowing you to ingest, queue, and retry webhooks, and ultimately avoid the effects of server downtimes.

Feature	Description	How it Helps with Downtime
Rate limiting	Set a rate limit by destination to limit the pace at which webhooks can be delivered.	Avoid downtime when webhook traffic spikes. Only receive webhooks at a pace your server can handle.
Alert	Receive alerts as soon as there are failures.	Allows you to take action quickly to mitigate the issues.
Pause	Hookdeck pauses the delivery of webhooks to your destination by queuing them, and resumes delivery of queued up webhooks when unpaused.	When you detect a downtime (or plan one) you can control the stream of events.
Reporting Dashboard	Dashboard with all your incoming webhooks. Includes status, header, and body information.	Monitor the status of your webhooks to identify and track failed webhooks.
Webhook Retries	Retry (resend) webhooks from Hookdeck to your destination.	When a downtime causes a lot of webhooks to fail, you can use manual and bulk retries to replay webhooks after fixing the issue.

For a step-to-step guide on how yo use Hookdeck to resolve downtime issues, check out our Problems/Solutions series.

Conclusion

Server downtimes can be inevitable and during a down period, you want to make sure you're ingesting and queueing your webhooks to be processed later; this strategy is known as asynchronous processing. This way, you can be at peace knowing that in the event of server downtime, your application will not miss out on any webhooks and all requests will (eventually) be served.

Idempotency Resilience