What Are the Implementation Considerations for Message Queues When Processing Webhooks

If you’re processing a large number of webhooks, it is not uncommon to experience sudden spikes in webhook traffic. These spikes can result in resources being used up quickly, which leads to a shutdown of the webhook processing server.

Another issue that can arise from traffic spikes is an increase in response time for each webhook. Almost all major webhook producers (Shopify, Stripe, GitHub) set a hard limit on the response time for each webhook. If this response time is exceeded, the producer will assume that the webhook has failed.

To avoid these types of issues, it is recommended that large volume webhook processing be done asynchronously. One of the strategies for asynchronously processing webhooks is using a message broker.

In this article, we take a look at how asynchronous processing is done with a message broker with a focus on the implementation and performance considerations for deploying this component in production environments.

^Why asynchronously process webhooks with a message broker?

Before we begin, let’s get familiar with what a message broker is and how it operates.

The main function of a message broker is to buffer and distribute messages from message producers to message consumers. Message routing to consumers is done by using routing rules defined when the message is sent to the broker. The broker uses these rules to place the message in a queue where the message is buffered before it is later picked up by one or more consumers subscribed to the queue.

There are some variations to how different messages brokers operate, with some brokers using the concept of topics to define a publish/subscribe system. However, the concept is pretty much the same: buffer messages sent by producers to be picked up by one or more consumers.

Consumers pick up messages at the rate they can handle, thus reducing the likelihood of a consumer being overloaded with requests and shutting down.

Most (if not all) message brokers do not support the HTTP protocol used in sending webhooks. Therefore, to use a message broker to ingest webhooks, you need to proxy the webhook requests through an API gateway.

The role of the API gateway is to act as a protocol conversion middleware, transforming webhook HTTP requests into the format that the message broker supports (e.g. AMQP and STOMP). Once the message is translated, it is then added to the message broker (which is a very fast operation) and the producer is immediately sent a response once the broker confirms that the message has been successfully added.

Because producers send webhooks using the secure version of HTTP (HTTPS), gateways are often tasked with the responsibility of TLS termination.

Now let’s take a look at some of the important factors you need to take into consideration when deploying your API gateway and message broker to ensure reliability and failure recovery.

^What are the implementation considerations for the API gateway?

Scalability and performance considerations

As stated earlier, producers often enforce a hard limit on the response time for each webhook. This response time limit means that the API gateway has a request timeout limit it mustn't exceed. Adding messages to a message broker is a pretty fast process, but while you might not have any worries there, other activities (decryption, setting a trace id for monitoring, etc.) being performed by the gateway must also be factored into the response time.

Also, every task the gateway performs increases the entire latency of the process as webhook traffic grows.

Imagine that you are successfully serving a response time of 3 seconds within a 5 second limit for 1000 concurrent webhooks; this threshold will be exceeded if the number of concurrent webhooks suddenly doubles. To handle the new load, you will need to scale up your API gateway.

You can choose to add more processing power and memory to your gateway server, but there is always a performance ceiling with that approach.

The recommended way is to horizontally scale your API gateway by deploying multiple instances of it into a worker pool and putting a load balancer in front of them.

The load balancer will help distribute the webhook traffic across the pool of gateways and each instance can quickly process and add the webhook to the message broker, resulting in faster response times.

Cloud providers like Amazon API Gateway often offer scalability out-of-box, so you might not need to worry about scaling up with an increase in traffic.

^What are the implementation and scalability considerations for the message broker?

Scalability

Message brokers often have functionality built-in to handle horizontal scalability, however there are still some limitations to this. One of the major limitations is the total throughput of a single queue as messages passing through the queue need to be delivered to all connected subscribers.

While brokers like RabbitMQ and ActiveMQ can easily process tens of thousands of messages per second, if you plan on processing hundreds of thousands (or millions) of messages per second, you may need to add custom sharding mechanisms into your application to spread the load amongst multiple broker instances.

One of the solutions to this problem is Apache Kafka’s concept of topic partitions. Topic partitions in Kafka can be used to split the work of storing messages, writing new messages, and processing existing messages among many nodes in a broker cluster.

Message recovery and durability

One of the operational requirements for webhook processing in production environments is failure recovery. In the context of message brokers, failure can come from a consumer suddenly disconnecting from the broker or rejecting a message due to an error in processing. We need to make sure that these messages are recycled back into the processing pipeline.

These days, every broker supports the concept of dead letter queues. A dead letter queue holds rejected messages. This queue can then be subscribed to in order to inspect issues, trigger alerts, and perform corrective actions. It is common in webhook architectures to see a retry component that makes use of this queue to ensure that failed messages are reconciled.

To ensure that you’re properly handling failure situations, it is recommended that consumers explicitly acknowledge webhook messages consumed from the broker after successfully processing them. If an error occurs in the process, a consumer should reject the message so that it can be re-queued or sent to the dead letter queue.

Messages also have to be explicitly declared to be durable and persistent. A durable message is able to survive a reboot while a persistent message can survive a server crash. In RabbitMQ for example, you need to set the durable parameter of a message to true to make it durable and configure messages to be persistent.

Performance

To ensure that your broker is performant, you need to be collecting and monitoring usage metrics. These metrics provide actionable data that can help you refine your broker configuration for increased efficiency. Below are some metrics and message broker features to take into account when deciding on which message broker to use:

Number of messages published per second
Average message size
Number of messages consumed per second
Number of concurrent publishers
Number of concurrent consumers
Support for persistent messages
Support for message acknowledgment

To learn more about measuring and analyzing a message broker’s performance, check out our article on performance metrics for a webhooks solution.

^Conclusion

In this article, we have taken a look at how a message broker can be used to asynchronously process webhooks to ensure that traffic spikes or webhook volume increases do not cause the webhook processing server to shut down. We also dove deeply into implementation and performance considerations for the message broker and the API gateway required to proxy webhook messages to the broker.

This article is a break-out article from our comprehensive series on building a production standard webhook solution. If you want a top-to-bottom analysis, including more in-depth details on each aspect of the webhook solution, you can start the series here.

Webhook Producers Health Monitoring