What to Monitor in a Webhook Infrastructure
In the previous article in this series, we walked through our webhook solution to examine each component, its function, and different options for deploying it. In this article, we zoom into each component to determine how we can measure and monitor its health, availability, and performance.
Each component in our webhook solution is in itself an independent entity with different moving parts. In production, it is important to be able to track how these components’ resources are used, and generally monitor their health and performance.
Information derived from this activity can serve as a diagnostic aid to detect and correct issues, spot potential problems, and lead to actions that prevent the issues from taking place.
Monitoring is also essential to ensure that you’re meeting the performance targets for the webhook solution.
^Application monitoring scenarios for webhooks
When it comes to monitoring, there are tons of data that can be collected from individual components and the system as a whole. Knowing the right type of information to collect, where to get the information, relationships between the data collected, and how much of the information to collect is key to building an efficient monitoring setup.
Collecting too much information or garbage information can lead to confusing and overwhelming insights. This can make it impossible to make sense of the data being collected.
To avoid this anti-pattern of collecting component usage data, we start by scoping the scenarios we want to monitor/diagnose. Scoping these scenarios can be done by identifying the reasons why we want to monitor the webhook solution.
The core 3
1) Ensure that the system remains healthy (it is up all the time).
2) Track the availability of the entire system and all its individual components down to instances within a cluster (e.g. broker instances in a message broker cluster).
3) Maintain performance to ensure that the throughput of webhook processing does not degrade as the volume of webhooks increases (response time for webhook producers is not exceeded, processing of webhooks do not time out, etc.).
In this article, we will focus on these 3 scenarios and how they relate to each of the 3 types of monitoring (health, availability, and performance) because each item on this list is core to ensuring that the webhook solution keeps running.
Keep in mind that this list is not exhaustive, and you can keep adding scenarios to it based on your monitoring requirements. For example:
- Daily monitoring of system usage and detection of patterns that might lead to problems if not addressed.
- Issue tracking, analysis of possible causes, and rectification of webhook failures.
- Ensure that the webhook solution meets any service-level agreements established with users.
In the next section, I break down the correlation between each of the first 3 core scenarios and the type of monitoring it describes. This will lead us to the different types of monitoring and the purpose for implementing each of them, followed by a review of which exactly which metrics you should be monitoring.
^Types of webhook monitoring
The table below shows the core 3 application scenarios listed in the section above and the type of monitoring they describe.
|Ensure that the system remains healthy (it is up all the time).
|Track the availability of the entire system and all its individual components down to instances within a cluster (e.g. broker instances in a message broker cluster).
|Maintain performance to ensure that the throughput of webhook processing does not degrade as the volume of webhooks increases.
The aim of health monitoring is to provide instant feedback on the health status of the system.
This status is often represented using the traffic light system with the color green indicating good health, orange for partial availability, and red for total system downtime. A system is healthy when it is running and capable of processing requests.
Health information can be viewed for the overall system or per component. This information is often supplied either through health check endpoints exposed by each component or a metrics collector.
Check out our article on health monitoring for more details.
Availability monitoring is very similar and closely related to health monitoring. However, while health monitoring seeks to provide an immediate real-time view of the health of the system, availability monitoring is concerned with keeping track of the availability of the system and its components.
An availability monitoring system captures availability data that correspond to low-level factors (CPU usage, memory utilization, etc.) and aggregates them to give an overall picture of the system.
Unlike health data, availability data can be queried over a period of time. This helps monitor trends in infrastructure usage and prevent issues that might occur or that have occurred in the past. This is why availability data such as timeouts, network connectivity issues, and connection retry attempts are timestamped.
Check out our article on availability monitoring for a webhook infrastructure for more details.
In production environments, it is not uncommon to witness a gradual surge in the volume of webhooks that our infrastructure needs to process. As this volume increases, the numbers of concurrent webhooks that need to be processed and datasets moving through our infrastructure grow, thus increasing the likelihood of one or more components failing.
These failures are often preceded by a dip in the overall performance of the webhook infrastructure. In order to keep track of system performance and ensure that you avoid a decrease in performance, you need to set Key Performance Indicators (KPIs) which will serve as a benchmark for the webhook infrastructure’s expected throughput.
With your KPIs defined, you can then load test your infrastructure and fix any bottleneck, add more resources, and/or tweak configurations till you are able to meet up with your KPIs.
For more information on performance monitoring for a webhook infrastructure, check out this article.
^Webhook processing metrics: What you should be monitoring
There are tons of metrics you can monitor when measuring performance. Oftentimes, the metrics you focus on depend largely on the performance targets you have set for your webhook processing operations.
In this section, we will take a look at some of the metrics (direct or derived) that you should take into account in order to properly track the performance of your webhook infrastructure.
For more clarity on how you should handle these metrics, I have separated them into two groups:
- Metrics you should watch (these appear in the default text color)
- Metrics you should alert on especially when they cross a set threshold (these appear highlighted)
Metrics for the API gateway
|Can indicate an authentication or missing required parameters error
|Can be timeout errors in your gateway, webhook request transformation errors, errors adding webhooks to the queue, etc.
|Gateway request count:
|Total number of requests made to your API gateway
|Request latency in API gateway:
|Amount of time between when your API gateway receives a webhook request and when it responds to the producer. High latencies can be an indication of a bug.
Metrics for the message broker
|Count of messages not routed to a queue
File descriptors used
|Count of file descriptors used by broker processes
File descriptors used as sockets
|Count of file descriptors used as network sockets by broker processes
Disk space used
|Bytes of disk used by a broker node
|Bytes in RAM used by a broker node (categorized by use)
|Proportion of time that the queue can deliver messages to consumers
|Number of consumers
|Count of consumers for a given queue
|Messages published in
|Messages published to an exchange/queue (as a count and a rate per second) - throughput in
|Messages published out
|Messages that have left an exchange/queue (as a count and a rate per second) - throughput out
|Number of octets sent/received within a TCP connection per second
|Count of all messages in the queue
|Count of messages a queue has delivered without receiving acknowledgment from a consumer
|Count of messages available to the consumer
|Messages that move in or out of a queue per second, whether unacknowledged, delivered, acknowledged, or redelivered
|Message bytes in RAM
|Sum in bytes of messages stored in memory
Metrics for the webhook consumers
Request per minute (or per second):
Errors per minute:
|Total number of exceptions generated by the consumer within a minute
|Average and Maximum latency:
|Average and maximum time it takes the consumer to return a response after receiving a webhook
|API usage growth:
|A measure of the increase in requests hitting the webhook consumer
It is often emphasized as a best practice that a webhook solution should be thoroughly tested before it’s deployed to production. We know this because we have also mentioned it numerous times.
However, we have realized that you cannot test everything and sometimes, you fall into the delusion that if you write enough tests, you can catch all bugs. This is sadly untrue. Testing ahead of deployment also requires that you predict issues without the data to back up your predictions. This can lead to spending time dealing with issues that are not of the utmost importance.
What is important is ensuring that you write just enough tests to confirm your assertions about your design, then have monitoring tools in place that help you catch bugs in production.
These monitoring tools serve as a great feedback mechanism for issues in your infrastructure design and often inform you about the problems you should focus more of your time and resources on.
This way, you’re spending your valuable time fixing the most important flaws in your webhook solution.
In the next article in this series, we will be taking a look at different technological stack options that can be used to deploy our webhook infrastructure.