Introduction to Health Monitoring for Webhooks
Uptime is one of the key metrics that infrastructure administrators always need to keep a tab on. An issue within a service or component in a distributed system can quickly cascade to all other services and/or components, and can eventually degrade the overall health of the system. This is why it is important for administrators to be able to query the health status of the system at any point in time in order to detect, diagnose and rectify health issues.
In this article, we take a quick look at how developers can monitor the health of their webhook infrastructure. Note that this article is a branch of our series “Building a Standard Infrastructure Solution for Managing Webhooks.”
If you’re only interested in health monitoring, you can start reading this article right away. However, if you need more context on what a standard webhook solution looks like as well as details about its implementation, I recommend that you read the whole series.
What is the purpose of monitoring the health status of a webhook solution?
A webhook solution is considered healthy if it continues to run and is capable of processing webhooks. This health requirement also applies to each component that makes up the webhook solution.
The purpose of monitoring the health of a webhook solution is to be able to generate a snapshot at any given time that shows the current health status of the system. Having this information helps us verify that all components of the system are functioning as expected.
Oftentimes, the health status is indicated using the traffic light system. Green for good health, orange for partial availability, and red for downtime.
What is necessary in a webhook health monitoring setup?
- The ability to see/know the current health status of the webhook solution.
- The ability to ascertain the parts of the solution that are functioning and the ones that are experiencing problems.
- Any component or sub-component that goes down should raise an alert to the administrator or trigger a corrective operation.
Where can I find the information needed to monitor my webhook solution’s health?
To properly monitor the health of the webhook solution, you need to collect raw data from individual components. These can be generated from, but are not limited to, the following operations:
Tracing webhook requests
This information can help determine which requests succeeded, which ones failed, and how long it takes to process each webhook.
Capturing exceptions, faults, and warnings in component logs
These logs can be drawn from each component and the event logs of any service that the system references (for example when using a cloud provider to provide an API gateway). Log statements can also be embedded in the code of custom components to capture this information (for example the webhook retry system).
Simulated webhook action for monitoring purposes
This involves simulating the steps performed by a webhook to check if a service/component in the webhook processing pipeline is responding as expected. Collecting data from this activity can help update the health status information of each service/component.
Collecting ambient statistics within components
Within each component, collecting resource usage information like CPU utilization, network and/or disk I/O activity can be very critical to troubleshooting and making sure a component does not shut down.
Monitoring the health of third-party services
Sometimes, the health of a certain component is tied to another component and this other component might be a third-party service, for example an external message broker service like Amazon SQS or a webhook producer like Shopify. It is important to retrieve and analyze the health data that these services provide and factor them into your monitoring information.
In this article, I have introduced some key things to keep in mind when monitoring the health of our webhooks in a distributed system. Health monitoring is just one of the different types of monitoring you can set up on your webhook solution. If you’re interested in other types of monitoring activities, check out our articles on performance monitoring for webhooks intrastructure and availability monitoring for webhook infrastructure.