Introduction to Availability Monitoring for Webhook Infrastructure

Introduction to Availability Monitoring for Webhook Infrastructure

When it comes to monitoring the activities of a distributed webhook solution (or really any type of distributed system), 3 types of monitoring are important: health monitoring, availability monitoring, and performance monitoring.

In this article, we will go into quick details regarding how to perform availability monitoring. We will discuss what availability monitoring is all about, how it differs from health monitoring, the requirements for it to succeed, and what data to analyze.

Do note that this article is a branch of our series on building a standard infrastructure solution for managing webhooks.

If you’re only interested in availability monitoring, you can start reading this article right away. However, if you need more context on what a standard webhook solution looks like and details about its implementation, I recommend that you read the series.

What is the role of availability monitoring?

While health monitoring focuses on an immediate view into the current health state of the webhook solution, availability monitoring is done to track the availability of the system and its components in order to generate statistics about the uptime of the system.

Let me explain this using an example. Imagine you have a pool of consumers processing webhooks with a message broker load balancing the workload. If one consumer goes down, the health of the system is still intact as the broker simply ignores the killed consumer and distributes webhooks to other consumers in the pool. The users of the system don’t notice any problem (ideally).

However in availability monitoring, it’s necessary to gather information about the failed consumer as it factors into the overall availability of the system. A failed consumer dips the redundancy and reliability of the entire system and though the system is still in good health, it is closer to a complete shut down with one dead consumer.

Availability monitoring also helps to determine the cause of failures and take corrective actions before its effect spreads to other parts of the system. It also provides useful information that helps administrators set up measures that prevent the failure from recurring.

A standard availability monitoring system should be able to capture availability data from all parts of the system and aggregate them to give an overall picture of the state of the system.

What are the requirements for an availability monitoring system to succeed?

Let’s review what we are trying to achieve: what does the system need? The attributes below are essential for a monitoring system to properly monitor the availability of a webhook solution.

  • The ability to view the historical availability of each component and sub-component.
  • The ability to sort by date and other key indicators.
  • The ability to spot trends that cause one or more components to periodically fail.
  • Intuitive visualizations and information that answers questions like:
    • Do services start to fail at a particular time of day?
    • What is the failure rate during peak processing hours?
    • How does one component’s downtime affect the availability of another?
  • Just as with health monitoring, it should be able to quickly alert administrators when one or more services fail or webhooks can’t be ingested.

This is not an exhaustive list. The main aim here is to collect as much data from the components that can be analyzed to provide useful and actionable information to ensure that all components and sub-components are available always.

Availability monitoring strategies and what to monitor

Availability monitoring can be seen as an aggregated and dated version of health monitoring. Instead of at-the-moment snapshots, health data is collected over time and for every component and sub-component in the system.

Similar to health monitoring, the raw data required for availability monitoring can be generated as a result of simulated webhook actions, logging of exceptions, faults, and warnings that take place in the system.

Another important monitoring strategy that is also applicable to health monitoring is endpoint monitoring.

Endpoint monitoring is used to monitor the overall health and functions of a component. To achieve this, the component needs to expose one or more health endpoints, each testing access to a functional area within the system.

The monitoring system can then ping each endpoint at configured intervals and collect results (success or failure). Data related to timeouts, network connectivity issues, retry attempts, etc. should also be recorded.

All information collected should be timestamped because availability monitoring systems need to provide historical information that can be sorted by time and date.

Analyzing the availability of the webhook solution

To properly analyze the availability of each component and the overall availability of a webhook solution, the data collected and aggregated should support the following types of analysis:

  • The availability and failure rates of each component within the system.
  • The ability to correlate failures with specific events (what happened when the system failed) and the time of the event.
  • Reasons for unavailability of the system or component. Reasons might include loss of connectivity, a dependent service not running, timeouts, a connected service returning errors, etc.
  • The ability to calculate percentage availability over a period of time (this can be useful for SLA/SLO purposes).
  • Historical view of failure rates of the system or component based on the load on the system and across any specified period of time.

Conclusion

Availability monitoring helps administrators view the uptime history of their infrastructure and detect failing components even when redundancy keeps the system healthy, and provides useful information as to how the health of a system is impacted over time based on workload and changing factors within the system.

With availability monitoring, administrators are able to predict the behavior of the system over time and resolve issues even before they happen.