Webhook Infrastructure Requirements and Architecture

TL;DR: When starting an online service, launching an application is the easy part. Keeping the application running and users satisfied is the true evidence that an application has indeed established itself in the world of online service delivery.

Online services tend to grow pretty fast when they deliver value that helps users ease certain pain points. As the saying goes, “if you build it, they will come,” and when users begin to flood your online service, the once capable infrastructure that supported your service may start to buckle.

As traffic grows, every component in your infrastructure needs to be scaled (a term used generally here to represent any strategy, activity, culture, or checks employed) to meet up with the growing pressure.

Webhooks, an HTTP-based communication technology used to pass information between networked applications, is not an exception to this demand for scale.

Webhooks heavily depend on the network and thus come with the usual baggage: networks not being 100% reliable, latency issues, limited bandwidth, insecure networks, and the cost of data transport (amongst others).

This article takes you through a series of steps to find a solution for sustaining webhooks in a production environment. We begin with a problem statement by domain experts, and break it down to identify the core features our solution needs to possess. Finally, we propose a high-level architectural design that ensures optimal performance of webhooks in production.

^Understanding the domain problem: reliability for webhooks in production

To begin, let’s zoom out of the numerous issues that come with using webhooks in production for now and take a look at what an optimal webhook setup should look like. To give you a good idea, I spoke to domain experts in the webhooks space to pick their brains about the problems they have faced and noted what they would like to see in an optimal webhook infrastructure setup.

After a couple of interesting discussions, I was able to condense the information I received into the problem statement below:

The proposed webhook management system aims to serve as a centralized point for managing webhooks from all third-party providers, as well as to be reliable enough to never miss a webhook. We want to be able to handle varying amounts of webhook loads by designing a system that is elastic enough to adjust to business traffic. The system also needs to be fault-tolerant and provide visibility into the entire lifecycle of a webhook. Peculiar quirks of different third-party webhook providers' operations should be abstracted into a single webhook workflow, centralized webhook verification system, and a unified payload format.

Workflows related to failure recovery (alerts, retries, etc.) should also be customizable based on the importance of the webhook to the success of the system.

In the end, we want a reliable and resilient system where webhook information is easily distilled for every stakeholder, whether they're onboarding new team members, the support team is trying to solve an issue, or developers want to know the status of a webhook by taking a quick glance at the reports.

There is definitely a lot to unpack here, and some of the topics may end up in the wish list. However, we are not going to be making any immediate decisions or drawing fast conclusions; instead we will let the data speak. In the next section, I break down this statement and map the requirements to architectural terms that can be better understood by the architects, engineers, and developers that will take action.

^Translating webhook domain concerns to architectural characteristics

Now that we have our problem statement, let’s dig in to extract what the key requirements are for the proposed solution. Reading through the statement, the following features stand out:

  • Never miss a webhook
  • Handle varying webhook load
  • Have visibility over the complete webhook lifecycle
  • Receive alerts when there are problems
  • Replay failed webhooks
  • Centralize the management of webhooks from every third-party service by developing a single workflow for each webhook
  • Verification in a single platform
  • Unified payload format
  • Configure webhook behavior (alerts, retries) based on its importance to the success of the business

The list above makes it easier to see the requirements than reading paragraphs of statements. We can now tackle each requirement to determine how feasible it is for us to support it.

When it comes to proposing solutions, stakeholders and architects/engineers speak different languages. While stakeholders use terms like “user satisfaction,” “time to market,” and “time and budget,” architects and engineers talk about architectural characteristics like scalability, availability, testability, and deploy-ability.

So, our next step is to distill these requirements down to the architectural characteristic(s) they fall into. Do note that one requirement can fall under one or more architectural characteristics.

System Requirements

The breakdown of requirements into characteristics can be found in the table below:

  • Operational Characteristics appear in Black
  • Structural/Implicit Characteristics are highlighted
System RequirementArchitecture Characteristic(s)
Never miss a webhookReliability, Performance, Availability
Handle varying webhook loadScalability, Reliability
Have visibility over the complete webhook lifecycleSupportability/Monitoring
Receive alerts when there are problemsSupportability/Monitoring, Recoverability, Availability
Replay failed webhooksRecoverability, Robustness, Continuity, Fault Tolerance
Centralize the management of webhooks from every third-party service by developing a single workflow for each webhookSimplicity, Adaptability, Usability
Verification in a single platformSecurity, Simplicity
Unified payload formatSimplicity
Configure webhook behavior (alerts, retries) based on its importance to the success of the businessConfigurability, Fault Tolerance

Architectural characteristic definitions

Design attributeDetails
AvailabilityHow long the system will need to be available (if 24/7, steps need to be in place to allow the system to be up and running quickly in case of any failure).
ContinuityDisaster recovery capability.
PerformanceMeasurement of efficiency relative to the number of resources used under known conditions. Includes stress testing, peak analysis, analysis of the frequency of functions used, capacity required, and response times.
RecoverabilityBusiness continuity requirements (e.g., in case of a disaster, how quickly is the system required to be online again?). This will affect the backup strategy and requirements for duplicated hardware.
Reliability/SafetyAssess if the system is fail-safe (can revert to a safe condition in the event of a breakdown or malfunction). Or, if it is mission-critical in a way that affects the business negatively; for example, will the business lose large sums of money?
RobustnessAbility to handle error and boundary conditions while running. For example, a network error, failing remote services, or power outage.
ScalabilityAbility for the systems to perform and operate as the number of users or requests increases.
SecurityDoes the data need to be encrypted in the database (data-at-rest encryption)? Encrypted for network communication between systems (data-in-transit encryption)? What type of authentication needs to be in place for remote access etc.
Supportability/MonitoringThe level of technical support needed by the application. What level of logging and other facilities are required to debug errors in the system?
ConfigurabilityAbility for the end-users to easily change aspects of the software’s configuration (through usable interfaces).
SimplicityEase of use of the system in relation to developer experience
Fault ToleranceA system’s ability to continue operating uninterrupted despite the failure of one or more of its components.
AdaptabilityCan developers effectively and efficiently adapt the software for different evolving hardware, software, or other operational or usage environments?
UsabilityUsers can use the system effectively, efficiently, and satisfactorily for its intended purpose

With the above table, we can now visualize the problem in technical terms and can begin to discuss the components required in our architecture as well as the design patterns best suited for our proposed solution. But before that, we need to prioritize.

Trying to support every feature, though desired, is not always feasible. Too many architecture characteristics lead to generic solutions being used to solve every known problem in the domain. These architectures rarely work and often lead to over-engineering.

In the next section, we will walk through the requirements to determine the features that are most critical to the performance of our webhooks in production.

^Shortlisting and prioritizing architectural requirements

When you’re in the process of proposing a solution to a problem, one of the greatest skills you can have is knowing how to pick your battles.

As it was mentioned earlier, trying to support every desired feature is not always feasible. Each supported architectural characteristic requires design effort and perhaps structural support. There is also a bigger problem in the fact that each architectural characteristic has an impact on others. For example, introducing a load balancer between a client and a pool of servers helps scale the processing of client requests but causes a bump in latency due to the addition of a middleman component.

You need to understand what features are most important for webhooks in your infrastructure and make trade-offs for others.

To achieve this, it is important to first prioritize the business requirements of the application. Start by asking yourself these questions for each identified requirement:

What will be the cost of not having this requirement implemented? Will the system break down? Would we be losing users or money?

To help with the breakdown, we have three options to answer the above question. Based on the answers, each requirement will either be categorized as a “Must have” or put in a “Wishlist” (nice to have):

  • YES - This is a core operational feature whose absence will lead to a system collapse.
  • REQUIRED - The absence of this feature will not cause the system to collapse but is required to support, troubleshoot or diagnose problems that will cause (or have caused) a system to collapse.
  • MAYBE - The absence of this feature will not cause the system to collapse but can be used to further prevent.
  • NO - This feature is a luxury.

A YES gets put into a “Must have” bucket. A REQUIRED is ranked lower than one with a YES in the “Must have” bucket. A MAYBE feature is ranked lower than one with a REQUIRED in the “Must have” bucket, while requirements with a NO go into a “Wishlist” bucket.

So, let’s do this for our list of requirements to see what our buckets look like.

System Requirements Must Haves

System RequirementAnswerBucket
Never miss a webhookYESMust have
Handle varying webhook loadYESMust have
Have visibility over the complete webhook lifecycleREQUIREDMust have
Receive alerts when there are problemsREQUIREDMust have
Replay failed webhooksYESMust have
Centralize the management of webhooks from every third-party service by developing a single workflow for each webhookNOWishlist
Verification in a single platformNOWishlist
Unified payload formatNOWishlist
Configure webhook behavior (alerts, retries) based on its importance to the success of the businessMAYBEMust have

You now have a good idea of the main problems you’re trying to solve. To summarize, the following list shows our shortlisted requirements in order of priority:

  • Never miss a webhook
  • Replay failed webhooks
  • Handle the load
  • Have visibility over the complete webhook lifecycle
  • Receive alerts when there are problems
  • Configure webhook behavior (alerts, retries) based on its importance to the success of the business

These are the problems we will be trying to solve in our proposed design for a standard webhook infrastructure.

^Proposing a design for optimizing webhook performance

Now that we have our feature shortlist, let’s map them back to the architectural characteristics they fall under.

System RequirementArchitecture Characteristic(s)
Never miss a webhookReliability, Performance, Availability
Replay failed webhooksRecoverability, Robustness, Continuity, Fault Tolerance
Handle the loadScalability, Reliability
Have visibility over the complete webhook lifecycleSupportability/Monitoring
Receive alerts when there are problemsSupportability/Monitoring, Recoverability, Availability
Configure webhook behavior (alerts, retries) based on its importance to the success of the businessConfigurability, Fault Tolerance

This table will serve as a reference point throughout the course of our design and implementation of the solution, including all iterations of the design, in this series.

Before we go into analyzing each requirement to discover the best component or group of components for the job, one thing to keep in mind is that we want to design our architecture to be as iterative as possible.

If we have a design that we can make changes to easily, we can stress less about discovering the exact correct thing in the first attempt.

Never miss a webhook

Architecture characteristics:

  • Reliability: We want our webhooks to always deliver their purpose. No webhook should be missing and every webhook should (eventually) cause the impact it is intended to on the destination application. To achieve this, we need a logging system to trace our webhooks from point to point and a temporary store to hold our webhook information until its purpose is fulfilled.
  • Performance: Webhooks need to be attended to in time to avoid timeouts. Webhook processing should also make efficient use of system resources. To achieve this, we need to optimize for quick response times by avoiding long-running tasks or having to wait for them to complete. We also need to have a distributed pool of workers to process webhooks speedily.
  • Availability: Consumers should never be unavailable to attend to webhooks. We need to design for redundancy.

So, from this requirement we have the following components:

  • Logging system
  • Temporary store to hold webhook information
  • Messaging systems for asynchronously processing webhooks
  • Load balancing webhook consumption from the message broker

Replay failed webhooks

The main theme for all the architectural characteristics under this requirement is fault tolerance. In other words, the ability to recover from webhook failure, continuity despite failure, and consistency of application state.

To achieve this, first, we want to ensure that webhook consumers are stateless (state should always be kept in external stores and not in the application) and our webhook processing is idempotent. Proper error handling should also be implemented in consumers to ensure that errors are properly reported back to the handlers within the webhook infrastructure.

Next, we need a component and workflow that collects failed webhooks and passes them to a retry system where they can be replayed.

So, from this requirement we have the following components:

  • Error handlers for webhook failure (errors are reported by consumers)
  • Dead-letter queues in the message broker
  • Retry system for failed webhooks

Handle varying webhook load

This requirement screams scalability.

So, as mentioned before, we need a load balancer that distributes traffic amongst a pool of uniform consumers. This pool can be expanded when there is an increase in traffic by adding more consumer application instances. It can also be scaled down by removing instances when traffic returns to its usual numbers.

We will also require a rate limiter to throttle traffic that goes to the consumers in order to serve webhooks at a rate that matches the consumer’s capacity.

So, from this requirement we have the following components:

  • Load balancing webhook consumption from the message broker
  • Rate limiter

Have visibility over the complete webhook lifecycle

It is expected that our webhook will pass through a number of components in our design before reaching its destination. Thus, the design should include a mechanism to trace a webhook from source to destination. For this, we need to introduce a monitoring tool and a trace ID for our webhooks.

This will allow us to track the webhook’s state and status from point to point within the infrastructure.

So, from this requirement we have the following components:

  • Monitoring tool
  • Trace ID

Receive alerts when there are problems

We need to be alerted when there are failures or when any of the system’s resource usage moves close to a threshold that could lead to a fault within the system. To achieve this, we need a component that collects metrics, a component that reports metrics, and alerts configured on any metric that needs a close watch.

We also need a messaging service (email, push messages, etc.) that sends the alert to the appropriate individuals who need to take action.

So, from this requirement we have the following components:

  • Metrics collection component
  • Metrics reporting
  • Alerting system
  • Messaging service for delivering alerts to administrators

Configure webhook behavior (alerts, retries) based on its importance to the success of the business

This is one of these requirements where there isn’t really an off-the-shelf software out there that can fill the need. We may have to build a custom component for this. This solution will work in conjunction with the retry and alerting system to serve as a configurable scheduler for the alerting and retry operations.

So, from this requirement we have just one last required component:

  • Configurable scheduling system for alerting issues and retrying webhooks

^Proposed architecture diagram

The proposed architecture is shown in the diagram below:

standard webhook infrastructure solution

Do note that this is a tentative architecture that is subject to changes as we discover new information during the course of this series.

Let’s walkthrough this architecture and do a brief overview of the components involved.

Breaking down the design flow

Now that we know what the proposed architecture for our webhook solution looks like, let’s discuss how it works.

Webhooks from different sources pass through an API gateway that performs any pre-processing required on the webhook message. This is where activities like verification, TLS termination, payload transformation, etc., can be carried out.

As a gateway proxy, it also helps convert the HTTP messages (the format in which webhooks are produced) into a protocol that is supported by the message broker.

Next, webhook messages are ingested and queued in the message broker. They are then routed based on the routing rules attached to them when ingested.

These messages are pushed to a rate-limiter that relays the messages at a rate that the pool of consumers can handle. A load balancer is placed between the rate limiter and the consumers to distribute the webhook load.

When an error occurs while a consumer is processing a webhook, the consumer reports the error in its response to the load balancer which then passes the failed webhook information to the retry system to re-queue the webhook.

For the purpose of monitoring, we pull logs and metrics from every component between the producers and consumers using a trace id to track a webhook as it moves through each component.

Information from these logs and metrics will then be used to configure alerts for administrators.

In the next article in this series, we will dissect each component by taking a look at its purpose, setup, use case considerations, deployment options, and more important information on how it handles webhooks within the system.

^Conclusion

In this article, we have been able to go from understanding what a standard webhook solution should have to proposing a solution. This was achieved by doing a step-by-step analysis of the problem to figure out what the technical requirements are and what is needed to reliably handle webhooks.

We will proceed from here by using our solution’s architectural blueprint to explore the components involved and different strategies for implementing the solution.

See you in the next chapter on Infrastructure components and their functions.