Webhook Infrastructure Requirements and Architecture

TL;DR: When starting an online service, launching an application is the easy part. Keeping the application running and users satisfied is the true evidence that an application has indeed established itself in the world of online service delivery.

Online services tend to grow pretty fast when they deliver value that helps users ease certain pain points. As the saying goes, “if you build it, they will come,” and when users begin to flood your online service, the once capable infrastructure that supported your service may start to buckle.

As traffic grows, every component in your infrastructure needs to be scaled (a term used generally here to represent any strategy, activity, culture, or checks employed) to meet up with the growing pressure.

Webhooks, an HTTP-based communication technology used to pass information between networked applications, is not an exception to this demand for scale.

Webhooks heavily depend on the network and thus come with the usual baggage: networks not being 100% reliable, latency issues, limited bandwidth, insecure networks, and the cost of data transport (amongst others).

This article takes you through a series of steps to find a solution for sustaining webhooks in a production environment. We begin with a problem statement by domain experts, and break it down to identify the core features our solution needs to possess. Finally, we propose a high-level architectural design that ensures optimal performance of webhooks in production.

^Understanding the domain problem: reliability for webhooks in production

To begin, let’s zoom out of the numerous issues that come with using webhooks in production for now and take a look at what an optimal webhook setup should look like. To give you a good idea, I spoke to domain experts in the webhooks space to pick their brains about the problems they have faced and noted what they would like to see in an optimal webhook infrastructure setup.

After a couple of interesting discussions, I was able to condense the information I received into the problem statement below:

The proposed webhook management system aims to serve as a centralized point for managing webhooks from all third-party providers, as well as to be reliable enough to never miss a webhook. We want to be able to handle varying amounts of webhook loads by designing a system that is elastic enough to adjust to business traffic. The system also needs to be fault-tolerant and provide visibility into the entire lifecycle of a webhook. Peculiar quirks of different third-party webhook providers' operations should be abstracted into a single webhook workflow, centralized webhook verification system, and a unified payload format.

Workflows related to failure recovery (alerts, retries, etc.) should also be customizable based on the importance of the webhook to the success of the system.

In the end, we want a reliable and resilient system where webhook information is easily distilled for every stakeholder, whether they're onboarding new team members, the support team is trying to solve an issue, or developers want to know the status of a webhook by taking a quick glance at the reports.

There is definitely a lot to unpack here, and some of the topics may end up in the wish list. However, we are not going to be making any immediate decisions or drawing fast conclusions; instead we will let the data speak. In the next section, I break down this statement and map the requirements to architectural terms that can be better understood by the architects, engineers, and developers that will take action.

^Translating webhook domain concerns to architectural characteristics

Now that we have our problem statement, let’s dig in to extract what the key requirements are for the proposed solution. Reading through the statement, the following features stand out:

Never miss a webhook
Handle varying webhook load
Have visibility over the complete webhook lifecycle
Receive alerts when there are problems
Replay failed webhooks
Centralize the management of webhooks from every third-party service by developing a single workflow for each webhook
Verification in a single platform
Unified payload format
Configure webhook behavior (alerts, retries) based on its importance to the success of the business

The list above makes it easier to see the requirements than reading paragraphs of statements. We can now tackle each requirement to determine how feasible it is for us to support it.

When it comes to proposing solutions, stakeholders and architects/engineers speak different languages. While stakeholders use terms like “user satisfaction,” “time to market,” and “time and budget,” architects and engineers talk about architectural characteristics like scalability, availability, testability, and deploy-ability.

So, our next step is to distill these requirements down to the architectural characteristic(s) they fall into. Do note that one requirement can fall under one or more architectural characteristics.

System Requirements

The breakdown of requirements into characteristics can be found in the table below:

Operational Characteristics appear in Black
Structural/Implicit Characteristics are highlighted

System Requirement	Architecture Characteristic(s)
Never miss a webhook	Reliability, Performance, Availability
Handle varying webhook load	Scalability, Reliability
Have visibility over the complete webhook lifecycle	Supportability/Monitoring
Receive alerts when there are problems	Supportability/Monitoring, Recoverability, Availability
Replay failed webhooks	Recoverability, Robustness, Continuity, `Fault Tolerance`
Centralize the management of webhooks from every third-party service by developing a single workflow for each webhook	`Simplicity`, `Adaptability`, `Usability`
Verification in a single platform	Security, `Simplicity`
Unified payload format	`Simplicity`
Configure webhook behavior (alerts, retries) based on its importance to the success of the business	`Configurability`, `Fault Tolerance`

Architectural characteristic definitions

Design attribute	Details
Availability	How long the system will need to be available (if 24/7, steps need to be in place to allow the system to be up and running quickly in case of any failure).
Continuity	Disaster recovery capability.
Performance	Measurement of efficiency relative to the number of resources used under known conditions. Includes stress testing, peak analysis, analysis of the frequency of functions used, capacity required, and response times.
Recoverability	Business continuity requirements (e.g., in case of a disaster, how quickly is the system required to be online again?). This will affect the backup strategy and requirements for duplicated hardware.
Reliability/Safety	Assess if the system is fail-safe (can revert to a safe condition in the event of a breakdown or malfunction). Or, if it is mission-critical in a way that affects the business negatively; for example, will the business lose large sums of money?
Robustness	Ability to handle error and boundary conditions while running. For example, a network error, failing remote services, or power outage.
Scalability	Ability for the systems to perform and operate as the number of users or requests increases.
Security	Does the data need to be encrypted in the database (data-at-rest encryption)? Encrypted for network communication between systems (data-in-transit encryption)? What type of authentication needs to be in place for remote access etc.
Supportability/Monitoring	The level of technical support needed by the application. What level of logging and other facilities are required to debug errors in the system?
Configurability	Ability for the end-users to easily change aspects of the software’s configuration (through usable interfaces).
Simplicity	Ease of use of the system in relation to developer experience
Fault Tolerance	A system’s ability to continue operating uninterrupted despite the failure of one or more of its components.
Adaptability	Can developers effectively and efficiently adapt the software for different evolving hardware, software, or other operational or usage environments?
Usability	Users can use the system effectively, efficiently, and satisfactorily for its intended purpose

With the above table, we can now visualize the problem in technical terms and can begin to discuss the components required in our architecture as well as the design patterns best suited for our proposed solution. But before that, we need to prioritize.

Trying to support every feature, though desired, is not always feasible. Too many architecture characteristics lead to generic solutions being used to solve every known problem in the domain. These architectures rarely work and often lead to over-engineering.

In the next section, we will walk through the requirements to determine the features that are most critical to the performance of our webhooks in production.

^Shortlisting and prioritizing architectural requirements

When you’re in the process of proposing a solution to a problem, one of the greatest skills you can have is knowing how to pick your battles.

As it was mentioned earlier, trying to support every desired feature is not always feasible. Each supported architectural characteristic requires design effort and perhaps structural support. There is also a bigger problem in the fact that each architectural characteristic has an impact on others. For example, introducing a load balancer between a client and a pool of servers helps scale the processing of client requests but causes a bump in latency due to the addition of a middleman component.

You need to understand what features are most important for webhooks in your infrastructure and make trade-offs for others.

To achieve this, it is important to first prioritize the business requirements of the application. Start by asking yourself these questions for each identified requirement:

What will be the cost of not having this requirement implemented? Will the system break down? Would we be losing users or money?

To help with the breakdown, we have three options to answer the above question. Based on the answers, each requirement will either be categorized as a “Must have” or put in a “Wishlist” (nice to have):

YES - This is a core operational feature whose absence will lead to a system collapse.
REQUIRED - The absence of this feature will not cause the system to collapse but is required to support, troubleshoot or diagnose problems that will cause (or have caused) a system to collapse.
MAYBE - The absence of this feature will not cause the system to collapse but can be used to further prevent.
NO - This feature is a luxury.

A YES gets put into a “Must have” bucket. A REQUIRED is ranked lower than one with a YES in the “Must have” bucket. A MAYBE feature is ranked lower than one with a REQUIRED in the “Must have” bucket, while requirements with a NO go into a “Wishlist” bucket.

So, let’s do this for our list of requirements to see what our buckets look like.

System Requirements Must Haves

System Requirement	Answer	Bucket
Never miss a webhook	YES	Must have
Handle varying webhook load	YES	Must have
Have visibility over the complete webhook lifecycle	REQUIRED	Must have
Receive alerts when there are problems	REQUIRED	Must have
Replay failed webhooks	YES	Must have
Centralize the management of webhooks from every third-party service by developing a single workflow for each webhook	NO	Wishlist
Verification in a single platform	NO	Wishlist
Unified payload format	NO	Wishlist
Configure webhook behavior (alerts, retries) based on its importance to the success of the business	MAYBE	Must have

You now have a good idea of the main problems you’re trying to solve. To summarize, the following list shows our shortlisted requirements in order of priority:

Never miss a webhook
Replay failed webhooks
Handle the load
Have visibility over the complete webhook lifecycle
Receive alerts when there are problems
Configure webhook behavior (alerts, retries) based on its importance to the success of the business

These are the problems we will be trying to solve in our proposed design for a standard webhook infrastructure.

^Proposing a design for optimizing webhook performance

Now that we have our feature shortlist, let’s map them back to the architectural characteristics they fall under.

System Requirement	Architecture Characteristic(s)
Never miss a webhook	Reliability, Performance, Availability
Replay failed webhooks	Recoverability, Robustness, Continuity, Fault Tolerance
Handle the load	Scalability, Reliability
Have visibility over the complete webhook lifecycle	Supportability/Monitoring
Receive alerts when there are problems	Supportability/Monitoring, Recoverability, Availability
Configure webhook behavior (alerts, retries) based on its importance to the success of the business	Configurability, Fault Tolerance

This table will serve as a reference point throughout the course of our design and implementation of the solution, including all iterations of the design, in this series.

Before we go into analyzing each requirement to discover the best component or group of components for the job, one thing to keep in mind is that we want to design our architecture to be as iterative as possible.

If we have a design that we can make changes to easily, we can stress less about discovering the exact correct thing in the first attempt.