Comparing Open Source and Cloud Services When Building a Webhook Management Infrastructure

So far in our webhook management solution series, we have taken a look at architecting a design blueprint. We also took a deep dive into each component that makes up our proposed webhook solution, and spent time discussing how our webhook infrastructure can be monitored to improve performance, detect faults and avoid failures.

We have now arrived at the fun part: building! In this article, we will be taking a look at RabbitMQ for implementing our proposed design for Open Source and AWS for cloud services. For each solution, we will take a look at the pros and cons of the approach.

I’ll also share my own experience building a lean version. I ignored the rate-limiter and the retry system to keep the design lean; as much as they add to the infrastructure’s fault tolerance and reliability, the system can still operate without having them in place. Also, the rate limiter and retry system are custom components with varying ways of operation and implementation. For example, a rate limiter can be implemented with different types of algorithms (for example token bucket, leaking bucket, and sliding window counter) each having its own merits and demerits.

In the end, you will be furnished with useful information to employ when making decisions about how you want to implement the architecture for your specific use case.

What to take into consideration when deciding between open source and cloud services

Proposed solution

Webhook producers trigger webhook requests and send them to an API gateway.
The API gateway adds the webhook messages into a queue (or queues) in the message broker and adds a Trace ID for tracking purposes.
The message broker buffers the webhook messages to process them asynchronously and routes them to a pool of consumers.
A rate limiter sits between the message broker and pushes webhook messages to consumers at a rate that they can handle.
When processing fails, dead letter queues collect failed messages to be replayed by a retry system.
Metrics and logs are collected throughout the system components.
Metrics and logs are used to set up alerts for cases where administrators need to take action.

standard webhook infrastructure solution

Factors to keep in mind when building

Factor	Definition
Ease of implementation	The system should demand the least amount of effort and time possible from your development team.
Scalability	The ability of the system to expand relative to pressures of increasing webhook look so that performance is not degraded.
Configuration options	The ability to tweak the system’s settings to control the webhook workflow and to improve the reliability and performance of the system.
Technical knowledge required	The learning curve should not be too steep and documentation should be available to all. The system should also not rely on knowledge silos.
Cost ($$)	The system should be designed to be cost-effective for development, maintenance, and scale.
Reliability/Performance	The system should be fault-tolerant and have high throughput for webhooks.
Flexibility of choice	Components of the system should be easily replaceable with better or preferred options.
Extendibility	It should be easy to add new features to the system without breaking existing features.
Feature set	The system should have the core features to get the job done, but also the ability to accommodate new features that improve performance and ease of use.

Building the solution using open source technology RabbitMQ

My first go at building out the architecture was with the use of open source technologies. The design consists of different components, so I had to make the important decision of picking the open-source technology I would use for each component. After much thought, based mainly on my experience with the technology and ease of use, I settled for the following:

Component/Service	Technology
API gateway	Node.js server application
Message queue	RabbitMQ
Consumers	Node.js worker apps
Metrics collector	StatsD and Graphite
Metrics visualization	Grafana

With my component technologies selected, the next decision to make was about the infrastructure orchestration tool. I could decide to start and stop the services within my infrastructure manually or by using fancy bash scripts, but using an orchestration tool makes the implementation experience a lot easier.

By the way, in case you haven’t guessed it already, all my services will be running within Docker containers. This makes it easy to easily spin up and tear down components.

Docker Compose stood out as the best option as I was building this demo in my local development environment. However, for production environments, docker-compose will not be applicable. A more robust orchestration tool like Kubernetes or Docker Swarm is more appropriate for production environments.

After several hours of dealing with issues from docker, components within the architecture, and my code itself, I was able to implement the blueprint in the architecture design for the webhook solution.

Now, I could go on and on about how I built this and fixed that, but I believe my docker-compose.yaml file (shown below) tells the story more eloquently:

version: "3.9"

services:
  api-gateway:
    build: gateway/.
    ports:
      - "1337:1337"
    depends_on:
      - message-queue
      - metrics-collector
    restart: always
  message-queue:
    image: "rabbitmq:3.9-management"
    ports:
      - "5672:5672"
      - "15672:15672"
      - "15692:15692"
  worker:
    build: consumer/.
    deploy:
      mode: replicated
      replicas: 3
    depends_on:
      - message-queue
  metrics-collector:
    image: "graphiteapp/graphite-statsd:1.1.6-1"
    ports:
      - "8080:80"
      - "8125:8125/udp"
  graphana:
    image: "grafana/grafana:6.5.2"
    ports:
      - "8000:3000"

If you don’t speak docker-compose, you will most likely not be able to understand what is going on in this file. So let’s go through it service by service.

There are 5 services running in this file and each is described below:

api-gateway: This is the API gateway service running in a Docker container built from a local Dockerfile at the root of the Node.js application’s folder. It exposes port 1337 and uses the depends_on option in docker-compose to ensure that it does not start until the message queue and metrics collection services are up and running. This is because the API gateway needs to connect to these services in order to function. The restart option is also used to ensure that the service boots up again in the event of a shutdown.
message-queue: This service uses the rabbitmq:3.9-management Docker image to spin up an instance of RabbitMQ bundled with the web management interface. Port 5672 is exposed for other services to connect to the RabbitMQ instance while port 15672 is exposed for the management interface to be accessible via a web browser.
worker: Worker services are the consumers of the webhook requests. They automatically connect to the queue and based on the consumption method, they either poll messages or have messages pushed to them from the RabbitMQ instance. I have used the mode and replicas options to deploy 3 instances of the worker in order to distribute the webhook load. These consumers also depend on the message queue to be running before they are started.
metrics-collector: This is the service responsible for collecting metrics from various components within the infrastructure. Using the graphiteapp/graphite-statsd:1.1.6-1 Docker image, an instance of Graphite and StatsD is deployed. Port 8080 is mapped to Graphite’s port 80 and the udp port 8125 for StatsD is forwarded for metrics collection. Because RabbitMQ already comes bundled with its management solution that includes a suite of monitoring and visualization features built-in, I’m only collecting metrics from the API gateway.
graphana: Grafana is the infrastructure's metrics visualization tool. The metrics collected by StatsD into Graphite are queried by Grafana. The result of these queries is translated into graphs, charts, and other types of visualization options supported by Grafana. The service is exposed on port 8000, which maps to Grafana’s port 3000 for access via a web browser at http://localhost:8000.

The entire code for this project can be found in this GitHub repository. Feel free to tweak and experiment with it.

With the services neatly orchestrated within docker-compose, running the following command will spin up the services in the order that they should start up:

docker-compose up

Note: If you’re running the above command for this first time, add the --build option after up to build the local Docker files into images.

The image below shows the services starting up after running the docker-compose up command.

services-startup on docker

When all services are up and running, we can get some data flowing by testing our webhook infrastructure. To achieve this, I have added a load testing module within the project. This module uses Autocannon to simulate a POST request with a payload (typical of most webhook requests) and send requests to the /ingest endpoint of the API gateway.

The script for the test is shown below and can be found in the file loadtest/index.js at the root of the webhook solution project.

"use strict";

const autocannon = require("autocannon");

autocannon(
  {
    url: "http://localhost:1337",
    amount: 3000,
    connections: 10,
    pipelining: 1,
    duration: 300,
    requests: [
      {
        method: "POST",
        path: "/ingest",
        body: JSON.stringify({ name: Date.now().toString(36) }),
        onResponse: (status, body, context) => {
          if (status === 200) {
          } // on error, you may abort the benchmark
        },
      },
    ],
  },
  console.log,
);

This load test sends a total of 3000 requests using 10 concurrent connections within a time span of 300 seconds to the /ingest endpoint defined for webhook ingestion on the API gateway. The payload is a simple JSON with a name parameter set to the current time.

Feel free to add more load testing scripts to this file to simulate other webhook volume scenarios.

To run the test, use the following command at the root of the webhook solution project:

node loadtest

Below are screenshots of RabbitMQ monitors showing the rate of ingestion (green line) and consumption (orange line) of webhook requests:

rabbitmq load test

I also set up a Grafana visualization to display the successful (green lines) and failed (red lines) requests at the API gateway:

Grafana load test

The docker-compose logs on the terminal display how the queue is distributing the request to the workers in a round-robin fashion, as shown below:

consumption round robin

Building the solution using AWS cloud services

It was fun building with open source technologies, that is if you ignore all the times I was arguing with my computer and sifting through tons of documentation. The amount of control over how each component operates and the level of customization were some of the major benefits I realized from using this approach.

So, I decided to replicate the same blueprint using a cloud service provider. I had to pick a cloud service provider that offers services for each component in the webhook solution design.

Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform were my obvious options, and since anecdotally AWS is the most popular of the three platforms, I decided to go with AWS.

The next step was to pick the AWS service(s) to be used for each component in the architecture. The table below shows the services I used to build out the solution:

Component	AWS Service
API gateway	Amazon API Gateway + AWS Lambda
Message queue	Amazon Simple Queue Service (SQS)
Consumers	AWS Lambda
Monitoring and visualization	AWS CloudWatch

My strategy for building out this solution can be described in the following steps:

Webhook requests are received by the API gateway at a POST endpoint. This endpoint is connected to a Lambda function.
The Lambda function takes care of adding the webhook request to the SQS queue.
The SQS queue is created and the previous Lambda function is added as a Lambda trigger for the queue.
Another Lambda function is then created to poll the SQS queue for webhook messages. This Lambda function has a concurrency setting that can be used to determine the number of consumer instances that can concurrently poll messages from the queue, thus adding more parallelism to the processing of webhook messages.
The SQS queue is then added as a trigger for the consumer Lambda function in order for the function to receive new messages placed in the SQS queue.

Once I had all of these components set up and I dealt with a couple of AWS Identity and Access Management (IAM) issues, the bane of every AWS configuration task, the solution was complete. I had an API endpoint I could send my webhook requests to and have them processed accordingly.

A major advantage of going the cloud services route is that most cloud services can scale automatically based on your webhook volume. You can also configure how your services scale up or down as the load varies.

AWS also comes with the added luxury of having a monitoring tab for each service instance. On the Monitor tab, you can view important metrics that have been selected by the AWS team as essential to your service operations. You can also create more dashboards to be added to the monitoring page.

This is powered by AWS CloudWatch so I didn’t have to set up monitoring separately. Below are screenshots illustrating the monitoring of the API Gateway Lambda function, SQS queue, and consumer Lambda function when a load test was performed on the API endpoint.

API Gateway Lambda Monitor

lambda api load test

SQS Monitor

SQS load test

Lambda Consumers Monitor

Consumer lambda load test

Most cloud providers have monitoring built-in. Similar to AWS CloudWatch, Microsoft has the Azure Monitor and Google has Cloud Monitoring on GCP.

Despite the seamless integration of the AWS-native services used to build out the webhook solution, one thing I did notice is the restriction of features for each component.

Unlike the open-source approach, you are limited to what AWS offers. As much as AWS tries to provide you with all the features and configuration options you need, it does not beat the flexibility and control you get with building your services from open-source technologies.

Comparing both approaches

Now that we have discussed and experimented with different approaches for implementing our webhook solution, it’s time to compare. In the table below, I will share my experience working with both approaches under the same factors we have been looking at throughout this article:

	Open Source	Cloud Service Provider
Ease of implementation	Depends on the technical know-how and size of the team. It took me 2 days to properly set up Docker and Docker compose and that is just because I was rusty and not because I had no experience using these technologies.	A relatively faster implementation time compared to using open source technologies. I didn’t have prior experience using SQS but was able to set it up in about 30 minutes after watching some video tutorials.
Scalability	The responsibility of scaling the infrastructure falls on the shoulders of the architect or system designer. I had to run load tests, measure throughput and then use the information to determine the number of component nodes to deploy. This is not a maintainable scaling technique if not automated.	Most components have scalability features built-in that allow you to support growing traffic. My Lambda functions on AWS scale automatically with increasing load and there were configurations on both SQS and Lambda functions for improving performance based on the use case.
Configuration options	Highly configurable. Both in code and through the different web interfaces for my RabbitMQ, Grafana, and StatsD instances, I was able to apply configuration options at a granular level. For example, I was able to configure automatic acknowledgments for my message consumption and persisted my queue messages to disk until they were consumed.	Limited to configuration options provided by the service provider. For example, SQS comes with a few configuration options for tasks like message retention period and dead lettering messages. Some options also exist on the client SDK, however, the configuration flexibility was not on par with RabbitMQ.
Technical knowledge required	Expert knowledge of architecture and different architectural components is required. Before using RabbitMQ, I read half of a book on it and watched a couple of courses. I also had to be referring to the online documents from time to time to check for breaking changes across versions and implementation samples.	A lot of the complexity is abstracted, making it easier for beginners to deploy powerful and reliable infrastructure. As I mentioned earlier, setting up Lambda functions and SQS was straightforward after watching a couple of YouTube videos.
Cost ($$)	Starts small, and grows with scale. With my open-source setup, free tiers of a couple of hosting companies will host my infrastructure conveniently. However, I know I will need more for real production work and it will keep growing as webhook volume grows.	Pricing grows with resource usage. Competitive pricing amongst service providers is a huge advantage. The pricing for services like Lambda, API Gateway, and SQS looked fair enough from what I saw when building on it. Everything I built and tested also cost me less than a penny but we all know that is not the case for real production work. Huge traffic also has the potential of raking high bills on cloud services.
Reliability/Performance	Depends on the quality of your architectural design and available computing power. My design decisions when the building was rooted in a strong knowledge of architecture. There is really no manual or silver bullet for the “best” design that leads to the “best” performance.	Depends on the SLAs and SLOs advertised by the service provider. For example, AWS Monthly uptime SLO for Simple Queue Service (SQS) is “Less than 99.9% but greater than or equal to 99.0%”
Flexibility of choice	Highly flexible, as you can pick and choose the technologies you want in your stack. I was more familiar with RabbitMQ for message queueing so I went with that, but I could have used Kafka if I felt more comfortable with it.	Depends on the service provider’s offerings. As I was interested in using AWS only, I had to learn how to do queueing the SQS-way. Though I was able to translate my general knowledge about queueing into it, I couldn't use any RabbitMQ-specific skill to improve the experience.
Extendibility	You can always use messaging systems and gateways to integrate different types of component technologies/services. I knew if I was going to add more components, I could also use messaging and proxy servers to connect services that naturally do not use the same protocol.	Can only plug into other technologies/services supported by the cloud provider. AWS Identity and Access Management enabled me to connect all my services by simply granting the right permissions to the components.
Feature set	Every feature required needs to be consciously added to the infrastructure. I had to design and build all my monitoring features from scratch. There were no presets and very limited templates were available.	Most services provided come with added features (AWS EC2, SQS, API Gateway, etc., all come with monitoring and logging built-in). However, the monitoring was not as customizable as when I used open source technologies like StatsD and Grafana to design the visualizations I wanted.

Do note that this is not an activity aimed at picking a “winner”. As with all decisions to be made when architecting software solutions, the answer is always “it depends”.

However, using the table above, I have provided my observations from the two main approaches under different factors. This is not an exhaustive list, but I hope that this can help serve as a guide when making decisions that best suit your scenario.

When to use a hybrid approach to develop the webhooks solution

A hybrid approach is simply the act of combining the best of both worlds (open source and cloud offerings) based on things like performance, technical know-how or familiarity, and personal/team preferences.

This strategy also encourages the achievement of the factors we considered when deciding between open source technologies and cloud services. We can use the best attributes of either option to get the best deal for each factor we are taking into consideration.

Let’s take a look at how we can use this strategy based on the factors discussed to get the best out of the two options.

Ease of implementation: For a component like a message queue which might be easy to set up but takes a high degree of proficiency to design efficiently, this is one area where you can take advantage of the stability of a cloud service like AWS SQS.
Scalability: A good number of cloud services have scalability built-in. This is an attribute you can take advantage of if you do not have a team that is skilled in scaling complex architectures. For example, using Lambda functions for your webhook workers helps you take care of scalability on the webhook consumption end.
Configuration options: Components like the API gateway where you perform custom operations which include pre-queueing activities like TLS termination, webhook payload transformation, and verification, are better off built with open source technology. This is because every team’s needs are different when it comes to these components.
Technical knowledge required: If the technical know-how for building and scaling a component is available within your team, there is more of an upside to using open source technologies to build from scratch. If not, you can plug in a cloud service with a shallow learning curve to fill the component’s role.
Cost: This can go both ways. However, from experience, I have realized that open source technologies are often cheaper at the beginning. As webhook volume grows, the cost of hosting and scaling the solution rises. Cloud services try to operate on a pay-as-you-use basis and some scale down when your traffic isn’t so high to help you save cost. The option to go with high will most likely depend on the current stage you are with your product, your team, and the number of users.
Reliability/Performance: The point to be made here is similar to that of scalability. If your team is proficient enough to achieve the best design for this attribute for a component like monitoring, use open source technologies. If not, deploy a cloud service to take care of the component’s responsibility.
Flexibility of choice: Your message queueing component is one area where using a hybrid approach really helps you get the best fit for your design. Let’s say you started with AWS SQS and then, later on, find out that Kafka has features that support the direction your infrastructure is going, you can swap in Kafka for SQS to get the best performance possible.
Extendibility: A hybrid approach gives you the ability to use the technologies that best plug into your existing suite of applications or makes it easier for your team to extend its functionality. For example, if you need to feed data from your queues to an existing AWS service, you can use SQS as your queue. On the flip side, if you need to feed data from your queue into a central metrics monitoring system like Prometheus, you can use RabbitMQ instead.
Feature set: Depending on your team's flexibility on core features required for a component, you can use either open-source or a cloud service. For example, if your queues need strong support for topical message separation, and message streaming, you might want to use Kafka. However, if you just want a reliable and simple queueing system, using a service like AWS SQS is recommended.

Conclusion

In this article, we have demonstrated the use of open source technologies and cloud service providers to deploy our webhook solution. We also discussed a hybrid approach that pragmatically tries to combine the two approaches to get the best of both worlds. Finally, we compared each approach under certain factors important to the sustainability of the solution.

If you are thinking about whether to build your webhook solution from scratch or buy an existing solution, you can read the build versus buy article. With this information at hand, you will not only be able to make decisions that support your infrastructural needs, but also help your business save money and provide a reliable user experience.

Build vs Buy Guide