At Hookdeck, we understand that reliability is important. We know that you are introducing a new dependency in your system, and for that to make any sense, the benefits of our product have to outweigh the extra work and maintenance you would incur by building it yourself. Most importantly, this means that Hookdeck should be more reliable than your current system for webhooks processing. This is a valid concern, and the only way we know how to address it is to be transparent and explain what Hookdeck’s approach to reliability looks like.
Our commitment to reliability
Reliability is at the core of Hookdeck’s mission. Without it, our product would not be as compelling. We would even go as far as to say that our dependability IS our product.
Of course, we are still realistic. You would be catching us in a lie if we told you our system will never go down, because ALL the best and brightest have their outages. You can't make everything perfect, but we can certainly try. The key in engineering and design is to focus on the most critical parts of your platform. In our case, in order to increase the reliability of our product, we have identified ingestion as our priority.
In this case, ingestion refers to the service used to receive and process inbound webhooks. It's a unique challenge because we don't control the rate (throughput) at which we receive requests, and we don't control the content of the payload either. We've identified ingestion as the priority because we don't want to miss out on any data. Outages further down our stack would lead to increased delivery latency, but webhooks will eventually get delivered. If we fail to acknowledge ingestion in the first place, that leaves us in a much tougher spot.
Because of this, we've implemented different measures and practices across our stack to fully isolate our ingestion service and scale-on-demand. This is how we're able to provide some of our coolest features like Throttled Delivery and Event Filtering.
How is Hookdeck reliable?
Building a SaaS solution is a continuous quest of learning experiences. As our customer base and our throughput has increased, we've architected many solutions but ultimately came to the conclusion that in order to have the highest possible uptime on ingestion, we had to take what some might see as extreme measures. We therefore keep this service completely isolated from the rest of our stack, and have reduced the number of dependencies to a minimum.
- The http-ingestion service has its own repository, CI/CD pipeline, and does not share any code with the rest of our stack. Updates to any other part of Hookdeck do not alter, redeploy or change the service in any way.
- We've reduced the dependency to just 3 services:
- Cloudflare DNS
- GCP Cloud Functions (used to run the request handling function)
- GCP PubSub (for our ingestion queues)
- The service does not share dependencies with any part of our stack that could be impacted by an error further down the pipeline. The service does not share VM, Cluster, Database or any other components.
- All dependencies used by the ingestion have unrestricted horizontal scaling capacity.
- We enforce 100% test coverage.
- We generally avoid making any changes to the service unless absolutely necessary. In the last 3 months, we've deployed it twice, in comparison to our core infrastructure that has seen significantly more deployments.
While we at Hookdeck have identified ingestion as our priority, we've also put in place industry best practices to remain highly available across our whole stack.
- We deploy our infrastructure across multiple availability zones with redundancy for all services (Postgres, Redis and Kubernetes clusters).
- All our CI/CD pipelines are fully automated and involve no manual steps. All deploys are backward compatible and use blue/green deployment for zero downtime.
- All changes are proactively QA'd on a staging server that is a full replica of our production environment.
- We proactively write unit and integration tests for each new feature; our global test coverage is at about 80% and increasing.
- We regularly load test our infrastructure to identify bottlenecks before they become actual problems.
- We use tried and true technologies (for example PG, Kubernetes, Docker, Redis).
- We are committed to hiring experienced engineers who have dealt with large scale highly concurrent systems.
What we will improve on
While we have strived to make Hookdeck as reliable as possible, we know there is always more we can do. Using a prominent cloud provider has many advantages, including reliability. However, we know we can't completely rely on those services always being available. We need to continue to do our part to ensure uptime.
As we continue to grow, we aim to follow the chaos engineering approach and target each piece of our technology stack to make sure it can withstand any unexpected issues or outages. This means making sure we don't have a single point of failure across systems, services, or regions.
What this means for you
By using Hookdeck, you are offloading the responsibility of successfully receiving webhooks onto us. It's not something we take lightly, and we hope our efforts will help you free yourself from the burden of webhook processing. We understand that each team has their own requirements and concerns, so feel free to reach out to us ([email protected] or via this website livechat) and we'd be happy to jump into the weeds, share how our infrastructure is built, and work through those concerns with you.
Alexandre Bouchard, Co-Founder & CEO