Hookdeck's Approach to Reliability
At Hookdeck, we understand the importance of reliability. Our users depend on us to ingest and deliver a large number of webhooks, often with unforeseeable spikes in volume.
If you’re considering Hookdeck, you know that the cost of introducing a new dependency into your system must be less than the cost incurred by building and maintaining your own custom service. But you also want to be sure that Hookdeck is more reliable than your current webhook processing solution.
As part of our commitment to transparency, we’d like to explain our approach to reliability.
We’re committed to reliability
Reliability is at the heart of Hookdeck’s mission. Without it, we would lose the trust of our users. As realists, we won’t promise that our system will never go down. Even the best and brightest have their outages. But perfection, while unattainable, remains our north star.
Since engineering always involves tradeoffs, one must identify and prioritize the most critical piece of a platform’s infrastructure. At Hookdeck we believe that critical piece is ingestion – namely, the service we use to receive and process inbound webhooks.
Simply put, our #1 goal is to never miss a single webhook.
While outages further down our stack could cause increased delivery latency, any webhooks we’ve ingested will eventually get delivered. But failing to ingest a webhook in the first place would leave us, and our users, in a much tougher spot.
Ingestion raises two unique challenges:
- We don't control the rate (throughput) at which we receive requests.
- We don't control the content of the payloads.
To overcome these obstacles, we've taken measures to fully isolate our ingestion service and scale on demand.
Hookdeck’s reliability practices
Our customer base and throughput have grown considerably since Hookdeck began, and so we’ve had many opportunities to assess and re-assess our architecture. These learnings have led us to a solution some might see as extreme. For the highest possible ingestion uptime, we keep our ingestion service completely isolated from the rest of our stack, and have reduced its dependencies to a minimum.
Here’s what that looks like:
- The http-ingestion service has its own repository, CI/CD pipeline, and does not share any code with the rest of our stack. Updates to any other part of Hookdeck do not alter, redeploy or change the service in any way.
- We've reduced its dependencies to 2 services:
- Cloudflare DNS
- Cloudflare Workers
- The ingestion service does not share dependencies with any part of our stack that could be impacted by an error further down the pipeline. It does not share VM, Cluster, Database or any other components.
- All dependencies used by the ingestion service have unrestricted horizontal scaling capacity.
- We enforce 100% test coverage.
- We generally avoid making any changes to the service unless absolutely necessary – as opposed to our core infrastructure, which has seen significantly more deployments.
Our implementation of Cloudflare Workers is designed to fail gracefully if downstream services are offline. We have multiple levels of redundancy to cover cases where our ingestion processing queues might be unavailable.
While ingestion is our priority, we also employ industry-accepted best practices to keep our whole stack highly available.
- We deploy our infrastructure across multiple availability zones, with redundancy for all services (Postgres, Redis and Kubernetes clusters).
- All our CI/CD pipelines are fully automated and involve no manual steps. All deploys are backward compatible and use blue/green deployment for zero downtime.
- All changes are QA'd, and automatically tested (end-to-end tests) on a staging environment that is an accurate replica of our production environment.
- We proactively write unit and integration tests for each new feature; our global test coverage is at about 80% and increasing.
- We regularly load test our infrastructure to identify bottlenecks before they become actual problems.
- We use tried and true technologies (for example PG, Kubernetes, Docker, Redis).
- We are committed to hiring experienced engineers who have dealt with large-scale, highly concurrent systems.
All our services are automatically monitored for performance degradation and outages. This data is publicly accessible at status.hookdeck.com. Issues are automatically reported without manual intervention, and we never edit or delete historical outages from the public status page data.
Our ingestion services are also monitored for performance by servers on every continent.
How we plan to improve
While we’ve worked hard to make Hookdeck failproof, there’s always more we can do.
For example, using a prominent cloud provider has many advantages for reliability – but we aren’t guaranteed 100% availability by those services. We need to continue to do our part to ensure uptime.
We also have plans to improve our latency. In the short term, we’ve focused on increasing availability above all else. But as our data set grows, so does our understanding of how system load impacts latency. Along with this understanding come new architectural targets for operating with predictable latency.
As part of that work, we plan to provide latency data to our users so they can understand more about Hookdeck’s performance at any given time, beyond basic availability.
What this means for you
By using Hookdeck, you are choosing to trust us with your webhooks. We don’t take that responsibility lightly, and we hope that the above overview gives you a sense of our overall commitment to reliability.
We understand that each team has their own requirements and concerns, so feel free to reach out to us at firstname.lastname@example.org, or via this page’s live chat. We'd be happy to jump into the weeds, share more about how our infrastructure is built, and work through any concerns you may have.
We know how important asynchronous events are to your organization. Our hope is to free you, once and for all, from the burden of webhook processing.
Alexandre Bouchard, Co-Founder & CEO