Author picture Alexandre Bouchard

Aug 28-29th incident: Reaffirming our commitment to reliable async messaging infrastructure

Published


Hookdeck experienced an incident on August 28th and 29th that significantly delayed event delivery. This had severe consequences for many of our customers. We've failed to meet our own high standards and our customers expectations, and we understand that significant and concrete changes are needed to maintain trust.

The postmortem has been published publicly along with our other incident reports.

TLDR: The incident was caused by our database vendor storage auto-scaling feature, which was triggered five times and provisioned slower disks rather than high-performance disks. Database IO calls to disk drastically slowed down. A new database had to be rebuilt on a separate high-performance disk, which took 16 hours. Processing delay peaked at 1h30m for events within configured project throughput and 25h for events outside of throughput limit. No data was lost during the incident.

The incident has shown that:

  • Our communication was too slow and lacked details. Many customers wondered what was happening and contacted our support.
  • Our architecture limited our ability to scale and stabilize the system during performance degradation.
  • One specific vendor has been a recurring source of performance degradation and outages. We've been working on replacing that vendor for over six months and haven't completed the work with sufficient urgency.

Behind the scenes, Hookdeck infrastructure is being revamped. For over two months, about one-third of all workload has been going through our new infrastructure, which remained bulletproof during this incident.

While this incident was unrelated to those ongoing changes, the new architecture is resilient to database performance issues and would have given us a lot more margin to remain fully operational. The infrastructure team's sole focus at this time is to roll out the remaining changes over the next few weeks.

While our SLA is only offered to enterprise customers, because of the scale of the incident, we've decided to credit all customers based on our SLA policy. On top of this, we'll be making several changes:

On the business:

  • Our SLA will cover all our customers on any plan for the next three months at no additional fee.
  • We will update our SLA to offer higher credit compensation. The SLA will now provide five times the incident duration in credit with a minimum of 10% of the monthly invoice. For this incident, all customers will receive ten days of credits (1/3 of their monthly invoice). The previous SLA only covered up to 10%.

On engineering:

  • Hookdeck real-time p99 latency will be made public.
  • We temporarily doubled the provisioned capacity of all our services while completing the ongoing changes.
  • We will work with the database vendor to ensure that their RCA and mitigation plans are sufficient. Otherwise, we will replace the vendor. The problematic auto-scaling feature has already been disabled.

On communication

  • By the end of September, any performance degradation (p99 >3 seconds) of more than 1 minute will be considered an incident. Incidents will be communicated on our status page within 3 minutes of being identified and updated at most every 30 minutes.
  • The system's status will be shown directly in the dashboard when degraded.

We appreciate the incredible patience that our customers have had with our team during the incident. Although the problems are behind us, the team is restless in addressing the core issues that led to this incident and we hope this public statement reassures our present and future customers of our commitment to building highly reliable asynchronous messaging infrastructure.

– Alexandre Bouchard, CEO & Co-founder