How to Solve Webhook Data Integrity Issues
Data is the heart of every modern application, and no matter the architecture of your application (monolith or microservices), there is always a data layer. The data layer is the last layer of any application architecture and therefore, the foundation on which every data-driven application is built.
Its importance to the entire system makes the data layer very delicate. Any situation that causes data to be incomplete, wrongly computed, or even duplicated can compromise the integrity of the system.
Webhooks are used because you want an action to take place in the target application (the webhook consumer). Oftentimes, this action causes data to be written to or manipulated within the database. In this article, we look at different ways improper handling of webhooks can lead to data inconsistencies, and what you should be doing to avoid this.
What is data integrity?
Data integrity is the accuracy and consistency of data stored in a database or any other data construct over its entire life-cycle. Data integrity is critical to the usage of any application that stores, processes or retrieves data.
A webhook must complete its impact on a system by keeping its data integrity intact. The fact that webhooks are fired automatically makes it more important to ensure that webhook processing leaves the application data in a consistent and accurate state. If not, multiple webhooks fired at an endpoint can compromise the data to an extent that it will be almost impossible to roll back to a consistent and accurate state.
Causes of data integrity issues with webhooks
One of the ways to easily get into a state of data inconsistency is when a webhook fails. Imagine you have a payment service and a ticketing service. When a user makes a payment, a payment record is saved and a webhook is fired to the ticketing service to save a ticket record and mail the customer their ticket details.
Let's assume the payment goes through and the webhook is fired, but fails at your ticketing endpoint and returns an error status code. Your application is automatically in an inconsistent data state because you have a payment record but no corresponding ticket record. If this occurs multiple times, suddenly you will notice a substantial amount of complaints from your customers about not getting their tickets after they made payment. This is definitely not a good look for your business.
Improper error handling
Another way to compromise the data integrity of your application when receiving webhooks is by improperly handling errors, for example returning the wrong status code for the error that occurred.
When webhooks fail, some webhook providers retry the webhook when they receive an error status code (4xx, 5xx) in the response. Bad API design is when you return a 2xx status code that indicates success when something has actually failed. Imagine your webhook causes data to be written to your database. For some reason the write operation fails and you catch the error, but instead of returning a 500 status code in your response, you return 200. This will cause the webhook provider or retry system to assume that all went well and skip to the next webhook instead of resending the failed one.
The incomplete write operation leaves your data in an inconsistent state, as data that needed to be written via the webhook operation is now missing.
Buggy database transactions logic
Transactions are a set of database operations that all need to either succeed or fail to keep the data consistent. When you have four operations that need to occur in an atomic transaction, even if three succeed and just one fails, the entire operation needs to be rolled back. This is very common in financial applications.
The easiest way to fall into an inaccurate data state is by not using a transaction in the first place and performing the writes individually. Another subtle way data inconsistencies can creep in is when you have to perform three write operations, but only two of them are related so you put the two in a transaction and leave the third one out. Even though the two writes that are kept in a transaction are protected, the one outside is not and can fail without the opportunity to be rolled back.
Of the issues discussed so far, webhook duplication does not take place on the side of the webhook consumer. For some reason, webhook providers can fire a webhook request more than once. This can cause a database write to be performed more than once, which can lead to duplication of data or duplicate updates to data. Data duplication might not matter so much when you're counting likes on a social media post, but it is dangerous when you're crediting or debiting a user's financial account.
Preventing data issues
The most recommended way of preventing data integrity issues is through proper testing before your webhooks are deployed to production. Ensure that you test your webhook endpoints against failures and atomic writes issues. You can find a detailed breakdown of how to test your webhooks for reliability here.
Atomic database operations
You also need to make sure that all database writes are consistent. One rule of thumb in handling all data manipulation operations on a webhook endpoint is to keep them all in a single transaction so that if any of the writes fail, you can roll back everything to return the data to a consistent state.
Another preventive strategy is to build your webhooks to be idempotent and test them against data duplication issues caused by receiving the same webhooks more than once. To learn more about making your webhooks idempotent, check out this article.
Recovering from data integrity issues
Before implementing recovery strategies, you want to make sure that you have covered as many preventive strategies as you can through proper testing and debugging.
To recover from inevitable webhook failures, you need to have a retry system in place. A retry system detects a failure on the webhook URL when it receives an error status code in the response and resends a webhook when it fails. To build a retry system, you need a component to buffer webhook requests coming from your webhook provider so that you can persist the information between retries.
A message queue is recommended for buffering your webhooks and retrying them when needed. You can build one from scratch using open source libraries like RabbitMQ or Apache Kafka, and write a retry logic to work with it. If you need to quickly set up a message queue with a retry system and not worry about the maintenance overhead, simply use Hookdeck to get started immediately.
Another corrective measure you can adopt is having near real-time database backups. This can help you restore your database to a consistent state when your data has been compromised.
For example, on AWS, several storage services like DynamoDB, Relational Database Service (RDS), and Elastic Block Store (EBS) offer backup services that you can configure within the service. You can also use AWS Backup which provides a centralized console to automate and backups across your AWS services.
Logging and monitoring systems
Having a good logging and monitoring system is also very important to find and fix errors with your webhooks and data, especially in production. You can easily detect what went wrong with your data, where it happened, and make informed decisions on how to fix it.
Still using AWS as an example, you can use the CloudWatch service to monitor your compute services where your applications are running and also monitor write operations to your database and other storage systems. This can help provide visibility into your webhook operations and help track faults in your system.
Data inconsistencies can be very detrimental to the success of your application, as compromised data can grow into distrust among the users of your application. Webhooks play a role in manipulating data in your applications by triggering the action that causes the data to change, for example an update to a user's financial account balance due to a purchase. This process should be handled with care to ensure that your data stays consistent and accurate.
For more information on how to make sure your webhooks are production-ready, check out our articles, "Deploying Webhooks in Production" and "Building Resilience Into Webhooks to Mitigate Performance Issues."