Handling Twilio Message Status Webhook Spikes with Hookdeck, Django, and Celery
If you've ever used Twilio's programmable messaging product, you know that they send an overwhelming number of webhooks. Every time the message changes status, you'll receive a webhook request, and a message can change status frequently. Twilio has 12 different statuses a message can have depending on the sending channel. That can result in a signficant spike of incoming webhook requests to send just one message. For us, it was crucial to know if a message was successfully delivered or not so we could update our database.
In this article, I'm going to walk through how Hookdeck helps us scale to handle thousands of simultaneous Twilio webhooks.
About Harvust
At Harvust, a farm labor compliance software tool, we use Twilio to send a lot more than just one message at a time. We help farms efficiently manage their seasonal workforce and compliance with burdensome labor regulations. One of our most popular features is our automated compliance messaging for high heat and wildfire smoke. We monitor the weather at our grower's ranches and when the temperature or air quality exceeds a regulatory threshold we message all of that farm's employees (sometimes thousands) to let them know of the hazardous conditions.
The result is Harvust sends thousands of messages; and receives thousands of Twilio webhooks almost simultaneously. The volume of requests would be so high that the WSGI server (in our case gunicorn) would timeout trying to churn through all the queued requests. Unfortunately this led to a second problem: we were running the webhooks on the same server as our API. So when the flood of webhooks timed out, it was also timing out API requests from our apps users.
Problem #1: Synchronous external HTTP requests
Our setup for receiving the webhooks looked something like this:
# urls.py
from django.urls import include, path
from .views import SMSMessageWebhookView
urlpatterns = [
path("api/webhooks/sms-status", SMSMessageWebhookView.as_view(), name="sms-status"),
]
And the view that would handle the webhook looked like this:
# views.py
from django.utils.decorators import method_decorator
from django.views.decorators.csrf import csrf_exempt
from rest_framework.parsers import JSONParser, FormParser, MultiPartParser
from rest_framework.response import Response
from rest_framework.views import APIViewfrom twilio.rest import Client
twilio_client = Client()
@method_decorator(csrf_exempt, name="dispatch")
class SMSMessageWebhookView(APIView):
http_method_names = ["post"]
parser_classes = (JSONParser, FormParser, MultiPartParser)
authentication_classes = ()
permission_classes = []
def post(self, request, *args, **kwargs):
twilio_message_sid = request.data["MessageSid"]
# Refetch the Twilio Message resource that way we always get the latest
# status in case the webhooks come out of order.
original_message = twilio_client.messages(twilio_message_sid).fetch()
status = original_message.status
# Update our internal Message model with the status, so our customer
# can prove to regulators the message was delivered to their employee.
internal_message = Message.objects.get(twilio_message_sid=twilio_message_sid)
# Filter out statuses we don't care about
if status not in ("accepted", "queued", "sending"):
internal_message.status = status
internal_message.save()
return Response(status=200)
As you can see there are a few problems with how we're handling this request:
- We're making a synchronous HTTP request to Twilio which means we have to wait for their webserver to respond before we can respond to our request. Amateur (but common!) mistake.
- No matter what happens in the business logic, we're always returning a 200 HTTP code, so why are we waiting??
To solve both of these problems we decided to move the business logic into an asynchronous Celery task to follow the best practice of asynchronously process the inbound webhook, and return a 200 response immediately in the webhook view.
Celery is a well known Python task queue used to handle the execution of time-consuming or long-running tasks outside of the main web request/response cycle.
It isn't hard to take the business logic in the original webhook handler and turn it into a Celery task because the data from Twilio is serialized, and can be passed directly into Celery (which also serialized the data).
# tasks.py
from celery import shared_task
from twilio.rest import Client
twilio_client = Client()
@shared_task
def handle_twilio_status_webhook(data):
twilio_message_sid = data["MessageSid"]
# Refetch the Twilio Message resource that way we always get
# the latest status in case the webhooks come out of order.
original_message = twilio_client.messages(twilio_message_sid).fetch()
status = original_message.status
# Update our internal Message model with the status, so our customer
# can prove to regulators the message was delivered to their employee.
internal_message = Message.objects.get(twilio_message_sid=twilio_message_sid)
# Filter out statuses we don't care about
if status in ("accepted", "queued", "sending"):
return
internal_message.status = status
internal_message.save()
That allows us to simplify our view significantly
# views.py
from django.utils.decorators import method_decorator
from django.views.decorators.csrf import csrf_exempt
from rest_framework.parsers import JSONParser, FormParser, MultiPartParser
from rest_framework.response import Response
from rest_framework.views import APIViewfrom
from .tasks import handle_twilio_status_webhook
@method_decorator(csrf_exempt, name="dispatch")
class SMSMessageWebhookView(APIView):
http_method_names = ["post"]
parser_classes = (JSONParser, FormParser, MultiPartParser)
authentication_classes = ()
permission_classes = []
def post(self, request, *args, **kwargs):
handle_twilio_status_webhook.delay(request.data)
return Response(status=200)
This is a huge improvement! The webhook handler is much simpler, finishes the request in a few milliseconds, and is able to consume the rest of the backlog of webhooks faster.
But it still wasn't enough, there were just so many webhooks that we still couldn't process them without causing timeouts, even with Celery.
Problem #2: Thinking more resources would solve it
With that easy optimization complete we thought that instead of spending more time on this problem, let's just spend more money. During business as usual we have plenty of reserve capacity with the baseline web server configuration, so we thought that a quick scale-up to handle these spikes of webhooks would be a fine solution. This helped, but now we had the devops burden of managing scaling rules, database connection limits, etc. and we ended up over-provisioned immediately after these spikes.
Along the same line of thinking we could have spun up a specific server just for webhooks, so it didn't interfere with our API requests from users, but we didn't want to add even more infrastructure and maintenance complexity.
Problem #3: Irrelevant webhooks
So then our thoughts turned to reducing the number of webhooks we had to deal with in the first place. Enter Hookdeck...
You'll notice that in our task the only data we are getting from the webhook payload is the "MessageSid"
. And we are using that to fetch the Message resource from Twilio. So while the webhooks are still valuable, we're mostly using them as signals to go fetch the latest status. And webhooks with status "accepted"
, "queued"
, or "sending"
are just redundant signals. So we use Hookdeck's Connection Rules Filter functionality:
By reducing the number of webhooks by about a third this reduced the timeouts significantly. It also simplifies our code slightly by offloading the if status in ("accepted", "queued", "sending")
filter to Hookdeck.
Problem #4: Treating these webhooks as a priority
As developers, sometimes it's hard to take a step back and view the problem from the customer's perspective. After doing so we realized that while an accurate delivery status for each message was of high value for the customer, it wasn't needed immediately. We thought the value was in "realtime transparency and records", but in reality the customer would look at the delivery status hours later, if at all -- they just assumed things worked out.
This realization made us realize that we didn't have to handle these all at once. And Hookdeck had just the feature: Destination Max delivery rate.
This realization led us to understand that immediate processing wasn't necessary. Hookdeck offers a feature called Destination Max delivery rate that perfectly suited our needs that allows us to set a maximum number of requests per second that Hookdeck will send to our webhook endpoint.
We played around with this setting until we found a rate that didn't cause spikes. This spread the requests out over time, which meant we didn't need to auto-scale the server, and could make use of that excess capacity on the baseline server configuration. In fact, we've found that using this delivery rate feature with all of our Hookdeck destinations gave us a way of defining the priority for each connection.
Less code and infrastructure == better outcome
To solve Harvust's Twilio webhook spike problem we initially turned to code and more infrastructure. But once the easy wins were out of the way we found it more efficient to turn to using Hookdeck as our inbound webhook infrastructure to reduce both code and infrastructure complexity.