Phil Leggetter

Asynchronous AI: Why Event Callbacks Are the Future of GenAI APIs

Published Jul 8, 2024

As an industry, we have a developer experience problem when it comes to generative AI APIs. The REST API model is designed for a largely synchronous world where responses return in milliseconds. However, GenAI APIs sit in front of technologies that can take tens of seconds in most cases and, in some scenarios, minutes to respond.

We're starting to see the limits of just how far we can stretch synchronous tooling. And workarounds like extending timeouts beyond reasonable limits to support a transactional request/response paradigm can only get us so far.

So, rather than trying to shoehorn synchronous technologies and paradigms into fundamentally asynchronous products, it's time for a rethink.

Synchronous responses only work when things are fast

With transactional APIs, we've come to expect sub-second response times. Even as far back as 2008, Amazon found that every extra 100ms of latency reduced their profit by 1%. More recent research suggests that sites that load in one second have three times the conversion rate of sites that load in five seconds.

And that makes sense. A delay of 100ms is the limit at which people feel a UI is responding instantaneously. Break that barrier, and people will start to lose focus. Go from one second to three seconds of latency, and the probability of a user bouncing increases by 32%.

So, we've built APIs, infrastructure, and architectural patterns that minimize latency and, for the most part, emphasize synchronous communication. That leads us to a situation where responses from UK Open Banking providers, for example, average at under 500ms.

But then along came generative AIs, and all our assumptions got upended.

How long are GenAI API latencies?

Research by PromptHub puts the response time of OpenAI's GPT3.5 at 26ms per token. With a token being roughly four characters, that works out to around 100 tokens for a 75 word response delivered in around 2.6 seconds. Not terrible, but certainly outside the parameters of what's normal for a traditional API.

A more typical 250-word response, with 330 tokens, would take 8.58 seconds. Switch to GPT4, which PromptHub benchmarked at 76ms per token, and the response would take just over 25 seconds.

With any other product category, trying to handle such response times synchronously would blast through timeouts and set alarms ringing in observability dashboards.

Async and not quite synchronous

So, we can't expect sub-second responses from most generative AI APIs. However, end users are generally willing to tolerate these longer wait times because the outcomes can be so impactful. But there's still a point at which both users and our tooling grow intolerant of longer latencies.

For the people using tools built on top of GenAI APIs, the UX is largely driven by how they interact with the product. Take chat-like interfaces, for example; they set the expectation that responses may take a while, but they keep people engaged by streaming the response, showing progress.

But when the output is a single item, such as an image, or the response from the AI is part of a broader backend process within an app we're building, it's harder to mask delays using similar tricks.

The image generation takes around 40 seconds to complete in the above example.

This means we should think about two broad strategies for handling the longer response times of generative AI APIs:

Decoupled and asynchronous: Many generative AI APIs should be treated as event-driven systems. Instead of relying on polling for updates or extending timeout limits, these APIs should use callback events (e.g., webhooks) to notify us when the response is ready.
Mimicking real-time: For situations where users expect a real-time experience similar to ChatGPT, we can use protocols such as Server Sent Events (SSE) or WebSocket. This helps avoid the fragility of extended timeouts while making sure that responses are delivered to users the instant they are ready.

It's more than just latency

But what is it, specifically, about many generative AI APIs that make them a poor fit for synchronous architectures?

They don't respond in real-time: As we've seen, large language models (LLMs) and other generative AI models can take several seconds to respond. That's past the point at which synchronous responses make sense for most use cases, both from a UX perspective and in terms of maintaining open connections.
Response times are variable: Response times for AI models can vary significantly not only across different models but also across different use cases. This variability makes it harder to develop robust and predictable application code.
Long-running connections are hard to scale and maintain: Maintaining long-running connections, especially in high-demand environments, presents scalability challenges. Open connections can be a resource drain, complicating load management and system stability. This problem compounds as user numbers grow.

While these challenges are not unique to AI APIs, they underscore why the event-driven architecture (EDA) model is particularly suited to generative AI.

What are people doing today?

How are AI APIs managing these long latencies today?

Synchronous manual polling: The default is still to ask developers to wait for responses, adjusting timeouts up into the tens of seconds.
Batching requests: For situations where a real-time response isn't needed, you can batch requests for later processing. Both OpenAI and Google's Vertex AI support batch requests but, at the time of writing, ask you to poll manually to check if a response is ready.
Streaming responses: Claude, OpenAI, and other providers use Server-Sent Events (SSE) to stream responses as they emerge from the LLM. This fits text-based models reasonably well but is less useful for genAI models where the output is a singular artifact, such as an image or a video. And it still requires a persistent connection.

Each of these approaches has its place. As we've seen, user expectations can make a synchronous response unavoidable, even if it takes longer than we'd expect from non-AI APIs. But have there been any attempts to build asynchronous responses into GenAI APIs?

The answer is "kinda'". At the time of writing, asynchronous responses are relatively rare and tend to come in two types:

Sending responses to a data sink: Google's Vision AI, amongst others, gives you some choice over where you want the results delivered. For example, you can send them to BigQuery for analysis, which could trigger an event in Google Pub/Sub and subsequent handling.
Webhooks and other event callbacks: Specialized AI tools, like AssemblyAI's speech recognition and transcription, will trigger a webhook when a transcript is ready. HuggingFace lets you use webhooks for MLOps tasks, such as triggering fine-tuning when changes occur in a dataset.

Why Asynchronous AI is the answer for many GenAI APIs

Techniques like manual polling and stretching timeouts to the point of breaking are only temporary answers to the longer latencies we find in AI APIs. Better solutions are already beginning to appear, in particular:

Streaming protocols: Where you need a near synchronous experience but that response can be broken into smaller chunks, streaming protocols like WebSocket and SSE offer a near-universally available way of getting data to clients reliably while giving the impression of a real-time response.
Event callbacks: In almost all other situations, especially where a streaming response would be less useful, asynchronous AI is the answer via event callbacks such as webhooks, which offer a much better technical implementation and developer experience than manual polling. They're a good fit, whether it's a one-off job generating an image or video or running a batch of jobs together. Crucially, event callbacks help you reduce overhead, are more robust than manual polling, are easy to scale, and rely on standard web protocols.

Most of the innovation in AI-based tools focuses on the models themselves and not so much on the developer experience. And honestly, that makes sense. When things change quickly, we tend to stick with what we know. Even if the response times are slower than what we're used to with REST APIs, using REST APIs has made getting AI-based APIs to market faster.

But as those APIs take on a more business-critical role, it's time to reshape their developer experience by adopting asynchronous AI and patterns that better match how GenAI products work and how we, as developers, engage with them.

The Hookdeck event gateway enables multiple use cases for asynchronous AI:

Looking to receive events (e.g., webhooks) from a GenAI platforms?
Try the receive webhooks quickstart.
Have a need to add asynchronous event delivery to your GenAI platform?
Try the send webhooks quickstart.
Need to receive asynchronous events at scale from devices or SDKs?
Try the asynchronous API gateway quickstart.