Gareth Wilson Gareth Wilson

Building Reliable MCP Servers: Handling Async Events and Webhooks

Published


The Model Context Protocol has changed how AI agents interact with external services. An MCP server wraps an API — Stripe, GitHub, Shopify — and exposes it as a set of tools that any compatible AI client can call. But there's a fundamental assumption baked into MCP's design that breaks down the moment you move beyond simple request-response interactions: MCP assumes the world is synchronous.

In our post about MCP Gateway, we raised an open question: when an MCP tool triggers an action that produces an asynchronous callback, does that callback get fed back to the AI agent? The short answer is: not automatically. And this gap — between triggering an action and receiving its eventual result — is one of the most important infrastructure problems in production MCP server architecture.

The Synchronous Assumption

MCP's core interaction model is straightforward. A client calls tools/call, the server executes the operation, and a result comes back in the response. This works perfectly for read operations — fetching a customer record, listing open issues, checking an account balance.

But many real-world operations don't complete synchronously. When an MCP tool initiates a Stripe payment, creates a GitHub deployment, or starts a Shopify fulfillment, the immediate response is just an acknowledgment. The actual result — payment succeeded, deployment finished, order shipped — arrives later as a webhook from the third-party service.

This creates two distinct communication patterns that MCP servers need to support:

  • Synchronous: Client → MCP Server → External API → Response → Client
  • Event-driven: Third-Party Service → Webhook → MCP Server → Notification → Client

The MCP spec handles the first pattern well. The second pattern? That's where things get complicated.

Tasks Solve Half the Problem

The 2025-11-25 MCP specification introduced Tasks — a "call-now, fetch-later" primitive designed specifically for long-running operations. When a tool call can't return immediately, the server returns a taskId and the client polls for updates. Tasks move through defined states: working, input_required, completed, failed, cancelled.

Tasks are a meaningful step forward. They acknowledge that not everything in the real world happens synchronously and give agents a structured way to track ongoing operations. But Tasks solve the problem from the agent's perspective — they give the client a way to wait for results.

What Tasks don't solve is the server's infrastructure problem. When Stripe sends a payment_intent.succeeded webhook, or GitHub sends a deployment_status event, the MCP server needs to:

  1. Receive that inbound HTTP POST reliably
  2. Verify its authenticity
  3. Handle retries and deduplication
  4. Route it to the correct handler
  5. Correlate it with the original tool call
  6. Update the Task state accordingly

Tasks give you the envelope. They don't give you the mailroom.

Why Standard MCP Transports Don't Help

MCP currently supports stdio and Streamable HTTP as its primary transport mechanisms. Streamable HTTP replaced the earlier SSE-based transport and brought MCP closer to production readiness, but neither transport is designed for receiving inbound webhooks from third-party services.

Streamable HTTP improved on the original SSE transport's scaling challenges — persistent connections, sticky sessions, and the resource cost of maintaining thousands of open connections. But the fundamental issue remains: third-party services don't speak MCP transports. They send webhooks via HTTP POST to a URL you provide. There's a protocol mismatch between how external services want to communicate and how MCP servers are designed to receive messages.

The MCP community has recognized this gap. As one contributor noted, maintaining connections between client and server is often unnecessary and resource-intensive, particularly for MCP deployments running at scale. Proposals for native callback or webhook mechanisms exist, but they're not in the spec today — they're listed under "On the Horizon" as "Triggers and Event-Driven Updates."

The MCP 2026 roadmap does prioritize evolving Streamable HTTP to run statelessly across multiple server instances behind load balancers. That will help with horizontal scaling, but it doesn't address the fundamental challenge of reliably ingesting inbound events from third-party services.

The MCP Gateway Blind Spot

The MCP Gateway market has seen rapid growth. Kong, Lunar, TrueFoundry, and others have launched enterprise MCP gateways focused on authentication, routing, rate limiting, and observability for outbound tool calls. These solve real problems — managing dozens of MCP servers in an enterprise environment requires centralized control.

But there's a consistent blind spot across these gateway implementations: none of them address inbound event ingestion. They're built around the request-response model — an agent calls a tool, the gateway routes it, the tool returns a result. The asynchronous, event-driven half of the equation is left as an exercise for the reader.

This isn't a criticism of those products. It reflects a genuine gap in the ecosystem. The MCP spec itself doesn't have a first-class primitive for "a third-party service wants to push data to this server." Until it does, the infrastructure for handling that pattern needs to come from somewhere else.

Webhook Infrastructure Is the Missing Layer

Here's the thing: the challenges of reliably receiving, processing, and routing inbound webhooks are well-understood. They're just not MCP-specific problems. They're webhook infrastructure problems.

When you need to receive webhooks from third-party services in a production environment, you need:

  • Reliable ingestion — an always-available HTTPS endpoint that won't drop events during deployments, scaling events, or temporary outages
  • Deduplication — because webhook providers retry on failure, and you'll receive duplicates
  • Backpressure management — because a Shopify flash sale or a GitHub monorepo push can generate thousands of events in seconds
  • Retry logic — because your processing code will occasionally fail, and you need events to be redelivered
  • Observability — because when an agent's tool call produces no result, you need to see whether the webhook arrived, whether it was processed, and where it failed
  • Routing and fan-out — because different events need to reach different handlers, and sometimes multiple handlers need the same event

These are exactly the problems that webhook infrastructure — specifically, an Event Gateway — is designed to solve. Not for MCP specifically, but for any system that needs to receive webhooks reliably. MCP servers are simply the latest (and perhaps most consequential) consumer of this infrastructure pattern.

What This Looks Like in Practice

Consider a concrete scenario: you're building an MCP server that wraps Stripe's API, allowing AI agents to create payment links, manage subscriptions, and process refunds.

The synchronous operations are straightforward — your MCP tools call Stripe's API and return results. But the interesting workflows are asynchronous. An agent creates a payment link. Minutes, hours, or days later, a customer pays. Stripe sends a checkout.session.completed webhook. Your MCP server needs to receive that event and potentially notify the agent (or update a Task) so the workflow can continue.

Without dedicated webhook infrastructure, your MCP server is directly responsible for being available to receive that webhook, handling Stripe's retry logic if it misses one, deduplicating events that arrive multiple times, and maintaining enough observability to debug when something goes wrong. That's a lot of infrastructure code that has nothing to do with the MCP tool's actual business logic.

With an Event Gateway sitting between Stripe and your MCP server, the architecture shifts. The Event Gateway provides a stable, always-available endpoint for Stripe to send webhooks to. It handles verification, deduplication, queuing, and delivery. Your MCP server receives clean, deduplicated events with full delivery guarantees. If your server is temporarily unavailable, events queue up and get delivered when it's back. And you get full observability into every event — what arrived, when it was delivered, whether delivery succeeded.

We explored this pattern when integrating Event Gateway and Claude Code, showing how external webhooks can flow through an Event Gateway, get forwarded to a local channel server, and emit MCP notifications to an AI agent.

The Architectural Pattern

The pattern that emerges is:

┌─────────────┐    webhook     ┌───────────────┐     event    ┌────────────┐
│ Third-Party │ ─────────────► │ Event Gateway │ ───────────► │ MCP Server │
│  Service    │                │  (Hookdeck)   │              │            │
└─────────────┘                └───────────────┘              └─────┬──────┘
                                  • Ingestion                       │
                                  • Deduplication                   │ Task update/
                                  • Retry & backpressure            │ notification
                                  • Observability                   │
                                  • Routing                         ▼
                                                              ┌────────────┐
                                                              │ AI Agent   │
                                                              │ (Client)   │
                                                              └────────────┘

The Event Gateway isn't an MCP component. It's infrastructure that sits at the boundary between the external webhook-producing world and your MCP server, handling everything that the MCP protocol itself doesn't address on the inbound side. The MCP spec handles agent-to-tool communication. Tasks handle async state management. The Event Gateway handles reliable event ingestion.

This separation of concerns means your MCP server code can focus on what it should — mapping tool calls to API operations and translating inbound events into meaningful state changes — while the hard problems of webhook reliability are handled by purpose-built infrastructure.

Looking Ahead

The MCP ecosystem is maturing fast. As MCP servers move from prototypes to production, the async event handling gap will only become more visible. Agents won't just read data — they'll trigger workflows that span minutes, hours, or days, and they'll need to receive the results of those workflows reliably.

The MCP spec will likely evolve to better support this pattern natively. The 2026 roadmap's focus on stateless transport and enterprise readiness points in the right direction. But even as the protocol matures, the underlying infrastructure challenge remains: someone needs to reliably receive, queue, and deliver the webhooks that third-party services send.

That's a solved problem. The question for teams building production MCP servers is whether they want to solve it themselves or use infrastructure that already handles it at scale.