Webhook Gateways and Durable Runtimes: Two Tools for Reliable Agent Workflows
Here's a pattern you've probably written recently.
app.post('/webhook', async (req, res) => {
const result = await agentSDK.run('summarise this', req.body);
await saveResultToDb(result);
await sendAnEmail(result);
await doSomethingElse(result);
res.status(200).send('OK');
});
An inbound webhook fires, you run an agent against the payload, you do a few side effects, and you return 200 OK. In development it works. In production it works for a while... until one day the LLM takes 47 seconds instead of 4, your load balancer times the request out at 30, your webhook provider interprets that as a failed delivery and retries it, your agent runs a second time, and your user gets two emails.
Somewhere around that moment, you start searching for "how to run long jobs from webhooks." You'll find two kinds of answers:
- Webhook gateways — tools like Hookdeck Event Gateway that sit in front of your application and handle the ingress side of the problem. They verify signatures, dedupe events, queue them durably, retry delivery, and give you observability into what came in.
- Durable runtimes — tools like Inngest, Trigger.dev, or Hatchet that sit inside or alongside your application and handle the execution side of the problem. They run your code durably, check-point between steps, pause tasks for external events, and give you observability into what your code is doing.
They sound similar, but they're solving two different halves of the same problem. Many teams pick one when they needed the other — or both.
This article walks through what each category actually does, where they shine, where they fall short, and how to decide which you need.
The problem
The naive code above has three distinct failure modes, and each one is solved by a different layer of the stack.
Failure 1 — the edge. The webhook itself might be spoofed, duplicated, or malformed. Your provider sends the same event twice during a retry. The signature header is stripped by a proxy. The payload uses an encoding your parser chokes on. These are ingress problems — they happen before your business logic runs.
Failure 2 — the execution. Your agent call takes too long. A dependency throws. The process crashes mid-run. Your serverless function hits its 30-second limit. You retry, but you retry the whole thing, including the expensive LLM call that already succeeded. These are execution problems — they happen while your business logic is running.
Failure 3 — the operation. Something went wrong in production. You need to know what. Was it the webhook? The LLM? The database write? Is there a way to replay the failed event without asking the provider to re-send it? These are observability and recoverability problems — they happen after the fact.
A webhook gateway solves failure 1 and most of failure 3. A durable runtime solves failure 2 and a different slice of failure 3. Neither solves all of it on its own.
What a webhook gateway does
A webhook gateway is a managed service that sits between the webhook producer (Stripe, Shopify, GitHub, your own upstream service) and your application. Hookdeck's Event Gateway is one; there are others with varying scope.
What it gives you:
- Signature verification for every major webhook provider, without you writing the crypto. Hookdeck ships with 120+ pre-configured sources, so "verify this Stripe webhook" is a toggle, not a library integration.
- Durable ingress queue. The moment the webhook hits Hookdeck, it's persisted. Even if your downstream endpoint is down, the event is safe and will be retried.
- Dedup and filtering at the edge. Stripe sends the same
invoice.paidevent twice? Hookdeck can drop the duplicate before it hits your code. Only care about orders over $500? Filter them at the gateway. - Transformations. Reshape the payload before it reaches your code. Useful when you want to normalise events from five providers into one shape, or strip PII, or rename fields to match your internal schema.
- Retries with exponential backoff and provider-aware semantics. If your endpoint 502s, Hookdeck retries on a curve you control, not on the webhook provider's unforgiving 1–3 retry window.
- Replay and inspection. Every event you received is stored, searchable, and replayable. When production breaks, you don't need the provider to re-send — you just replay from the Hookdeck dashboard.
- Dev-to-prod workflow. The Hookdeck CLI forwards real webhooks to your laptop during development, preserves history between restarts, and supports multiple developers sharing the same connection.
Where a webhook gateway shines
- You're receiving webhooks from third-party providers, full stop. The ingress concerns are real regardless of what happens downstream.
- You want provider-specific handling without writing it per provider.
- You want a consistent operational experience across your webhooks, even if different teams own the receiving services.
- You want to be able to replay and debug without coordination with the upstream sender.
- You want the webhook infrastructure decoupled from whatever your application code and runtime happens to be this quarter.
Where a webhook gateway alone falls short
- It doesn't run your code. A gateway delivers the event to an endpoint; what happens inside that endpoint is your problem. If your handler takes too long, you still have a timeout. If your handler crashes halfway through, the gateway will retry the whole delivery — including the parts that already succeeded.
- It can't check-point between steps. If your agent call takes 20 seconds and then your database write fails, the retry replays the agent call. Durable runtimes can skip the successful step; gateways cannot.
- No human-in-the-loop primitives. If your agent needs to pause for a human to approve something (like a code review, a refund, a content moderation decision) a gateway has nowhere to put that pause.
- Not scheduled or time-delayed. Gateways are reactive to inbound events. If your agent needs to send a follow-up three days later, that's a different tool.
- HTTP timeout ceiling. Even with generous delivery windows, there's a point at which "keep the HTTP request alive" stops being the right pattern.
What a durable runtime does
A durable runtime is a platform for running your code in a way that survives process restarts, container failures, and step-level errors. Inngest, Trigger.dev, and Hatchet are all in this category. So is Temporal, for a different slice of the market.
What it gives you:
- Check-pointed execution. You write what looks like normal async code. The runtime wraps each
awaitin a way that persists the result. When your function crashes at step 4 of 6 and restarts, steps 1–3 don't re-run. - Per-step retries. If the LLM call fails, retry just the LLM call. If the email send fails, retry just the email send. The runtime knows which steps succeeded and which didn't.
- Long-running tasks without keeping HTTP alive. Trigger.dev's tasks can run for minutes, hours, or longer. Your HTTP handler returns immediately; the task continues executing in the runtime's workers.
- Waitpoints and human-in-the-loop. A Trigger.dev task can
wait.forToken()indefinitely and resume when a human clicks Approve in Slack. Inngest has similar primitives. This is genuinely hard to build from scratch. - Realtime streams back to the frontend. Stream LLM output to a browser while the task is still running, without your app having to hold a connection open.
- Agent-specific primitives. Inngest's AgentKit, Trigger.dev's v4 agent tooling, and Hatchet's task orchestration all give you framework support for multi-step agent patterns.
- Scheduling and cron. Run this task in an hour. Run this task every Monday. Webhook gateways don't do this.
Where a durable runtime shines
- Multi-step workflows with real branching, fan-out, and conditional logic.
- Long-running agent work that needs to survive restarts.
- Human-in-the-loop or external-event waits.
- Complex orchestration across multiple services with compensating logic.
- Scheduled or delayed tasks.
- Any code that you want to be idempotent-by-default without writing idempotency wrappers yourself.
Where a durable runtime alone falls short
- Webhook ingress is an afterthought. Most runtimes ship a generic HTTP endpoint or "webhook trigger" with basic authentication. That's not the same as 120+ provider-specific signature verifications, replay, and edge dedup. If you're receiving webhooks from external providers and you only have a runtime, you're writing the ingress layer yourself.
- Per-execution pricing can bite at high ingress volume. Runtimes generally charge per execution or per compute-second. For a simple "receive webhook, run one-line side effect, return" pattern at 100K events/day, you'll pay runtime prices to do work that a queue could handle. Whether this matters depends on your volume and your side-effect complexity.
- Language and framework lock-in. Trigger.dev is TypeScript only. Inngest has TypeScript and Python SDKs. Hatchet covers TypeScript, Python, Go, and Ruby. Your code becomes coupled to the runtime's SDK and conventions, and changing runtimes later means rewriting.
- Execution observability is not webhook observability. A runtime's dashboard tells you what happened to your function after it ran. It doesn't tell you what the webhook provider actually sent, whether the signature was valid, or whether a duplicate event was rejected. Those are separate questions.
- Overhead for simple cases. If your "agent workflow" is one LLM call and one database write, you may not need a workflow DSL. A reliable queue and a plain HTTP handler might be enough.
How they fit together
In most mature production systems, the two categories work in sequence.
flowchart TB producer["Webhook producer"] gateway["<b>Webhook gateway</b><br/>• Verifies signature<br/>• Deduplicates<br/>• Filters<br/>• Transforms<br/>• Queues<br/>• Retries on failure<br/>• Replay & inspect"] runtime["<b>Durable runtime</b><br/>• Runs agent code<br/>• Per-step retries<br/>• Survives crashes<br/>• Waitpoints<br/>• Streaming<br/>• Scheduling"] effects["Side effects<br/>(DB, email, Slack, ...)"] producer -->|signed HTTP POST| gateway gateway -->|clean, verified event| runtime runtime --> effects
Concretely, with a Webhook Gateway in front of a Runtime, the flow looks like:
- Stripe fires
invoice.paid. Signed request hits your Hookdeck URL. - Hookdeck verifies the Stripe signature, drops the event on its durable queue.
- Hookdeck forwards the cleaned event to a Trigger.dev HTTP endpoint with its own auth.
- Trigger.dev accepts the event and starts a task: call LLM, write to DB, send Slack notification.
- If the LLM call fails, Trigger.dev retries just that step. If the whole task crashes, it resumes from the last check-point. If a human needs to approve the Slack message before it sends, the task waits.
- At every layer, you get observability appropriate to that layer — Hookdeck tells you what the webhook did; Trigger.dev tells you what the task did.
When you need one, the other, or both
A decision framework, with the caveat that every real system is messier than a framework.
Gateway only
You probably don't need a runtime if all of the following are true:
- Each webhook triggers a small number of discrete side effects (1–3), not a multi-step workflow.
- The total work fits comfortably in a few seconds to a minute.
- You don't need to pause for external events or human approval.
- You don't need to schedule follow-up work.
- Per-event cost matters to you more than workflow ergonomics.
Example: Stripe sends customer.subscription.updated, you update your internal user record, update a search index, and fire a Slack notification. That's a queue-and-handler shape, not a workflow shape. Hookdeck in front of a plain HTTP handler on your existing application is enough.
Runtime, with a gateway in front
You want a durable runtime if any of the following are true:
- You have workflows with real structure — five or more discrete steps, branching logic, fan-out/fan-in.
- You need to wait on external events, human approvals, or long-running background processes.
- Your agent work spans minutes, hours, or days.
- You want scheduled or delayed follow-ups.
- Your workload is expensive per-step and you don't want to re-run successful steps on retry.
You still want a gateway in front, because the ingress problems don't go away. You just also need the runtime to handle the execution problems. This is the common case for anything you'd call an "agent workflow" with real complexity.
When you only think you need a runtime
The trap is reaching for a workflow engine before you need one. A well-queued, well-retried, well-observed webhook delivery into a stateless handler handles a surprising amount of agent work. Workflow engines are wonderful when you have workflows; they're overhead when you have queues.
Ask yourself honestly: is your "workflow" actually a workflow, or is it three function calls you're worried will fail? If it's the latter, a good gateway with a solid handler gets you most of the way there, and you can reach for a runtime when the workflow actually emerges.
A few common misconceptions
"My runtime has a webhook trigger, so I don't need a gateway." The runtime's webhook trigger is usually a generic HTTP endpoint with basic auth. It doesn't verify Stripe's signature, it doesn't dedupe Shopify's retries, it doesn't give you a replay UI when production breaks. You can build those yourself inside your runtime tasks — you'll be rebuilding what a gateway does, one provider at a time.
"My gateway has retries, so I don't need a runtime." The gateway retries delivery. If your endpoint returns 500 halfway through a seven-step agent workflow, the gateway will re-send the event, and your handler will re-run all seven steps — including the expensive LLM calls that already succeeded. Check-pointed execution is a runtime thing.
"I'll just build it myself." The ingress layer is a lot harder than it looks. The execution layer is a lot harder than it looks. Both have been built and maintained at scale by teams who specialise in them. Build-vs-buy remains a valid decision, but the "it's just a queue" estimate is usually wrong by an order of magnitude.
Wrapping up
Running your agent inside the webhook handler works until it doesn't. When it stops working, the answer isn't usually "replace my webhook handler with a workflow engine." The answer is usually: put a gateway in front to make the ingress reliable, and reach for a runtime only when your workflow is actually a workflow.
The two categories do different things. Use both where you need both. Use just the gateway when the work is simple enough. Don't pay workflow-engine prices for work that a queue and a handler could do, and don't hand-roll webhook ingress when a gateway does it better.
If you're at the "my handler timed out again" moment and want a place to start, Hookdeck's free tier handles the ingress side in a few minutes, and the Hookdeck CLI lets you debug the whole flow locally before you ship. If you then realize you need a runtime, our Hookdeck + Trigger.dev integration guide shows the two-layer pattern end to end.