Why Redis is a great cache, but a bad webhook queue
Redis is one of the most useful pieces of infrastructure of the last fifteen years. It is a cache, a counter, a session store, a rate limiter, a leaderboard, a pub/sub bus, and a small-but-fast database when you need one. If you have a backend, you almost certainly have Redis running somewhere, and that's a defensible decision. So it's understandable then that you find yourself reaching for it, even when it's not quite the right tool for the job. Case in point: webhooks.
The pull toward Redis here is real. You probably already have BullMQ, Sidekiq, RQ, or a homegrown LPUSH/BRPOP loop in production for background jobs, and webhooks look, on the surface, like another background job source. So the ingestion endpoint becomes a thin HTTP handler that pushes the body onto a Redis list and returns 200. It works in staging. It works in early production, too. But eventually, it doesn't.
What "webhook traffic" actually looks like
Inbound webhook traffic differs from the work you'd normally hand to a job queue:
- Producers are not your code. Stripe, Shopify, GitHub, Twilio, your customers' systems — they decide the rate, the burst shape, and the retry behaviour. You don't.
- Bursts are spiky and correlated. A Black Friday refund window, a Shopify app install spike, a GitHub Action that fires on every push across an org — these arrive in seconds, not minutes.
- Duplicates are normal. Most providers retry on any non-2xx, and several retry on timeouts even when you did process the event. You will see the same event multiple times.
- You owe an answer in milliseconds. If you don't 2xx fast enough, the provider retries, marks your endpoint unhealthy, or both.
- Events have long tails. Three days from now, support will ask why customer X's invoice didn't sync. You need that one event out of a billion.
Hold those properties in mind. They're what each Redis pattern collides with.
Pattern 1: LPUSH and BRPOP, the job-queue shape
This is the most common pattern, and the one most teams arrive at first. An HTTP handler receives the webhook, validates the signature, LPUSHes the payload onto a list, returns 200. A worker pool does BRPOP and processes.
The first thing to be specific about is durability. Redis offers two persistence modes, and the defaults are not what most people assume.
- RDB snapshotting writes a point-in-time dump on an interval. If Redis crashes between snapshots, every event that arrived since the last snapshot is gone. With the default save policy, that window can be minutes.
- AOF (append-only file) logs every write. With
appendfsync everysec(the recommended default) you can still lose up to one second of writes on a crash. Withappendfsync always, you fsync on every write, which gives you durability close to a real disk-backed queue at the cost of significant throughput.
So the first question for a Redis-backed webhook queue is: which of those modes is your production Redis running, and have you measured what one second of webhook ingestion looks like during a burst? For a busy Shopify or Stripe endpoint, just "one second" can be hundreds of events.
The second issue is memory. A Redis list lives in RAM. If your consumers fall behind (a downstream API is slow, a deploy paused workers, you got rate-limited) the queue grows until it hits maxmemory. What happens then depends on maxmemory-policy. If it's noeviction, writes start failing and your ingestion endpoint returns 500s, and the upstream provider starts retrying, amplifying the problem. If it's any of the allkeys-* or volatile-* eviction policies, Redis will start deleting keys to free memory. Those keys can include your queue. There is no warning, no dead-letter, no log line that says "we dropped your webhook." The queue is just shorter than it was a moment ago.
This is the failure mode that surprises people most. An eviction policy chosen sensibly for the cache role of Redis is actively wrong for the queue role. The same instance cannot serve both well.
Third, replication. In a typical HA setup you have a primary and one or more replicas, with Sentinel or a managed equivalent handling failover. Redis replication is asynchronous. When the primary fails over to a replica, any writes the primary acknowledged but had not yet replicated are lost. For a cache, this is fine — the value will be recomputed. For a webhook queue, this is a webhook that the upstream considers delivered and your system has no record of.
None of this means Redis "can't" do it. It means the operational envelope is narrower than it looks, and the failure modes are silent.
Try it locally
Before going further, a practical aside. If you want to feel the difference between "I have a queue" and "I have a webhook pipeline," the fastest path is to point a real webhook source at your laptop. The Hookdeck CLI forwards inbound webhooks to localhost with a stable URL, retries, and an inspector, so you can replay the same event repeatedly while you change your handler. It's free and takes about thirty seconds to set up. Worth it even if you end the evaluation deciding to keep Redis.
Pattern 2: Redis Streams
Streams are the modern answer to "use Redis as a queue." They're a real append-only log with consumer groups, per-consumer pending lists, explicit XACK, and XCLAIM for stuck messages.
But the same primitives that make Streams a better queue than a list don't make Redis a better webhook gateway. A few specifics:
- In-memory first, with a retention ceiling. Streams live in RAM. To bound memory you call
XADD ... MAXLEN ~ NorXTRIM. Once trimmed, the event is gone. Choosing N is choosing how far back you can replay, and that number competes with every other thing Redis is doing on that instance. - No HTTP ingestion. Something still has to receive the webhook, verify the signature, and
XADDit. That something is your code, on your servers, behind your load balancer, with your own backpressure story. Streams don't help with the part of the pipeline that's exposed to the internet. - No source-aware retry.
XCLAIMlets you reprocess a message a consumer failed to ack. It doesn't know that Stripe wants exponential backoff with jitter, that your internal API returns 429 with aRetry-After, or that a 4xx from the destination means "stop retrying" while a 5xx means "back off and try again." - No dedupe across producers. If you scale ingestion horizontally and two pods receive the same retried webhook, both will
XADD. You'll dedupe downstream (using, presumably, another Redis key with a TTL) and now you have a second consistency problem. - No replay UI, no search, no observability. You can
XRANGEby ID. You cannot search by "events from customer 1234 in the last 72 hours where the destination returned 502." That's a query you'll write, a UI you'll build, and an index you'll maintain.
Streams are a good queue, but webhooks need more than that.
Pattern 3: Pub/Sub
Redis Pub/Sub has no durability. If no subscriber is connected when a message is published, the message is gone. It is not a queue and isn't a good fit for webhooks. I mention it since it occasionally shows up in early prototypes and needs to be removed before launch.
It's the wrong layer, not the wrong tool
A webhook ingestion path has at least five jobs:
- An HTTP edge that absorbs spikes, verifies signatures, and acknowledges to the upstream provider in milliseconds — before any downstream system is involved.
- Durable storage of the raw event that doesn't age out, doesn't evict under memory pressure, and survives a node failure without asynchronous-replication loss.
- Delivery and retries that understand the destination: backoff strategy, max attempts, response-code semantics, circuit-breaking when a destination is down, fan-out to multiple destinations.
- Idempotency keyed on the upstream event ID, so a provider retry doesn't become a duplicate side effect downstream.
- Observability good enough to find one event in a billion, see why it failed, and replay it without writing a script.
Redis is, generously, the queue part of job 3. It is not jobs 1, 2, 4, or 5, and was never built to be. Teams who put Redis at the webhook edge end up rebuilding a webhook gateway around it: an ingestion service, a dedupe layer, a retry orchestrator, a search index, a replay tool, a dashboard. Each piece is reasonable on its own. The sum is a system nobody on the team chose to build, owned by whoever last touched it, with operational characteristics that nobody fully understands.
There's also the cost shape. Redis bills you in RAM. A durable, disk-backed queue bills you in disk, which is one to two orders of magnitude cheaper per gigabyte. At low volumes this doesn't matter. But as you scale that becomes a real number. Add HA Redis (Sentinel or Cluster, replicas, a managed tier) and the line items add up faster than they should for what is, structurally, a logbook.
What a webhook gateway actually does differently
A webhook gateway is the thing that should sit between the public internet and your Redis. It does the five jobs above as its primary purpose, not as a side effect.
Concretely, that means: an HTTP ingestion layer that acknowledges the upstream before anything is queued, so a slow downstream never causes a provider retry. Durable, disk-backed storage of every event with retention measured in months, not minutes of free memory. Source-aware retries with configurable backoff, max attempts, and per-destination rules. Idempotency at the gateway, keyed on the provider's event ID, so a retried Stripe webhook is one event in your system regardless of how many times it arrives. Full-text search across event bodies, headers, and metadata, with a replay button next to each one. Issues and alerts when a destination starts failing, before a customer files a ticket.
Hookdeck Event Gateway is one option in this space. There are others. The argument holds either way: the layer in front of your Redis matters more than the queue itself, and most of the failure modes teams attribute to "Redis problems" are really "we put Redis in the wrong place" problems.
Where to go from here
When you're ready to put something in front of your Redis (or replace the bit of it that's handling webhooks), try Hookdeck Event Gateway. Keep Redis where it's good: caching, counters, sessions, the work it was designed for. Move the webhook edge to something built for that job.
FAQs
Can Redis be used as a webhook queue?
Redis can hold webhook payloads in a list or stream, but it is the wrong layer for the webhook edge. Its durability depends on persistence configuration that can lose seconds of writes, an eviction policy chosen for caching can silently drop queued events, and asynchronous replication loses acknowledged writes on failover. It also provides no HTTP ingestion, signature verification, idempotency, or replay.
Are Redis Streams a good webhook queue?
Streams are a better queue than a Redis list (a real append-only log with consumer groups and explicit acknowledgement) but they are still in-memory first with a retention ceiling, still need your own HTTP ingestion and signature verification, and provide no source-aware retry, no cross-producer dedup, and no replay UI or search.
What does a webhook gateway do that Redis doesn't?
A webhook gateway provides an HTTP edge that ack's producers in milliseconds, durable disk-backed storage with retention measured in months, source-aware retries, idempotency keyed on the provider's event ID, and full-text search and replay across event history. Redis is only the queue part of delivery and retries.