Webhooks at scale: what we learned at eighty million events, pPULSE blog

Webhooks at scale: what we learned at eighty million events

How we deliver, retry, and dead-letter webhooks across 2,400 customers without losing events. Idempotency keys, exponential back-off, and the quietly-impactful twelve-hour replay window.

By Rohan Mehta · December 5, 2025 · 12 min read · 1.1k views

The shape of the workload

Across 2,400 customers, our webhook system delivers around eighty million events a month. Events range from new-applicant notifications to payroll-cycle complete alerts to permission-change audit logs. Each customer subscribes to a subset; the matching is per-subscription.

At this scale, a tail latency event from a customer's endpoint can cascade. We invested in isolation: every customer's queue is independent, slow customers cannot back up fast customers.

Idempotency keys and retries

Every event carries a UUID idempotency key. Customers are expected to dedupe on the key; in practice, many do not, and we have to be defensive. We retry with exponential back-off for forty-eight hours, then dead-letter. The dead-letter queue has a twelve-hour replay window where customers can ask us to redeliver.

Twelve hours is short enough that we do not accumulate stale events forever, and long enough to cover most customer outage windows. Our multi-entity work reuses some of the same isolation patterns.

What customers usually get wrong

Three things. First, returning two-hundred when you meant to return four-hundred (we keep retrying if you do not signal). Second, processing the event synchronously inside the webhook handler and timing out (use a queue). Third, not handling out-of-order delivery (we deliver in order best-effort, not guaranteed).

Our docs cover all three explicitly. The help centre has the integration patterns we recommend.

What we are working on

Per-event filtering at the subscription level (today most filtering is at the event-type level), and per-customer SLAs for delivery latency on the priority events (today the SLA is uniform). Both are coming in 2026. Our release notes track the timeline.

Webhooks at scale: what we learned at eighty million events

The shape of the workload

Idempotency keys and retries

What customers usually get wrong

What we are working on

Ready to get started?

See what you'll pay

Book a demo

pPULSE Help

The shape of the workload

Idempotency keys and retries

What customers usually get wrong

What we are working on

More from the blog

Designing the AI scoring engine to be DPDP and bias aware

Building a payroll engine that handles fourteen entities at once

Why we left Postgres for ClickHouse on the analytics tier

Ready to get started?

See what you'll pay

Book a demo