The shape of the workload
Across 2,400 customers, our webhook system delivers around eighty million events a month. Events range from new-applicant notifications to payroll-cycle complete alerts to permission-change audit logs. Each customer subscribes to a subset; the matching is per-subscription.
At this scale, a tail latency event from a customer's endpoint can cascade. We invested in isolation: every customer's queue is independent, slow customers cannot back up fast customers.
Idempotency keys and retries
Every event carries a UUID idempotency key. Customers are expected to dedupe on the key; in practice, many do not, and we have to be defensive. We retry with exponential back-off for forty-eight hours, then dead-letter. The dead-letter queue has a twelve-hour replay window where customers can ask us to redeliver.
Twelve hours is short enough that we do not accumulate stale events forever, and long enough to cover most customer outage windows. Our multi-entity work reuses some of the same isolation patterns.
What customers usually get wrong
Three things. First, returning two-hundred when you meant to return four-hundred (we keep retrying if you do not signal). Second, processing the event synchronously inside the webhook handler and timing out (use a queue). Third, not handling out-of-order delivery (we deliver in order best-effort, not guaranteed).
Our docs cover all three explicitly. The help centre has the integration patterns we recommend.
What we are working on
Per-event filtering at the subscription level (today most filtering is at the event-type level), and per-customer SLAs for delivery latency on the priority events (today the SLA is uniform). Both are coming in 2026. Our release notes track the timeline.