The wall at eight hundred
Postgres was fine until customer eight hundred and a sudden surge in cross-tenant analytics queries. Dashboard load times went from a comfortable four hundred milliseconds to seven seconds. The CPU was not the bottleneck; the I/O pattern was. Analytics queries scan; OLTP queries point-lookup. The two share an index strategy badly.
We tried partitioning. We tried materialised views. Both helped at the margins. Neither closed the gap on the queries that mattered.
Why ClickHouse
ClickHouse's columnar storage is a direct fit for analytics queries: scan a few columns across millions of rows. The ratio of speed-up on our representative workloads was between fifteen and forty times, depending on query shape. The trade-off is write amplification and a different operational story.
We kept Postgres as the source of truth and stream changes via CDC into ClickHouse. The transactional tier never sees an analytics query; the analytics tier never serves a write. Both are happier.
The migration plan that worked
Six weeks, three engineers, one rollback rehearsal. We migrated dashboards in slices: the simplest first, the cross-tenant rollups last. Every slice had a parallel-run window where the dashboard read from both tiers and the engineering team compared values. The single discrepancy we found was a timezone-handling bug we would have caught in production a quarter later.
The multi-entity payroll work uses some of the same patterns: row-level isolation on writes, materialised rollups on reads.
What we would do differently
Start the migration sooner. We waited too long, and customer-facing dashboards were degraded for two months before we shipped the first slice. We thought we could optimise our way out; we could not.