BackendArchitectureScale

System Design Lessons From Scaling to 1M Users

What I learned about caching, databases, and distributed systems while scaling a B2B platform.

GT

Gaurav Talesara

AI Systems Engineer · Agentic Systems Architect

Feb 8, 202610 min read
System Design Lessons From Scaling to 1M Users
Infrastructure scaling memoScaling work is less about isolated optimizations and more about understanding load, latency, and failure domains.

The Journey to 1M

Scaling from 1,000 to 1,000,000 users isn't 1000x harder. But it does require fundamentally different thinking about architecture decisions.

Lesson 1: Cache Everything (But Invalidate Carefully)

Caching gave us the biggest performance wins. We cached: - Database query results - API responses - Computed aggregations - Session data

The hard part isn't caching — it's invalidation. We learned to be explicit about cache lifecycles and to prefer TTL-based expiration over complex invalidation logic.

Lesson 2: Your Database is Lying to You

"The query is fast" means nothing without load testing. We had queries that took 10ms with 100 concurrent users and 10 seconds with 10,000.

Solutions that helped: - Read replicas for read-heavy workloads - Connection pooling (seriously, do this first) - Query optimization based on actual production patterns

Lesson 3: Async Everything

We moved as much as possible out of the request path: - Email sending → queue - Analytics tracking → queue - Webhook delivery → queue

This made our APIs faster and more resilient.

Lesson 4: Observability Isn't Optional

At scale, you can't debug with logs alone. We invested in: - Distributed tracing - Real-time metrics - Alerting on business-critical paths

The cost was worth it. We caught issues before customers noticed.

The Meta-Lesson

Most scaling problems aren't technical — they're about understanding your system's behavior under load. Invest in understanding before optimizing.