How to Prepare for a System Design Interview (Step-by-Step)
A practical framework for system design interviews: requirements, estimation, high-level design, deep dives, and trade-offs — with a study plan and example.
System design interviews are famously open-ended: the interviewer hands you a vague prompt — "design Twitter" or "design a URL shortener" — and watches how you think. There's no single correct answer, but there is a repeatable approach that separates candidates who ramble from those who get offers. This guide gives you that approach, plus a worked example and the common mistakes to avoid.
What system design interviews test
Interviewers aren't expecting you to have memorized the architecture of every major tech company. They want to see:
- Structured thinking: Can you break an ambiguous problem into concrete, manageable pieces?
- Trade-off awareness: Every design decision has a cost. Can you articulate what you're trading?
- Scalability intuition: Do you know when to introduce a cache, a queue, or a CDN — and why?
- Communication: Can you explain technical decisions clearly to a non-specialist?
The mistake most candidates make is jumping straight into drawing boxes. Interviewers penalize that because it signals you skip requirements — a fatal flaw in real engineering.
A repeatable framework
A useful acronym is RESHADED — it covers the full arc of a system design session:
| Step | What you do |
|---|---|
| Requirements | Clarify functional + non-functional requirements |
| Estimation | Back-of-envelope numbers: users, QPS, storage, bandwidth |
| Storage | Choose data model + database type |
| High-level design | Draw the major components and data flow |
| APIs | Define the key endpoints or interfaces |
| Deep dives | Go deep on the components that matter most |
| Edge cases | Failure modes, hotspots, security |
| Discussion | Trade-offs, alternatives you rejected, what you'd do next |
Step 1: Requirements (5 minutes)
Spend the first few minutes asking clarifying questions, even if the prompt seems clear. Every question you ask demonstrates engineering maturity.
- Functional requirements: What does the system actually do? What are the core user actions?
- Non-functional requirements: Scale (how many users/requests per second?), latency SLAs, availability target (99.9%? 99.99%?), consistency requirements (strong vs. eventual?), geographic distribution.
- Scope: "Should I design the analytics pipeline too, or focus on the write/read path?" — interviewers respect explicit scoping.
Step 2: Estimation
Estimation grounds the design in reality and shows you can reason quantitatively. A typical sketch:
- DAU × actions per user = daily requests → / 86,400 = average QPS
- Peak QPS ≈ 2–10× average
- Storage = daily writes × record size × retention period
- Bandwidth = QPS × average payload size
You won't get exact numbers — close is fine. State your assumptions out loud ("I'll assume 100 bytes per tweet on average").
Step 3: High-level design
Draw the major system components: clients, API gateway/load balancer, application servers, databases, caches, queues, CDN. Show the data flow for the 1–2 most important use cases. Keep it simple — you'll add detail in the deep dive.
Step 4: API design
Define the key API endpoints. For a URL shortener:
POST /shorten { url: string, custom_alias?: string } → { short_url: string }
GET /{alias} → 301 Redirect to original URL
Naming, HTTP verbs, and response codes matter — they signal that you think in terms of contracts.
Step 5: Deep dives
Pick 2–3 areas to go deep on. Let the interviewer guide you if they have preferences, but if not, go deep on: the database schema + indexing strategy, the caching layer (what gets cached, TTL, invalidation), and any bottleneck you identified in estimation.
Step 6: Edge cases and failure modes
- What happens if a database node goes down? (replication, read replicas)
- What if a cache is cold? (cache stampede — use a mutex or probabilistic early expiry)
- What if traffic spikes 10×? (auto-scaling, rate limiting)
- How do you handle malicious URLs? (blocklist, async scanning)
Estimation & scaling basics
A few numbers worth internalizing:
| Thing | Rough number |
|---|---|
| 1 million users × 10 req/day | ~120 QPS |
| Single MySQL node | ~10,000 QPS (simple reads) |
| Redis throughput | ~100,000+ ops/sec |
| SSD random read latency | ~0.1 ms |
| Disk read latency | ~10 ms |
| Network round-trip (same DC) | ~0.5 ms |
| CDN cache hit latency | ~10–50 ms |
When a single database can't handle your estimated QPS, the natural progression is: vertical scaling → read replicas → caching → sharding. Each step has trade-offs around consistency and operational complexity.
Sharding partitions data across multiple database nodes by a shard key (e.g., user_id % N). Hot-key problems arise when one shard gets disproportionate traffic — consistent hashing or range-based sharding with careful key design helps.
A worked example: design a URL shortener
This is one of the most common prompts — small enough to complete in 45 minutes, yet rich enough to test all the design dimensions.
Requirements
- Functional: shorten a URL; redirect short URL to original; optional custom aliases.
- Non-functional: read-heavy (100:1 read-to-write ratio); high availability; low latency redirects (<10 ms P99); URLs don't expire by default.
- Scale: 100M DAU, ~10 shortening requests per user per day, ~1000 redirects per user per day → write QPS ≈ 12k, read QPS ≈ 1.2M.
Storage
- Each URL record:
alias (7 chars) + long_url (2 KB avg) + created_at + user_id≈ 2.1 KB. - 100M shortens/day × 365 days × 2.1 KB ≈ ~76 TB/year → plan for tiered storage (hot SSDs for recent data, cold object storage for old).
- Key schema:
aliasestable withaliasas primary key (VARCHAR 7) +long_url+created_at+user_id.
High-level design
Client → CDN → API Gateway → Write Service → DB (Primary)
→ Read Service → Cache (Redis) → DB (Read Replica)
Write path: Client POSTs a URL. Write Service generates a 7-character Base62 alias (62^7 = 3.5 trillion unique IDs), checks for collision (rare), writes to the primary DB, and optionally pre-warms the cache.
Read path: Client follows short URL. Read Service first checks Redis (cache hit → return 301/302 instantly). Cache miss → read from replica → cache the result with TTL of 24 hours.
Alias generation
Two common approaches:
- Counter + Base62 encoding: A global counter (or distributed sequence like Snowflake IDs) encodes to a short string. Guaranteed unique but the counter is a potential bottleneck; use a pre-generated pool.
- Random Base62 string + collision check: Simpler, slightly slower on collision. At 7 chars the collision probability is negligible at our scale.
Deep dives
Cache: Use Redis with a TTL that matches the alias's read frequency. A Bloom filter on the read service can short-circuit lookups for definitely-nonexistent aliases.
Hot aliases: A viral short URL could overwhelm a single cache node. Use Redis Cluster with consistent hashing. For extreme hot keys, in-process L1 caching (e.g., a local LRU map in the service) absorbs the last mile.
Custom aliases: Check uniqueness at write time. Apply a blacklist of reserved words (admin, api, login).
Common mistakes
1. Diving into details before clarifying requirements. Interviewers will redirect you, but you've already burned credibility. Always start with requirements.
2. Over-engineering from the start. Design for the stated scale, not for 10× that. If the problem says 1M users, you don't need Kafka, sharding, and a global CDN. Add complexity only when estimation demands it.
3. Treating the interview as a monologue. System design is a conversation. Pause, check in, and invite the interviewer's input — "Does this caching strategy make sense, or would you like to go deeper on the DB layer?"
4. Ignoring the database schema. Many candidates sketch boxes but never define their tables or document types. A concrete schema shows you've thought through the actual data.
5. Forgetting failure modes. A design with no mention of what happens when a node crashes signals inexperience. At least mention replication, retries, and timeouts.
6. No trade-off discussion. Choosing SQL over NoSQL (or vice versa) is a decision — explain why. "I chose PostgreSQL here because the data is relational and our write QPS is within a single-node budget, but I'd consider Cassandra if writes grew 10×."
Wrap-up
System design mastery comes from breadth of exposure. Study the architectures of systems you use: how does a key-value store work? how does a CDN cache? Read engineering blogs (Netflix Tech Blog, Uber Engineering, Discord Engineering). Practice talking through a design out loud — the verbal explanation is half the interview. Most importantly, internalize the framework: requirements → estimation → high-level → APIs → deep dives → trade-offs. That structure will carry you through any prompt.