5 System Design Patterns Every Interviewer Asks About
Ravi Subramanian·Jan 12, 2026·12 min read
Interview TipsMost system design interview prep starts with memorizing architectures. You study how to design a URL shortener, a chat app, a news feed — and hope the interviewer picks one you've seen before. That works until they don't.
The better approach is to learn the underlying patterns. Every system design question — whether it's "design Uber" or "design a distributed key-value store" — is built from the same five or six recurring building blocks. Identifying these patterns is what separates senior engineers from candidates who are still thinking in terms of individual problems.
Here's the thing most prep resources won't tell you: interviewers aren't scoring your architecture. They're scoring your reasoning. The "right" answer matters far less than demonstrating that you understand why you're making each decision and what you'd give up by choosing differently. Every pattern below has trade-offs, and the candidates who discuss those trade-offs unprompted are the ones who get offers.
1. Caching — The Pattern You'll Use in Every Single Answer
If there's one pattern you can't afford to be fuzzy on, it's caching. It shows up in virtually every system design question because almost every system has a read-heavy workload, and caching is the first lever you pull to handle it.
Where it appears in interviews: "Design Twitter's home timeline." "Design a URL shortener." "Design a product recommendation engine." Any question where users read data more often than they write it — which is most of them.
What the interviewer actually evaluates: Not whether you mention caching (everyone does), but whether you understand the invalidation problem and can pick the right strategy for the specific use case.
The Three Strategies You Need to Know
Cache-aside (lazy loading). The application checks the cache first. On a miss, it reads from the database, writes to the cache, and returns the result. This is the default for most read-heavy systems. The trade-off: cache misses hit the database directly, and there's a window where stale data can be served after a write.
Write-through. Every write goes to the cache and the database simultaneously. Data is always fresh in the cache, but writes are slower because they hit two systems. Good for use cases where read-after-write consistency matters — like a user updating their profile and immediately viewing it.
Write-behind (write-back). Writes go to the cache first, and the cache asynchronously flushes to the database. This is fast for writes but risky — if the cache node fails before flushing, you lose data. Use this when you can tolerate some data loss in exchange for write throughput (think analytics event ingestion, not financial transactions).
The Follow-Up That Trips People Up
Interviewers love to ask: "What happens when your cache fills up?" This is testing whether you understand eviction policies. LRU (least recently used) is the safe answer for most cases, but you should know when LFU (least frequently used) is better — for example, in a CDN where popular content should stay cached even if it wasn't accessed in the last few seconds.
The other common follow-up is cache stampede: when a popular cache key expires and hundreds of concurrent requests hit the database simultaneously. Solutions include staggered TTLs (adding a random jitter to expiration times), lock-based rebuilding (only one request rebuilds the cache while others wait), and early recomputation (refreshing the cache before it expires).
The trade-off to articulate: Caching trades consistency for latency. Every caching decision is a bet on how stale your users can tolerate the data being. Name this trade-off explicitly and you'll stand out.
2. Data Partitioning — How You Answer "Now Scale It"
Twenty minutes into a design question, the interviewer will say something like "this needs to handle 500 million users" or "assume 100,000 writes per second." That's your cue to talk about partitioning.
Where it appears in interviews: Literally every question that involves a database. The initial design uses a single database. The scaling discussion introduces partitioning.
What the interviewer actually evaluates: Whether you can pick an appropriate partition key and reason about the downstream consequences — hot partitions, cross-partition queries, rebalancing.
Choosing a Partition Key
This is the most consequential decision in the pattern, and it's where most candidates either shine or stumble.
Hash-based partitioning. Apply a hash function to a key (like user ID) and distribute across partitions. Gives even distribution but destroys data locality — if you need to query all posts by a user and all posts in a geographic region, you can't optimize for both with a single hash key.
Range-based partitioning. Partition by value ranges (timestamps, alphabetical ranges, geographic regions). Preserves locality for range queries but creates hot spots — if you partition a social media app by creation date, the most recent partition handles all new writes.
Composite keys. Combine both approaches. For a messaging app, you might partition by hash(user_id) for even distribution, then sort within each partition by timestamp for efficient range queries. This is the DynamoDB model, and it comes up frequently in interviews targeting companies that use it at scale.
The Follow-Up That Matters
"How do you handle a hot partition?" This tests operational thinking. Solutions include further splitting the hot partition, adding a random suffix to the key to distribute load (with a scatter-gather pattern for reads), or introducing a write-back cache in front of the hot partition. The worst answer is "I'd just add more partitions" without addressing why the hotspot exists.
The trade-off to articulate: Partitioning trades query flexibility for write scalability. Once you shard, joins and cross-partition queries become expensive or impossible. Acknowledge this, and explain how you'd handle the queries that now span partitions (denormalization, materialized views, or accepting higher latency for those specific access patterns).
3. Asynchronous Processing — The Answer to "What If This Is Slow?"
Some operations can't — or shouldn't — happen synchronously within a request. Video transcoding, email delivery, report generation, payment settlement. When the interviewer introduces a requirement that takes more than a few hundred milliseconds, asynchronous processing is the pattern.
Where it appears in interviews: "Design YouTube" (video processing pipeline). "Design an e-commerce platform" (order fulfillment). "Design a notification system" (fan-out to millions of devices). Any question involving long-running tasks or event-driven workflows.
What the interviewer actually evaluates: Whether you understand the difference between message queues and event streams, and when to use each.
Message Queues vs. Event Streams
Message queues (SQS, RabbitMQ). Point-to-point delivery. A message is consumed by exactly one consumer, then removed from the queue. Good for task distribution — "process this video," "send this email." The queue is the backpressure mechanism: if consumers fall behind, the queue grows, and you can add more consumers to catch up.
Event streams (Kafka, Kinesis). Publish-subscribe with persistence. An event is written to a log and can be consumed by multiple independent consumers. Events are retained for a configurable period, so new consumers can replay history. Good for event-driven architectures where multiple services need to react to the same event — "a user signed up" might trigger a welcome email, a CRM update, and an analytics event simultaneously.
Delivery Guarantees
This is where the interview gets interesting. There are three guarantees, and each has real consequences:
At-most-once. Fire and forget. Fast, but messages can be lost. Acceptable for non-critical analytics events.
At-least-once. Messages are retried until acknowledged. No message loss, but duplicates are possible. This is the default for most production systems, paired with idempotent consumers that can safely process the same message twice.
Exactly-once. The holy grail. Technically achievable within Kafka's transaction model, but practically very hard across service boundaries. If an interviewer asks for exactly-once delivery across services, the senior answer is: "We'd use at-least-once delivery with idempotent processing, which gives us effectively-once semantics."
The trade-off to articulate: Asynchronous processing trades latency predictability for throughput and resilience. The user doesn't get an immediate result — they get an acknowledgment and eventually a result. Explain how you'd communicate processing status to the user (polling, WebSockets, email notification) and what happens if processing fails (dead letter queues, retry policies, alerting).
4. Rate Limiting and Back Pressure — The Operational Thinking Test
This pattern tests something different from the others. It's not about features — it's about protecting your system and your users when things go wrong. Interviewers use it to evaluate operational maturity, which is a strong signal for senior and staff-level candidates.
Where it appears in interviews: "Design an API gateway." "Design a cloud storage service." "How would you protect this system from abuse?" It also appears as a follow-up in almost any design: "What happens if one client sends 10x the normal traffic?"
What the interviewer actually evaluates: Whether you can reason about system protection at multiple layers and understand the user experience implications of rate limiting.
The Two Algorithms
Token bucket. A bucket fills with tokens at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. Simple to implement, allows for short bursts (up to the bucket size), and is the most common choice for API rate limiting. This is what AWS, Stripe, and most public APIs use.
Sliding window. Track the number of requests within a rolling time window. More precise than a fixed window counter (which has the boundary problem — a burst at the end of one window and the start of the next can double the effective rate). Slightly more complex to implement, typically using a sorted set in Redis.
Where Candidates Go Wrong
Most candidates describe rate limiting at the API gateway and stop there. That's one layer. The interviewers who push further want to hear about:
Per-user vs. per-tenant vs. global limits. A multi-tenant system needs all three. A single user shouldn't be able to exhaust their tenant's quota, and a single tenant shouldn't be able to degrade the system for everyone else.
Backpressure propagation. What happens when a downstream service is overwhelmed? Circuit breakers (stop calling a failing service temporarily), bulkheads (isolate resources per tenant so one failure doesn't cascade), and load shedding (deliberately dropping low-priority requests to preserve capacity for high-priority ones).
Graceful degradation. The best answer isn't "reject the request." It's "return a degraded response" — serve cached data, reduce feature richness, or queue the request for later. Netflix's approach to graceful degradation during partial outages is a well-known example worth referencing.
The trade-off to articulate: Rate limiting trades availability for reliability. You're deliberately making your system less available to some users in order to keep it reliably available for everyone else. Frame it as a fairness problem, not just a protection mechanism.
5. Real-Time Updates — Answering "How Does the User See Changes?"
Any system where users expect to see live updates — chat messages, stock prices, collaborative editing, notification feeds — requires a real-time communication pattern. This is increasingly common in interviews because modern applications set high expectations for real-time behavior.
Where it appears in interviews: "Design Slack." "Design a live sports scoreboard." "Design Google Docs." "Design a notification system." Any question where the data changes server-side and the client needs to know about it without refreshing.
What the interviewer actually evaluates: Whether you understand the spectrum of solutions (not just WebSockets) and can match the right approach to the specific requirements.
The Spectrum
Short polling. Client sends a request every N seconds. Simple, works everywhere, but wasteful — most responses are "nothing changed." Acceptable for low-frequency updates (checking for new email every 30 seconds) or as a fallback when other approaches aren't supported.
Long polling. Client sends a request, server holds the connection open until there's new data or a timeout. More efficient than short polling for infrequent updates. The trade-off: each held connection consumes a server thread (or connection slot), which limits concurrency. Good for moderate-scale notification systems.
Server-Sent Events (SSE). A persistent one-way connection from server to client over HTTP. The server pushes events as they occur. Simpler than WebSockets (works over standard HTTP, automatic reconnection built into the browser API), but unidirectional — the client can't send messages back over the same connection. Ideal for live feeds, dashboards, and notification streams.
WebSockets. A persistent bidirectional connection. Both client and server can send messages at any time. The most capable option, but also the most complex — you need to handle connection management, heartbeats, reconnection logic, and load balancer configuration (sticky sessions or connection-aware routing). Use when you genuinely need bidirectional communication, like chat or collaborative editing.
The Decision Framework
Start with the simplest approach that meets requirements. If updates happen once every few minutes, polling is fine — don't introduce WebSocket complexity for a low-frequency use case. If updates are frequent but unidirectional (notifications, live scores), SSE is simpler than WebSockets. Reserve WebSockets for genuinely interactive features.
Interviewers respect candidates who start simple and justify added complexity, rather than jumping straight to WebSockets because it sounds more impressive. This is one of the most common over-engineering mistakes in design interviews.
The trade-off to articulate: Real-time updates trade server resource consumption for user experience freshness. Each persistent connection costs memory and a file descriptor on the server. At scale (millions of concurrent users), this is a significant infrastructure cost. Explain how you'd manage this — connection pooling, server-side fan-out via pub/sub, and tiered approaches (real-time for active users, batch for inactive ones).
The Pattern Behind the Patterns
Here's what connects all five of these: every system design decision is a trade-off, and the interviewer is evaluating whether you see it that way.
Caching trades consistency for speed. Partitioning trades query flexibility for write scale. Async processing trades immediacy for resilience. Rate limiting trades availability for fairness. Real-time updates trade server resources for freshness.
Candidates who memorize architectures can produce a working design. Candidates who understand trade-offs can explain why that design is the right one — and what they'd change if the requirements shifted. The second group gets the offers.
One practical observation: the recall challenge in system design interviews is real. You might know all five patterns cold during prep but struggle to articulate the specific trade-offs when the conversation is live, the interviewer is probing, and you're managing a whiteboard at the same time. This is the kind of high-pressure information recall that tools like Neothi are designed for — keeping your prep material accessible as a real-time overlay while you focus on the conversation itself.
But tools only help if you've done the preparation. Learn these five patterns deeply enough that you can explain each trade-off in your own words. Then, whether you're working from memory or with assistance, your answers will have the specificity and reasoning that interviewers are actually scoring.