System Design from First Principles
System Design Guide
The complete guide — from first principles to full system designs.
There is no single right answer in system design — everything has tradeoffs. This guide is meant as initial exposure to the core concepts, building blocks, and patterns that underpin every distributed system.
The guide is built from the ground up. If you know nothing about system design, start at Part 1. If you're already comfortable with the basics, skip ahead to Part 3 (Patterns & Principles) or Part 5 (Putting It All Together).
What This Guide Covers
- Part 1 — First Principles: How the internet works — clients, servers, protocols, DNS, HTTP, TLS. The foundation everything else is built on.
- Part 2 — Building Blocks: The components you assemble to build systems — servers, APIs, databases, caching, CDNs, load balancers, message queues, and virtual networks.
- Part 3 — Patterns & Principles: The theoretical concepts and architectural patterns — ACID, CAP theorem, consistency models, replication, sharding, and scaling strategies.
- Part 4 — Security: Authentication, authorization, encryption, and the common vulnerabilities every system must defend against.
- Part 5 — Putting It All Together: A structured framework for designing systems end-to-end — requirements, estimation, API design, data modeling, and architecture.
- Part 6 — Common Design Problems: Worked examples of real system designs — URL shorteners, chat apps, notification systems, and more.
- Glossary: Quick-reference definitions for every term used throughout the guide.
Parts 1–4 each end with a chapter of a running story: you and your co-founder Sam are building and scaling LinkPulse, a link-sharing platform. Each chapter picks up where the last left off. The concepts are taught first — the story shows you what they feel like in practice. Look for the gold-bordered sections at the end of each Part.
Part 1: First Principles — How the Internet Works
Before we talk about system design, you need to understand what happens when a user types a URL into their browser. Every system you'll ever design sits on top of this flow.
1.1 The Client-Server Model
Everything on the internet is a conversation between two types of machines:
- Client: The thing making a request (your browser, a mobile app, another server)
- Server: The thing responding to the request (a machine running code that listens for incoming connections)
When you type https://example.com into your browser:
1. Your browser (client) asks DNS: "What IP address is example.com?"
2. DNS responds: "It's 93.184.216.34"
3. Your browser opens a TCP connection to 104.26.10.53 on port 443 (HTTPS)
4. Your browser sends an HTTP GET request: "Give me the homepage"
5. The server processes the request and sends back HTML/CSS/JS
6. Your browser renders the page
1.2 What Is a Protocol?
A protocol is a set of rules that two machines agree to follow when communicating. Without a protocol, one machine might send data as JSON while the other expects XML — chaos. Think of it like two people agreeing to speak English, take turns, and raise a hand when they have a question.
| Protocol | What It Does | Analogy |
|---|---|---|
| TCP | Reliable, ordered delivery of data | Certified mail — you get confirmation it arrived |
| UDP | Fast, unreliable delivery | Shouting across a room — fast but some words get lost |
| HTTP | Request-response format for web communication | A structured form: "I want X" → "Here's X" |
| HTTPS | HTTP + encryption (TLS) | Same form, but in a sealed envelope |
| WebSocket | Persistent two-way connection | A phone call (stays open, either side can talk) |
TCP Deep Dive
TCP (Transmission Control Protocol) guarantees that data arrives completely, in order, and without errors. It does this through three mechanisms:
- Acknowledgments (ACKs): The receiver confirms every chunk of data it gets. If the sender doesn't hear back, it resends.
- Ordering: Every chunk gets a sequence number. If chunk #3 arrives before chunk #2, TCP holds #3 and waits for #2.
- Retransmission: If a chunk is lost in transit, the sender automatically resends it after a timeout.
Before any data flows, TCP establishes a connection with a 3-way handshake:
UDP Deep Dive
UDP (User Datagram Protocol) skips the handshake entirely. It just fires data and hopes for the best — no ACKs, no ordering, no retransmission. This makes it much faster but unreliable.
Analogy: TCP is like a phone conversation ("Did you hear that?" "Yes." "OK, next thing..."). UDP is like a live radio broadcast — the station keeps transmitting whether you're listening or not.
TCP vs UDP — When does it matter?
- TCP for anything where you can't lose data: web pages, API calls, file transfers, database queries
- UDP for anything where speed matters more than perfection: video streaming, gaming, voice calls
But wait — if TCP is reliable, why would anyone use UDP? Because reliability has a cost. TCP's handshakes, ACKs, and retransmissions add latency. In a video call, a dropped frame from 50ms ago is worthless — you'd rather skip it and show the next frame. UDP lets you do that. TCP would pause the stream to re-request the lost frame, causing a visible stutter.
1.3 IP Addresses and Ports
An IP address is a machine's address on a network: 192.168.1.100
A port is like an apartment number at that address. One machine can run many services, each on a different port:
Port 80 → HTTP (web traffic)
Port 443 → HTTPS (encrypted web traffic)
Port 5432 → PostgreSQL (database)
Port 6379 → Redis (cache)
Port 3000 → Your Node.js app
But wait — how many ports can a server have? 65,535 (216). Ports 0–1023 are "well-known" and reserved for standard services like HTTP and SSH. Ports 1024–49151 are "registered" for specific applications. Ports 49152–65535 are ephemeral — the OS assigns them to temporary client connections on the fly.
What if two applications try to use the same port? The OS rejects the second one with a "port already in use" error (EADDRINUSE). Each port can only be bound by one process at a time — it's like two tenants trying to rent the same apartment.
1.4 DNS: The Internet's Phone Book
DNS (Domain Name System) translates human-readable names to IP addresses.
example.com → 93.184.216.34
Lookup chain:
1. Check browser cache
2. Check OS cache
3. Ask your ISP's DNS resolver
4. Resolver asks root nameserver → "Who handles .com?"
5. Ask .com nameserver → "Who handles example.com?"
6. Ask example.com's nameserver → "Here's the IP: 93.184.216.34"
GeoDNS is DNS that returns different IP addresses based on where the query originates. Queries from Europe resolve to an EU load balancer IP; queries from Asia resolve to an AP load balancer IP. This is how global systems achieve low latency — a user in Frankfurt gets routed to the nearest data center, dropping response times from 280ms to 15ms.
How long does a DNS cache last? Each DNS record has a TTL (Time to Live) — typically 60–300 seconds. When the TTL expires, the resolver re-queries the authoritative nameserver. Low TTL means faster failover but more DNS traffic. High TTL means fewer lookups but slower propagation when you change an IP.
1.5 HTTP: The Language of the Web
HTTP is a request-response protocol. Every HTTP message has:
Request:
GET /api/users/123 HTTP/1.1
Host: api.example.com
Authorization: Bearer eyJhbGciOiJ...
Content-Type: application/json
Response:
HTTP/1.1 200 OK
Content-Type: application/json
{
"id": 123,
"name": "John Doe",
"role": "Senior Engineer"
}
HTTP Methods
| Method | Purpose | Safe? | Idempotent? |
|---|---|---|---|
GET | Read/retrieve data | Yes | Yes |
POST | Create new data | No | No |
PUT | Replace/update data completely | No | Yes |
PATCH | Partially update data | No | No |
DELETE | Remove data | No | Yes |
Safe = doesn't change anything on the server. Idempotent = calling it 10 times has the same effect as calling it once.
Still confused by idempotency? Think of it like a light switch vs. a doorbell. A light switch is idempotent: flipping it to "on" ten times still results in the light being on — the end state is the same no matter how many times you do it. A doorbell is NOT idempotent: pressing it ten times rings it ten times. Each press creates a new side effect.
POST vs PUT vs PATCH: A Concrete Example
// Original user object on server:
{ "id": 42, "name": "John", "email": "john@example.com", "role": "editor" }
// POST /users — CREATE a new resource
// Every call creates a NEW user. Call it 3 times, get 3 users:
{ "name": "Jane", "email": "jane@example.com", "role": "viewer" }
// → 201 Created (id: 43)
// Call again → 201 Created (id: 44) ← duplicate! POST is NOT idempotent.
// PUT /users/42 — REPLACE the entire resource
// You MUST send the complete object:
{ "name": "John", "email": "john-new@example.com", "role": "editor" }
// ⚠️ If you forget "role": the field is ERASED (set to null)
// Call it 10 times → same result every time. PUT IS idempotent.
// PATCH /users/42 — UPDATE only specified fields
// Send only what changed:
{ "email": "john-new@example.com" }
// ✓ "name" and "role" are untouched
Why Idempotency Matters: The Retry Problem
Networks are unreliable. Requests fail, time out, and get retried — often automatically. The question idempotency answers is: "If this request gets sent twice by accident, will anything bad happen?"
What happens with a non-idempotent POST: The server sees a brand-new request. It has no idea this is a retry. It processes a second charge. You've now been billed $100.
What happens with an idempotent operation: The server recognizes "I already processed this" and returns the original result. You're charged $50 — exactly once.
How to make non-idempotent operations safe: Use an idempotency key — a unique ID (like a UUID) the client generates and sends with the request. The server stores this key with the result. On retry, it sees the same key, skips re-processing, and returns the cached result. Stripe, PayPal, and most payment APIs require this.
Unsafe Retry (POST without idempotency key)
1. Client sends POST /charge → Server processes, charges $50
2. Response lost in transit → Client sees TIMEOUT
3. Client retries POST /charge → Server processes AGAIN, charges $50
Result: Customer charged $100
Safe Retry (POST with idempotency key)
1. Client sends POST /charge + Idempotency-Key: abc-123 → Server processes, charges $50, stores key
2. Response lost in transit → Client sees TIMEOUT
3. Client retries POST /charge + Idempotency-Key: abc-123 → Server finds key → returns cached result
Result: Customer charged $50
Quick idempotency cheat sheet:
| Method | Idempotent? | Why |
|---|---|---|
GET | Yes | Reading data never changes it. Read 100 times, same result. |
PUT | Yes | "Set this resource to exactly this state." Doing it twice = same state. |
DELETE | Yes | Deleting something that's already gone is a no-op. |
POST | No | "Create a new thing." Each call creates another one. |
PATCH | No | Depends on implementation — can have side effects on retry. |
HTTP Status Codes
2xx = Success
200 OK — Here's what you asked for
201 Created — I made the thing you asked me to make
204 No Content — Done, nothing to send back
3xx = Redirect
301 Moved Permanently — This resource lives somewhere else now
304 Not Modified — Use your cached version
4xx = Client Error (YOUR fault)
400 Bad Request — Your request doesn't make sense
401 Unauthorized — Who are you? Log in first
403 Forbidden — I know who you are, but you can't do this
404 Not Found — That doesn't exist
429 Too Many Requests — Slow down (rate limiting)
5xx = Server Error (SERVER's fault)
500 Internal Server Error — Something broke on our end
502 Bad Gateway — The upstream server failed
503 Service Unavailable — We're overloaded or in maintenance
504 Gateway Timeout — The upstream server took too long
502 and 504: The Gateway Codes
These two codes specifically involve a gateway — a load balancer or reverse proxy sitting between the client and the backend server. The client never talks to the backend directly; the gateway forwards requests on its behalf.
Analogy: You call a company's receptionist (gateway) who connects you to a specialist (backend). 502 = the specialist picks up but speaks gibberish. 504 = the specialist never picks up and the receptionist gives up waiting.
What about HTTP/2 and HTTP/3? HTTP/2 multiplexes multiple requests over a single TCP connection, eliminating head-of-line blocking at the HTTP layer. HTTP/3 goes further — it replaces TCP with QUIC (a UDP-based protocol), reducing connection setup from 2–3 round trips to just 1. Most CDNs and major sites use HTTP/2+ today. The progression is worth understanding, though the focus here is on HTTP/1.1 semantics (methods, headers, status codes) — those haven't changed.
1.6 HTTP is Stateless
Every HTTP request is completely independent. The server has no memory of previous requests. When you send request #2, the server has zero knowledge of request #1 — it's like talking to someone with amnesia every time.
But wait — if HTTP is stateless, how does the server know I'm logged in? Great question. The server doesn't "remember" you between requests. Instead, you send proof of identity with every single request — usually a tokenA string of characters that proves your identity. Think of it like a wristband at a concert — you show it at every door. or cookieA small piece of data that the browser automatically attaches to every request to a specific domain.. We'll cover exactly how this works in Part 4: Security.
1.7 TLS/HTTPS: Encryption in Transit
TLS (Transport Layer Security) is the protocol that puts the "S" in HTTPS. It encrypts data between the client and server so that anyone intercepting the traffic sees only gibberish.
Analogy: Imagine passing notes in class. HTTP is writing the note in plain English — anyone who intercepts it can read it. HTTPS is writing the note in a secret code that only you and the recipient understand, inside a sealed envelope.
The certificate proves the server is who it claims to be (not an imposter). It's issued by a trusted Certificate Authority (CA)A trusted organization (like Let's Encrypt, DigiCert) that verifies a server's identity and issues digital certificates..
Does TLS slow things down? The handshake adds ~1 round trip on the first connection. After that, TLS session resumption makes subsequent connections nearly free. The encryption/decryption overhead is negligible on modern hardware — HTTPS is effectively free performance-wise. Never skip TLS to "improve performance."
1.8 Making a Real HTTP Request
Here's what it looks like to make an HTTP request in code:
// JavaScript (browser or Node.js)
const response = await fetch('https://api.example.com/users/123', {
method: 'GET',
headers: {
'Authorization': 'Bearer eyJhbGciOiJ...',
'Content-Type': 'application/json'
}
});
const user = await response.json();
console.log(user.name); // "John Doe"
The equivalent using curl from the command line:
# curl (command line)
curl -X GET https://api.example.com/users/123 \
-H "Authorization: Bearer eyJhbGciOiJ..." \
-H "Content-Type: application/json"
Both do the exact same thing: send an HTTP GET request with an auth header and expect JSON back. The fetch() API is what your frontend JavaScript uses; curl is what you'll use for quick testing.
The LinkPulse Story
Chapter 1: Day One
It's launch day. You and Sam deploy LinkPulse — a link-sharing platform — on a single $20/month server in Virginia. The stack is simple: a Node.js API, a PostgreSQL database, and a React frontend. All on one machine.
Your first user signs up. Let's trace exactly what happens. They type linkpulse.io into their browser. First, a DNS lookup resolves the domain to your server's IP address. Then the browser opens a TCP connection — the three-way handshake you just learned about. Finally, the browser sends an HTTP GET request for the homepage. Your server receives it, renders the page, and sends back an HTTP response. The browser paints the page. It takes 120ms. It works.
Ten users sign up by lunch. Twenty by dinner. You're watching logs in real time, grinning.
Uh Oh
Day two. A user signs up from a coffee shop. They type their email and password into your registration form and click submit. What you didn't think about: your site is served over plain HTTP on port 80. That means their credentials cross the coffee shop's Wi-Fi in plaintext. Anyone on the same network running a packet sniffer can read their password.
You scramble. You grab a free TLS certificate from Let's Encrypt, configure your server to accept HTTPS on port 443, and redirect all HTTP traffic. Now every request is encrypted with TLS — even if someone intercepts the packets, they see gibberish. Crisis averted, lesson learned: encryption isn't optional, it's step one.
A week in, Sam asks a good question: "We have a login page now. But HTTP is stateless — the server forgets who you are after every request. How does the server know a user is logged in on the next click?" You tell her you're using a session cookie for now, but you know this question goes deeper. You'll revisit it properly in Part 4.
Uh Oh
Day ten. A user in Mumbai reports that the page takes 4 seconds to load. You check your server metrics — CPU is at 5%, response time is 15ms. The server is fast. So what's slow? Physics. A round trip from Mumbai to Virginia takes roughly 200ms. Your page loads dozens of assets — HTML, CSS, JavaScript, images, API calls — each one a separate round trip. Those 200ms add up fast across a TCP connection that starts slow (remember: TCP's congestion window starts small and grows). The speed of light is a hard limit, and your server is 13,000 km away.
Key Takeaway
LinkPulse works. But it only works well for people near Virginia. Every concept from Part 1 — DNS, TCP, HTTP, TLS — happens on every single request. And every request from Mumbai crosses the ocean. You haven't hit a code problem. You've hit a distance problem. And a single server can't be close to everyone.
Part 2: The Building Blocks of a System
Now that you understand how machines communicate, let's talk about the components you'll use to build systems. Think of these as Lego pieces — you choose which pieces to use and explain why.
2.1 Servers
Frontend Servers serve your UI — HTML, CSS, JS. In modern architectures, they often serve a static SPA (React/Vue) that makes API calls to the backend.
Backend Servers run business logic: processing payments, validating data, querying databases. Examples: Node.js (Express), Python (FastAPI), Go.
What is NginxNginx (pronounced "engine-X") is a high-performance web server and reverse proxy. As a web server, it serves static files (HTML, CSS, JS, images). As a reverse proxy, it sits in front of your backend servers and forwards incoming requests to them — handling load balancing, SSL termination, and caching so your app servers don't have to.? You'll see Nginx mentioned throughout this guide. It's a web server and reverse proxy — software that sits between the internet and your backend servers. It handles two main jobs: (1) serving static files (your HTML, CSS, JS) directly to users, and (2) forwarding API requests to your backend application servers. Think of it as a receptionist — it greets every visitor and routes them to the right department. Most production deployments use Nginx (or a cloud equivalent like AWS ALB) as the entry point to their system.
Read vs Write Servers (CQRSCommand Query Responsibility Segregation — a pattern where you use separate models (or even separate databases) for reading and writing data.): In high-traffic systems, separate read and write paths. The idea: reads and writes have fundamentally different needs. Reads can be scaled easily with replicas and caching; writes need strict consistency guarantees. By separating them, you can optimize each independently — for example, the read database might store a user's profile with their 10 most recent orders already combined into one row (fast to fetch, no joins needed), while the write database keeps users and orders in separate tables to maintain clean, consistent data. We'll cover this concept fully in Section 3.8: Normalization vs Denormalization.
Serverless Functions (Lambda): Single functions that run on-demand. The cloud provider manages all the infrastructure — you just write the function code.
What is a "cold start"? When a serverless function hasn't been called recently, the cloud provider needs to spin up a new container to run it. This initial setup can add 100ms–2s of latency. It's like a car that's been sitting in the cold — it takes a moment to warm up. After the first call, subsequent calls are fast ("warm starts"). This is why serverless isn't great for latency-sensitive operations.
Use for: event-driven tasks, unpredictable traffic, simple stateless operations. Avoid for: long-running processes, WebSockets, or high-volume steady traffic (a dedicated server is cheaper).
2.2 APIs
Wait — isn't this just HTTP? What's the difference between HTTP and an API? HTTP is the transport protocol — the language that clients and servers use to talk to each other (we covered this in Section 1.5). An API is a contract — a set of rules about what URLs exist, what data you send, and what you get back. Think of it this way: HTTP is like the postal service (it delivers envelopes), and an API is like a form you fill out and mail in (it defines what information goes in the envelope and what response you'll get). REST, GraphQL, and gRPC are different styles of designing that form — but they all use HTTP (or similar protocols) to deliver it.
REST
The most common API style. REST stands for Representational State Transfer. The key idea: everything is a resource (a noun — user, project, order) identified by a URL. You use HTTP methods to perform actions on resources.
GET /api/users → List all users
GET /api/users/123 → Get user 123
POST /api/users → Create a new user
PUT /api/users/123 → Update user 123 (full replace)
PATCH /api/users/123 → Update user 123 (partial)
DELETE /api/users/123 → Delete user 123
GraphQL
Client specifies exactly what data it wants. Single endpoint. Best for complex data relationships and mobile apps (minimize data transfer). Harder to cache because every query is different.
// GraphQL query — client asks for exactly what it needs
query {
user(id: 123) {
name
email
projects {
title
status
}
}
}
// Response — no over-fetching, no under-fetching
{
"user": {
"name": "John",
"email": "john@example.com",
"projects": [
{ "title": "AI Dashboard", "status": "active" }
]
}
}
With REST, you'd need GET /users/123 then GET /users/123/projects — two round trips. GraphQL does it in one.
REST vs GraphQL: When to Pick Each
| Factor | REST Wins | GraphQL Wins |
|---|---|---|
| Data shape | Simple, predictable resources | Deeply nested, interconnected data |
| Client diversity | One client type (e.g., web only) | Many clients needing different data (web, mobile, TV) |
| Caching | HTTP caching works out of the box (CDN, browser) | Harder — all requests are POST to one endpoint |
| Bandwidth | Over-fetching is acceptable | Mobile/low-bandwidth needs only specific fields |
| Team expertise | Most developers know REST already | Requires learning schema, resolvers, and tooling |
Default to REST unless you have multiple clients with significantly different data needs. GraphQL shines when the alternative is dozens of custom REST endpoints.
gRPC
A high-performance protocol for server-to-server communication. "Binary protocolInstead of sending human-readable JSON text, gRPC sends compact bytes that are smaller and faster to parse. Like the difference between a handwritten letter and a barcode — less readable to humans, but much faster for machines." means data is encoded in a compact binary format (Protocol Buffers) instead of JSON text — smaller payloads, faster parsing. "Strongly typedBoth the client and server agree on the exact shape of every message via a .proto file. If you send the wrong type, it fails at compile time — not at runtime." means both sides share a .proto definition file that defines every message and service. Use gRPC for internal microservice communication where performance matters.
WebSockets
Persistent, bidirectional connection. Unlike HTTP (client asks, server responds), WebSockets allow the server to push data to the client at any time.
How it works: the connection starts as a normal HTTP request, then "upgrades" to a WebSocket:
Use for real-time features: chat, live dashboards, collaborative editing, notifications, multiplayer games.
How do you handle API versioning? Common approaches: URL path versioning (/v1/users, /v2/users) or a custom header (API-Version: 2). Version when you make breaking changes to the contract. Keep old versions running until clients migrate — never break existing consumers.
2.3 Databases
SQL (Relational)
Tables with rows, columns, and foreign keys. PostgreSQL is the go-to. Choose when: data is relational, you need ACID transactionsAtomicity, Consistency, Isolation, Durability — guarantees that database operations are reliable. See Part 3 for the full breakdown., complex queries, or consistency is critical.
Why not just use one big database for everything? Because different data has different access patterns. Your user profiles need complex queries and consistency (SQL). Your session tokens need blazing-fast lookups (Redis). Your product images need cheap, infinite storage (S3). Using the right tool for each job is a core system design principle.
NoSQL Types
- Document Stores (MongoDB): Store data as flexible documentsIn NoSQL, a "document" is a self-contained data record (usually JSON-like) that can have nested fields, arrays, and varying structure — unlike a rigid SQL row. — think JSONJavaScript Object Notation — a lightweight text format for storing data as key-value pairs: {"name": "John", "age": 30}. Human-readable and used everywhere in web APIs. objects with nested fields. Use for semi-structured data, varying schemas, content management.
- Key-Value Stores (Redis): Simple key → value lookups. Entirely in-memory, sub-millisecond reads. Use for caching, sessions, rate limiting.
- Wide-Column (Cassandra): Massive scale writes with high availability. Use for time-series, IoT data, event logging.
- Graph (Neo4j): Data stored as nodes and edges (relationships). Use for social networks, recommendations, fraud detection.
Database Indexes
An index is like a book's index — instead of reading every page (full table scan), you look up the topic and jump to the right page.
- Primary index: Automatically created on the primary key. Determines physical storage order of rows on disk (clustered index). Every table has one.
- Secondary index: Manually created on frequently queried columns. A separate data structure that maps column values → row locations. Example: indexing
emailin a users table soWHERE email = 'john@example.com'doesn't scan all rows.
Trade-off: Indexes speed up reads (O(log n) lookup instead of O(n) scan) but slow down writes (every INSERT/UPDATE must also update the index). Don't index every column — index the columns you frequently WHERE, JOIN, or ORDER BY on.
If Redis is so fast, why not use it as the main database? Redis stores everything in RAM, which is expensive and volatile. A server with 256GB RAM costs 10x more than 256GB of SSD. And if Redis crashes without persistence configured, your data is gone. Use Redis as a cache layer in front of your primary database — not as a replacement for it.
2.4 Blob / Object Storage
For large files: images, videos, PDFs. Use AWS S3 / Azure Blob. Upload file → get URL → store URL in database. Cheap, scales to exabytes, integrates with CDNs.
Pre-signed URLs
When a user needs to download a large file, don't proxy it through your server. Instead, generate a pre-signed URL — a temporary, time-limited URL that lets the user download directly from S3. Your server hands them the URL and steps out of the way.
// Generate a URL valid for 15 minutes — user downloads directly from S3
const url = await s3.getSignedUrlPromise('getObject', {
Bucket: 'my-bucket',
Key: attachment.s3Key,
Expires: 900 // seconds
});
// Your server never touches the file bytes on download
This matters enormously at scale: a 2GB video download doesn't consume your server's bandwidth or memory at all. The pattern is: store the S3 key in your database, the actual file in S3. Your database row is tiny; S3 handles the heavy lifting.
2.5 Load Balancers
A load balancer sits in front of your servers and distributes incoming requests so no single server gets overwhelmed.
Algorithms
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Requests go to servers in order: A, B, C, A, B, C... | Identical servers, even distribution |
| Weighted Round Robin | More powerful servers get proportionally more requests | Servers with different capacities |
| Least Connections | Send to whichever server currently has the fewest active connections | Requests with varying processing times |
| IP Hash | Hash the client's IP to determine which server handles it — same client always goes to same server | Session stickiness without shared state |
Layer 4 vs Layer 7
Layer 4 (Transport)
Sees: IP addresses, ports
Decision: Route by IP/port
Example: All traffic on :443 → Server Pool
Layer 7 (Application)
Sees: URLs, headers, cookies, body
Decision: Route by content
Example: /api/* → Backend servers, /static/* → CDN, /admin/* → Admin servers
Layer 4 is faster (less inspection), but Layer 7 is smarter (can route based on HTTP content). Default to Layer 7 for more control.
What if the load balancer itself goes down? You always run at least two load balancers. One handles traffic while the other stands by as a backup. If the active one fails, the backup takes over automatically — usually within seconds. In practice, cloud providers (AWS, GCP, Azure) manage this for you behind the scenes. When you create a load balancer on AWS, you're never relying on a single machine — they run redundant copies automatically.
2.6 Caching
Caching stores frequently accessed data in a fast layer (RAM) so you don't have to hit the slower layer (database/disk) every time. Analogy: a sticky note of frequently called phone numbers on your desk — faster than looking them up in the phone book every time.
Redis is the industry standard in-memory cache.
Cache-Aside (Lazy Loading) — Step by Step
Invalidation Strategies
- Cache-Aside (lazy loading): App checks cache first, loads from DB on miss. Most common pattern.
- Write-Through: Every write goes to both the cache and the database simultaneously. Cache is always fresh, but writes are slower (two operations).
- Write-Behind (Write-Back): Write to cache immediately, then asynchronously flush to the database later. Fastest writes, but risk of data loss if the cache crashes before flushing.
Key Concepts
TTLTime to Live — a countdown timer on a cache entry. When it hits zero, the entry is automatically deleted. Prevents serving stale data forever. (Time to Live): Auto-expires cache entries after a set duration. Even if your active invalidation fails, TTL ensures data refreshes eventually.
LRULeast Recently Used — an eviction policy where the item that hasn't been accessed for the longest time gets removed first when the cache is full. (Least Recently Used): When the cache is full, evict the item that hasn't been accessed for the longest time. The assumption: recently accessed data is more likely to be accessed again.
What happens when the cache has stale data? This is the fundamental tension of caching. If a user updates their profile, but the old version is still in the cache, other users see outdated info. Solutions: delete/update the cache on writes (cache-aside), use short TTLs for frequently changing data, or use write-through for data where freshness is critical.
What happens when the cache fills up? The cache uses an eviction policy to make room. LRU (Least Recently Used) evicts the oldest-accessed entry. LFU (Least Frequently Used) evicts the least-accessed overall. Most systems use LRU — it's simple and works well for typical access patterns. Important: Redis defaults to no eviction (it returns errors when full), so always configure a maxmemory-policy.
Distributing Cache: Consistent Hashing
A single Redis instance has a memory ceiling. When you need multiple cache nodes, you must decide which keys go where. The naive approach — hash(key) % N — breaks when you add or remove a node (N changes → ~75% of keys remap → massive cache miss storm).
Consistent hashing solves this by placing cache nodes on a virtual ring. Adding a node only remaps ~1/N of keys instead of all of them. This is how Redis Cluster, DynamoDB, and Cassandra distribute data.
Full explanation with diagrams in Section 3.9: Hashing & Consistent Hashing.
2.7 CDN
A CDN (Content Delivery Network) is a global network of servers that caches and serves content from the location closest to the user.
An edge server is a CDN server in a specific geographic location. Here's how it works:
Use for: images, CSS/JS bundles, videos, fonts. Don't use for: user-specific data (dashboards) or real-time data (live stock prices).
2.8 Message Queues
A queue is a FIFOFirst In, First Out — like a line at a coffee shop. The first person in line gets served first. line of messages waiting to be processed. A producer adds messages; a consumer reads and processes them.
Queue vs Stream
Queues (SQS/RabbitMQ): One consumer per message. Message is deleted after processing. Think of a to-do list — once a task is done, you cross it off.
Streams (Kafka): Multiple consumers can read the same messages. Messages are retained and replayable. Think of a newspaper — many people can read the same article, and back issues are kept.
// Producer: add a job to the queue
await queue.send({
type: 'SEND_WELCOME_EMAIL',
userId: 123,
email: 'john@example.com'
});
// Consumer: process jobs from the queue
while (true) {
const message = await queue.receive();
await sendEmail(message.email, 'Welcome!');
await queue.delete(message.id); // Done, remove from queue
}
The power of queues: they decouple producers from consumers. Your API server says "send this email" and immediately responds to the user. The email actually sends 2 seconds later via a background worker. If the email service is down, the message stays in the queue and retries later.
What happens if a worker crashes mid-processing? The message becomes "invisible" for a timeout period (visibility timeout in SQS, or unacked in RabbitMQ). If the worker doesn't acknowledge completion before the timeout, the message reappears in the queue for another worker to pick up. After several failed attempts, it moves to a dead letter queue (DLQ) for manual inspection.
2.9 Virtual Networks
A VPC (Virtual Private Cloud) is your own isolated network in the cloud. Think of it like a gated office building — you control who gets in and what they can access.
A subnet is a subdivision of your VPC:
- Public subnet: Accessible from the internet. Only put your load balancer here.
- Private subnet: NOT accessible from the internet. Put your app servers, databases, and Redis here.
Only the load balancer is exposed to the internet (public subnet). App servers, PostgreSQL, and Redis sit in a private subnet — unreachable from outside.
Only the load balancer is exposed to the internet. Attackers can't directly reach your database — they'd have to go through the load balancer, then the app server, which validates authentication and authorization.
What if an app in the private subnet needs to call an external API? Use a NAT (Network Address Translation) gateway. It sits in the public subnet and lets private instances make outbound requests — for example, calling Stripe's API to process a payment — without being directly reachable from the internet. Outbound: yes. Inbound: still blocked.
The LinkPulse Story
Chapter 2: Week Two
LinkPulse has 500 users now. Every page load — the feed, trending links, user profiles — hits PostgreSQL directly. You run htop on the server: CPU is pinned at 80%. The database is doing all the work, and most of it is redundant. The "trending links" query crunches the same numbers for every single visitor, and the results barely change minute to minute.
You add Redis as a cache. The pattern is simple: before querying PostgreSQL, check Redis. If the data is there and fresh, return it. If not, query PostgreSQL, store the result in Redis with a 60-second expiration (TTL — it auto-deletes after 60 seconds so data stays fresh), and return it. This is the cache-aside pattern you learned in Section 2.6. The trending page response drops from 200ms to 3ms. CPU falls to 20%.
Uh Oh
Users start uploading thumbnail images for their links. You store the actual image files directly inside PostgreSQL — it seemed easy at the time. Within days, the database balloons to 12 GB. Queries slow to a crawl because the database is now struggling to sift through huge image files mixed in with your regular data. Every time someone loads a feed page, the server pulls megabytes of image data out of the database and sends it all to the user's browser.
You move all images to Amazon S3 — object storage designed exactly for this. The database now stores only a URL string pointing to S3. Then you put Cloudflare CDN in front of S3. The CDN caches copies of your images on servers around the world (those "edge servers" from Section 2.7). Your user in Mumbai? Their thumbnail now loads in 5ms from an edge server in Mumbai, not 200ms from Virginia. The database shrinks back to 400 MB.
Analogy
Your single server was a restaurant where the chef also waits tables, washes dishes, and parks cars. Adding S3, Redis, and CDN is like hiring specialists — a waiter, a dishwasher, a valet. The chef can finally focus on cooking. Each component does one thing well.
Two weeks later, a tech blogger shares LinkPulse. Traffic spikes 10× overnight. Your single Node.js process can't keep up — requests pile up and start timing out. Users see 502 errors. You need more servers, fast.
You spin up 3 Node.js instances behind an Application Load Balancer (ALB). The ALB uses round-robin to distribute incoming requests across the three servers. This works because — remember — HTTP is stateless (Section 1.6): no server "remembers" your previous request. And since you store login sessions in Redis (not on the server itself), any of the three servers can check Redis and know you're logged in. The same user can hit server 1, then server 3, and both know they're logged in.
You also notice the signup flow is slow: after creating an account, the API sends a welcome email right then and there, during the request — meaning the user has to sit and wait for the email to actually send before they see the signup confirmation. If the email provider is slow (or down), the user waits. You add a message queue (Amazon SQS) — the API drops a "send welcome email" message onto the queue and returns instantly. A background worker picks up the message and sends the email on its own time. The user never waits for the email provider.
Uh Oh
You now have three app servers, but still one PostgreSQL database. You can keep adding more app servers to handle more traffic — that part scales. But the database doesn't. Database CPU climbs to 95% during peak hours. You distributed the code, but the data is still on one machine. Every single read and write from all three servers funnels into that single database.
Key Takeaway
Every building block from Part 2 — caching, CDN, object storage, load balancers, message queues — solved a specific bottleneck. But notice the pattern: you pushed the problem somewhere else each time. The hardest question is still ahead: what do you do when a single database can't keep up?
Part 3: Core Patterns & Principles
These are foundational theoretical concepts that underpin every distributed system. Understanding them deeply will make every design decision clearer.
3.1 ACID Transactions
ACID is a set of four guarantees that relational databases provide for transactions:
- Atomicity: All operations succeed or none do. Example: transferring $100 from Account A to Account B involves two operations (debit A, credit B). If the system crashes after debiting A but before crediting B, atomicity ensures the debit is rolled back — no money disappears.
- Consistency: The database always moves from one valid state to another. All constraints (unique keys, foreign keys, check constraints) are enforced. Example: if you have a constraint that account balances can't go negative, any transaction that would violate this is rejected.
- Isolation: Concurrent transactions execute as if they were sequential. Example: if two users try to book the last seat on a flight simultaneously, isolation prevents them from both succeeding — one will see "seat taken."
- Durability: Once a transaction is committed, it survives crashes, power outages, and hardware failures. The data is written to disk, not just held in memory. Example: after you see "Payment confirmed," that data persists even if the server crashes 1 second later.
But what level of isolation? Doesn't that matter? Yes — databases offer multiple isolation levels. READ COMMITTED (PostgreSQL's default) prevents dirty reads but allows non-repeatable reads. SERIALIZABLE prevents all anomalies but is the slowest. Most applications use READ COMMITTED and handle edge cases in application logic or with explicit row-level locking.
3.2 CAP Theorem
A distributed system can provide at most TWO of: Consistency (every read returns latest write), Availability (every request gets a response), Partition Tolerance (system works despite network splits). Since network partitions are inevitable in any distributed system, your real choice is CP or AP.
CP example (banking): During a network partition between two data centers, a banking system rejects transactions rather than risk processing them with stale data. Users see "Service temporarily unavailable" — annoying, but no one loses money.
AP example (social feed): During a partition, a social media feed stays up and serves content, even if some posts are a few seconds stale. Users would rather see a slightly outdated feed than an error page.
Can a system be both consistent AND available? Yes — when there's no partition. CAP only forces a trade-off during a network split (also called a "partition") — that's when two parts of your system literally can't talk to each other. Imagine you have two data centers (East and West), and the network connection between them goes down. Each side is still running, but they can't sync data. Now you have to choose: do you reject requests until the connection is restored (choosing Consistency), or keep serving requests knowing the two sides might have different data (choosing Availability)? When the network is healthy and everything can communicate normally, you can have all three. That's why CAP is about failure modes, not normal operation.
3.3 Consistency Models
When data is stored across multiple replicasCopies of the same data store — whether that's a relational database (PostgreSQL), a NoSQL database (MongoDB, DynamoDB), or a cache (Redis). Data written to the primary is copied to replicas. Used for read scaling and fault tolerance. (copies of the database or data store), the question is: how up-to-date are reads?
- Strong consistency: Every read returns the latest write, no matter which replica you ask. Like a shared Google Doc — every viewer sees the exact same version at all times.
- Eventual consistency: Replicas will convergeEventually reach the same state. After a write, different replicas may temporarily return different values, but given enough time with no new writes, they'll all agree. to the same value eventually, but may be briefly stale. Like updating your social media profile — your friend might see the old photo for a few seconds before it refreshes.
- Read-Your-Writes: You always see your own writes immediately; others may see stale data. The best of both worlds for user experience.
If eventual consistency means stale data, how is that acceptable? Because many use cases don't need instant consistency. A "likes" counter showing 1,002 instead of 1,003 for 2 seconds is fine. A product listing showing old stock for 5 seconds is acceptable. The trade-off is that eventual consistency lets you scale reads massively across replicas without coordinating between them on every read.
3.4–3.5 Latency & Throughput
Latency = how long one operation takes. Throughput = how many operations per second.
Key Latency Numbers (approximate)
| Operation | Time | Relative |
|---|---|---|
| L1 cache reference | 0.5 ns | — |
| RAM access | 100 ns | 200× L1 |
| SSD random read | 150 µs | 1,500× RAM |
| HDD seek | 10 ms | 67× SSD |
| Same datacenter round trip | 0.5 ms | — |
| Cross-continent round trip | 150 ms | 300× datacenter |
Key takeaway: RAM is ~1000× faster than SSD, SSD is ~100× faster than network. This is why caching is so powerful — moving a lookup from disk to memory can drop response time from 50ms to 0.1ms.
Throughput example: A single server handling 1,000 requests/second with 50ms latency per request has high throughput and moderate latency. A batch job processing 1 million records but taking 5 seconds per record has high throughput but terrible latency. They're related but independent.
3.6 Fault Tolerance
The art of keeping a system running even when things break. Things WILL break — hard drives die, code has bugs, networks go down. The question isn't if something will fail, it's how your system responds when it does. Failures fall into three categories:
Types of Failures
1. Hardware Failures — physical components break.
- Disk dies: The hard drive that stores your data physically stops working. Mitigation: RAID (Redundant Array of Independent Disks) — instead of storing data on one disk, you spread it across multiple disks so that if one fails, the data can be rebuilt from the others. Think of it like writing the same notes in two separate notebooks — lose one, you still have a copy.
- Memory corrupts: The server's RAM (the fast, temporary storage it uses while running) has a glitch and flips a bit of data. Mitigation: ECC memory (Error-Correcting Code) — special RAM that can detect and automatically fix single-bit errors. Regular RAM doesn't catch these errors — your server would just silently use wrong data. ECC catches it and fixes it on the fly.
- Power supply fails: The physical power unit in the server dies. Mitigation: multiple power supplies — server-grade machines come with two, each plugged into a different power source.
- Entire server dies: Mitigation: multi-AZ deployment. "AZ" stands for Availability Zone — cloud providers like AWS split each region (e.g., US-East) into multiple physically separate data centers (AZs). If you run copies of your server in two AZs, a power outage or fire in one data center doesn't take you offline.
2. Software Failures — your code (or the code you depend on) misbehaves.
- Deadlock: Two parts of your program are each waiting for the other to finish, so neither ever does. Imagine two people in a hallway — each waiting for the other to step aside, so nobody moves. In code, this usually happens when two processes each hold a resource (like a database row) and need the one the other is holding. The app freezes.
- Memory leak: Your program gradually uses more and more RAM without releasing it. Like filling a bathtub with the drain closed — eventually it overflows. After hours or days, the server runs out of memory and crashes. Common in long-running servers that create objects but never clean them up.
- Out-of-memory crash: The program asks for more RAM than the server has, and the operating system kills it.
Mitigations for software failures:
- Health checks: The load balancer periodically asks each server "are you alive?" (usually by hitting an endpoint like
GET /health). If a server stops responding, the load balancer takes it out of the pool. - Auto-restart: Tools called process managers watch your app and restart it if it crashes. PM2 is a popular one for Node.js apps; systemd is built into Linux itself. They're like a babysitter that picks the kid back up when they fall.
- Circuit breaker: If your app calls an external service (say, a payment API) and that service is failing, a circuit breaker stops your app from repeatedly trying. It "opens the circuit" (like a fuse in your house) and returns a fallback response immediately, instead of making your users wait for a request that's going to fail anyway. After a cooldown period, it tries one request to see if the service recovered.
- Graceful degradation: When part of your system fails, you keep the rest working. Example: if your recommendation engine goes down, you still show the main feed — just without personalized suggestions. The user gets a slightly worse experience instead of a completely broken page.
3. Network Failures — communication between parts of your system breaks.
- Packets dropped: Data is sent over the network in small chunks called "packets." Sometimes they just get lost in transit — like letters lost in the mail.
- DNS resolution fails: DNS is the "phone book" that converts domain names (like
api.stripe.com) into IP addresses. If that lookup fails, your server literally doesn't know where to send the request. - Network partition: Two parts of your system can't talk to each other at all (the "network split" from Section 3.2 CAP).
Mitigations for network failures:
- Retries with exponential backoff: If a request fails, try again — but wait a little longer each time. First retry after 1 second, then 2 seconds, then 4, then 8. This prevents all your servers from hammering a struggling service at the same time (which would make the problem worse).
- Timeouts: Never wait forever for a response. Set a deadline (e.g., 3 seconds). If the other service doesn't respond by then, give up and either retry or return an error. Without timeouts, one slow service can freeze your entire application.
- Fallback responses: If you can't reach a service, return something reasonable instead of an error. Can't reach the recommendations service? Show trending posts instead.
Redundancy Strategies
A well-designed system handles all three failure types simultaneously. The core idea: have backup copies of everything. And "everything" means both your app servers (the machines running your code) and your databases (the machines storing your data). In a real production system, you'll typically have multiple app servers behind a load balancer, AND multiple copies of your database — so if any single machine dies, there's another one ready to go.
These strategies apply to any component — web servers, databases, caches, queues — anywhere a single machine failing would be a problem:
- Active-Passive (failover): You run two copies of a component — say, two database servers. One (the "active") handles all the work. The other (the "passive") sits idle, continuously receiving a copy of all the data, ready to go at a moment's notice. If the active one crashes, the passive takes over its role. Analogy: an understudy in a theater production — always rehearsed, but only goes on stage if the lead can't. This is the most common pattern for databases, where you want one source of truth with a hot backup.
- Active-Active: You run two copies and both handle real traffic at the same time. If one dies, the other absorbs all traffic instantly — no switchover delay. Analogy: two cashiers at a store — if one takes a break, the other keeps the line moving. This is the most common pattern for web/app servers (and load balancers), where any copy can handle any request.
- N+1 redundancy: Run one more instance than you need. If you need 3 app servers to handle peak traffic, run 4. If you need 1 database primary, have 1 standby. That way losing any single machine doesn't put you over capacity or take you offline.
What Actually Happens During a Database Failover
This example uses a relational database (PostgreSQL), but the same failover pattern applies to Redis clusters, MongoDB replica sets, Cassandra nodes — any storage system where you can't afford downtime.
Every production database should have a synchronous standby — a second copy of the database that receives every write at the same time as the primary. "Synchronous" means the primary waits for the standby to confirm "I got it" before telling your app "write successful." This way, the standby always has an exact copy of the data.
When the primary crashes, here's the actual timeline:
T+0s: Primary DB crashes (hardware failure or network issue)
T+10s: Cloud provider detects primary is unhealthy (health checks fail)
T+10s: Failover decision made — promote the standby to become the new primary
T+30s: Standby promoted. DNS record updated to point to the new server's IP
T+60s: App servers reconnect to new primary. Writes resume.
Synchronous standby = zero data loss — every write that was confirmed exists on both machines. An asynchronous standby (where the primary doesn't wait for confirmation) is cheaper and faster, but if the primary crashes, the last few seconds of writes may be lost because they hadn't been copied yet.
With proper retry logic — where your app automatically retries failed database connections, waiting a bit longer each time (exponential backoff: 1s, 2s, 4s, 8s...) — most users just see a brief "saving..." spinner, then their operation succeeds. No human has to wake up and manually fix anything.
Chaos Engineering
Chaos engineering means deliberately breaking things in production to make sure your backups actually work. Netflix pioneered this with a tool called Chaos Monkey, which randomly kills production servers throughout the day. If your system survives random server deaths without users noticing, your redundancy is real — not just theoretical. The key insight: if you only test failover during an actual outage, you will fail spectacularly at 2am.
What about split-brain — when two servers both think they're the primary? This is the most dangerous failure mode in databases. If both accept writes, you end up with two different versions of your data that can never be automatically merged. Prevention: use a quorum (majority vote). Before a standby can be promoted to primary, a majority of nodes in the cluster have to agree: "yes, the old primary is really dead, and yes, you should take over." This is why database clusters typically have odd numbers of nodes (3, 5, 7) — with an even number, you can get a tie vote and nobody gets promoted.
3.7 Sync vs Async
Sync: caller waits. Async: fire and forget, handle later. Use async for emails, image processing, analytics. Use sync for critical operations where the user must know the result.
The Event Bus Pattern
A powerful async pattern: when a service completes an operation, it publishes an event to a message queue (Kafka, SQS, etc.). Multiple independent consumers react to that event asynchronously — the publisher doesn't know or care who's listening.
// Write service: update the database, then publish an event
await db.query('UPDATE documents SET content=$1 WHERE id=$2', [content, docId]);
await eventBus.publish('document.updated', { docId, tenantId, userId });
// --- Async consumers react independently ---
// Search indexer updates Elasticsearch
eventBus.consume('document.updated', async ({ docId }) => {
await searchIndex.reindex(docId);
});
// Cache invalidator clears stale Redis entry
eventBus.consume('document.updated', async ({ docId, tenantId }) => {
await redis.del(`doc:${tenantId}:${docId}`);
});
// Audit logger writes to audit trail
eventBus.consume('document.updated', async (event) => {
await auditLog.write(event);
});
The write service finishes fast. Everything downstream — search indexing, cache invalidation, audit logging — happens asynchronously without blocking the user's response. This is a natural companion to CQRS (Command Query Responsibility Segregation): the write side publishes events, the read side consumes them to update its optimized read models.
What if an async consumer fails to process a message? Use retries with exponential backoff: retry after 1s, then 2s, then 4s, then 8s. After N failures, route the message to a dead letter queue (DLQ) for manual inspection. This prevents a single "poison message" from blocking the entire queue.
3.8 Normalization vs Denormalization
Normalization means organizing data to eliminate redundancy. Each piece of information lives in exactly one place. You connect related data using foreign keysA column in one table that references the primary key of another table. It creates a link between the two tables. Example: orders.user_id references users.id. and retrieve it with JOINsA SQL operation that combines rows from two or more tables based on a related column. Example: SELECT * FROM orders JOIN users ON orders.user_id = users.id.
-- Normalized: user name stored once
SELECT orders.id, users.name, orders.total
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.id = 456;
-- Denormalized: user name copied into orders table
SELECT id, user_name, total
FROM orders
WHERE id = 456; -- No JOIN needed, but user_name might be stale
Normalized: Less storage, no update anomalies (change a user's name in one place), but more JOINs (slower reads). Denormalized: Faster reads (no JOINs), but risk of inconsistency (user changes name → must update everywhere). Normalize core data; denormalize for read-heavy dashboards.
3.9 Hashing & Consistent Hashing
A hash function takes an input (any size) and produces a fixed-size output. The same input always produces the same output, but even a tiny change in input produces a completely different output. Hash functions are used everywhere: data integrity, caching, load distribution, and security.
The Modulo Problem
The simplest way to distribute data across N servers: hash(key) % N. Key "user_42" hashes to 78 → 78 % 4 = server 2. This works until you add or remove a server:
With 4 servers: hash("user_42") % 4 = 2 → Server 2
With 5 servers: hash("user_42") % 5 = 3 → Server 3 (MOVED!)
Result: ~75% of ALL keys remap to different servers
→ massive cache misses → thundering herd on the database
Consistent Hashing: The Virtual Ring
Consistent hashing solves the modulo problem by imagining all possible hash values arranged in a circle (a "ring") instead of a line. Both servers and keys get placed on this ring based on their hash value. Here's how it works, step by step:
Step 1: Place the servers on the ring. Hash each server's name to get its position on the ring. Each server "owns" the colored section of ring behind it (going counterclockwise back to the previous server). Any key that lands in a server's colored zone belongs to that server.
Step 2: Place a key on the ring. Hash the key (e.g., "user_42") to get its position. It lands in the gold zone — so it belongs to Server A.
Step 3 (the magic): Add a new server. Server D joins the ring between C and A. It takes over a slice of what was A's zone (shown in pink). Only the keys that land in that new pink slice need to move — everything else stays exactly where it was.
BEFORE (3 servers)
AFTER (Server D added)
That's the whole idea: Server D took over a slice of the ring that used to belong to A. Only the keys in that slice moved — everything else stayed put. Instead of ~75% of keys remapping (like with modulo), only about 1/N of all keys need to move. The more servers you have, the less disruption each change causes.
Virtual Nodes
With only 3 physical servers on the ring, distribution can be uneven. Solution: each physical server gets multiple "virtual nodes" spread around the ring (e.g., Server A → A-1, A-2, A-3, A-4). This ensures even key distribution even with few physical servers.
3.10 Scaling
Vertical: bigger machine (more CPU, RAM). Simple, no code changes, but has a physical ceiling. Horizontal: more machines behind a load balancer. Complex but virtually unlimited.
Real-World Server Tiers: What You're Actually Paying For
Abstract talk about "bigger machines" is hard to reason about. Here's what actual cloud servers look like, what they cost, and roughly what they can handle. These are the tiers you'll pick from when deploying a real app on AWS or Azure.
App / Web Servers (AWS EC2 & Azure equivalents)
| Tier | AWS Instance | Azure VM | vCPUs | RAM | ~Monthly Cost | Handles (typical web app) |
|---|---|---|---|---|---|---|
| Hobby / Dev | t3.micro | B1s | 2 | 1 GB | $8–12 | ~50–200 concurrent users. Fine for a personal project or dev/staging environment. |
| Small App | t3.small | B1ms | 2 | 2 GB | $15–22 | ~200–1,000 concurrent users. An early-stage SaaS or internal tool. |
| Startup | t3.medium | B2s | 2 | 4 GB | $30–42 | ~1,000–3,000 concurrent users. A real product with paying customers. |
| Growth | m5.large | D2s v3 | 2 | 8 GB | $70–93 | ~3,000–10,000 concurrent users per instance. Usually you're running 2+ of these behind a load balancer by now. |
| Production | m5.xlarge | D4s v3 | 4 | 16 GB | $140–185 | ~10,000–30,000 concurrent users per instance. Mid-size SaaS, e-commerce. |
| Heavy | m5.2xlarge | D8s v3 | 8 | 32 GB | $280–370 | ~30,000–80,000 concurrent users per instance. High-traffic platforms. |
| Compute-Heavy | c5.4xlarge | F16s v2 | 16 | 32 GB | $490–620 | CPU-intensive workloads (image processing, real-time analytics). Not more users — faster processing per request. |
Managed Databases (AWS RDS PostgreSQL & Azure Database for PostgreSQL)
| Tier | AWS RDS | Azure DB | vCPUs | RAM | ~Monthly Cost | Handles (simple queries) |
|---|---|---|---|---|---|---|
| Dev / Hobby | db.t3.micro | B1ms | 2 | 1 GB | $15–25 | ~100–500 queries/sec. Dev, prototyping, tiny apps. |
| Small | db.t3.small | GP B2s | 2 | 2 GB | $25–45 | ~500–1,500 queries/sec. Small SaaS with a few hundred active users. |
| Mid | db.m5.large | GP D2s v3 | 2 | 8 GB | $130–175 | ~2,000–5,000 queries/sec. This is where most startups live — enough for thousands of active users. |
| Production | db.r5.xlarge | MO E4s v3 | 4 | 32 GB | $350–460 | ~5,000–15,000 queries/sec. Large working set fits in RAM, fast reads. |
| Heavy | db.r5.2xlarge | MO E8s v3 | 8 | 64 GB | $700–920 | ~15,000–40,000 queries/sec. At this point you should also be looking at read replicas and caching. |
Practical estimation tip: When a system design scenario says "we have 10 million users," the first question should be: how many are concurrent? A product with 10M registered users might have 50,000 concurrent at peak. Looking at the tables above, that's 2–3 m5.xlarge app servers behind a load balancer, plus a db.r5.xlarge database — totally manageable. This kind of back-of-the-envelope reasoning is what separates good system designers from those who jump to over-engineered solutions.
Scaling Phases
Phase 2 — Distribute reads: Add read replicas, Redis cache for hot data, CDN for static assets, load balancer in front of multiple app servers. Handles 10-100× the original traffic.
Phase 3 — Partition data: Shard the database, introduce Kafka for event-driven processing, split into microservices for independent scaling, multi-region deployment. This is where complexity explodes — only do it when Phase 2 can't keep up.
Key insight: each phase adds complexity. Don't jump to Phase 3 when Phase 1 fixes haven't been tried. Walk through this progression to understand the trade-offs at each stage.
The Full Scaling Ladder
A more granular view of how a real system evolves from a single server to a global platform:
- Monolith + single database — one server, one DB. Build fast, learn fast.
- Polyglot persistence — right storage per data type (SQL for relational, document DB for flexible content, S3 for files).
- Load balancer + horizontal app scaling — multiple app servers, zero-downtime deploys, shared sessions in Redis.
- Read replicas + Redis cache + CDN — read load off primary, hot data in under 1ms, static assets served at the edge.
- CQRS — separate read and write APIs with independent scaling, connected via an event bus.
- Enterprise infrastructure — SSO (SAML/OIDC), tenant isolation, dedicated databases, SOC 2 compliance.
- Multi-region — GeoDNS routes users to nearest region, regional read replicas, regional caches.
- Write sharding + regional primaries + HA standby — writes sharded by tenant, synchronous standby per shard for instant failover.
- CRDTs/OT + circuit breakers + chaos engineering — conflict-free collaborative editing, graceful degradation, regular failure drills.
The best engineers are not the ones who build Stage 9 for a Stage 1 problem. They build precisely what's needed and see one stage ahead.
Replication
Create copies of your database to distribute read traffic:
Writes go to the primary only. Reads are distributed across replicas. If the primary fails, a replica can be promoted.
Sharding
Split data across multiple databases (or storage nodes) so no single machine holds everything. Sharding is used in relational databases (PostgreSQL, MySQL), NoSQL databases (MongoDB shards collections, Cassandra distributes partitions, DynamoDB splits by partition key), and even caches (Redis Cluster shards keys across nodes). The strategies are the same regardless of storage type:
- Range-based: Users A-M → Shard 1, N-Z → Shard 2. Simple but can create hot spots (if more users have names starting with A-M).
- Hash-based:
hash(user_id) % num_shards→ even distribution. But range queries (e.g., "all users created this week") must hit every shard. - Geographic: US users → US shard, EU users → EU shard. Low latency for local users, but cross-region queries are expensive.
Hot Spots
A hot spot is when one shard receives disproportionately more traffic than others, defeating the purpose of sharding. Three common causes:
- Sequential IDs: Range-based sharding where all new users (IDs 1M-2M) hit the same shard.
- Celebrity data: One user account (e.g., a celebrity with millions of followers) gets millions of reads on a single shard.
- Temporal patterns: Date-based sharding where today's shard handles all current traffic while historical shards sit idle.
Mitigations: Use hash-based sharding instead of range-based, split hot shards into sub-shards, cache hot keys aggressively in Redis, or add a random salt to the shard key to spread celebrity data across multiple shards.
Fan-outWhen one operation triggers many downstream operations. Example: one user posts → the system writes to 10,000 follower feeds. The "fan" shape of one input spreading to many outputs. is a related scaling concern: when one action triggers many downstream operations (like writing to thousands of follower feeds). Fan-out problems are what make systems like Twitter's news feed challenging at scale.
Fan-out-on-Write vs Fan-out-on-Read
| Approach | How It Works | Pro | Con |
|---|---|---|---|
| Fan-out-on-write (push) | When a user posts, immediately write to every follower's feed | Reads are instant — feed is pre-computed | Writes are expensive — 1M followers = 1M writes |
| Fan-out-on-read (pull) | When a user opens their feed, query all accounts they follow on the fly | Writes are cheap — just store the post | Reads are slow — must query and merge many accounts |
Most production systems use a hybrid: push for normal users (fast reads), pull for celebrities (avoid million-write storms).
The LinkPulse Story
Chapter 3: Month Three
LinkPulse has 50,000 users. The single PostgreSQL database that was buckling under three app servers is now your biggest problem. You start where everyone starts: vertical scaling. You upgrade from 4 cores and 16 GB of RAM to 64 cores and 256 GB. Queries speed up. CPU drops. You buy yourself about three months of breathing room.
But you know vertical scaling has a ceiling, so you add read replicas. You set up two PostgreSQL replicas that stream changes from the primary. Since roughly 90% of your queries are reads — loading feeds, browsing profiles, viewing links — you route those to replicas. The primary handles only writes. The load drops dramatically.
Uh Oh
Sam updates her profile photo, refreshes the page, and sees her old photo. She clicks save again — now she has duplicate writes. The problem: replication lag. Her write went to the primary, but her read was routed to a replica that hasn't received the update yet. The data is consistent eventually, but "eventually" doesn't feel great when you're staring at your own stale profile picture.
You implement read-your-writes consistency: for 2 seconds after any write, that user's reads are routed to the primary instead of a replica. It's a targeted fix — everyone else still reads from replicas, but the person who just made a change always sees their own update. Lag still exists, but users never notice it on their own data.
Then you hit an ACID moment. A user saves a new link and shares it simultaneously. Sharing creates a record that references the link by its ID. But what if the share record gets written and the link write fails? The share now points to a link that doesn't exist — a broken reference. You wrap both operations in a transaction. Either both succeed or neither does. Atomicity saves you from corrupted data.
Uh Oh
Month four, 200,000 users. Read replicas handled the read load, but now write throughput is maxed out. The primary server is handling every write for every user. You need to shard — split the data across multiple databases. You shard by user_id: users 1–100K go to shard A, 100K–200K to shard B. Writes are distributed. But then product asks for a "trending links across all users" page. That query now spans every shard. What was a single SQL query becomes a cross-shard nightmare that's slow and complex.
You experiment with range-based sharding first. It works — until you launch in Japan. Japanese usernames cluster in a specific alphabetical range, and that shard becomes a hot spot handling 60% of all writes while others sit idle. You switch to hash-based sharding — hashing user_id to evenly distribute data. The hot spot disappears, but now range queries like "find all users who signed up this week" require querying every shard.
Analogy
Sharding is like splitting a library into multiple buildings. Each building is smaller and faster to search. But "find every book published in 2024" now means visiting every building, searching each one, and merging the results. You traded query simplicity for write scalability.
You also add a "follow" feature — users can follow curators and see their links in a personalized feed. This introduces the fan-out problem. When a curator posts a link, do you immediately write it to every follower's feed (fan-out-on-write), or do you wait until a follower opens their feed and assemble it on the fly (fan-out-on-read)? You go hybrid: push for normal users (most have under 100 followers), pull for popular curators (one has 80,000 followers — pushing to 80K feeds on every post would be brutal).
Key Takeaway
Every scaling decision has a cost. Replicas introduce lag. Sharding breaks cross-cutting queries. Fan-out-on-write is fast to read but expensive to write. The art isn't avoiding trade-offs — it's choosing which trade-off you can live with for each feature.
Part 4: Security
Security is critical for enterprise systems. Understanding these fundamentals is essential for building production-grade applications.
4.1 Session-Based Authentication
A session is a server-side record that tracks a logged-in user. A cookie is a small piece of data the browser automatically sends with every request to that domain.
Pros: Easy to revoke (delete the session from Redis). Cons: Stateful — requires a shared session store (Redis) that all servers can access.
Why Sessions Struggle on Mobile
Session-based auth relies on browser cookies, which creates problems for non-browser clients:
- Cookies are browser-specific: Mobile apps (iOS/Android) don't have a browser cookie jar — you'd need to manually manage cookie storage and attachment.
- Cookies are domain-bound: If your API is on
api.example.combut your web app is onapp.example.com, cookies don't transfer across domains without extra CORS configuration. - Cross-platform inconsistency: Cookies behave differently across browsers, WebViews, and native HTTP clients.
This is why APIs that serve both web and mobile clients almost always use token-based auth (JWT) sent in the Authorization header — it works identically on every platform.
4.2 JWT Authentication
A JWT (JSON Web Token) is a self-contained token with three parts separated by dots:
eyJhbGciOiJIUzI1NiJ9.eyJ1c2VySWQiOjQyLCJyb2xlIjoiYWRtaW4iLCJleHAiOjE3MTA3MjAwMDB9.SflKxwRJSMeKKF2QT4fwpM
Part 1 — Header (algorithm): {"alg": "HS256"}
Part 2 — Payload (claims): {"userId": 42, "role": "admin", "exp": 1710720000}
Part 3 — Signature: HMAC-SHA256(header + payload, SECRET_KEY)
The server signs the token with a secret key. On every request, the server verifies the signature — if someone tampers with the payload, the signature won't match. No database lookup needed.
| Safe to Include | NEVER Include |
|---|---|
| User ID, role, email | Passwords, password hashes |
| Expiration time (exp) | Credit card numbers |
| Tenant ID, permissions | SSN, government IDs |
| Token ID (jti) for revocation | API keys, secret keys |
If you need to hide JWT contents, use JWE (JSON Web Encryption) — but for most apps, just don't put secrets in the payload.
If JWTs are stateless, how do you revoke them? You can't — natively. Solutions: maintain a token blocklist in Redis (checking on every request), use short-lived access tokens (15 min) paired with refresh tokens, or rotate the signing key (nuclear option — invalidates ALL tokens).
What happens when a JWT expires? Does the user get logged out? Not if you use refresh tokens. The pattern: issue a short-lived access token (15 min) and a long-lived refresh token (7 days, stored securely). When the access token expires, the client silently sends the refresh token to get a new access token — the user never notices. If the refresh token is also expired, then the user must log in again.
4.2b OAuth 2.0
OAuth lets users log in via a third party (Google, GitHub) without sharing their password with your app.
API Keys are simpler: a static string sent in request headers. Used for server-to-server communication or public API access. Unlike passwords, API keys identify the application, not the user. They don't expire automatically — you must rotate them manually.
4.3 Authorization
RBAC (Role-Based Access Control): Users are assigned roles, roles have permissions.
// RBAC middleware example (pseudocode)
function authorize(requiredRole) {
return (req, res, next) => {
const userRole = req.user.role; // from JWT or session
const permissions = ROLES[userRole]; // { admin: ['read','write','delete'], viewer: ['read'] }
if (permissions.includes(requiredAction)) {
next(); // allowed
} else {
res.status(403).json({ error: 'Forbidden' });
}
};
}
app.delete('/api/projects/:id', authorize('admin'), deleteProject);
ABAC (Attribute-Based Access Control): Permissions based on multiple attributes — not just role, but also resource ownership, time of day, location, etc. Example: "Editors can modify documents they created, but only during business hours, and only if the document is in 'draft' status." ABAC is more flexible than RBAC but harder to reason about.
Multi-Tenant Data Isolation
In a multi-tenant SaaS (e.g., Slack, Notion), Company A must never see Company B's data. The standard approach: include tenant_id in every table and filter every query with WHERE tenant_id = ?.
The danger: A single query that forgets the tenant_id filter leaks data across tenants — a catastrophic security bug. Defense-in-depth:
- PostgreSQL Row-Level Security (RLS): The database itself enforces tenant isolation. You define a policy like
CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.tenant_id')). Even if application code forgets the filter, the database adds theWHEREclause automatically. - Middleware injection: Extract
tenant_idfrom the JWT and automatically inject it into every database query via ORM middleware. - Automated tests: Integration tests that verify cross-tenant queries return zero results.
4.4 Encryption
Encryption transforms readable data (plaintext) into unreadable gibberish (ciphertext) that can only be decoded with the correct key.
- In transit: TLSTransport Layer Security — encrypts data as it travels between client and server. See Section 1.7 for the handshake details./HTTPS encrypts data while it's moving between machines.
- At rest: AES-256 encryption for database storage, S3 Server-Side Encryption for files. Data is encrypted on disk — even if someone steals the hard drive, they can't read it.
- Passwords: Never store plain text. Use bcrypt or argon2 — these are intentionally slow hashing algorithms.
Why hash passwords instead of encrypting them? Encryption is reversible — if an attacker gets the key, they can decrypt every password. Hashing is a one-way function: you can hash a password to check if it matches, but you can't reverse a hash back to the original password. Bcrypt adds a "salt" (random data) so even two users with the same password get different hashes.
If I encrypt data at rest, where do I store the encryption key? Never alongside the encrypted data. Use a dedicated key management service: AWS KMS, HashiCorp Vault, or Azure Key Vault. These handle key rotation, access control, and audit logging. The principle: your database stores the encrypted data, a completely separate system holds the keys.
Why SHA-256 Is Bad for Passwords
SHA-256 is a general-purpose hash — it's designed to be fast. That's great for file integrity checks, but terrible for passwords:
SHA-256 (general-purpose, fast)
Modern GPU: ~10 billion hashes/second
8-character password: cracked in hours
Bcrypt (password-specific, intentionally slow)
Modern GPU: ~30,000 hashes/second
8-character password: would take centuries
The difference: ~300,000× slower per hash
Bcrypt's "cost factor" is adjustable — as hardware gets faster, you increase the cost to keep brute-forcing impractical. This is why the algorithm choice matters more than password complexity rules.
4.5 SQL Injection
An attacker inserts malicious SQL into input fields:
// VULNERABLE — string concatenation
const query = "SELECT * FROM users WHERE email = '" + userInput + "'";
// Attacker enters: ' OR 1=1 --
// Resulting query: SELECT * FROM users WHERE email = '' OR 1=1 --'
// This returns ALL users!
// SAFE — parameterized query
const query = "SELECT * FROM users WHERE email = $1";
db.query(query, [userInput]); // Database treats userInput as data, never as SQL
Does NoSQL injection exist too? Yes. MongoDB queries accept objects — if user input flows directly into a query object, an attacker can inject operators like $gt, $ne, or $regex to bypass filters. Prevention is the same principle: never build queries from raw user input. Use your driver's built-in sanitization and validate input types.
4.6 XSS and CSRF
XSS (Cross-Site Scripting): An attacker injects malicious JavaScript into your page (e.g., via a comment field). When other users view the page, the script runs in their browser and can steal their cookies/tokens. Prevention: sanitize all user-generated content, use Content Security Policy headers, encode output.
CSRF (Cross-Site Request Forgery): An attacker tricks a user's browser into making an authenticated request to your site. Example: a hidden form on a malicious page that transfers money from your bank. Prevention: include a unique, unpredictable CSRF token in every form — the attacker's malicious page can't include the correct token.
4.7 DDoS, Rate Limiting & Zero Trust
DDoS (Distributed Denial-of-Service): An attack where thousands of machines flood your system with traffic to overwhelm it. Mitigations: rate limiting, CDN/WAF (absorb traffic at the edge), auto-scaling, IP blacklisting.
Rate limiting caps how many requests a client can make per time period. The token bucket algorithm works like this:
Bucket capacity: 10 tokens | Refill rate: 2 tokens/second
- Request arrives → take 1 token from bucket
- Bucket empty? → reject request (
429 Too Many Requests) - Bucket refills at steady rate → allows short bursts up to capacity
Zero Trust: A security model that assumes breach. Don't trust anything just because it's "inside the network." Principles: authenticate every request, enforce least privilege access, encrypt everything (even internal traffic), log everything, and continuously verify. Critical for enterprise systems with complex networks and many employees.
4.8 IDOR (Insecure Direct Object Reference)
A user changes /documents/1042 to /documents/1043 in the URL. Does your API check that document 1043 belongs to their tenant and that they have read permission? If not, you have an IDOR vulnerability — one of the OWASP top 10, and especially dangerous in multi-tenant systems where it can leak data across tenants.
The fix is a centralized authorization check that runs before every data access:
async function assertCanAccess(user, resourceType, resourceId, action) {
const resource = await db.findResource(resourceType, resourceId);
// Check 1: Tenant isolation — never cross tenant boundaries
if (resource.tenantId !== user.tenantId)
throw new ForbiddenError('Cross-tenant access denied');
// Check 2: Role-based permission (RBAC)
if (!await rbac.check(user.role, resourceType, action))
throw new ForbiddenError('Insufficient role permissions');
// Check 3: Resource-level ownership
if (action === 'delete' && resource.ownerId !== user.id && user.role !== 'admin')
throw new ForbiddenError('Can only delete your own resources');
return resource;
}
// Every route handler uses it
app.delete('/documents/:id', auth, async (req, res) => {
const doc = await assertCanAccess(req.user, 'document', req.params.id, 'delete');
await deleteDocument(doc.id);
});
The pattern checks three layers: tenant isolation → RBAC → resource ownership. Centralizing it means every endpoint gets the same protection — no developer can accidentally skip an access check.
The LinkPulse Story
Chapter 4: Month Four
100,000 users. You wake up to an urgent email: "Someone is posting links from my account." You dig into the logs. Your session system stores a user ID in a cookie — an unsigned cookie. The attacker didn't hack anything. They just edited their cookie from user_id=12847 to user_id=12846 and became someone else. You were trusting the client to tell the truth about who they were.
You rip out the unsigned cookie system and implement JWT with proper cryptographic signing. Now the server signs each token with a secret key. If anyone tampers with the payload, the signature check fails and the request is rejected. You also like that JWTs work for the mobile app you're building — mobile doesn't have browser cookies, but it can send a JWT in the Authorization header just fine.
Uh Oh
A security researcher emails you: they typed ' OR 1=1 -- into the search bar and got back every link in the database — including private ones. Your search endpoint was building SQL queries with string concatenation: "SELECT * FROM links WHERE title LIKE '%" + query + "%'". Classic SQL injection. The attacker didn't need a password. They needed a single quote and basic SQL knowledge.
You audit every database query in the codebase. String concatenation becomes parameterized queries everywhere: WHERE title LIKE $1 with the search term passed as a parameter. The database treats it as data, never as executable SQL. You also add input validation on the API layer as a second line of defense.
A month later, you launch a Teams feature — companies can create shared workspaces. An intern at Company A navigates to a link that belongs to Company B, and the page loads. No error, no access denied — your API checked authentication (are they logged in?) but not authorization (do they have permission to see this?). You add RBAC: every team member has a role — admin, editor, or viewer — and every API endpoint checks the role before returning data.
Uh Oh
A developer writes a quick analytics query for the admin dashboard and forgets a WHERE tenant_id = ? clause. The dashboard now shows aggregated data from all companies mixed together. One missing filter, and tenant isolation is broken. This isn't a hack — it's a bug that any developer can introduce on any query.
You implement Row-Level Security in PostgreSQL. Policies on each table automatically filter rows by tenant_id. Your middleware sets the tenant context on every database connection. Now even if a developer forgets the filter, the database itself refuses to return rows from other tenants. Defense in depth — the database enforces what the application might forget.
The final test comes on a Monday morning: a DDoS attack. 100,000 requests per second, all hitting the login endpoint. Your servers buckle. You implement rate limiting with a token bucket algorithm — each IP gets 20 requests per minute to the login endpoint. Excess requests get a 429 Too Many Requests response. You also put Cloudflare's WAF in front of everything, blocking obviously malicious patterns at the edge before they even reach your servers.
Analogy
Security isn't a feature you bolt on at the end. It's the locks on every door, the ID check at every entrance, the camera in every hallway. LinkPulse didn't get hacked by geniuses — you left the front door open with unsigned cookies, built a SQL injection target with string concatenation, and forgot to check permissions. Every vulnerability was a missing basic.
Key Takeaway
Every vulnerability LinkPulse hit was a consequence of not thinking about security from the start. Unsigned cookies, raw SQL, missing authorization, no tenant isolation — none of these were sophisticated attacks. They were missing fundamentals. Mentioning security proactively shows the difference between thinking like a junior developer and thinking like a senior engineer.
Part 5
System Design Walkthroughs
Nine common system design challenges, each with a detailed walkthrough of the key decisions and architecture.
URL Shortening Service
Design a URL shortening service like bit.ly. Generate short unique URLs and redirect at massive scale. Expect ~100:1 read-to-write ratio.
Key Decisions
Base62An encoding scheme using 62 characters: a-z, A-Z, 0-9. A 7-character base62 string can represent ~3.5 trillion unique values — more than enough for a URL shortener. encoding: Convert an auto-incrementing ID to a short string. ID 12345 → dnh. This guarantees uniqueness (IDs are unique) and produces short codes.
Why not hash the URL? You'd need collision handling. Auto-increment ID → base62 is simpler and guarantees uniqueness.
Two users shorten the same URL? Design choice: (1) separate short codes per user (simpler, unique analytics) or (2) deduplicate (saves storage, shared analytics). Most services choose option 1.
Redirect Flow
Critical tech decision → Redis: Caching IS the design. A 100:1 read ratio means almost every request hits the cache. Without Redis, you'd need 100× the database capacity.
Real-Time Chat System
Design a real-time chat system like Slack or WhatsApp. Support 1:1 messaging, group chats, online presence indicators, and message history. Target 500K DAU.
Message Delivery
WebSocketA persistent, bidirectional connection between client and server. Unlike HTTP (request-response), either side can send messages at any time. See Section 2.2. servers handle real-time delivery. When User A sends "Hello" to User B:
- User A's client sends the message via their WebSocket connection
- Server persists the message to PostgreSQL/Cassandra
- Server publishes event to Kafka
- User B's WebSocket server receives the event and pushes it to User B's connection
Presence
Tracked via Redis. Each connected client sends a heartbeatA periodic signal ("I'm still here") sent from the client to the server. If the server stops receiving heartbeats, it marks the user as offline. every 30 seconds. Redis stores user:42:last_seen → timestamp with a TTL. If the TTL expires (no heartbeat), user is shown offline.
Group sizing strategy: Small groups (<100): push messages via WebSocket to all members. Large groups (>100): pull model — client fetches messages on demand.
Critical tech decision → WebSockets: Real-time IS the product. Without persistent connections, you'd fall back to polling, which adds latency and wastes bandwidth.
Social Media News Feed
Design a social media news feed like Twitter or Instagram. When a user opens their feed, efficiently assemble posts from everyone they follow. Handle the "celebrity problem" — users with millions of followers.
Fan-out Strategies
Fan-out-on-write (push model): When User A posts, immediately write that post to every follower's precomputed feed table. Reads are instant, but writes are expensive — 1M followers = 1M writes.
Fan-out-on-read (pull model): When User B opens their feed, query all accounts they follow and assemble on the fly. Writes are cheap, but reads are slow.
Hybrid approach (what Twitter/Instagram actually use): Normal users (few followers) use fan-out-on-write — pre-compute feeds for fast reads. Celebrities (millions of followers) use fan-out-on-read — their posts are fetched on demand and merged. This limits write amplification while keeping reads fast for most users.
Critical tech decision → Kafka: Fan-out IS the hard problem. You need durable event processing to handle writing to millions of follower feeds without losing messages.
File Storage & Sync Service
Design a file storage and sync service like Google Drive or Dropbox. Users upload/download files, sync across devices, and share with others. Handle files from KB to multi-GB.
Architecture
Metadata in PostgreSQL, files in S3. The key insight: don't store files as single blobs.
Chunking: Split each file into ~4MB blocks. Why?
- Efficient sync: If one byte in a 1GB file changes, you only re-upload the affected 4MB chunk, not the entire file.
- Resumability: If upload fails, resume from the last incomplete chunk.
- Parallelism: Multiple chunks can upload simultaneously.
Deduplication via hashing: Each chunk is hashed (SHA-256). Before storing a chunk, check if that hash already exists. If it does, just reference the existing chunk. 100 users upload the same PDF → stored once with 100 references.
Critical tech decision → S3: Infinite, cheap storage IS the requirement. You can't store petabytes of user files on application servers or in a relational database.
Distributed Rate Limiter
Design a distributed rate limiter. Protect APIs from abuse by limiting requests per user, per IP, or per endpoint. Must work across multiple server instances.
Algorithms
| Algorithm | How It Works | Best For |
|---|---|---|
| Token Bucket | Tokens added at steady rate, each request consumes one. Bucket has max capacity. | Allowing short bursts above the steady rate |
| Sliding Window Log | Store timestamp of each request. Count requests in the last N seconds. | Precise rate counting, but memory-heavy |
| Sliding Window Counter | Hybrid: split time into fixed windows, weight current + previous window by overlap. | Balance of precision and memory efficiency |
| Fixed Window | Count requests per fixed time window (e.g., per minute). Reset at boundary. | Simple, but allows burst at window edges |
Distributed Rate Limiting
A single-server counter doesn't work when you have N API servers. Solutions:
- Centralized Redis counter: All servers increment the same Redis key (
INCR rate:{user_id}:{minute}with TTL). Simple, but adds a Redis round-trip to every request (~1ms). - Per-server local counters + sync: Each server tracks locally and periodically syncs to Redis. Less accurate but faster — good when approximate limiting is acceptable.
Multi-level limiting: Apply different limits at different scopes — per-IP (prevent DDoS), per-user (prevent abuse), per-endpoint (protect expensive operations like /search). Return 429 Too Many Requests with a Retry-After header.
Critical tech decision → Redis: Atomic INCR with TTL provides the distributed counter you need. Fast enough to check on every request without meaningful latency impact.
Notification System
Design a notification system supporting push, email, SMS, and in-app channels. Users can configure preferences. Must handle millions of notifications/day with delivery guarantees.
Architecture
Event-driven design with Kafka at the center. When an event occurs (new message, order shipped, friend request):
- Event producer publishes to Kafka topic (e.g.,
notification.events) - Notification service consumes the event, checks user preferences in Redis/PostgreSQL
- Router fans out to the appropriate channel workers:
- Push: Firebase Cloud Messaging (FCM) for Android, APNs for iOS
- Email: Amazon SES or SendGrid
- SMS: Twilio
- In-app: WebSocket push or write to notification table for polling
Key Design Decisions
Delivery guarantees: Use Kafka consumer groups with at-least-once delivery. Each channel worker acknowledges after successful send. Failed sends go to a dead-letter queue for retry.
User preferences: Store in PostgreSQL, cache in Redis. Schema: user_id, channel, event_type, enabled, quiet_hours_start, quiet_hours_end. Check before every send.
Rate limiting per user: Don't spam — aggregate similar notifications ("You have 5 new messages" instead of 5 separate pushes). Use a short delay window (30s) to batch.
Critical tech decision → Kafka: Decouples event producers from notification delivery. Handles backpressure, retries, and multiple consumers (one per channel) naturally.
Search Autocomplete
Design a search autocomplete (typeahead) system. As the user types, suggest the top completions in under 100ms. Handle billions of search queries for ranking data.
Data Structure: Trie
A trie (prefix tree) stores strings character by character. Each node represents a prefix; children represent the next character. To find completions for "sys", traverse to the "s" → "y" → "s" node and return all descendants.
Precomputed Top-K
At each trie node, store the top K results (e.g., top 10 completions ranked by search frequency). This avoids traversing the entire subtree at query time — just return the precomputed list. Trade-off: more storage, but queries are O(prefix length) instead of O(subtree size).
Offline Pipeline
Ranking data comes from an offline analytics pipeline:
- Collect search query logs → aggregate frequency counts (MapReduce/Spark job, runs hourly or daily)
- Build/update the trie with new frequency data
- Serialize the trie and push to the serving layer
Serving
- CDN caching: Cache the top completions for common prefixes (1-2 characters) at the CDN edge. "s", "sy", "sys" are queried millions of times — serve from CDN, not your servers.
- In-memory trie servers: For longer/rarer prefixes, query a cluster of servers holding the trie in RAM.
- Client-side optimization: Debounce keystrokes (wait 100-200ms after the user stops typing before querying). Cache recent responses locally.
Critical tech decision → Precomputed trie with CDN caching: Sub-100ms latency requires avoiding any database lookup on the hot path. The trie is the data structure; CDN + in-memory serving is the delivery mechanism.
Web Crawler
Design a web crawler that indexes billions of pages. Handle politeness (don't overload sites), deduplication (don't re-crawl the same page), and distributed crawling across many machines.
Core Components
URL Frontier: A priority queue of URLs to crawl. Organized by domain with per-domain rate limiting. High-priority pages (news sites, frequently updated) are crawled more often.
Crawl Loop
- Pick URL from the frontier (respecting per-domain delays)
- Check robots.txt — honor the site's crawling rules. Cache robots.txt per domain.
- Fetch the page via HTTP. Handle redirects, timeouts, and errors gracefully.
- Parse HTML — extract text content for indexing + extract outgoing links for the frontier.
- Deduplicate: Before adding to the index, check if we've seen this content before.
- Store the parsed content in the search index. Add new URLs back to the frontier.
Deduplication
- URL dedup: Use a Bloom filter — a probabilistic data structure that tells you "definitely not seen" or "probably seen." Memory-efficient for billions of URLs. False positives (skip a new URL) are acceptable; false negatives (re-crawl) are not.
- Content dedup: Use SimHash or MinHash to detect near-duplicate pages (e.g., same article with different ad placements). Compute a content fingerprint and compare against existing fingerprints.
Distributed Crawling
Use consistent hashing by domain to assign each domain to a specific crawler node. This ensures per-domain rate limiting is enforced by a single machine (no coordination needed) and improves cache locality for robots.txt and DNS lookups.
Critical tech decision → URL frontier design: The frontier determines crawl order, politeness, and freshness. A poorly designed frontier either overloads target sites (getting you blocked) or wastes time on low-value pages.
Video Streaming Platform
Design a video streaming platform like YouTube or Netflix. Handle video upload, transcoding to multiple resolutions, and adaptive bitrate streaming to millions of concurrent viewers.
Upload → Transcode → Serve Pipeline
- Upload: Client uploads the original video to S3 (chunked upload for large files, resumable). Metadata (title, description, uploader) stored in PostgreSQL.
- Transcode: A message is published to a queue (SQS/Kafka). Worker fleet transcodes into multiple resolutions (1080p, 720p, 480p, 360p) and formats. Each resolution is split into small segments (2-10 seconds each).
- Store: Transcoded segments stored in S3, organized by video ID and resolution.
- Serve: CDN caches and delivers video segments to viewers globally.
Adaptive Bitrate Streaming (HLS/DASH)
The client requests a manifest file that lists all available resolutions and their segment URLs. The video player monitors network bandwidth in real-time and switches between resolutions seamlessly:
- Fast WiFi → stream 1080p segments
- Bandwidth drops → seamlessly switch to 480p segments
- Bandwidth recovers → switch back to 1080p
This happens segment-by-segment (every 2-10 seconds), so the user never waits for buffering — they just see slightly lower quality during slow periods.
Scale Considerations
- Storage: A 10-minute video in 4 resolutions ≈ 2-4GB. At 500K uploads/day, that's ~1-2 PB/year. S3's durability and cost structure handle this.
- CDN is critical: Popular videos ("head" content) should be cached at every CDN edge. Long-tail content is fetched from origin on demand.
- Transcoding fleet: Use spot/preemptible instances — transcoding is embarrassingly parallel and fault-tolerant (just restart the segment).
Critical tech decision → CDN + segment-based delivery: Streaming IS delivery. Without a global CDN caching video segments at the edge, you'd need impossible origin bandwidth. Segment-based architecture enables adaptive bitrate and CDN caching simultaneously.
Critical Tech Decision Summary
Each system design challenge has a single most important technology choice. The critical decision is the one the core challenge depends on:
- URL Shortener → Redis: Caching IS the design (100:1 read ratio).
- Chat System → WebSockets: Real-time IS the product.
- News Feed → Kafka: Fan-out IS the hard problem.
- File Storage → S3: Infinite, cheap storage IS the requirement.
- Rate Limiter → Redis: Distributed atomic counters with TTL.
- Notifications → Kafka: Decouples events from multi-channel delivery.
- Autocomplete → Precomputed trie + CDN: Sub-100ms requires no DB on the hot path.
- Web Crawler → URL frontier: Controls politeness, priority, and freshness.
- Video Streaming → CDN + segments: Delivery IS the product.
Part 6
Technology Cheat Sheet
Technologies ranked by popularity within each category. Each tool includes pros, cons, and when to use it. The #1 in each category is the standard default — know when you'd pick an alternative.
Relational Databases
Open-source RDBMS with JSONB support, full-text search, row-level security, and a rich extension ecosystem. The default choice for most applications.
Pros
- JSONB for flexible schemas
- Advanced indexing (GIN, GiST, BRIN)
- Strong community & ecosystem
Cons
- Write-heavy sharding is complex
- Vertical scaling ceiling
Use When
- Relational data with ACID needs
- Complex queries & JOINs
- Default choice for most apps
Mature RDBMS with strong replication. Simpler than PostgreSQL, widely used in PHP/WordPress ecosystems and legacy systems.
Pros
- Simple replication setup
- Massive ecosystem & hosting support
Cons
- Fewer advanced features than PG
- Weaker JSON/full-text support
Use When
- Legacy PHP/WordPress apps
- Simple CRUD applications
Distributed SQL database that scales horizontally while maintaining strong consistency and PostgreSQL wire compatibility.
Pros
- Horizontal scaling with SQL
- Multi-region by default
- PostgreSQL compatible
Cons
- Higher latency than single-node PG
- Expensive at scale
Use When
- Need SQL + global distribution
- Outgrowing single-node PostgreSQL
Document Databases
Flexible JSON-like documents with dynamic schemas. Strong developer experience and aggregation pipeline.
Pros
- Schema flexibility
- Rich query language
- Good for rapid prototyping
Cons
- No ACID across documents (until v4.0 multi-doc transactions)
- JOINs are expensive
Use When
- Semi-structured/varying schemas
- Content management, catalogs
Fully managed, serverless key-value/document store by AWS. Predictable single-digit ms latency at any scale.
Pros
- Zero ops, auto-scaling
- Predictable performance
- Built-in global tables
Cons
- Limited query patterns (partition key required)
- Expensive at high volume
- AWS lock-in
Use When
- Simple key-value access patterns
- Serverless architectures on AWS
Cache
In-memory data structure store supporting strings, lists, sets, sorted sets, hashes, streams, and pub/sub. The industry standard cache.
Pros
- Rich data structures
- Persistence options (RDB/AOF)
- Pub/sub, Lua scripting, streams
Cons
- Single-threaded (for commands)
- RAM-limited (expensive at scale)
Use When
- Caching, sessions, rate limiting
- Leaderboards, real-time counters
- Default cache choice
Simple, multi-threaded in-memory key-value cache. No data structures, no persistence.
Pros
- Multi-threaded (better on multi-core)
- Simpler, less overhead
Cons
- No persistence
- No data structures
- No pub/sub
Use When
- Pure key-value caching only
- Multi-threaded workloads
Message Queue / Streaming
Distributed event streaming platform. Durable, replayable log with multi-consumer support. High throughput for event-driven architectures.
Pros
- Event replay & audit trail
- Multiple consumers per topic
- Extremely high throughput
Cons
- Operational complexity
- Overkill for simple queues
- Higher latency than SQS/RabbitMQ
Use When
- Event-driven architectures
- Multiple services consuming same events
- Data pipelines, analytics
Fully managed message queue by AWS. Zero ops, auto-scaling, pay-per-message. FIFO mode supports exactly-once processing.
Pros
- Zero ops, fully managed
- FIFO + deduplication
- Dead-letter queues built-in
Cons
- No event replay
- One consumer per message
- AWS lock-in
Use When
- Simple job/task queues
- Process-each-once workloads
Open-source message broker with advanced routing, priority queues, and message acknowledgment patterns.
Pros
- Complex routing patterns
- Priority queues
- Low latency delivery
Cons
- Doesn't scale like Kafka
- No built-in event replay
Use When
- Complex routing logic needed
- Priority-based processing
Search
Distributed full-text search engine built on Lucene. Inverted indexes, fuzzy matching, faceted search, relevance tuning.
Pros
- Powerful full-text search
- Horizontal scaling
- Rich analytics (ELK stack)
Cons
- Resource-heavy (RAM-hungry)
- Complex to operate at scale
Use When
- Search IS the feature
- 10M+ documents
- Fuzzy/faceted search needs
Built-in tsvector/tsquery with GIN indexes. No extra infrastructure needed.
Pros
- No extra infrastructure
- SQL integration
- Good enough for <1M docs
Cons
- Limited fuzzy/relevance features
- Doesn't scale horizontally
Use When
- Search is a minor feature
- Small dataset, simple queries
Hosted search-as-a-service with instant results. Sub-50ms latency, typo tolerance, and great developer UX out of the box.
Pros
- Zero ops, instant setup
- Built-in typo tolerance
- Great frontend SDKs
Cons
- Expensive at scale
- Less customizable than ES
Use When
- Need great search fast
- Small-to-medium dataset
Open-source, lightweight search engines. Easy to self-host with good defaults. Good Algolia alternatives.
Pros
- Easy to deploy & operate
- Typo tolerance built-in
- Open-source (free)
Cons
- Less mature than ES
- Smaller community
Use When
- Want Algolia-like UX, self-hosted
- Moderate scale
Object Storage
Unlimited blob storage with 11 nines (99.999999999%) durability. The industry standard for file/object storage.
Pros
- Near-infinite scale
- Lifecycle policies, versioning
- Deep AWS integration
Cons
- Egress costs add up
- AWS lock-in
Use When
- Images, videos, backups, static assets
- Default choice for file storage
Microsoft's equivalent to S3. Tight integration with Azure ecosystem and Active Directory.
Pros
- Azure ecosystem integration
- Tiered storage (hot/cool/archive)
Cons
- Azure lock-in
Use When
- Already on Azure
Google's object storage with strong integration into BigQuery and ML services.
Pros
- BigQuery integration
- Multi-region by default
Cons
- GCP lock-in
Use When
- Already on GCP
- ML/analytics pipelines
Load Balancer
Managed Layer 7 load balancer with path-based routing, WebSocket support, and auto-scaling.
Pros
- Zero ops, auto-scaling
- Path/host-based routing
- Native AWS integration
Cons
- AWS-only
- Less configurable than Nginx
Use When
- AWS infrastructure
- Default choice on AWS
High-performance web server and reverse proxy. Can also serve static files, terminate SSL, and cache responses.
Pros
- Extremely configurable
- Cloud-agnostic
- Also serves static content
Cons
- Self-managed
- Config can be complex
Use When
- Self-hosted / multi-cloud
- Need fine-grained control
High-performance TCP/HTTP load balancer. Industry standard for extreme throughput requirements.
Pros
- Very high performance
- Advanced health checks
Cons
- Steeper learning curve
- No static file serving
Use When
- Maximum throughput needed
- Complex health check requirements
CDN
Global CDN with built-in WAF, DDoS protection, DNS, and Workers (edge compute). Generous free tier.
Pros
- Free tier, easy setup
- Built-in security (WAF, DDoS)
- Edge compute (Workers)
Cons
- Less AWS-native
- Enterprise features are pricey
Use When
- Default CDN choice
- Need WAF + CDN in one
AWS's CDN. Deep integration with S3, ALB, Lambda@Edge for origin-less processing.
Pros
- Native S3/ALB integration
- Lambda@Edge for custom logic
Cons
- AWS lock-in
- Config is more complex
Use When
- All-AWS stack
- Need Lambda@Edge
Enterprise CDN with the largest global network. Used by Fortune 500 for mission-critical delivery.
Pros
- Largest edge network
- Enterprise SLAs
Cons
- Expensive
- Complex contracts
Use When
- Enterprise / Fortune 500
- Need guaranteed global SLAs
Monitoring
Full-stack observability: logs, metrics, traces, APM, and dashboards in one platform.
Pros
- All-in-one platform
- Great integrations
- Easy setup
Cons
- Expensive at scale
- Vendor lock-in on queries
Use When
- Want one tool for everything
- Team velocity matters more than cost
Open-source metrics collection (Prometheus) + visualization (Grafana). The Kubernetes-native monitoring stack.
Pros
- Free, open-source
- Kubernetes-native
- PromQL is powerful
Cons
- Self-managed infrastructure
- No built-in log aggregation
Use When
- Cost-sensitive
- Already on Kubernetes
Full-stack observability with generous free tier. Strong APM and distributed tracing capabilities.
Pros
- Generous free tier (100GB/mo)
- Strong APM
Cons
- Per-user pricing adds up
- Query language learning curve
Use When
- Small team, budget-conscious
- Need APM focus
Containers / Orchestration
Container orchestration for deploying, scaling, and managing containerized applications. The industry standard for microservices at scale.
Pros
- Cloud-agnostic
- Auto-scaling, self-healing
- Massive ecosystem
Cons
- Steep learning curve
- Operational overhead
- Overkill for simple apps
Use When
- Microservices at scale
- Multi-cloud / hybrid
AWS's container orchestration. ECS manages containers; Fargate removes the need to manage servers entirely.
Pros
- Simpler than K8s on AWS
- Fargate = serverless containers
- Native AWS integration
Cons
- AWS lock-in
- Less flexible than K8s
Use When
- All-AWS, smaller team
- Don't want K8s complexity
Multi-container orchestration for development and simple deployments. Define services in a YAML file.
Pros
- Dead simple
- Great for dev/staging
Cons
- Single-host only
- No auto-scaling or self-healing
Use When
- Local development
- Small single-server deployments
Auth
Managed identity platform with SSO, MFA, OAuth/OIDC, social logins, and user management. Auth0 was acquired by Okta.
Pros
- Full-featured (SSO, MFA, social)
- Great SDKs
- Enterprise-grade security
Cons
- Expensive at scale
- Vendor lock-in
Use When
- Enterprise SSO/MFA needed
- Don't want to build auth
Google's auth service. Easy social login, email/password, phone auth. Great for mobile apps.
Pros
- Free for most use cases
- Great mobile SDKs
- Easy setup
Cons
- Limited enterprise features
- Google lock-in
Use When
- Mobile apps, MVPs
- Simple auth needs
AWS's managed auth service. User pools for authentication, identity pools for AWS resource access.
Pros
- AWS-native integration
- Generous free tier (50K MAU)
Cons
- Poor developer experience
- Limited customization
Use When
- All-AWS serverless stack
- Need AWS resource access control
Real-time
Persistent bidirectional connections (WebSockets) with cross-server broadcasting (Redis Pub/Sub). The standard real-time architecture.
Pros
- True bidirectional
- Low latency
- Cross-server broadcast via Redis
Cons
- Stateful connections (harder to scale)
- Need sticky sessions or Pub/Sub
Use When
- Chat, gaming, collaboration
- Either side initiates messages
Server pushes data to client over a standard HTTP connection. Simpler than WebSockets for one-way updates.
Pros
- Simple (just HTTP)
- Auto-reconnect built in
- Works through proxies/firewalls
Cons
- Server-to-client only
- No binary data
Use When
- Live feeds, notifications
- Server-to-client push only
Graph Databases
Native graph database. Nodes and edges (relationships) as first-class citizens. Cypher query language.
Pros
- Fast graph traversals
- Intuitive Cypher query language
- Great visualization tools
Cons
- Doesn't scale horizontally easily
- Niche use cases
Use When
- Social networks, recommendations
- Fraud detection, knowledge graphs
Managed graph database by AWS. Supports both property graph (Gremlin) and RDF (SPARQL) models.
Pros
- Fully managed
- Multi-model (property + RDF)
Cons
- AWS lock-in
- Expensive
Use When
- AWS-native graph needs
- Knowledge graphs, identity graphs
Vector Databases
Managed vector database purpose-built for similarity search. Core infrastructure for RAG (Retrieval-Augmented Generation) and AI applications.
Pros
- Zero ops, fully managed
- Fast similarity search
- Metadata filtering
Cons
- Expensive at scale
- Vendor lock-in
Use When
- RAG applications
- Semantic search, recommendations
PostgreSQL extension for vector similarity search. Store embeddings alongside relational data in a single database.
Pros
- No extra infrastructure
- SQL + vectors together
- Free, open-source
Cons
- Slower than dedicated vector DBs
- Scaling limitations
Use When
- Small-to-medium vector workloads
- Already using PostgreSQL
Open-source vector databases with rich filtering, hybrid search (vector + keyword), and self-hosting options.
Pros
- Open-source, self-hostable
- Hybrid search
- Active development
Cons
- Newer, less battle-tested
- Self-hosting complexity
Use When
- Need vector DB without vendor lock-in
- Want hybrid keyword + vector search
Workflow Orchestration
Durable execution platform for long-running workflows. Code-first approach — write workflows as regular functions.
Pros
- Code-first (no YAML/JSON DSL)
- Built-in retries, timeouts, versioning
- Durable execution
Cons
- Operational complexity
- Learning curve
Use When
- Complex, multi-step business processes
- Long-running workflows (hours/days)
Managed workflow service. Visual state machine with JSON-based definition. Integrates natively with Lambda, SQS, DynamoDB.
Pros
- Zero ops, fully managed
- Visual workflow designer
- Native AWS integration
Cons
- JSON DSL is verbose
- AWS lock-in
- Expensive at high volume
Use When
- AWS-native orchestration
- Serverless workflows
Python-based DAG scheduler. The standard for data pipeline orchestration (ETL/ELT).
Pros
- Python-native
- Huge plugin ecosystem
- Great for data pipelines
Cons
- Not designed for real-time
- Scheduler bottleneck at scale
Use When
- Data/ML pipelines
- Scheduled batch processing
Time-Series DB
Purpose-built for time-stamped data. Optimized for high-frequency writes with automatic downsampling and retention policies.
Pros
- Built for metrics/IoT
- Auto downsampling & retention
- High write throughput
Cons
- Custom query language (Flux)
- Not SQL-compatible
Use When
- Monitoring, IoT sensors
- High-frequency time-stamped data
PostgreSQL extension for time-series. Full SQL compatibility with time-series optimizations (hypertables, compression).
Pros
- Full SQL support
- Runs on PostgreSQL
- JOINs with relational data
Cons
- Less optimized than InfluxDB for pure metrics
Use When
- Need SQL + time-series
- Already using PostgreSQL
API Gateway
Open-source API gateway with plugins for auth, rate limiting, logging, and transformation. Cloud-agnostic.
Pros
- Rich plugin ecosystem
- Cloud-agnostic
- Open-source core
Cons
- Self-managed (unless Kong Cloud)
- Learning curve
Use When
- Microservices with complex API needs
- Multi-cloud environments
Managed API gateway by AWS. REST and WebSocket APIs. Native Lambda integration for serverless backends.
Pros
- Zero ops
- Native Lambda integration
- Usage plans & API keys
Cons
- AWS lock-in
- Latency overhead
- Expensive at high volume
Use When
- Serverless AWS backends
- Public API with usage plans
Serverless
Event-driven functions. Pay-per-invocation, auto-scales to thousands of concurrent executions. 15-min max runtime.
Pros
- Zero server management
- Pay only for usage
- Scales automatically
Cons
- Cold starts (100ms–2s)
- 15-min timeout
- Vendor lock-in
Use When
- Event-driven processing
- Bursty, unpredictable traffic
Microsoft's serverless platform. Supports Durable Functions for long-running workflows with state management.
Pros
- Durable Functions (orchestration)
- Azure ecosystem integration
Cons
- Azure lock-in
Use When
- Azure ecosystem
- Need durable/stateful functions
Google's serverless platform. Tight integration with Firebase, Pub/Sub, and Cloud Run.
Pros
- Firebase integration
- Cloud Run for containers
Cons
- GCP lock-in
- Smaller ecosystem than Lambda
Use When
- GCP/Firebase ecosystem
When Scale Changes the Answer
The #1 in each category is the right default — but at certain scale thresholds, you need specialized tools. A common question is "why not just use PostgreSQL for X?"
Search: PostgreSQL Full-Text vs Elasticsearch
PostgreSQL's built-in tsvector and tsquery work well for <1M documents with simple keyword search. At 10M+ documents, Elasticsearch wins because:
- Inverted indexes are purpose-built for full-text search (faster than PG's GIN indexes at scale)
- Horizontal scaling — shard the search index across nodes
- Rich features — fuzzy matching, faceted search, relevance tuning, synonyms, autocomplete
Rule of thumb: If search is a minor feature (e.g., searching your own notes), PostgreSQL is fine. If search IS the feature (e.g., a product catalog, documentation site), use Elasticsearch.
Time-Series: PostgreSQL vs Specialized Stores
PostgreSQL can store timestamped data, but at high-frequency writes (metrics, IoT sensors) it struggles because:
- Standard
INSERToverhead on B-tree indexes doesn't suit append-heavy workloads - No built-in downsampling (aggregating old data: per-second → per-minute → per-hour)
- No automatic retention policies (purging data older than 90 days)
TimescaleDB is a PostgreSQL extension that adds time-series optimizations while keeping full SQL compatibility — a good middle ground. InfluxDB is purpose-built for metrics and monitoring with its own query language.
Part 7
Glossary
Quick-fire definitions for key system design terms used throughout this guide.
| Term | Definition |
|---|---|
| Microservices | Each feature is its own deployable service |
| Monolith | Single deployable unit containing all features |
| API Gateway | Single entry point that routes, authenticates, and rate-limits API traffic |
| Circuit Breaker | Stops calling a failing service to let it recover. 3 states: Closed (normal — requests pass through), Open (tripped — requests immediately fail without calling the service), Half-Open (testing — allows one request through to check if the service has recovered). |
| Saga Pattern | Sequence of local transactions across services with compensating actions on failure. E-commerce example: (1) reserve inventory → (2) charge payment → (3) ship order. If step 3 fails: compensate by refunding payment (undo step 2) and releasing inventory (undo step 1). Each step can be rolled back independently. |
| Idempotency | Safe to retry — same result no matter how many times called |
| Backpressure | Consumer tells producer to slow down when overwhelmed. Analogy: a sink filling with water faster than it drains — backpressure is turning down the faucet instead of letting the sink overflow. Implementations: return 429 status, use bounded queues, or use reactive streams. |
| Blue-Green | Two identical environments; switch traffic from old to new instantly |
| Canary | Roll out new version to small % of users first, then gradually increase |
| Feature Flag | Toggle to enable/disable features without deploying code |
| Event Sourcing | Store events (what happened) instead of current state. Bank account example: instead of storing "balance = $500," store every event: [+$1000 deposit, -$200 withdrawal, -$300 payment]. Current balance = replay all events. Benefits: complete audit trail, time-travel debugging, can rebuild state at any point. |
| CQRS | Separate read and write models/paths. See Part 2. |
| SLA | Contractual uptime commitment (99.9% = 8.7 hrs downtime/year) |
| SLO | Internal target — stricter than SLA |
| P99 Latency | 99% of requests complete within this time |
| Thundering Herd | Many requests hit an expired cache entry simultaneously. See Part 2. |
| Hot Spot | One shard/server receiving disproportionate traffic. See Part 3. |
Concepts Expanded
Several terms in the table above deserve deeper explanation. Here's the depth you need.
Thundering Herd Prevention
When a popular cache entry expires, thousands of requests slam the database simultaneously. Three mitigations:
- Cache locking: The first request acquires a short-lived lock and fetches from the DB. All other requests wait for the lock to release and then read the freshly cached value.
- Staggered TTLs: Add random jitter to expiration times (e.g., base TTL 5 min + random 0-60 sec). Entries expire at different times, spreading the load.
- Background refresh: A background job re-populates the cache before TTL expires, so users never see a cache miss. Best for data that's expensive to compute.
Cross-reference: caching fundamentals in Part 2.
Idempotency Key Pattern
Critical for payment APIs and any operation where retries could cause duplicates. The client generates a UUID and sends it in a header; the server checks before executing:
POST /payments
Headers: Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Server logic:
1. Check DB: has this idempotency key been processed?
├── YES → return the cached result (no re-execution)
└── NO → process the payment
2. Store: idempotency_key → result (with 24hr TTL)
3. Return result to client
This guarantees that network timeouts and retries never cause double-charges.
Circuit Breaker in Practice
A circuit breaker monitors calls to a dependency. If the failure rate exceeds a threshold, it "opens" — subsequent calls return a fallback immediately without actually calling the failing service. The system degrades gracefully rather than collapsing:
const breaker = new CircuitBreaker(fetchFromDependency, {
timeout: 3000, // fail after 3s instead of hanging
errorThresholdPercentage: 50, // open after 50% failure rate
resetTimeout: 30000, // try again after 30s
});
breaker.fallback(() => ({
data: null,
message: 'Service temporarily unavailable'
}));
// If the dependency is down, users get a degraded response
// instead of a timeout or crash. The system keeps running.
Cross-reference: circuit breaker definition in the glossary table above.
Saga Pattern vs. Distributed Transactions (2PC)
Two-phase commit (2PC) locks all participants until everyone agrees to commit — strong consistency, but a single slow participant blocks everything and the coordinator is a single point of failure. Sagas break the work into a sequence of independent local transactions, each with a compensating action for rollback. Sagas trade consistency for availability: you get eventual consistency but no global locks. Default to sagas for microservices; reserve 2PC for cases where you truly need atomic cross-service consistency (rare).
Feature Flags — Why Use Them
- Instant kill switch: If a new feature causes errors, disable it in seconds — no rollback or redeploy needed.
- Gradual rollout: Release to 5% of users → monitor → 25% → 100%. Catches issues before they affect everyone.
- A/B testing: Show version A to half your users, version B to the other half. Measure which performs better.
- Unblock deploys: Merge code behind a flag even if the feature isn't ready. The flag keeps it invisible until you flip it on.