Design an API rate limiter

Question

Accepted Answer

An API rate limiter is a gatekeeping mechanism that caps how many requests a client — identified by API key, user ID, IP address, or endpoint — can make within a time window. Without rate limiting, a single misbehaving client can exhaust your infrastructure, degrade service for everyone else, or bypass business model constraints (e.g., a free-tier user calling a paid inference endpoint millions of times). Designing a rate limiter well requires choosing the right algorithm , picking the right storage backend , placing the limiter in the right location in the stack, and handling distributed correctness across multiple gateway nodes. Rate Limiting Algorithms Fixed Window Counter — maintain a counter per time bucket (e.g., "100 requests per minute"). Simple to implement with a Redis INCR + EXPIRE . Fatal flaw: a client can send 100 requests at 00:59 and 100 more at 01:00, effectively 200 requests in 2 seconds at the window boundary — 2x burst amplification . Sliding Window Log — store a timestamp for every request in a sorted set; count timestamps within the last N seconds. Most accurate; zero boundary artifacts. Cost: O(requests) memory per user — a client making 1M requests per hour would store 1M timestamps, which is prohibitive. Sliding Window Counter (hybrid) — the practical sweet spot. Maintain the current window count and the previous window count, then compute a weighted sum: rate = prev_count × (1 - elapsed_fraction) + current_count . This approximates the sliding window behavior with O(1) memory per key, and the error is at most a few percent in practice. Token Bucket — each client has a bucket that holds up to N tokens and refills at R tokens/second. Each request consumes one token; if the bucket is empty, the request is rejected. Allows short bursts up to the bucket size while enforcing a long-run rate of R/second. This is the most popular production algorithm because it matches real traffic patterns — legitimate clients do send bursts, and penalizing them for it is bad UX. Leaky Bucket — requests enter a queue that drains at a constant rate. Enforces a perfectly smooth output rate, eliminating all bursts. Ideal for downstream services that cannot handle spikes. The cost: legitimate burst traffic is delayed rather than served immediately, increasing latency for the client. Storage Backend: Redis as the Central Counter A rate limiter running on multiple API gateway nodes must share state — otherwise each node maintains its own counter and a client can bypass limits by round-robining requests across nodes. Redis is the standard shared counter store. A fixed-window limiter uses INCR key followed by EXPIRE key 60 ; to make the check-and-increment atomic (avoiding race conditions between the check and the increment), use a Redis Lua script that executes both operations as a single atomic unit. A token bucket is implemented by storing (token_count, last_refill_timestamp) per key and computing the refill in a Lua script atomically. Redis operates at sub-millisecond latency, so the limiter adds at most 1–2ms to each request's end-to-end latency. Placement: Where in the Stack to Enforce The limiter belongs at the API gateway or edge layer (NGINX, Envoy, AWS API Gateway, Kong, Cloudflare Workers) so requests are rejected before reaching any backend service — saving compute on every rejected call. A second, lighter limit at the per-service level provides defense in depth against internal misuse. The rate limit key choice matters: by API key (for authenticated clients), by user ID (for logged-in users), by IP address (for unauthenticated traffic — but be careful about NAT and CDN shared IPs that lump thousands of users onto one IP). For high-value endpoints (LLM inference, payments, authentication), apply tighter limits than for inexpensive read endpoints. Key trade-off — precision vs performance: A centralized Redis counter is perfectly accurate but adds one network round trip per request. A local in-memory counter (per gateway node) adds zero latency but allows each node's clients to independently hit the limit, meaning a client using N gateway nodes can send N× the allowed rate. The practical resolution is to use local counters for coarse limits (e.g., 10,000 req/min) where 20% over-counting is acceptable, and centralized Redis for strict limits (auth endpoints, expensive APIs). Many production systems tolerate a small amount of over-counting in exchange for not making a Redis call on every request. Client Communication and Graceful Handling Return HTTP 429 Too Many Requests with a Retry-After header indicating when the client can try again. Include informational headers on every response: X-RateLimit-Limit (the ceiling), X-RateLimit-Remaining (requests left in the window), X-RateLimit-Reset (Unix timestamp when the window resets). Graceful degradation — if Redis is unavailable, decide per-endpoint: fail open (allow all traffic) for non-critical limits, or fail closed (reject all traffic) for expensive or se