An Introduction to Rate Limiting Algorithms

May 21, 2026 · 8 min read

Rate limiting protects a service from being overwhelmed, enforces fair use across clients, and caps the cost of abusive or buggy callers. The mechanism is simple to state — "allow at most N requests per window" — but the algorithm you pick determines how bursts are handled, how much memory you spend, and how accurate the limit is under load. This article walks through the five algorithms you'll actually encounter, their tradeoffs, where to put the enforcement, and how to extend them across many machines.

The five algorithms

Token bucket

A bucket holds up to capacity tokens and refills at a steady refillRate tokens per second. Each request removes one token; if the bucket is empty, the request is rejected (or queued). The key property is that unused capacity accumulates up to the bucket size, so a client that has been idle can spend a burst all at once, then is throttled down to the sustained refill rate.

function allow(key):
    b = store.get(key) or { tokens: capacity, last: now() }
    elapsed = now() - b.last
    b.tokens = min(capacity, b.tokens + elapsed * refillRate)
    b.last   = now()
    if b.tokens >= 1:
        b.tokens -= 1
        store.set(key, b)
        return ALLOW
    store.set(key, b)
    return DENY

Token bucket is the workhorse default. It allows configurable bursts (via capacity) while bounding the long-run average, and it stores only two numbers per client. Note that you do not need a background timer to add tokens — you lazily compute the refill from the elapsed time on each request.

Leaky bucket

The leaky bucket models a queue that drains at a fixed rate. Requests enter a FIFO queue; the queue "leaks" (processes) at most leakRate per second. If the queue is full, new requests are dropped. Where token bucket allows bursts to pass through immediately, leaky bucket smooths output to a constant rate — useful when the thing you are protecting (a payment processor, a downstream API) cannot tolerate spikes.

function allow(key):
    q = store.get(key) or { level: 0, last: now() }
    leaked = (now() - q.last) * leakRate
    q.level = max(0, q.level - leaked)
    q.last  = now()
    if q.level < capacity:
        q.level += 1
        store.set(key, q)
        return ALLOW
    store.set(key, q)
    return DENY

Implemented this way (as a counter rather than a real queue), leaky bucket is nearly identical in cost to token bucket. The practical difference is intent: token bucket permits a burst up to the bucket size; leaky bucket enforces a steadier outflow and rejects rather than absorbs sudden spikes.

Fixed-window counter

Divide time into fixed windows (e.g. each calendar minute) and keep one counter per client per window. Increment on each request; reject once the counter exceeds the limit; the counter resets when the window rolls over.

function allow(key):
    window = floor(now() / windowSize)
    k = key + ":" + window
    count = store.incr(k)
    if count == 1:
        store.expire(k, windowSize)
    return count <= limit ? ALLOW : DENY

This is the cheapest algorithm to run — a single integer and an atomic increment per client. Its weakness is the boundary burst: a client can send limit requests in the last instant of one window and another limit at the start of the next, achieving up to 2 × limit requests in a span shorter than one window. For coarse, forgiving limits this is fine; for tight ones it is a real loophole.

Sliding-window log

Keep a timestamp for every request in a sorted structure. On each request, drop timestamps older than one window, then allow only if the remaining count is below the limit. This is exact — the boundary burst disappears entirely because the window genuinely slides with each request.

function allow(key):
    log = store.get(key) or []
    cutoff = now() - windowSize
    log = [t for t in log if t > cutoff]   # drop expired
    if len(log) < limit:
        log.append(now())
        store.set(key, log)
        return ALLOW
    store.set(key, log)
    return DENY

The cost is memory: you store one timestamp per request in the window, so a client allowed 10,000 requests/minute needs room for 10,000 entries. Accuracy is perfect; scalability is the problem.

Sliding-window counter

This is the pragmatic compromise. Keep fixed-window counters (cheap) but approximate a sliding window by weighting the previous window's count by how much of it still overlaps the current sliding window.

function allow(key):
    now_w   = floor(now() / windowSize)
    prev    = store.get(key + ":" + (now_w - 1)) or 0
    curr    = store.get(key + ":" + now_w) or 0
    # fraction of the previous window still inside the sliding window
    overlap = 1 - (now() mod windowSize) / windowSize
    estimate = prev * overlap + curr
    if estimate < limit:
        store.incr(key + ":" + now_w)
        return ALLOW
    return DENY

It uses two counters per client instead of a full log, eliminates most of the fixed-window boundary burst, and is accurate enough for almost all production traffic. The estimate assumes the previous window's requests were spread evenly, so it can be slightly off, but the error is small and bounded. This is what many API gateways use by default.

Choosing: burst tolerance vs. memory

Token bucket — O(1) memory; deliberate, configurable bursts; great general default for APIs.
Leaky bucket — O(1) memory; smooths output, no bursts; use when downstream needs a steady rate.
Fixed window — cheapest; allows a 2x boundary burst; fine for coarse limits and quotas.
Sliding-window log — exact, no boundary burst; O(requests) memory; reserve for low-volume, high-precision limits.
Sliding-window counter — O(1) memory; near-exact; the best accuracy-per-byte and a strong default.

Where to enforce

The same algorithm behaves very differently depending on where it runs.

Client — A self-imposed limit (often token bucket) prevents your own app from hammering an upstream and respecting its quota. It improves manners and latency but is purely advisory: never trust the client for protection, since it can be bypassed or buggy.
Gateway / load balancer / API proxy — The usual home for rate limiting. It sits in front of every service, sheds load before it reaches your code, and centralizes policy (per-API-key, per-IP, per-route). This is where you stop abuse cheaply.
Service — In-process limits guard a specific expensive resource — a database connection pool, a third-party billing call — that a gateway cannot see. Use it for fine-grained, resource-aware limits the edge does not know about.

A mature system layers these: a generous gateway limit for blanket protection, plus targeted service-level limits on the expensive paths.

Distributed rate limiting

An in-memory counter only limits one process. With N gateway instances behind a load balancer, a per-instance limit of L silently becomes N×L globally. To enforce a single global limit you need shared state, typically Redis, with atomic operations so concurrent instances don't race.

The critical detail is atomicity. A naive GET, check, then SET has a race window where two instances both read "9 of 10 used" and both allow a request. Solve it by making the read-modify-write atomic — an INCR for fixed window, or a Lua script for token/sliding-window logic so the whole computation runs in one round trip under Redis's single-threaded execution.

-- atomic fixed-window in Redis (Lua)
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[1])  -- windowSize
end
if current > tonumber(ARGV[2]) then          -- limit
    return 0   -- DENY
end
return 1       -- ALLOW

Other distributed concerns worth planning for:

Latency — every check is now a network hop. Batch where possible, and keep the shared store close to the gateways.
Availability — decide the failure mode up front. Fail open (allow when the store is down) favors availability; fail closed favors protection. Most public APIs fail open with a local fallback limit.
Hot keys — a single popular key (one big customer, one shared IP) can bottleneck on one Redis shard. Consider local pre-checks or sharded counters.
Clock skew — prefer the store's clock or monotonic elapsed time over each instance's wall clock for window boundaries.

Practical takeaway

Start with token bucket or sliding-window counter at the gateway: both are O(1) per client, handle bursts sensibly, and cover the vast majority of needs. Reach for leaky bucket only when a downstream demands a smooth output rate, and for the sliding-window log only when you need exactness on a low-volume path. The moment you run more than one instance, move the state into a shared store and make every update atomic — and decide deliberately whether you fail open or closed. Finally, always return 429 Too Many Requests with a Retry-After header so well-behaved clients can back off instead of retrying blindly.

rate-limitingsystem-designdistributed-systemsapibackend

← All articles