A Practical Guide to Database Indexing
Indexes are the single highest-leverage tool for query performance, and also one of the easiest to get wrong. This guide covers how the default B-tree index actually works, when it pays off, when it quietly costs you, and how to use EXPLAIN to confirm what is really happening instead of guessing.
How a B-tree index works
When people say "index" without qualification, they almost always mean a B-tree (more precisely a B+tree). It is a balanced, sorted tree structure. Internal nodes hold separator keys that route you toward a leaf; the leaf nodes hold the actual indexed values in sorted order, along with a pointer back to the table row (a row ID, or in some engines the primary key).
Two properties matter for everything that follows:
- It is sorted. Because keys are ordered, the engine can binary-search down the tree and find any value in roughly
log(n)steps. A billion-row table is reachable in about 30 hops, not a billion. - The leaves are a sorted, linked sequence. Once you locate a starting key, you can walk the leaves to read a contiguous range without revisiting the tree.
That second property is why a B-tree serves so many access patterns from one structure: equality lookups (WHERE id = 42), range scans (WHERE created_at >= '2026-01-01'), prefix matches (WHERE name LIKE 'sam%'), and even ORDER BY / GROUP BY, because the data is already in order and the engine can skip the sort.
The lookup itself is two steps
For a non-covering index, finding a row is two operations: traverse the index to get the row pointer, then fetch the actual row from the table (a "heap fetch" or "bookmark lookup"). That second fetch is a random I/O. It is cheap for a handful of rows and ruinous for millions — which is the core reason indexes help selective queries and hurt unselective ones.
When an index helps vs. when it hurts
The decisive concept is selectivity: the fraction of rows a predicate keeps. An index is a win when a query touches a small slice of the table.
- High selectivity (few matching rows): an index on a unique or near-unique column (user ID, email, order number) turns a full scan into a handful of page reads. Big win.
- Low selectivity (many matching rows): indexing a boolean
is_activeor astatuscolumn where 90% of rows share one value is usually pointless. The planner estimates it would do millions of random heap fetches and correctly chooses a sequential scan instead — reading the table front to back is faster than scattered lookups once you exceed roughly 5–20% of the table.
An index can also actively hurt:
- Writes pay a tax (see write amplification below).
- The planner can be misled by stale statistics into using a bad index, or skewed data can make an "average" selectivity estimate wrong for a specific value.
- Redundant indexes waste disk and buffer-pool memory and add write cost without serving any query that an existing index doesn't already cover.
Composite indexes and column order
A composite (multi-column) index sorts by the first column, then by the second within each first-column group, and so on — exactly like sorting a spreadsheet by column A, then B, then C. Column order is therefore not cosmetic; it determines which queries the index can serve.
The key rule is the leftmost prefix: an index on (a, b, c) can satisfy predicates on a, on (a, b), and on (a, b, c), but not a query that filters only on b or only on c. There is no usable starting point in the sort order without the leading column.
-- Index: (tenant_id, status, created_at)
WHERE tenant_id = 7 -- uses index (prefix)
WHERE tenant_id = 7 AND status = 'open' -- uses index (prefix)
WHERE tenant_id = 7 AND status = 'open'
AND created_at > now() - interval '7 days' -- uses index fully
WHERE status = 'open' -- cannot use this index
Practical ordering guidance:
- Put columns used with equality predicates before columns used with ranges. Once a range column is used, columns after it in the index can no longer narrow the scan — a range "opens up" the remaining sort order. So
(status, created_at)beats(created_at, status)forstatus = 'open' AND created_at > X. - Lead with the column that appears in the most queries, so a single index serves several access patterns via its prefixes.
- If a column always carries an equality filter (like
tenant_idin a multi-tenant app), it is usually a good leading column.
Covering indexes
Recall the two-step lookup: index traversal, then heap fetch. A covering index eliminates the second step by including every column the query needs, so the engine answers entirely from the index. The plan shows an "index-only scan."
If a query reads customer_id and total while filtering on customer_id, an index on just customer_id still requires a heap fetch for total. Extend it to cover:
-- PostgreSQL: payload columns via INCLUDE (not part of the sort key)
CREATE INDEX idx_orders_customer
ON orders (customer_id) INCLUDE (total);
-- MySQL/InnoDB: append columns to the key
CREATE INDEX idx_orders_customer
ON orders (customer_id, total);
Use INCLUDE (PostgreSQL) for columns you only need to return, not filter or sort on — they ride along in the leaves without bloating the tree's search keys. Covering indexes are powerful for hot read paths, but every extra column makes the index larger and slower to maintain, so cover deliberately, not reflexively.
The write-amplification cost
Indexes are not free storage that only helps. Every INSERT must add an entry to each index on the table; every DELETE must remove them; every UPDATE to an indexed column must move the entry to its new sorted position. A table with eight indexes turns one row insert into nine structural writes.
This shows up as:
- Slower write throughput and higher I/O on write-heavy tables.
- Index bloat and page splits as random-key inserts fragment the tree (a reason monotonically increasing keys like timestamps or sequences are friendlier to insert than random UUIDs).
- More work for the buffer pool and the write-ahead log.
The takeaway: each index is a standing tax on writes that should be justified by reads it actually serves. Audit for unused indexes periodically — Postgres exposes pg_stat_user_indexes.idx_scan, and a count near zero after a representative period is a strong removal candidate.
Reading EXPLAIN and EXPLAIN ANALYZE
Never trust your mental model over the planner. EXPLAIN shows the plan and the optimizer's estimates; EXPLAIN ANALYZE actually runs the query and reports measured times and row counts.
EXPLAIN ANALYZE
SELECT id, total FROM orders
WHERE customer_id = 4242 AND status = 'open';
What to look for:
- Scan type.
Index Scan/Index Only Scanmeans your index is used;Seq Scan(or "full table scan") on a large table for a selective query is a red flag. - Estimated vs. actual rows. In
ANALYZEoutput, compare the planner'srows=estimate toactual rows. A large gap (say 50 estimated, 2,000,000 actual) means statistics are stale — runANALYZEon the table to refresh them, because bad estimates cause bad plan choices. - Cost vs. time. Cost numbers are unitless and only comparable within a plan;
ANALYZEgives real milliseconds. Read the plan from the most-indented node outward — the innermost operations run first. - Buffers.
EXPLAIN (ANALYZE, BUFFERS)reports pages read; it distinguishes a query that is slow from disk I/O versus one that is genuinely doing too much work.
Common pitfalls
Function-wrapped predicates
An index stores the column's raw values in sorted order. The moment you wrap the column in a function or expression, that sort order no longer matches what you are searching for, and the index is bypassed in favor of a full scan:
-- Index on lower(email) is NOT usable here, because the
-- column is wrapped on the left-hand side:
WHERE lower(email) = 'sam@example.com' -- ok IF you indexed the expression
WHERE email = lower('Sam@Example.com') -- ok: column is bare
-- Classic offender: a function on the column kills the index
WHERE date(created_at) = '2026-06-28' -- full scan
WHERE created_at >= '2026-06-28'
AND created_at < '2026-06-29' -- range, uses the index
Implicit type casts (comparing a string column to a number, or vice versa) trigger the same problem because the engine inserts a hidden conversion. The fixes are to keep the column bare on one side, rewrite to a range, or create an expression index (e.g. CREATE INDEX ON users (lower(email))) that matches the exact expression you query.
Leading wildcards
LIKE 'sam%' can use a B-tree (it is a prefix range). LIKE '%sam' cannot — there is no sorted prefix to anchor on, so it scans. Reach for a trigram index or full-text search for substring matching.
Low-selectivity columns and OR
Indexing a column with two or three distinct values rarely helps. And a predicate joined by OR across different columns often defeats a single composite index; sometimes two separate indexes plus a bitmap/index-merge plan, or a rewrite to UNION, performs far better. Always confirm with EXPLAIN.
Practical takeaway
Index the columns your selective queries filter, join, and sort on. Order composite-index columns equality-first, range-last, leading with the most-queried column. Add covering columns only for proven hot paths. Keep predicates index-friendly by leaving columns unwrapped and ranges explicit. Then verify every assumption with EXPLAIN ANALYZE rather than intuition — and periodically delete indexes nothing uses, because each one taxes every write.