Database Indexing
An index is the difference between scanning a million rows and reading three. Get good at indexes and your queries get 1000x faster. Get bad at them and your writes get 10x slower.
The book analogy that actually works
Imagine a 1000-page book without an index. You want to find every mention of "consistent hashing". Your only option is to read every page. With an index at the back, you flip to "C", find "consistent hashing → page 412", and you're done. The book itself didn't change. The index is a separate, sorted lookup that points into the book.
A database index is exactly that. A separate data structure, sorted on one or more columns, that points to the actual rows. Queries that use the index go from O(N) (read every row) to O(log N) (binary search in a tree).
The B-tree (the default index)
Most relational databases use a B-tree (specifically a B+ tree). It is a balanced tree where each node holds many keys and points to children. Lookups, range scans, insertions, and deletions are all O(log N).
For a table with 100M rows, a B-tree is about 4-5 levels deep. A point lookup is 4-5 reads. Without the index, it would be 100M reads. The factor is roughly twenty million. That is what an index buys you.
Other index types
- Hash index. O(1) point lookups. Cannot do range queries. Great for caches, useless for "WHERE age > 30".
- Bitmap index. Compact for low-cardinality columns (gender, status flag). Heavy on writes; common in data warehouses.
- Inverted index. Used by full-text search. Maps each word to the documents that contain it. Elasticsearch's bread and butter.
- GiST / GIN. Postgres specialty indexes for geospatial, JSONB, arrays.
Composite indexes and column order
An index can cover multiple columns. CREATE INDEX ON orders (user_id, created_at). The order matters. This index helps queries on user_id alone or user_id + created_at, but not created_at alone. Why? Because the tree is sorted by the leftmost column first.
Rule of thumb: put the most-filtered column first, then the next-most-filtered, etc. Equality columns before range columns.
The cost of indexes
Indexes are not free. Every index speeds up reads on its columns and slows down every write to the table. INSERTs, UPDATEs, and DELETEs must update every relevant index. A table with twelve indexes will write thirteen times (the row plus all indexes) on every change.
This means: index for the queries you actually run. Stale or unused indexes are pure cost. Most databases have tools to identify them (Postgres has pg_stat_user_indexes).
Reading EXPLAIN like a senior
The EXPLAIN command shows the query plan. Three things to look for:
- Sequential scan on a big table is usually a problem. Means no useful index.
- Index scan is good. Index range scan even better for ranges.
- Rows estimate vs actual. If they diverge, the optimizer is making bad choices. Run ANALYZE.
EXPLAIN is the single most useful database skill. Spend an afternoon learning it for your database and you will pay back the time within a month.
Covering indexes
If an index includes every column the query needs (in the index or as INCLUDE columns), the database does not even need to read the row. The whole query is satisfied from the index. This can be 10x faster.
Example: SELECT user_id, status FROM orders WHERE user_id = 42. An index on (user_id) INCLUDE (status) serves the whole query without touching the table.
Indexing is the single highest-leverage skill in database performance work. The good news: 80% of the value comes from understanding B-trees, composite indexes, and EXPLAIN. The other 20% takes years. Start with the 80%.