SQL Practice Logo

SQLPractice Online

SQL Indexes: The Complete Guide to B-Tree, Hash, GIN, GiST, Composite, Covering and Partial Indexes

An index is the difference between 50ms and 50 seconds. Pick the wrong type and the planner ignores it; pick the wrong column order and you index the wrong half of your queries; index too eagerly and writes slow to a crawl. This guide covers every index family — B-tree, hash, GIN, GiST, BRIN, full-text — with the rules for composite column order, covering / INCLUDE columns, partial indexes, and exactly when adding an index makes things slower.

Last updated · SQL Practice Online team

What is a SQL index, and why does it work?

An index is a separate, ordered data structure that the database maintains alongside a table so the planner can find rows without scanning every page. Conceptually it is the back-of-book index of a textbook: instead of reading every page to find a topic, you read a small sorted list and jump to the right page.

Mechanically, the most common index — the B-tree — is a balanced search tree where each leaf holds a sorted batch of (key, row-locator) pairs. Looking up a single value takes O(log N) page reads instead of O(N); range scans walk the leaves in order; and because B-trees are balanced, performance does not degrade as the table grows.

The index types every SQL developer should know

  • B-tree — sorted, balanced, supports equality and range (=, <, >, BETWEEN, LIKE 'prefix%'). The default everywhere; 95% of all indexes are B-trees.
  • Hash — equality-only, slightly faster than B-tree for exact lookups, useless for range or sort. PostgreSQL hash indexes are crash-safe since v10; MySQL InnoDB has the adaptive hash index built in.
  • GIN (PostgreSQL) — Generalized Inverted Index. The right choice for arrays, JSONB key/value lookups and full-text search (tsvector). Builds slowly, queries blazingly.
  • GiST (PostgreSQL) — Generalized Search Tree. Geometric / nearest-neighbour queries, range types, full-text. Faster updates than GIN, slower lookups.
  • BRIN (PostgreSQL) — Block Range Index. A few KB of metadata per gigabyte of table; only useful when data is physically sorted on disk (timestamp-ordered append-only tables). 1000× smaller than B-tree, only ~3× slower for the right workloads.
  • SP-GiST, BRIN_inclusion (PostgreSQL) — specialised; reach for them when GIN/GiST are not the right shape.
  • Full-text — tsvector + GIN (Postgres), FULLTEXT (MySQL), CONTAINS / FREETEXT (SQL Server). Built for natural-language search, not LIKE '%word%'.
  • Bitmap (Oracle, partial in others) — extremely small for low-cardinality columns; great in data warehouses, terrible under high write rates.

B-tree indexes: the workhorse

A B-tree leaf node holds (sorted-key, row-id) pairs. The index supports four query shapes natively: equality (`=`), inequality range (`<`, `<=`, `>`, `>=`, `BETWEEN`), prefix matching (`LIKE 'abc%'`), and ordered output (`ORDER BY indexed_column`).

  • Equality and range queries become O(log N) tree descents that land at the right leaf, then scan forward.
  • Prefix matches work because sorted keys cluster together: `LIKE 'abc%'` is a range scan from 'abc' to 'abd'.
  • Suffix and middle matches do NOT use the index: `LIKE '%xyz'` and `LIKE '%xyz%'` force a full scan. Use a trigram index (Postgres pg_trgm) or full-text index instead.
  • ORDER BY on an indexed column avoids the sort step entirely — the planner just walks the B-tree leaves.

Composite indexes and column order

A composite (multi-column) index sorts rows by the first column, then by the second within each first-column value, and so on. The order of columns in the index definition is the most consequential decision in indexing — and the one that is most often wrong.

The leftmost-prefix rule: an index on (a, b, c) supports queries that filter on (a), (a, b) or (a, b, c) — but NOT queries that filter only on (b), (c) or (b, c). The first column must appear in the WHERE clause for the index to be considered.

-- Index: (tenant_id, created_at, status)

-- ✅ Uses index (leftmost prefix)
WHERE tenant_id = 42

-- ✅ Uses index (full prefix; range on tail is fine)
WHERE tenant_id = 42 AND created_at >= '2026-01-01'

-- ✅ Uses index for tenant_id + created_at, then filters status in heap
WHERE tenant_id = 42 AND created_at >= '2026-01-01' AND status = 'paid'

-- ❌ Does NOT use the index (no tenant_id)
WHERE created_at >= '2026-01-01' AND status = 'paid'

-- ⚠️ Uses index for tenant_id only; status filter happens after
WHERE tenant_id = 42 AND status = 'paid'

Covering indexes and INCLUDE columns

A covering index supplies every column the query needs — including columns in the SELECT list — without going back to the heap. The planner reports this as an "index-only scan" (Postgres) or "covering index" (MySQL/SQL Server). It is the single biggest speedup available for read-heavy hotspots.

-- Query: SELECT user_id, score FROM events WHERE tenant_id = 42 AND occurred_at >= '...';

-- B-tree without covering: index lookup → heap fetch for user_id, score
CREATE INDEX ix_events_tenant_time
  ON events (tenant_id, occurred_at);

-- Covering index (Postgres 11+, SQL Server): no heap fetch
CREATE INDEX ix_events_tenant_time_covering
  ON events (tenant_id, occurred_at) INCLUDE (user_id, score);

-- MySQL: just put covering columns at the end of the key
CREATE INDEX ix_events_tenant_time_covering
  ON events (tenant_id, occurred_at, user_id, score);
  • Postgres and SQL Server have INCLUDE — the column is stored in the leaf but is NOT part of the sort key, so it does not bloat tree branches.
  • MySQL has no INCLUDE; you append covering columns to the key, which makes the tree slightly larger but still avoids heap fetches.
  • For a covering index to help, the query must touch ONLY columns in the index. A SELECT * defeats covering on every wide table.

Partial / filtered indexes

A partial index (Postgres terminology) or filtered index (SQL Server) is a B-tree built only over rows that match a WHERE predicate. It is often the right answer when one value of a column dominates the table — soft-deleted rows, archived events, the 99% of orders that are already shipped.

-- Postgres / SQL Server: index ONLY active orders (3% of the table)
CREATE INDEX ix_orders_active_customer
  ON orders (customer_id, created_at)
  WHERE status IN ('open', 'pending');

-- Now the index is ~3% of the size and serves the dashboard query
-- WHERE customer_id = ? AND status IN ('open', 'pending')
-- without indexing the 97% of orders that are already shipped or cancelled.

Clustered vs non-clustered indexes

  • A clustered index defines the physical row order on disk: looking up a key directly returns the data row, no second hop. Each table can have at most one clustered index.
  • A non-clustered (secondary) index is a separate structure that maps key → row-locator. Looking up a key returns a pointer that the engine then dereferences to fetch the row.
  • SQL Server: PRIMARY KEY defaults to clustered; you choose explicitly. The clustered key is appended to every non-clustered index as the row-locator, so a wide clustered key bloats every other index.
  • MySQL InnoDB: every table is a clustered index on the primary key. There is no heap. Non-clustered indexes store the PK as the row-locator. UUID PKs hurt because random PK inserts fragment the entire table.
  • PostgreSQL: heap-organised by default; CLUSTER reorders the heap to match an index but does not maintain the order. PG primary keys are non-clustered B-trees.
  • Oracle: index-organised tables (IOT) are the equivalent of clustered indexes; heap tables are the default.

Index-only scans, visibility maps and the heap-fetch cost

In Postgres, even a covering index sometimes requires a heap fetch — to check row visibility for the current MVCC snapshot. The visibility map tracks pages where every row is visible to every transaction; if your query lands on those pages, the heap fetch is skipped. This is why VACUUM (which updates the visibility map) is the key to keeping index-only scans fast.

In MySQL InnoDB, the equivalent gotcha is REPEATABLE READ + secondary index lookups: visibility checks may force the engine to consult the clustered index even when a secondary index appears to cover the query. Set isolation level to READ COMMITTED for read-heavy analytical workloads when consistency permits.

When NOT to add an index

  • Tables under ~10K rows. The planner will sequentially scan in less time than it takes to traverse an index.
  • Columns with very low cardinality (status with 3 values, gender, boolean flags). The index will return more than ~10% of the table, which is slower than a sequential scan. Use a partial index that filters on the rare value instead.
  • Columns rewritten by every UPDATE. An indexed `last_seen_at` column updated on every API call doubles or triples the write cost.
  • Queries that always run with `LIKE '%term%'` or `<> value` — B-trees cannot help. Use trigram or full-text indexes for the first; redesign the query for the second.
  • Tables that are bulk-loaded then read once. DROP INDEX before the load and CREATE INDEX after; building once is far cheaper than maintaining during the load.

Cardinality, selectivity and the planner

The planner decides whether to use an index based on selectivity — the estimated fraction of rows that the predicate will match. The estimate comes from statistics (Postgres ANALYZE / pg_statistic, MySQL ANALYZE TABLE, SQL Server UPDATE STATISTICS). When statistics are stale or skewed, the planner picks badly.

  • High cardinality (most values are unique): indexes shine. user_email, order_uuid.
  • Medium cardinality (10s to 1000s of distinct values): indexes help filtered queries; column order in composite indexes matters most here.
  • Low cardinality (< 10 distinct values): plain index hurts. Use a partial index on the rare value, or skip indexing entirely.
  • Skewed distributions (one value covers 80% of rows): use multivariate / extended statistics so the planner sees the skew, or add a partial index on the rare values.

Reading EXPLAIN: did my index actually help?

-- PostgreSQL: EXPLAIN (ANALYZE, BUFFERS) is the gold standard
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events WHERE tenant_id = 42 AND occurred_at >= '2026-05-01';

-- Look for:
--   Index Scan / Index Only Scan / Bitmap Index Scan → index is being used
--   Seq Scan → index NOT being used (or planner thinks it's cheaper)
--   "Rows Removed by Filter: N" → predicate ran in the heap, not the index
--   "Heap Fetches: N" → covering scan with N visibility-driven heap reads
  • Postgres: Index Only Scan is the goal. Bitmap Index Scan is good for medium-selective queries. Seq Scan on a small table is fine.
  • MySQL: EXPLAIN shows `key` = the index used (or NULL = none); `rows` = estimated rows examined; `Extra: Using index` = covering.
  • SQL Server: SET STATISTICS IO ON; the actual execution plan shows logical reads — covering index queries minimise these.
  • Common surprise: a query "should" use an index but does not. Causes: stale statistics, implicit type conversion (`varchar_col = 123`), function on the column (`WHERE LOWER(email) = ...` requires an expression index), low estimated selectivity.

Expression / functional indexes

When a query filters on a function of a column rather than the column itself, a regular B-tree on the column is useless. Build the index on the expression instead.

-- Case-insensitive email lookup
CREATE INDEX ix_users_lower_email ON users (LOWER(email));
-- Now: WHERE LOWER(email) = 'alice@example.com' uses the index.

-- JSON key extraction (Postgres)
CREATE INDEX ix_events_action ON events ((payload->>'action'));
-- Now: WHERE payload->>'action' = 'click' uses the index.

-- Date truncation
CREATE INDEX ix_events_day ON events (DATE_TRUNC('day', occurred_at));
-- Now: WHERE DATE_TRUNC('day', occurred_at) = '2026-05-21' uses the index.
  • Postgres: native CREATE INDEX ON t (expression).
  • MySQL 8.0+: functional indexes via CREATE INDEX ix ON t ((expr)).
  • SQL Server: persisted computed column + ordinary index over it.
  • Oracle: function-based indexes with QUERY_REWRITE_ENABLED.

GIN, JSONB and full-text indexes

-- Postgres GIN over JSONB: any-key, any-value search
CREATE INDEX ix_events_payload ON events USING GIN (payload jsonb_path_ops);
SELECT * FROM events WHERE payload @> '{"action": "purchase"}';

-- Postgres full-text: tsvector + GIN
ALTER TABLE articles ADD COLUMN search tsvector
  GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body)) STORED;
CREATE INDEX ix_articles_search ON articles USING GIN (search);
SELECT * FROM articles WHERE search @@ websearch_to_tsquery('english', 'sql indexing');

-- Trigram index: substring / fuzzy search
CREATE EXTENSION pg_trgm;
CREATE INDEX ix_users_name_trgm ON users USING GIN (full_name gin_trgm_ops);
SELECT * FROM users WHERE full_name ILIKE '%ander%';

Common index anti-patterns

  • One index per column. The planner usually combines a covering composite better than two single-column indexes via a Bitmap AND.
  • Indexing UUIDv4 primary keys on InnoDB. Random PK inserts fragment the clustered table, halving write throughput. Use UUIDv7 (timestamp-prefixed) or BIGINT identities.
  • Putting low-cardinality column first in a composite. Index on (status, customer_id) where status has 4 values: every range scan reads 25% of the index.
  • Adding an index to "fix slowness" without checking EXPLAIN. The slow query may be unrelated to indexing (lock contention, network, big result set).
  • Forgetting to ANALYZE after a large data shift. Bad statistics → bad plans even with the right indexes.
  • Ignoring write amplification. Every additional index on a write-heavy table multiplies the WAL/redo-log volume; on hot tables, fewer larger covering indexes beat many small narrow ones.

Practice: hands-on indexing lessons

Each topic above maps to a hands-on lesson in the Query Optimization & Performance module — runnable EXPLAIN walk-throughs, before/after indexing exercises, and adversarial cases where the index does not help.

Practice this in the editor

Frequently asked questions

What is a SQL index?

A SQL index is an auxiliary data structure (most commonly a B-tree) that the database maintains alongside a table to make it fast to find rows by particular column values. Without an index, a query that filters on a column must scan every row in the table; with an index, the engine descends a balanced tree in O(log N) page reads. Indexes speed up SELECT and UPDATE/DELETE that match by indexed columns, but they slow every INSERT and any UPDATE that touches indexed columns.

When should I add an index?

Add an index when (1) a query that filters or sorts by a column is too slow, (2) the column has reasonable cardinality (more than a handful of distinct values), and (3) the column is not rewritten by most UPDATEs. The strongest signal is EXPLAIN reporting Seq Scan on a large table for a frequent query: a B-tree on the WHERE column will usually fix it. Index foreign-key columns by default — they are referenced by every cascading delete and most join predicates.

What is the difference between a B-tree and a hash index?

A B-tree is a sorted, balanced tree that supports equality, range, prefix-match and ordered-output queries — the all-purpose default. A hash index hashes each key into a bucket and supports only equality lookups; it is slightly faster than B-tree for exact-match-only workloads but useless for ranges, ORDER BY or LIKE 'prefix%'. PostgreSQL hash indexes have been crash-safe since v10. MySQL InnoDB has the adaptive hash index baked in. In practice, almost every index you create will be a B-tree.

How do I choose the column order in a composite index?

Three rules in order: (1) put equality-filtered columns before range-filtered columns — the leftmost-prefix rule means the index is only useful as far as you can extend the WHERE clause one column at a time; (2) among equality columns, the most selective (highest cardinality) goes first; (3) if the query has an ORDER BY that survives the WHERE filter, place those columns last to enable an ordered index scan. An index on (tenant_id, created_at, status) supports queries that filter on tenant_id alone, on tenant_id + created_at, or on all three — but not status alone.

What is a covering index?

A covering index contains every column referenced by a query — both in the WHERE clause and the SELECT list — so the engine can answer the query entirely from the index without fetching from the heap. PostgreSQL and SQL Server expose this explicitly via INCLUDE columns: CREATE INDEX ... ON t (a, b) INCLUDE (c, d). MySQL achieves the same effect by appending the SELECT-list columns to the key. Covering indexes turn many queries from O(rows × heap-fetches) into O(rows in index) — the single biggest indexing speedup for read-heavy workloads.

What is the difference between a clustered and non-clustered index?

A clustered index defines the physical order of rows on disk — the leaf pages of the index ARE the table data. A non-clustered (secondary) index is a separate B-tree mapping key → row-locator. SQL Server lets you choose; the PRIMARY KEY is clustered by default. MySQL InnoDB always clusters on the primary key (every table is a clustered index). PostgreSQL is heap-organised — there are no clustered indexes, only heap + secondary indexes (CLUSTER reorders once but does not maintain). The clustered key is implicitly part of every secondary index, so a wide or random-UUID clustered key bloats every other index too.

When does an index actually slow things down?

Indexes hurt in three situations: (1) very small tables (under ~10K rows) where a Seq Scan beats any index lookup; (2) write-heavy workloads — every additional index multiplies WAL/redo-log volume and INSERT cost; (3) low-cardinality columns where the planner correctly decides the index would return too many rows and falls back to a Seq Scan anyway. The last one is the trickiest: the index sits there consuming write cost on every INSERT but is never used for reads. Audit pg_stat_user_indexes.idx_scan periodically and drop indexes that have zero scans over a representative window.

What is a partial / filtered index?

A partial index (PostgreSQL) or filtered index (SQL Server) is a B-tree built only over rows that match a WHERE predicate in the index definition. The classic use case is a soft-delete column where 99% of rows are deleted: indexing only the 1% that are active gives you an index 100× smaller and a query that targets the active set without the planner having to filter a huge index. Other natural fits: open vs closed orders, recent vs historical events, NOT NULL filters where most rows are NULL. MySQL has no native syntax; emulate with a generated boolean column and an ordinary index.