Database Indexing: Efficient Data Lookup

Understanding Database Indexing

The primary goal of a database system is to store data efficiently and provide quick access to it. Instead of relying on filesystem hierarchies, database systems use specialized file organization for three key reasons:

Storage efficiency: Minimize storage overhead per record
Access efficiency: Locate records in the smallest number of steps
Update efficiency: Minimize disk changes when updating records

How Data is Organized

Database systems store data as records (rows with multiple fields) in tables. Each record can be looked up using a search key. To do this efficiently, databases use indexes: a data structure technique that improves the speed of data retrieval O(log N) or even O(1) by minimizing disk accesses without scanning an entire table on every access.

Index Data Structure: B-Tree (Balanced Tree)

The most common index structure is a B-Tree, the default index type (always on primary keys by default). It maintains a sorted tree structure where the “Root” points to “Branches,” which point to “Leaves” (where the actual data pointers are).

Best For: Exact matches (=) and range queries (>, <, BETWEEN).

Complexity: O(log N).

Why Range Queries Work: If you want IDs > 7, the database finds 20, then just reads the “next” pointers (11, 13, 17, …) sequentially without traversing the tree again.

Used by: Most relational databases (MySQL, PostgreSQL, Oracle) as the default index structure.

Two Data Storage Approaches

1. Index-Organized Storage (Clustered Index)

In this approach, data lives inside the index itself. Leaf nodes of the B+ tree store full rows, so no separate heap file exists.

B+ Tree Structure:
  leaf: key → full row

Query execution: Traverse the index and the row is already there (single lookup).

Advantages: Fast primary-key lookups, excellent range queries, good cache locality.
Used by: MySQL InnoDB (primary key), Oracle IOT, SQL Server clustered index.

2. Heap-Organized Table + Separate Index (Most Common)

Rows are stored unsorted in a heap, and the index is a separate structure that points to them.

Index (B+ Tree):    key → (page, slot)
Heap Data File:     page → record

Query execution: (1) Search the index, (2) Follow the pointer, (3) Read the data page (two lookups).

Advantages: Fast inserts, fast updates, flexible storage.
Disadvantages: Extra I/O overhead, poor range query locality.
Used by: PostgreSQL, SQL Server (non-clustered indexes).

When to Use Each Approach

No single strategy is optimal for all workloads:

Workload	Best Choice
Heavy inserts	Heap
Frequent updates	Heap
Range queries	Clustered
Read-heavy OLTP	Clustered
Mixed workload	Heap + indexes

Index Storage Approaches

Visual comparison of index-organized storage (left) where data is stored within the index structure itself, versus heap-organized storage with a separate index file pointing to a data file (right).

Secondary Indexes: When the Primary Index Isn’t Enough

Even if data is stored in a clustered index, queries often search on non-clustered columns. For example:

SELECT * FROM users WHERE email = 'a@gmail.com';

If the data is clustered (sorted) by user_id, the clustered index is useless for this query. The database needs a secondary index on the email column.

Without a Secondary Index

The database must perform a Clustered Index Scan (or table scan): start at the first leaf page, read every row, check each email value, and stop only at the end. This is O(N) complexity—very slow.

With a Secondary Index

A secondary index on email stores:

email → user_id (or reference to row)

Query execution:

Search the secondary index (B+ tree) for 'a@gmail.com' → get user_id = 3
Use the primary/clustered index on user_id to fetch the row

This Index Seek + Key Lookup is approximately O(log N)—much faster.

Key insight: Only ONE index can define physical storage (the clustered index). All other indexes are secondary indexes that must somehow reference the data.

Two Approaches for Secondary Index Implementation

Secondary indexes can reference data in two fundamentally different ways, each with different performance trade-offs:

Approach A: Direct References (RID-based)

Secondary indexes store direct pointers to data:

Secondary index: key → (PageID, SlotID)

Lookup path: Secondary Index → Data Page (direct jump, single lookup).

Advantages: Faster reads (fewer lookups), fewer tree traversals.

Disadvantages: If rows move (due to page splits, compaction, or vacuum operations), ALL secondary indexes must be updated. This becomes very expensive with many indexes.

Used by: PostgreSQL, older database designs, systems optimized for read performance.

Non-Clustered Index with Direct References

How a non-clustered B+ tree index uses direct data pointers (RowIDs) to locate rows in the heap file. Each leaf node contains index keys and direct pointers to the corresponding row locations, enabling single-step data access after finding the key.

Approach B: Indirection via Primary Index (PK-based)

Secondary indexes store references to the primary key:

Secondary index: secondary_key → primary_key
Primary index:   primary_key → actual row

Lookup path: Secondary Index → Primary Index → Data (two lookups).

Why this indirection? When rows move (due to page splits, reorganization, or background maintenance), only the primary index is updated—secondary indexes stay untouched. This massively reduces write amplification in write-heavy systems.

Used by: MySQL InnoDB (the most common modern approach).

Secondary Index with Primary Key Indirection

How secondary indexes reference the primary index instead of data directly. Secondary indexes point to primary key values, which then point to actual data. This reduces write amplification when rows move.

The Trade-off (The Key Insight)

Strategy	Read Cost	Write Cost
Direct pointer	Low	High
PK indirection	Higher	Lower

Read-heavy workload? Direct pointers are acceptable.
Write-heavy workload or many indexes? PK indirection wins by avoiding expensive updates to every index.

Concrete MySQL InnoDB Example

SELECT * FROM users WHERE email = 'a@gmail.com';

InnoDB execution (using PK indirection):

Search secondary index on email → find 'a@gmail.com' → get user_id = 3
Search the primary (clustered) index on user_id
Fetch the row

This demonstrates how secondary indexes point to primary keys, not directly to data. The indirection adds a lookup but greatly reduces write overhead when rows move.

Alternative Index Structures

While B-Trees are the default, different scenarios require different index structures for specialized use cases:

Hash Index

Concept: Uses a mathematical formula (hash function) to calculate a specific memory slot (bucket) for a key.

Best For: Exact lookups only (WHERE name = ‘Alice’).

Limitation: No range queries. The physical storage order is random.

Example:

Query: WHERE name = 'Alice'
Hash Function: Hash('Alice') = Bucket 003

Key        Hash Calculation    Storage Bucket
'Alice'    func('Alice')       Bucket 003
'Bob'      func('Bob')         Bucket 052
'Charlie'  func('Charlie')     Bucket 001

Memory Layout:
[Bucket 001] → Charlie's Data
[Bucket 002] → (Empty)
[Bucket 003] → Alice's Data  ← Direct jump here! (O(1))
...
[Bucket 052] → Bob's Data

Why Ranges Fail: Notice that Alice (A) is in Bucket 3, but Bob (B) is in Bucket 52. They are not adjacent, so the database cannot scan for “Names from A to B.”

Used by: MySQL MEMORY table engine, some NoSQL databases for exact lookups.

C. Bitmap Index

Concept: Uses bit arrays (0s and 1s) for columns with low cardinality (few unique values).

Best For: Gender, Marital Status, Is_Active, or any column with limited distinct values.

Pros: Ultra-fast bitwise operations (AND, OR, NOT) with minimal memory overhead.

Example:

Table Data (4 Rows):
1. Male, Single
2. Female, Married
3. Female, Single
4. Male, Married

Bitmap Index Representation:
Row ID    Gender='Male'    Gender='Female'    Status='Single'    Status='Married'
1         1                0                  1                  0
2         0                1                  0                  1
3         0                1                  1                  0
4         1                0                  0                  1

Finding "Female AND Single":
CPU compares bits:
0 1 1 0 (Female Vector)
1 0 1 0 (Single Vector)
AND -------
0 0 1 0 (Result: Only Row 3 matches)

Performance: Bitmap indexes excel in data warehouse and OLAP scenarios with low-cardinality columns.

Index Types Comparison: When to Use Each

Index Type	Exact Match	Range Queries	Cardinality	Speed	Use Case
B-Tree	✓	✓	Any	O(log N)	General purpose, default choice
Hash	✓✓ Fast	✗	Any	O(1)	Exact lookups, in-memory tables
Bitmap	✓	Limited	Low	Very Fast	Data warehouses, low-cardinality columns