Fundamentals

What is a Database?

A database is software that imposes structure and rules on raw, messy disk storage so data can be accessed efficiently and reliably.

Databases are modular systems consisting of multiple software components:

Transport layer: Accepts requests
Query processor: Determines the most efficient way to run queries
Execution engine: Carries out the operations
Storage engine: Manages data persistence

Client/Server Model

Database instances (servers): Long-running processes (postgres, mysqld, mongod) that include:
- CPU threads
- Memory (buffer cache)
- Open files (data files, WAL, indexes)
- Network listeners (ports like 5432, 3306)
Application instances (clients): Request data from the database (Node.js APIs, mobile apps, etc.)

Architectural Layers of Database instances

DBMS Architecture Layers

The layered architecture of a database management system, showing how queries flow from transport through query processor, execution engine, to storage engine.

Transport / Communication Subsystem

Handles connection protocol (e.g., TCP/IP, Unix Sockets)
Manages handshake and receives the raw SQL string
Receives queries from clients
Hands off to the query processor
Performs access control after parsing, ensuring security based on the interpreted request

Query Processor

Query Parser:
- Syntax analysis: Checks grammar (e.g., missing semicolons, misspelled SELECT)
- Semantic analysis: Checks validity (e.g., does the table exist? Does the column belong to it?)
- Authorization check: Checks permissions (e.g., does the user have READ access?)
Query Optimizer:
- Statistics analysis: Examines data distribution and index information
- Plan generation: Decides the most efficient execution strategy
  - Should we use an index or scan the table?
  - In what order should we join tables?
- Redundancy elimination: Removes impossible and redundant query parts
- Optimization techniques:
  - Index ordering selection
  - Cardinality estimation
  - Access method selection
  - Cost-based decision making (considers transfer costs across cluster nodes)
- Output: Physical execution plan (sequence of operations)

Execution Engine

Receives the physical execution plan from the Query Optimizer
Executes operators in the plan:
- Table scan: Sequentially read all tuples from a table
- Index scan: Use an index to find specific tuples
- Join: Combine data from multiple tables (Nested Loop / Hash / Merge)
- Filter: Apply WHERE conditions
- Projection: Select specific columns
- Aggregation: Compute GROUP BY, SUM, COUNT, etc.
Controls execution flow:
- Tuple-by-tuple execution: Process one row at a time (traditional iterator model)
- Batch-based execution: Process multiple rows together for better performance
Requests data from the Storage & Transaction Subsystem
Coordinates with:
- Transaction manager: Ensures proper isolation levels
- Lock manager: Manages concurrent access to data
Streams result tuples upward to the communication subsystem

Storage & Transaction Subsystem

Stores, retrieves, and manages data in memory and on disk
Acts as the persistent, long-term memory of each node
Provides a simple API for creating, updating, deleting, and retrieving records
DBMSs build on storage engines to add features like schemas, query languages, indexing, and transactions

Transaction/Lock Managers:
- Ensure ACID properties (Atomicity, Consistency, Isolation, Durability)
- Handle concurrency so two people don't overwrite the same data simultaneously
Buffer Manager:
- Keeps frequently accessed data in RAM for performance, reducing slow disk reads
Access Methods:
- Code that knows how to read specific file formats (like B-Trees or Heap files)
Recovery Manager:
- Maintains logs and restores system state

Query Flow

Client
   |
   | SQL Query
   v
Transport / Communication Subsystem
   |
   v
Query Parser
   |  - Syntax analysis
   |  - Semantic analysis
   |  - Authorization check
   v
Query Optimizer
   |  - Statistics analysis
   |  - Cost-based planning
   |  - Physical plan generation
   v
Execution Engine (Execution Layer)
   |  - Receives physical plan
   |  - Executes operators (scan, join, filter, etc.)
   |  - Coordinates with transaction/lock managers
   |  - Requests data from the Storage & Transaction Subsystem
   | 
   v
Storage & Transaction Subsystem
   |  - Transaction Manager
   |  - Lock Manager
   |  - Access Methods
   |  - Buffer Manager
   |  - Recovery Manager
   |
   | (data pages/tuples)
   v
Execution Engine
   |  - Aggregates results
   |  - Streams result tuples upward
   v
Transport Subsystem
   |
   v
Client