Skip to main content
Bytes & Beyond

Fundamentals

Understanding the architecture, components, and storage mechanisms that power modern databases.

What is a Database?

A database is software that imposes structure and rules on raw, messy disk storage so data can be accessed efficiently and reliably.

Databases are modular systems consisting of multiple software components:

  1. Transport layer: Accepts requests
  2. Query processor: Determines the most efficient way to run queries
  3. Execution engine: Carries out the operations
  4. Storage engine: Manages data persistence

Client/Server Model

  • Database instances (servers): Long-running processes (postgres, mysqld, mongod) that include:
    • CPU threads
    • Memory (buffer cache)
    • Open files (data files, WAL, indexes)
    • Network listeners (ports like 5432, 3306)
  • Application instances (clients): Request data from the database (Node.js APIs, mobile apps, etc.)

Architectural Layers of Database instances

DBMS Architecture Layers

The layered architecture of a database management system, showing how queries flow from transport through query processor, execution engine, to storage engine.

Transport / Communication Subsystem

  • Handles connection protocol (e.g., TCP/IP, Unix Sockets)
  • Manages handshake and receives the raw SQL string
  • Receives queries from clients
  • Hands off to the query processor
  • Performs access control after parsing, ensuring security based on the interpreted request

Query Processor

  1. Query Parser:
    • Syntax analysis: Checks grammar (e.g., missing semicolons, misspelled SELECT)
    • Semantic analysis: Checks validity (e.g., does the table exist? Does the column belong to it?)
    • Authorization check: Checks permissions (e.g., does the user have READ access?)
  2. Query Optimizer:
    • Statistics analysis: Examines data distribution and index information
    • Plan generation: Decides the most efficient execution strategy
      • Should we use an index or scan the table?
      • In what order should we join tables?
    • Redundancy elimination: Removes impossible and redundant query parts
    • Optimization techniques:
      • Index ordering selection
      • Cardinality estimation
      • Access method selection
      • Cost-based decision making (considers transfer costs across cluster nodes)
    • Output: Physical execution plan (sequence of operations)

Execution Engine

  • Receives the physical execution plan from the Query Optimizer

  • Executes operators in the plan:

    • Table scan: Sequentially read all tuples from a table
    • Index scan: Use an index to find specific tuples
    • Join: Combine data from multiple tables (Nested Loop / Hash / Merge)
    • Filter: Apply WHERE conditions
    • Projection: Select specific columns
    • Aggregation: Compute GROUP BY, SUM, COUNT, etc.
  • Controls execution flow:

    • Tuple-by-tuple execution: Process one row at a time (traditional iterator model)
    • Batch-based execution: Process multiple rows together for better performance
  • Requests data from the Storage & Transaction Subsystem

  • Coordinates with:

    • Transaction manager: Ensures proper isolation levels
    • Lock manager: Manages concurrent access to data
  • Streams result tuples upward to the communication subsystem

Storage & Transaction Subsystem

  • Stores, retrieves, and manages data in memory and on disk
  • Acts as the persistent, long-term memory of each node
  • Provides a simple API for creating, updating, deleting, and retrieving records
  • DBMSs build on storage engines to add features like schemas, query languages, indexing, and transactions
  1. Transaction/Lock Managers:
    • Ensure ACID properties (Atomicity, Consistency, Isolation, Durability)
    • Handle concurrency so two people don't overwrite the same data simultaneously
  2. Buffer Manager:
    • Keeps frequently accessed data in RAM for performance, reducing slow disk reads
  3. Access Methods:
    • Code that knows how to read specific file formats (like B-Trees or Heap files)
  4. Recovery Manager:
    • Maintains logs and restores system state

Query Flow

Client
   |
   | SQL Query
   v
Transport / Communication Subsystem
   |
   v
Query Parser
   |  - Syntax analysis
   |  - Semantic analysis
   |  - Authorization check
   v
Query Optimizer
   |  - Statistics analysis
   |  - Cost-based planning
   |  - Physical plan generation
   v
Execution Engine (Execution Layer)
   |  - Receives physical plan
   |  - Executes operators (scan, join, filter, etc.)
   |  - Coordinates with transaction/lock managers
   |  - Requests data from the Storage & Transaction Subsystem
   | 
   v
Storage & Transaction Subsystem
   |  - Transaction Manager
   |  - Lock Manager
   |  - Access Methods
   |  - Buffer Manager
   |  - Recovery Manager
   |
   | (data pages/tuples)
   v
Execution Engine
   |  - Aggregates results
   |  - Streams result tuples upward
   v
Transport Subsystem
   |
   v
Client