PostgreSQL - Complete Guide

PostgreSQL is an open-source relational database. It stores data in tables with rows and columns, enforces strict types and constraints, and lets you query that data with SQL. It has been in active development since 1986 and is one of the most feature-complete databases available.

Why PostgreSQL Over the Alternatives

SQLite

SQLite is a file-based database. Great for local development and mobile apps. Not designed for multiple concurrent writers or network access. Not suitable for production web servers.

MySQL / MariaDB

MySQL is fast for simple read-heavy workloads. PostgreSQL beats it on:

Standards compliance (PostgreSQL follows SQL standards closely)
Advanced data types (JSONB, arrays, ranges, custom types)
Complex queries (CTEs, window functions, lateral joins)
MVCC implementation (MySQL’s is less robust under heavy concurrency)

NoSQL (MongoDB, etc.)

NoSQL gives you flexibility: no schema, nested documents. The tradeoff is you lose joins, transactions, and referential integrity. Most apps that start with NoSQL for “flexibility” end up reimplementing what relational databases give you for free.

PostgreSQL supports JSONB columns, so you can store schemaless data inside a relational database when you genuinely need it.

Installation

macOS

brew install postgresql@16
brew services start postgresql@16

Ubuntu / Debian

sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql
sudo systemctl enable postgresql

Docker (recommended for dev)

docker run --name pg-dev \
  -e POSTGRES_USER=myuser \
  -e POSTGRES_PASSWORD=mypassword \
  -e POSTGRES_DB=mydb \
  -p 5432:5432 \
  -d postgres:16

Connect

# Connect as postgres superuser
psql -U postgres

# Connect to specific DB
psql -U myuser -d mydb -h localhost

Core Concepts

Databases, Schemas, Tables

PostgreSQL Server
└── Database (mydb)
    └── Schema (public)       <- default schema
        ├── Table (users)
        ├── Table (posts)
        └── Table (comments)

A single PostgreSQL server can host many databases. Each database has schemas (namespaces). The default schema is public. Tables live inside schemas.

Creating a Database

CREATE DATABASE mydb;
\c mydb   -- connect to it in psql

Data Types

Common Types

Type	Use case
`INTEGER` / `INT`	Whole numbers
`BIGINT`	Large whole numbers (IDs at scale)
`SERIAL`	Auto-incrementing integer (shorthand)
`BIGSERIAL`	Auto-incrementing bigint
`NUMERIC(p, s)`	Exact decimal (money, measurements)
`FLOAT` / `REAL`	Approximate decimal (scientific data)
`VARCHAR(n)`	Variable-length string, max n chars
`TEXT`	Unlimited-length string
`BOOLEAN`	`true` / `false`
`DATE`	Date only (2025-01-15)
`TIMESTAMP`	Date + time, no timezone
`TIMESTAMPTZ`	Date + time + timezone (store everything as this)
`UUID`	128-bit unique identifier
`JSONB`	Binary JSON (indexed, queryable)
`JSON`	Raw JSON text (no indexing)
`ARRAY`	Array of any type

Type Notes

Use TIMESTAMPTZ over TIMESTAMP. It stores UTC internally and converts to local time on read. TIMESTAMP has no timezone info, which causes bugs.

Use TEXT over VARCHAR(255). PostgreSQL stores them identically. VARCHAR without a limit is the same as TEXT. The limit only makes sense if you genuinely need to enforce a max length.

Use NUMERIC for money, never FLOAT. Floats have rounding errors:

SELECT 0.1 + 0.2;          -- 0.30000000000000004 (FLOAT)
SELECT 0.1::NUMERIC + 0.2; -- 0.3 (NUMERIC)

Tables

Creating Tables

CREATE TABLE users (
  id         BIGSERIAL PRIMARY KEY,
  email      TEXT NOT NULL UNIQUE,
  name       TEXT,
  role       TEXT NOT NULL DEFAULT 'user',
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE posts (
  id          BIGSERIAL PRIMARY KEY,
  title       TEXT NOT NULL,
  content     TEXT,
  published   BOOLEAN NOT NULL DEFAULT FALSE,
  author_id   BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Constraints

-- NOT NULL: column must have a value
email TEXT NOT NULL

-- UNIQUE: no two rows can have the same value
email TEXT UNIQUE

-- PRIMARY KEY: NOT NULL + UNIQUE, identifies each row
id BIGSERIAL PRIMARY KEY

-- FOREIGN KEY: must reference an existing row in another table
author_id BIGINT REFERENCES users(id)

-- CHECK: custom condition must be true
age INT CHECK (age >= 0)
price NUMERIC CHECK (price > 0)

-- Table-level constraints (more readable for multi-column)
CONSTRAINT unique_user_email UNIQUE (email)
CONSTRAINT valid_price CHECK (price > 0 AND price < 100000)

ON DELETE behavior for foreign keys

-- CASCADE: delete posts when user is deleted
author_id BIGINT REFERENCES users(id) ON DELETE CASCADE

-- SET NULL: set author_id to NULL when user is deleted
author_id BIGINT REFERENCES users(id) ON DELETE SET NULL

-- RESTRICT: prevent deleting user if they have posts (default)
author_id BIGINT REFERENCES users(id) ON DELETE RESTRICT

Modifying Tables

-- Add column
ALTER TABLE users ADD COLUMN avatar TEXT;

-- Drop column
ALTER TABLE users DROP COLUMN avatar;

-- Rename column
ALTER TABLE users RENAME COLUMN name TO full_name;

-- Change type
ALTER TABLE users ALTER COLUMN age TYPE BIGINT;

-- Add constraint
ALTER TABLE users ADD CONSTRAINT check_age CHECK (age >= 0);

-- Drop table
DROP TABLE users;
DROP TABLE IF EXISTS users;  -- no error if doesn't exist

CRUD Operations

INSERT

-- Single row
INSERT INTO users (email, name) VALUES ('alice@example.com', 'Alice');

-- Multiple rows
INSERT INTO users (email, name) VALUES
  ('bob@example.com', 'Bob'),
  ('carol@example.com', 'Carol');

-- Return the created row (very useful)
INSERT INTO users (email, name)
VALUES ('dave@example.com', 'Dave')
RETURNING id, email, created_at;

-- Upsert (insert or update on conflict)
INSERT INTO users (email, name)
VALUES ('alice@example.com', 'Alice Updated')
ON CONFLICT (email)
DO UPDATE SET name = EXCLUDED.name, updated_at = NOW();

-- Ignore duplicates
INSERT INTO users (email, name)
VALUES ('alice@example.com', 'Alice')
ON CONFLICT (email) DO NOTHING;

SELECT

-- All columns
SELECT * FROM users;

-- Specific columns
SELECT id, email, name FROM users;

-- With alias
SELECT id, email AS user_email FROM users;

-- Filter
SELECT * FROM users WHERE role = 'admin';
SELECT * FROM users WHERE created_at > '2025-01-01';
SELECT * FROM users WHERE name IS NULL;
SELECT * FROM users WHERE name IS NOT NULL;
SELECT * FROM users WHERE id IN (1, 2, 3);
SELECT * FROM users WHERE email LIKE '%@example.com';
SELECT * FROM users WHERE email ILIKE '%example%'; -- case-insensitive

-- Sort
SELECT * FROM users ORDER BY created_at DESC;
SELECT * FROM users ORDER BY name ASC, created_at DESC;

-- Limit and offset (pagination)
SELECT * FROM users ORDER BY id LIMIT 10 OFFSET 20;

-- Distinct
SELECT DISTINCT role FROM users;

UPDATE

-- Update one field
UPDATE users SET name = 'Alice Smith' WHERE id = 1;

-- Update multiple fields
UPDATE users
SET name = 'Alice Smith', updated_at = NOW()
WHERE id = 1;

-- Update with condition based on another table
UPDATE posts
SET published = TRUE
WHERE author_id IN (SELECT id FROM users WHERE role = 'admin');

-- Return updated rows
UPDATE users SET name = 'Alice Smith' WHERE id = 1
RETURNING id, name, updated_at;

DELETE

DELETE FROM users WHERE id = 1;
DELETE FROM posts WHERE published = FALSE;

-- Return deleted rows
DELETE FROM users WHERE id = 1 RETURNING *;

-- Delete all rows (faster: TRUNCATE)
TRUNCATE TABLE users;
TRUNCATE TABLE users RESTART IDENTITY; -- also resets SERIAL counter

Joins

-- INNER JOIN: rows that match in both tables
SELECT u.name, p.title
FROM users u
INNER JOIN posts p ON p.author_id = u.id;

-- LEFT JOIN: all users, including those with no posts
SELECT u.name, p.title
FROM users u
LEFT JOIN posts p ON p.author_id = u.id;

-- Multiple joins
SELECT u.name, p.title, c.content AS comment
FROM users u
JOIN posts p ON p.author_id = u.id
JOIN comments c ON c.post_id = p.id
WHERE p.published = TRUE;

-- Self join (e.g., employees and their managers)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Aggregations

-- COUNT, SUM, AVG, MIN, MAX
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM users WHERE role = 'admin';
SELECT AVG(price) FROM products;
SELECT SUM(amount) FROM orders WHERE status = 'paid';

-- GROUP BY
SELECT role, COUNT(*) AS total
FROM users
GROUP BY role;

-- HAVING (filter after grouping, like WHERE but for aggregates)
SELECT author_id, COUNT(*) AS post_count
FROM posts
GROUP BY author_id
HAVING COUNT(*) > 5;

-- GROUP BY with JOIN
SELECT u.name, COUNT(p.id) AS post_count
FROM users u
LEFT JOIN posts p ON p.author_id = u.id
GROUP BY u.id, u.name
ORDER BY post_count DESC;

Indexes

Indexes speed up reads by letting PostgreSQL find rows without scanning the whole table. The tradeoff is slower writes and extra disk space.

Creating Indexes

-- Basic index (B-tree, the default)
CREATE INDEX idx_posts_author_id ON posts(author_id);

-- Unique index (also enforces uniqueness)
CREATE UNIQUE INDEX idx_users_email ON users(email);

-- Composite index (useful when filtering by multiple columns together)
CREATE INDEX idx_posts_author_published ON posts(author_id, published);

-- Partial index (index only a subset of rows)
CREATE INDEX idx_posts_published ON posts(created_at) WHERE published = TRUE;

-- Index on expression
CREATE INDEX idx_users_lower_email ON users(LOWER(email));

When to Add Indexes

Add indexes on:

Foreign key columns (author_id, user_id, etc.) - PostgreSQL does NOT auto-index these
Columns you filter on frequently (WHERE status = ..., WHERE email = ...)
Columns you sort on frequently (ORDER BY created_at)
Columns used in joins

Do not index:

Small tables (full scan is faster)
Columns with very low cardinality (e.g., a boolean column - not worth it usually)
Columns rarely used in queries

Checking Index Usage

-- See all indexes on a table
\d posts

-- Check if a query uses an index
EXPLAIN ANALYZE SELECT * FROM posts WHERE author_id = 1;

Look for Index Scan in the output. Seq Scan means no index was used.

Transactions

A transaction groups multiple SQL statements into one atomic unit. Either all succeed or all fail.

BEGIN;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;   -- apply changes
-- or
ROLLBACK; -- undo all changes

ACID Properties

Atomicity: all statements in a transaction succeed or none do
Consistency: database stays in a valid state before and after
Isolation: concurrent transactions do not interfere with each other
Durability: committed data survives crashes

Savepoints

BEGIN;

INSERT INTO orders (user_id, total) VALUES (1, 99.99);
SAVEPOINT after_order;

INSERT INTO order_items (order_id, product_id) VALUES (1, 999);
-- this fails - product 999 doesn't exist

ROLLBACK TO after_order;  -- undo only the failed part

INSERT INTO order_items (order_id, product_id) VALUES (1, 5);
COMMIT;

Advanced Features

CTEs (Common Table Expressions)

CTEs let you break complex queries into named parts. Think of them as temporary named result sets.

-- Basic CTE
WITH active_users AS (
  SELECT id, name
  FROM users
  WHERE last_login > NOW() - INTERVAL '30 days'
)
SELECT * FROM active_users WHERE name ILIKE 'a%';

-- Multiple CTEs
WITH
  published_posts AS (
    SELECT author_id, COUNT(*) AS count
    FROM posts
    WHERE published = TRUE
    GROUP BY author_id
  ),
  prolific_authors AS (
    SELECT author_id FROM published_posts WHERE count > 10
  )
SELECT u.name, pp.count
FROM users u
JOIN published_posts pp ON pp.author_id = u.id
WHERE u.id IN (SELECT author_id FROM prolific_authors);

Window Functions

Window functions compute values across a set of rows related to the current row, without collapsing them into groups like GROUP BY does.

-- Row number within each group
SELECT
  name,
  department,
  salary,
  ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees;

-- Running total
SELECT
  order_date,
  amount,
  SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- Lag/lead (access previous/next row)
SELECT
  date,
  revenue,
  LAG(revenue) OVER (ORDER BY date) AS prev_revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS change
FROM daily_revenue;

JSONB

JSONB stores JSON as binary and supports indexing and querying inside the document.

-- Store JSON
CREATE TABLE events (
  id   BIGSERIAL PRIMARY KEY,
  data JSONB NOT NULL
);

INSERT INTO events (data) VALUES
  ('{"type": "click", "user_id": 1, "element": "button"}'),
  ('{"type": "view", "user_id": 2, "page": "/home"}');

-- Access fields
SELECT data->>'type' AS event_type FROM events;
SELECT data->>'user_id' AS user_id FROM events WHERE data->>'type' = 'click';

-- Nested access
SELECT data->'meta'->>'source' FROM events;

-- Filter by JSON field
SELECT * FROM events WHERE data->>'type' = 'click';
SELECT * FROM events WHERE (data->>'user_id')::INT = 1;

-- Index a JSONB field
CREATE INDEX idx_events_type ON events((data->>'type'));

-- GIN index for all keys (slower writes, fast for any-key lookups)
CREATE INDEX idx_events_data ON events USING GIN(data);

Arrays

CREATE TABLE posts (
  id   BIGSERIAL PRIMARY KEY,
  tags TEXT[]
);

INSERT INTO posts (tags) VALUES ('{"javascript", "web", "tutorial"}');

-- Query array contains
SELECT * FROM posts WHERE 'javascript' = ANY(tags);

-- Array overlap
SELECT * FROM posts WHERE tags && ARRAY['javascript', 'python'];

-- Unnest array into rows
SELECT id, UNNEST(tags) AS tag FROM posts;

Full-Text Search

-- Basic text search
SELECT * FROM posts
WHERE to_tsvector('english', title || ' ' || content) @@ to_tsquery('english', 'postgresql & indexing');

-- With index (much faster)
ALTER TABLE posts ADD COLUMN search_vector TSVECTOR;

UPDATE posts SET search_vector =
  to_tsvector('english', title || ' ' || COALESCE(content, ''));

CREATE INDEX idx_posts_search ON posts USING GIN(search_vector);

SELECT title FROM posts WHERE search_vector @@ to_tsquery('english', 'postgresql');

How PostgreSQL Works Under the Hood

Storage: Pages and Tuples

PostgreSQL stores all data in 8KB pages on disk. Each table is a file of pages. Each row is called a tuple. Pages also store indexes, free space maps, and visibility maps.

Table file on disk:
[ Page 0 (8KB) ] [ Page 1 (8KB) ] [ Page 2 (8KB) ] ...
     |
     v
[ Header | Item pointers | ... free space ... | Tuples ]

MVCC (Multi-Version Concurrency Control)

MVCC is how PostgreSQL lets multiple transactions read and write concurrently without blocking each other.

When you UPDATE a row, PostgreSQL does not overwrite the old row. It:

Marks the old row as “deleted” (sets xmax to the current transaction ID)
Inserts a new row with the updated values (sets xmin to the current transaction ID)

Before UPDATE:
Row: { id: 1, name: "Alice", xmin: 100, xmax: 0 }

After UPDATE name to "Alice Smith" in transaction 200:
Row: { id: 1, name: "Alice",       xmin: 100, xmax: 200 }  <- old version
Row: { id: 1, name: "Alice Smith", xmin: 200, xmax: 0   }  <- new version

Readers see different versions depending on when their transaction started. A reader that started before transaction 200 still sees “Alice”. A reader that started after sees “Alice Smith”. No locks needed for reads.

This is why reads never block writes and writes never block reads in PostgreSQL.

Dead Tuples and VACUUM

Because old row versions stick around after updates and deletes, tables accumulate dead tuples. VACUUM cleans them up so disk space can be reused.

-- Manual vacuum (usually automatic)
VACUUM posts;

-- Vacuum + reclaim disk space (locks table briefly)
VACUUM FULL posts;

-- Check dead tuple count
SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

autovacuum runs in the background automatically. Do not disable it.

WAL (Write-Ahead Log)

Before PostgreSQL writes any change to the actual data files, it writes a record of the change to the WAL (a sequential log file). If the server crashes mid-write, PostgreSQL replays the WAL on restart to recover to a consistent state.

WAL is also the mechanism behind replication. Replicas stream the WAL from the primary and apply the same changes locally.

Write path:
Client INSERT -> WAL entry written -> ACK to client
                                  -> Data page written (async)

The WAL write is fast (sequential). The data page write happens later. This is why PostgreSQL can be fast and durable at the same time.

Query Planner

When you run a query, PostgreSQL does not execute it directly. It:

Parses the SQL into a query tree
Rewrites the query (views, rules expansion)
Plans multiple possible execution strategies and estimates their cost
Executes the cheapest plan

-- See the query plan
EXPLAIN SELECT * FROM posts WHERE author_id = 1;

-- See plan + actual runtime stats
EXPLAIN ANALYZE SELECT * FROM posts WHERE author_id = 1;

The planner uses table statistics (row counts, column value distributions) to estimate costs. You can update statistics manually:

ANALYZE posts;

PostgreSQL runs ANALYZE automatically via autovacuum, but after large bulk inserts it can help to run it manually.

Connection Model

PostgreSQL uses a process-per-connection model. Each client connection spawns a separate OS process. This is heavier than thread-per-connection (like MySQL). The practical limit is around 100-300 connections before performance degrades from context switching and memory overhead.

For production, put a connection pooler (PgBouncer) in front of PostgreSQL. PgBouncer maintains a small pool of actual DB connections and multiplexes many application connections onto them.

Useful psql Commands

\l              -- list databases
\c mydb         -- connect to database
\dt             -- list tables
\d users        -- describe table (columns, indexes, constraints)
\di             -- list indexes
\du             -- list users/roles
\timing         -- toggle query timing
\e              -- open query in editor
\q              -- quit

-- Run a file
\i /path/to/file.sql

Best Practices

1. Always Use TIMESTAMPTZ

Store all timestamps with timezone. Always store in UTC, display in local time in the application layer.

2. Use BIGSERIAL for IDs, Not SERIAL

SERIAL is a 32-bit integer (max ~2 billion rows). BIGSERIAL is 64-bit. Start with BIGSERIAL and you will never hit the limit.

Or use UUIDs for distributed systems where IDs are generated across multiple servers:

CREATE EXTENSION IF NOT EXISTS "pgcrypto";

CREATE TABLE users (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  ...
);

3. Index Foreign Keys

PostgreSQL does not auto-create indexes on foreign key columns. Add them manually or every join and cascade operation does a full table scan.

CREATE INDEX idx_posts_author_id ON posts(author_id);
CREATE INDEX idx_comments_post_id ON comments(post_id);

4. Use RETURNING Instead of a Second Query

-- Instead of INSERT then SELECT
INSERT INTO users (email) VALUES ('x@x.com') RETURNING id;
UPDATE users SET name = 'Bob' WHERE id = 1 RETURNING *;

5. Use Connection Pooling in Production

Do not let your app connect to PostgreSQL directly with hundreds of connections. Use PgBouncer or a managed pooler (like Supabase’s PgBouncer or Neon’s connection pooling).

6. Avoid SELECT *

Fetch only what you need. SELECT * wastes bandwidth and breaks code when columns are added or reordered.

7. Use Transactions for Multi-Step Writes

Any operation that touches multiple tables should be wrapped in a transaction. Otherwise a crash halfway through leaves the database in a partially updated state.

8. EXPLAIN ANALYZE Slow Queries

When a query is slow, always check the plan first before adding indexes blindly.

EXPLAIN ANALYZE SELECT * FROM posts WHERE published = TRUE ORDER BY created_at DESC LIMIT 10;

Look for: Seq Scan on large tables, high actual rows vs estimated rows mismatch, sort operations without an index.

9. Avoid Long-Running Transactions

Long transactions hold locks and prevent VACUUM from cleaning dead tuples. Keep transactions short. Do not do external API calls inside a transaction.

10. Use CHECK Constraints for Data Integrity

Put invariants in the database, not just in application code:

ALTER TABLE products ADD CONSTRAINT price_positive CHECK (price > 0);
ALTER TABLE users ADD CONSTRAINT valid_email CHECK (email LIKE '%@%.%');