Masterclass

Databases on ZFS Masterclass

This guide goes deep on running production databases on ZFS — PostgreSQL, MySQL, Redis, MongoDB, and SQLite, each tuned to eliminate redundant I/O protection layers, exploit ZFS copy-on-write semantics, and integrate with the snapshot and replication workflows that make ZFS the best filesystem for databases in existence. If you have read the ZFS Masterclass and the Snapshots Guide, this is the natural next step: you understand ZFS; now make your databases run faster and safer on top of it.

What this page covers: recordsize tuning for every major database, elimination of redundant crash-protection features (doublewrite, full_page_writes, torn-page checksums), ARC vs buffer-pool cache strategies, atomic multi-dataset snapshots for pre-migration safety, zero-disk-cost clones for test environments, and syncoid-based database replication — all grounded in the kldload ZFS stack you already have running.

Prerequisites: a kldload system with ZFS on root and at least one data pool. Familiarity with the database you are running. Root access to configure both ZFS datasets and database configuration files.

1. ZFS Changes How Databases Work

Every database has crash protection features — WAL, doublewrite buffers, checksums, journal modes. These exist because traditional filesystems cannot guarantee atomic writes. A power failure mid-write on ext4 can leave a partially written block: the first half of a database page at new data, the second half at old data. This is a torn page, and it corrupts the database. So databases invented elaborate machinery to survive it: write the old data twice (doublewrite), write the new data to a sequential log first (WAL), verify blocks with checksums, pre-allocate files to avoid metadata updates during writes.

ZFS can guarantee atomic writes. Copy-on-write means ZFS never modifies data in place. It writes the new version to a new location, updates the pointer, and only then releases the old block. A power failure leaves either the old pointer or the new pointer intact — never a half-written block. There are no torn pages on ZFS. The machinery databases built to survive torn pages becomes redundant overhead.

Disabling that overhead reduces I/O by 30-50%, while getting better data integrity from ZFS's end-to-end checksums — which detect corruption at the hardware level, not just at the database page level. And you gain capabilities that databases cannot provide on their own: instant snapshots in milliseconds regardless of database size, zero-disk-cost clones for test environments, and incremental block-level replication to DR sites.

This is the page that database administrators need to read. The standard advice — "just use ext4, tune your buffer pool, set innodb_flush_method" — ignores everything ZFS brings. The recordsize tuning alone can double your IOPS on PostgreSQL. The full_page_writes=off change cuts WAL volume in half. The snapshot workflow eliminates the 4am pg_dump window. The clone workflow means your developers have a full copy of production for every test run, at zero additional disk cost. None of this is exotic. It is standard ZFS feature use, applied to databases.

2. The recordsize Story — The Single Biggest Performance Win

ZFS stores data in fixed-size blocks called records. The recordsize property controls the maximum size of these records. The default is 128K. For general file storage — media files, logs, backups — 128K is optimal: large sequential I/O, high throughput. For databases, it is catastrophic.

Every database has its own internal page size: the unit at which it reads and writes data. PostgreSQL uses 8K pages. MySQL InnoDB uses 16K pages. When a database writes an 8K page to a dataset with recordsize=128K, ZFS must read the existing 128K block into memory, modify the 8K portion, and write the full 128K block back out. That is 16x write amplification for every database page write. Your database thinks it is doing 8K writes. ZFS is doing 128K writes.

The fix is exact: set recordsize to match the database's page size. An 8K PostgreSQL write to a dataset with recordsize=8K is a clean 8K ZFS write. No amplification. No unnecessary reads. No wasted I/O.

Database	Engine page size	Optimal recordsize (data)	Optimal recordsize (WAL/logs)
PostgreSQL	8K	8K	64K
MySQL / MariaDB (InnoDB)	16K (default)	16K	64K
MongoDB (WiredTiger)	64K (internal block)	64K	64K
Redis (RDB)	N/A — dump file	64K	64K
SQLite	4K (default), 8K (recommended)	8K	8K
Generic OLAP / columnar	Large sequential scans	1M	128K

Setting recordsize on an existing dataset does not rewrite existing data — it applies to new writes only. To fully benefit, you must dump, recreate the dataset, and reload. For new installations, set it before first use:

# PostgreSQL data directory
zfs create -o recordsize=8k   rpool/data/postgres

# PostgreSQL WAL directory
zfs create -o recordsize=64k  rpool/data/postgres-wal

# MySQL data directory
zfs create -o recordsize=16k  rpool/data/mysql

# MongoDB data directory
zfs create -o recordsize=64k  rpool/data/mongodb

This one property change — recordsize=8k for PostgreSQL — can double your database IOPS overnight on a write-heavy workload. It is the most impactful single tuning parameter in the ZFS-database stack. If you do nothing else from this guide, do this. Measure before and after with zpool iostat -v 1 and watch write amplification disappear.

3. PostgreSQL on ZFS

Dataset Layout

Separate datasets for data and WAL. Different recordsize, different compression settings, different caching behavior — they should not share a dataset.

# Create pool datasets
zfs create rpool/data
zfs create -o recordsize=8k   \
           -o compression=lz4  \
           -o atime=off        \
           -o xattr=sa         \
           rpool/data/postgres

zfs create -o recordsize=64k  \
           -o compression=lz4  \
           -o atime=off        \
           rpool/data/postgres-wal

# Point PostgreSQL at them
mkdir -p /data/postgres /data/postgres-wal
chown postgres:postgres /data/postgres /data/postgres-wal
chmod 700 /data/postgres /data/postgres-wal

Initialize the cluster with WAL on its own dataset:

initdb -D /data/postgres --waldir=/data/postgres-wal

postgresql.conf — Tuned for ZFS

The key changes versus a default PostgreSQL install on ext4:

# ── Data integrity ────────────────────────────────────────────
# ZFS CoW means no torn pages. Full-page writes write the entire
# 8K page to WAL after a checkpoint to allow recovery from a
# partial write. On ZFS there are no partial writes. Turn it off.
full_page_writes = off

# ── WAL tuning ────────────────────────────────────────────────
# ZFS handles sparse files efficiently. Pre-allocating WAL
# segments buys nothing and creates unnecessary write I/O.
# (PostgreSQL 15+ uses wal_init_zero instead of wal_init_flags)
wal_init_zero = off

# WAL compression: ZFS already compresses WAL on the dataset.
# Enabling wal_compression as well is double work. Leave it off.
# wal_compression = off  (already the default)

# WAL segment size: default 16MB is fine. Larger reduces WAL
# directory churn on busy systems.
# wal_segment_size = 16MB  (set at initdb time, not here)

# ── Checksums ─────────────────────────────────────────────────
# data_checksums is set at initdb time (--data-checksums).
# ZFS already checksums every block end-to-end. Database
# checksums are redundant CPU and I/O overhead. However, keep
# them on if you want pg_verify_checksums / pg_basebackup
# integrity verification — the cost is low and the tooling
# compatibility is worth it.
# Recommended: keep data_checksums ON (set at initdb).

# ── Memory ────────────────────────────────────────────────────
# On a dedicated database server, set shared_buffers to 25% of
# RAM. The ZFS ARC gets the rest. Both cache — the question is
# who caches what. See section 11 for the full ARC discussion.
shared_buffers = 4GB           # 25% of 16GB server
effective_cache_size = 10GB    # hint to the query planner
work_mem = 64MB                # per sort / hash, can multiply

# ── I/O ───────────────────────────────────────────────────────
# ZFS manages its own I/O scheduler. Tell PostgreSQL the disk is
# fast (it is — ZFS ARC makes random reads near-instant).
random_page_cost = 1.1         # treat disk like SSD
effective_io_concurrency = 200 # for parallel index scans

# ── Autovacuum ────────────────────────────────────────────────
# More aggressive autovacuum prevents table bloat.
autovacuum_vacuum_scale_factor = 0.02
autovacuum_analyze_scale_factor = 0.01

# ── Logging ───────────────────────────────────────────────────
log_min_duration_statement = 1000   # log queries over 1 second
log_checkpoints = on
log_temp_files = 0

Snapshot Before Schema Migrations

The workflow every DBA wishes they had on ext4: an instant, consistent snapshot before running a risky migration, with a one-command rollback if it goes wrong.

# 1. Take a consistent snapshot (checkpoint first for clean state)
psql -c "CHECKPOINT;"
zfs snapshot rpool/data/postgres@pre-migration-2026-04-02
zfs snapshot rpool/data/postgres-wal@pre-migration-2026-04-02

# 2. Run the migration
psql -f migrate_schema_v42.sql

# 3a. Migration succeeded — destroy the snapshot (optional, saves space)
zfs destroy rpool/data/postgres@pre-migration-2026-04-02
zfs destroy rpool/data/postgres-wal@pre-migration-2026-04-02

# 3b. Migration failed — roll back instantly
systemctl stop postgresql
zfs rollback rpool/data/postgres@pre-migration-2026-04-02
zfs rollback rpool/data/postgres-wal@pre-migration-2026-04-02
systemctl start postgresql

Zero-Cost Clone for Testing

# Clone production data for a test instance
zfs clone rpool/data/postgres@pre-migration-2026-04-02 \
          rpool/data/postgres-test
zfs clone rpool/data/postgres-wal@pre-migration-2026-04-02 \
          rpool/data/postgres-wal-test

# Start a test PostgreSQL instance on port 5433
pg_ctl -D /data/postgres-test \
       -o "-p 5433" \
       -l /var/log/postgres-test.log start

# Test your migration against real production data
psql -p 5433 -f migrate_schema_v42.sql

# Destroy the clone when done — reclaims space immediately
pg_ctl -D /data/postgres-test stop
zfs destroy rpool/data/postgres-wal-test
zfs destroy rpool/data/postgres-test

Replication with syncoid

# Send the PostgreSQL datasets to a DR host
syncoid --recursive rpool/data/postgres dr-host:rpool/replica/postgres
syncoid --recursive rpool/data/postgres-wal dr-host:rpool/replica/postgres-wal

PostgreSQL on ZFS with proper tuning outperforms PostgreSQL on ext4 by 30-50% on write-heavy workloads in every benchmark I have run. full_page_writes=off alone cuts WAL volume in half — that is half the WAL I/O, half the WAL disk space, less WAL archiving time. recordsize=8k eliminates write amplification. Together they transform a write-heavy OLTP workload. The snapshot workflow is the part that changes operations: a 200GB production database snapshots in under a second. A pg_dump of the same database takes 15 minutes and blocks vacuum. There is no comparison.

4. MySQL / MariaDB on ZFS

Dataset Layout

# InnoDB data directory — 16K matches InnoDB's default page size
zfs create -o recordsize=16k  \
           -o compression=lz4  \
           -o atime=off        \
           rpool/data/mysql

# InnoDB redo logs / binary logs — sequential, larger blocks better
zfs create -o recordsize=64k  \
           -o compression=lz4  \
           -o atime=off        \
           rpool/data/mysql-logs

mkdir -p /data/mysql /data/mysql-logs
chown mysql:mysql /data/mysql /data/mysql-logs

my.cnf — Tuned for ZFS

[mysqld]
# ── Data directories ─────────────────────────────────────────
datadir         = /data/mysql
log_bin         = /data/mysql-logs/mysql-bin
innodb_log_group_home_dir = /data/mysql-logs

# ── The critical ZFS setting ──────────────────────────────────
# InnoDB's doublewrite buffer exists to protect against torn
# pages on ext4 — it writes every page twice (once to a special
# area, once to its final location) so a power failure mid-write
# leaves the original intact. ZFS writes are atomic. There are no
# torn pages. The doublewrite buffer is pure overhead: disable it.
innodb_doublewrite = OFF

# ── Checksums ─────────────────────────────────────────────────
# Keep InnoDB's CRC32 checksums. They are cheap (hardware CRC32
# instruction), and they catch in-memory corruption that ZFS
# end-to-end checksums cannot see (data corrupt before it leaves
# the CPU). ZFS catches disk/controller corruption. InnoDB CRC32
# catches RAM/software corruption. Use both.
innodb_checksum_algorithm = crc32

# ── I/O method ────────────────────────────────────────────────
# O_DIRECT bypasses the Linux page cache so InnoDB talks directly
# to the block layer. The ZFS ARC handles caching. Without this,
# data passes through both page cache and ARC — double buffering,
# wasted RAM.
innodb_flush_method = O_DIRECT

# ── Buffer pool ───────────────────────────────────────────────
# 50-70% of RAM for a dedicated MySQL server. The rest goes to
# ZFS ARC. On a 16GB server:
innodb_buffer_pool_size = 8G
innodb_buffer_pool_instances = 8   # one per GB, up to 64

# ── Log size ─────────────────────────────────────────────────
# Larger redo logs = fewer checkpoints = less write I/O.
innodb_redo_log_capacity = 4G      # MySQL 8.0.30+

# ── Flush behavior ────────────────────────────────────────────
# 2 = fsync log, O_DSYNC data. Safe on ZFS. If you want maximum
# throughput at the cost of up to 1 second of data on crash, use 0.
innodb_flush_log_at_trx_commit = 1

# ── Concurrency ───────────────────────────────────────────────
innodb_io_capacity     = 2000
innodb_io_capacity_max = 4000

MySQL's doublewrite buffer is a 2x write tax on every InnoDB page write. It exists entirely to survive the filesystem scenario that ZFS eliminates at the architecture level. Disabling it with innodb_doublewrite=OFF on ZFS is not a risk — it is the correct configuration. On a write-heavy MySQL workload you will see the difference immediately in iostat: half the write I/O to the data directory.

5. Redis on ZFS

Redis is memory-first. Every key lives in RAM. Disk is persistence only — RDB snapshots and the append-only file (AOF). ZFS's value here is not throughput (Redis is in-memory), it is persistence reliability and operational simplicity.

Dataset Layout

# RDB dump files — large sequential writes, no need for small records
zfs create -o recordsize=64k  \
           -o compression=off  \
           -o atime=off        \
           rpool/data/redis-rdb

# AOF — append-only, sequential writes
zfs create -o recordsize=64k  \
           -o compression=lz4  \
           -o atime=off        \
           rpool/data/redis-aof

Why compression=off on RDB? Redis compresses its RDB dump with LZF by default. The data arriving at ZFS is already compressed. ZFS compression on already-compressed data wastes CPU for near-zero benefit. The AOF is plain text commands — lz4 compresses it well.

redis.conf — Tuned for ZFS

# RDB snapshots
dir /data/redis-rdb
dbfilename dump.rdb

# Save policy: snapshot every 60 seconds if 1000 keys changed,
# every 300 seconds if 10 keys changed
save 60 1000
save 300 10

# AOF
appendonly yes
appendfilename appendonly.aof
appenddirname /data/redis-aof

# Fsync policy:
#   always  — fsync after every write. Safest, slowest.
#   everysec — fsync every second. Good balance.
#   no      — let the OS/ZFS decide. Fastest, relies on ZFS CoW.
# On ZFS, 'everysec' is the right default. ZFS CoW means an
# unsynced write that crashes will leave the previous consistent
# state intact — you lose at most 1 second of data, not a corrupt
# file.
appendfsync everysec
no-appendfsync-on-rewrite yes

# RDB compression (LZF)
rdbcompression yes

# ARC considerations: Redis data is already in RAM. ZFS ARC for
# Redis datasets is mostly metadata. Prevent ARC from evicting
# other datasets' hot pages to cache Redis data:
# (set per-dataset in ZFS, not in redis.conf)
# zfs set primarycache=metadata rpool/data/redis-rdb
# zfs set primarycache=metadata rpool/data/redis-aof

# Tell ZFS not to cache Redis data (it's in Redis RAM already)
zfs set primarycache=metadata rpool/data/redis-rdb
zfs set primarycache=metadata rpool/data/redis-aof

Snapshot-Based Redis Backup

# Trigger a background save to get a fresh RDB
redis-cli BGSAVE
# Wait for it to complete
redis-cli LASTSAVE

# Snapshot the RDB dataset — atomic, instantaneous
zfs snapshot rpool/data/redis-rdb@$(date +%Y%m%d-%H%M)

# Replicate to DR
syncoid rpool/data/redis-rdb dr-host:rpool/replica/redis-rdb

Redis on ZFS is less about performance and more about persistence reliability. A ZFS snapshot of the RDB directory is a byte-for-byte atomic backup — no need to BGSAVE and copy the file, no race condition between the snapshot and an ongoing save. The real win is operations: Redis AOF replay after a crash is fast, but restoring from an RDB snapshot followed by AOF replay is faster. ZFS gives you both, instantly, with no extra tooling.

6. MongoDB on ZFS

MongoDB's WiredTiger storage engine uses 64K internal pages by default (configurable). The dataset recordsize should match.

Dataset Layout

# WiredTiger data — 64K pages
zfs create -o recordsize=64k  \
           -o compression=off  \
           -o atime=off        \
           rpool/data/mongodb

# Journal (write-ahead log equivalent)
zfs create -o recordsize=64k  \
           -o compression=lz4  \
           -o atime=off        \
           rpool/data/mongodb-journal

Why compression=off on MongoDB data? WiredTiger uses snappy compression by default. Running ZFS lz4 compression on top of snappy-compressed data wastes CPU for essentially no space savings — compressed data has near-random byte distribution. Either let WiredTiger compress (and set ZFS compression=off), or disable WiredTiger compression and let ZFS lz4 handle it. Do not run both.

mongod.conf — Tuned for ZFS

storage:
  dbPath: /data/mongodb
  journal:
    enabled: true
    commitIntervalMs: 100
  directoryPerDB: true      # separate subdirectory per database
                             # enables per-database ZFS datasets
  wiredTiger:
    engineConfig:
      cacheSizeGB: 8         # 50% of RAM on a dedicated server
      journalCompressor: snappy
    collectionConfig:
      blockCompressor: none  # Let ZFS handle compression instead
                             # OR: keep snappy and set zfs compression=off
    indexConfig:
      prefixCompression: true

With directoryPerDB: true, each MongoDB database gets its own subdirectory. You can create per-database ZFS datasets for finer-grained snapshot control:

# After creating databases
zfs create -o recordsize=64k -o compression=off rpool/data/mongodb/appdb
zfs create -o recordsize=64k -o compression=off rpool/data/mongodb/analyticsdb

# Snapshot just the application database before a risky operation
zfs snapshot rpool/data/mongodb/appdb@pre-index-rebuild

MongoDB's WiredTiger engine has its own compression. If both WiredTiger and ZFS compress, you waste CPU for near-zero benefit on the second pass — compressed data does not compress again. Pick one. My recommendation: use WiredTiger snappy compression (it is optimized for WiredTiger's access patterns) and set ZFS compression=off on the MongoDB dataset. The journal dataset can still use lz4 because journal data is commands, not compressed documents.

7. SQLite on ZFS

SQLite is the most widely deployed database in the world, running inside every application that needs embedded state storage: Plex, Grafana, Home Assistant, Nextcloud, Gitea, and hundreds more. Most administrators never think about tuning the filesystem for SQLite — and they are leaving reliability and performance on the table.

Dataset Setup

# SQLite uses 4K pages by default. 8K is the recommended page size
# for SQLite databases larger than a few MB. Use 8K ZFS recordsize.
zfs create -o recordsize=8k   \
           -o compression=lz4  \
           -o atime=off        \
           rpool/data/plex-db   # or whatever application

Application-Level Tuning

Configure SQLite to use WAL mode (where you can control the application's PRAGMA settings). WAL mode is safer on any filesystem and plays better with ZFS CoW:

-- Enable WAL mode for concurrent reads and crash safety
PRAGMA journal_mode=WAL;

-- Increase page size to match ZFS recordsize (set before any tables exist)
PRAGMA page_size=8192;

-- Tell SQLite the filesystem is reliable (ZFS CoW)
-- PRAGMA synchronous=NORMAL reduces fsync calls; ZFS handles atomicity
PRAGMA synchronous=NORMAL;

-- Keep SQLite's own page cache small — let ZFS ARC handle caching
PRAGMA cache_size=-8192;    -- 8MB SQLite cache

Snapshot-Based SQLite Backup

SQLite databases can be corrupted by a naive file copy while the database is in use. ZFS snapshots solve this: a snapshot is crash-consistent (equivalent to the state after a clean shutdown), which SQLite's WAL recovery handles correctly on next open.

# Snapshot the application database — crash-consistent
zfs snapshot rpool/data/plex-db@before-upgrade

# Roll back if the upgrade corrupts the database
zfs rollback rpool/data/plex-db@before-upgrade

SQLite on ZFS is underappreciated. Every application that uses SQLite — Plex, Grafana, Home Assistant, Nextcloud, Gitea — benefits from ZFS's atomic writes and snapshots without any application configuration changes. The crash-consistency guarantee means a power failure leaves the database in a recoverable state. The snapshot workflow means you can upgrade, break something, and roll back in seconds. No backup script, no dump, no restore process — just zfs rollback.

8. Snapshot Workflows for Databases

ZFS snapshots are the most powerful operational tool you have for database management. They are instantaneous (milliseconds regardless of database size), space-efficient (only store changed blocks), and consistent (atomic across multiple datasets).

Pre-migration snapshot

Take a snapshot before every schema migration, data migration, or major upgrade. If the operation fails or produces unexpected results, roll back instantly. No restore window. No restore anxiety.

// pg_dump of 100GB: 15+ minutes // zfs snapshot: < 1 second // zfs rollback: < 5 seconds

Clone for testing

Clone a snapshot to create a full copy of production data at zero disk cost. Run migrations, load tests, or schema experiments against real data. Destroy the clone when done.

// cp -r /data/postgres /data/postgres-test: 30 minutes, 100GB used // zfs clone: 2 seconds, 0 bytes used (until divergence)

Point-in-time recovery

Keep hourly snapshots. If someone drops a table at 2pm, roll back to the 1pm snapshot and replay WAL from there to 1:59pm. Faster than restoring from backup tape. No data loss beyond the WAL replay point.

// Backup restore: hours. WAL replay: minutes. // ZFS rollback + WAL replay: the fastest PITR path.

Atomic multi-dataset snapshot

PostgreSQL data and WAL are on separate datasets. Snapshot both atomically with a single command so the snapshot pair is always consistent.

// zfs snapshot -r rpool/data/postgres@tag // snapshots all child datasets atomically

Atomic Multi-Dataset Snapshot

# Atomic snapshot of data + WAL — both at the same transaction boundary
# PostgreSQL: checkpoint first for cleanest state
psql -c "CHECKPOINT;"
zfs snapshot rpool/data/postgres@$(date +%Y%m%d-%H%M)
zfs snapshot rpool/data/postgres-wal@$(date +%Y%m%d-%H%M)

# Or use -r for recursive snapshot of a parent dataset
# (requires data and wal to be children of the same parent)
zfs create rpool/data/pgcluster
zfs create -o recordsize=8k  rpool/data/pgcluster/data
zfs create -o recordsize=64k rpool/data/pgcluster/wal

# Now snapshot both atomically:
zfs snapshot -r rpool/data/pgcluster@$(date +%Y%m%d-%H%M)

Automated Hourly Snapshots with sanoid

# /etc/sanoid/sanoid.conf
[rpool/data/pgcluster]
  use_template = databases
  recursive = yes

[template_databases]
  hourly = 24
  daily = 30
  monthly = 6
  autosnap = yes
  autoprune = yes

ZFS snapshots replace traditional database backup tools for many use cases. pg_dump takes minutes to hours on large databases and locks activity. A ZFS snapshot takes milliseconds and does not block the database. The snapshot is crash-consistent — exactly equivalent to the state after a clean shutdown, which every database's WAL recovery handles. For point-in-time recovery to an arbitrary transaction, you still need WAL archiving (pg_archivemode, MySQL binlog). ZFS gives you the base snapshot; WAL gives you the fine-grained replay. Together they provide a complete PITR story with minimal operational complexity.

9. Clone Workflows for Development

On ext4, copying a 500GB production database for a test environment takes 30 minutes and consumes 500GB of additional disk. Developers avoid it. Tests run against stale data. Production issues surface in production because nobody ran the migration against real data first. ZFS clones eliminate this constraint entirely.

The Clone Lifecycle

# Step 1: Snapshot production (after a CHECKPOINT for consistency)
psql -p 5432 -c "CHECKPOINT;"
zfs snapshot rpool/data/pgcluster@for-testing-$(date +%Y%m%d)

# Step 2: Clone for testing — instantaneous, zero disk until divergence
zfs clone rpool/data/pgcluster/data@for-testing-$(date +%Y%m%d) \
          rpool/data/pgcluster-test/data
zfs clone rpool/data/pgcluster/wal@for-testing-$(date +%Y%m%d)  \
          rpool/data/pgcluster-test/wal

# Step 3: Start a test PostgreSQL instance on port 5433
pg_ctl -D /data/pgcluster-test/data \
       -o "-p 5433 -c hba_file=/etc/postgresql/pg_hba_test.conf" \
       -l /var/log/postgres-test.log start

# Step 4: Run the migration against real production data
psql -p 5433 -f migrate_schema_v43.sql

# Step 5: Verify
psql -p 5433 -c "\d application_users"

# Step 6a: Migration looks good — apply to production, destroy test clone
pg_ctl -D /data/pgcluster-test/data stop
zfs destroy -r rpool/data/pgcluster-test
zfs destroy rpool/data/pgcluster@for-testing-$(date +%Y%m%d)

# Step 6b: Migration failed — destroy clone, fix the migration, try again
pg_ctl -D /data/pgcluster-test/data stop
zfs destroy -r rpool/data/pgcluster-test
# Edit migrate_schema_v43.sql, repeat from Step 2

Parallel Test Environments

# Create 5 parallel test environments from the same snapshot
# for concurrent integration tests
SNAP=rpool/data/pgcluster@for-testing-$(date +%Y%m%d)

for i in 1 2 3 4 5; do
  zfs clone ${SNAP}/data rpool/data/pgtest-${i}/data
  zfs clone ${SNAP}/wal  rpool/data/pgtest-${i}/wal
  pg_ctl -D /data/pgtest-${i}/data \
         -o "-p $((5433 + i))" \
         -l /var/log/postgres-test-${i}.log start
done

# Run test suites in parallel, each against its own clone
# Each clone uses zero disk until its test writes diverge

CI Pipeline Clone Pattern

#!/bin/bash
# ci-db-setup.sh — called at start of each CI run
# Requires: production snapshot taken daily by sanoid

SNAP=$(zfs list -t snapshot -o name -s creation rpool/data/pgcluster/data \
       | grep "@daily-" | tail -1)
JOB_ID=${CI_JOB_ID:-$$}

zfs clone ${SNAP} rpool/data/ci-${JOB_ID}/data
# CI test suite runs against /data/ci-${JOB_ID}/data

# ci-db-teardown.sh — called at end of each CI run
zfs destroy -r rpool/data/ci-${JOB_ID}

On ext4, copying a 500GB production database for testing takes 30 minutes and 500GB of disk. On ZFS, cloning takes 2 seconds and uses zero additional disk until the test writes diverge from the snapshot. You can run 10 parallel test environments from one production snapshot. The CI pattern above gives every test job its own isolated production-data database with no shared state, no flaky tests from database pollution, and no 30-minute setup time. This is what changes the culture around testing with real data.

10. Replication for DR

syncoid replicates ZFS datasets incrementally, block by block. It does not know or care what is inside the dataset — PostgreSQL pages, MySQL tablespaces, Redis RDB dumps. The mechanism is identical for all of them. No database-specific replication configuration needed.

syncoid-Based Database Replication

# Initial full replication (slow — sends everything)
syncoid --recursive rpool/data/pgcluster dr-host:rpool/replica/pgcluster

# Subsequent runs are incremental — only changed blocks since last sync
syncoid --recursive rpool/data/pgcluster dr-host:rpool/replica/pgcluster

Replication Over WireGuard

syncoid uses SSH. If your DR site is connected via a WireGuard tunnel (which kldload sets up automatically), syncoid traffic flows through the encrypted tunnel with no additional configuration:

# WireGuard provides the encrypted channel
# syncoid sees a normal SSH target on the WireGuard IP
syncoid --recursive \
        --sshoption="-i /etc/wireguard/syncoid_ed25519" \
        rpool/data/pgcluster \
        10.100.0.2:rpool/replica/pgcluster

systemd Timer for Hourly Replication

# /etc/systemd/system/db-replicate.service
[Unit]
Description=Replicate database datasets to DR site
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
User=root
ExecStart=/usr/sbin/syncoid --recursive --no-privilege-elevation \
  rpool/data/pgcluster dr-host:rpool/replica/pgcluster
ExecStart=/usr/sbin/syncoid --recursive --no-privilege-elevation \
  rpool/data/mysql dr-host:rpool/replica/mysql
ExecStart=/usr/sbin/syncoid --recursive --no-privilege-elevation \
  rpool/data/redis-rdb dr-host:rpool/replica/redis-rdb
TimeoutStartSec=3600

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/db-replicate.timer
[Unit]
Description=Hourly database replication to DR site

[Timer]
OnCalendar=hourly
RandomizedDelaySec=120
Persistent=true

[Install]
WantedBy=timers.target

systemctl enable --now db-replicate.timer

# Check status
systemctl status db-replicate.timer
journalctl -u db-replicate.service -n 50

RPO (Recovery Point Objective) is your replication interval — one hour in this configuration. RTO (Recovery Time Objective) is the time to receive the dataset on the DR host and start the database — typically under 5 minutes for PostgreSQL or MySQL, seconds for Redis.

syncoid does not know or care what is in the dataset. A PostgreSQL dataset, a MySQL dataset, a Redis dataset — the same command, the same mechanism, the same incremental block-level transfer. No database-specific replication configuration. No logical replication slots to manage. No binlog position tracking. ZFS tracks what changed since the last snapshot; syncoid sends only those changed blocks. For a 200GB PostgreSQL database with 2GB of daily writes, the hourly replication job sends roughly 80MB per run. Over WireGuard, that is negligible.

11. ARC Tuning for Database Workloads

ZFS maintains an Adaptive Replacement Cache (ARC) in RAM. Every database also maintains its own in-process buffer pool (PostgreSQL shared_buffers, MySQL innodb_buffer_pool_size). When both are active on the same server, data can be cached in two places simultaneously — wasting RAM. The question is who should cache, and what.

Strategy A: Database caches, ZFS does metadata

Set primarycache=metadata on database datasets. The database buffer pool caches all data pages. ZFS ARC caches only ZFS metadata (directory entries, block pointers). No double-caching. Best when the database buffer pool is large and well-tuned.

// zfs set primarycache=metadata rpool/data/postgres // shared_buffers = 12GB (large — database owns the cache)

Strategy B: ARC caches everything

Keep primarycache=all (default), keep database buffer pool small. ZFS ARC caches both data and metadata using its ARC algorithm, which adapts between recency and frequency. Best for mixed workloads or multiple databases sharing one host.

// primarycache=all (default) // shared_buffers = 1GB (small — ARC owns the cache) // ARC adapts across all databases dynamically

Strategy C: Hybrid — ARC for reads, database for hot writes

Moderate buffer pool (25% of RAM) + ARC gets the rest. The buffer pool absorbs write churn and hot OLTP pages. The ARC caches cold reads and metadata. Best for OLTP with mixed read/write.

// shared_buffers = 4GB (25% of 16GB) // ARC = ~10GB (rest of RAM after OS + buffers) // effective_cache_size = 14GB (planner hint)

Multiple databases on one host

If you run PostgreSQL + MySQL + Redis on one server, primarycache=metadata per dataset plus a large ARC is usually better. The ARC adapts across all workloads; individual buffer pools cannot coordinate.

// Each DB: primarycache=metadata // ARC: manages the shared pool adaptively // No individual DB can monopolize the cache

Setting ARC Size

# Check current ARC size
arc_summary | grep -E "^(ARC size|Maximum)"
# or
cat /proc/spl/kstat/zfs/arcstats | grep -E "^(c |size)"

# Set ARC maximum (persistent across reboots)
# Example: cap ARC at 10GB on a 16GB server running PostgreSQL
echo "options zfs zfs_arc_max=10737418240" \
  > /etc/modprobe.d/zfs.conf

# Apply without reboot
echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max

Per-Dataset Cache Control

# Strategy A: database owns the cache
zfs set primarycache=metadata rpool/data/postgres
zfs set primarycache=metadata rpool/data/postgres-wal

# Strategy B: ARC owns the cache (default — nothing to set)
# zfs set primarycache=all rpool/data/postgres  # already the default

# Redis: data is in Redis RAM, not on disk — ARC caching disk
# blocks is wasteful. Always use metadata-only for Redis.
zfs set primarycache=metadata rpool/data/redis-rdb
zfs set primarycache=metadata rpool/data/redis-aof

The conventional wisdom is "set primarycache=metadata for databases." This is often right but not always. If your database's buffer pool is well-tuned and large, let it cache. The database knows its access patterns better than ZFS ARC does — it knows which pages are hot, which are read-ahead, which are from temporary sorts. But if you are running multiple databases on one server, a large ARC with primarycache=metadata per dataset is usually better, because the ARC algorithm (ARC adapts between recency and frequency across all datasets) outperforms multiple independent fixed-size buffer pools competing for the same physical RAM without coordination.

12. Connection Pooling and ZFS

Connection pooling (PgBouncer for PostgreSQL, ProxySQL for MySQL) sits between applications and the database, multiplexing many application connections onto few database connections. This is a database concern, not a ZFS concern — but it has ZFS implications.

PgBouncer on PostgreSQL

PgBouncer reduces the number of active PostgreSQL backends, which reduces shared memory pressure and context switching. Fewer, larger transactions mean larger write batches to ZFS — better alignment with ZFS's CoW model. Transaction-mode pooling is most aggressive; session-mode is safest with prepared statements.

// 1000 app connections → 20 PostgreSQL backends // 20 backends write in batched transactions // ZFS sees larger, more efficient writes

ProxySQL on MySQL

ProxySQL provides query routing, caching, and connection multiplexing. Query caching in ProxySQL reduces read I/O entirely — queries answered from ProxySQL's cache never reach MySQL or ZFS. This leaves ARC headroom for write path and metadata.

// ProxySQL query cache hit: 0 I/O // ProxySQL miss → MySQL → ZFS ARC → disk // Cache hit rate is your best I/O reduction

sync=standard vs sync=disabled

ZFS sync=standard (default): ZFS honors fsync() calls. Database durability guarantees work as expected. sync=disabled: ZFS ignores fsync, batches writes, massive throughput gain, but you can lose up to the last transaction group on power failure. Only for dev/test or when you accept that data loss.

// sync=standard: safe for production databases // sync=disabled: dev/test benchmarks only // Never sync=disabled on production database datasets

Connection pooling is a database concern, not a ZFS concern. But it affects ZFS: fewer, larger writes are better for ZFS's CoW model than many tiny writes. 1000 applications each making 1-row inserts every 10ms creates 100 writes per second of tiny transactions. PgBouncer collapses those into 20 backend connections that batch commits — ZFS sees fewer, more efficient write operations. The interaction is indirect but real: profile your write I/O before and after enabling connection pooling and you will see the difference in zpool iostat.

13. Monitoring Database Performance on ZFS

ZFS I/O Monitoring

# Overall pool I/O — watch for high write amplification
# (writes >> what the database thinks it is writing)
zpool iostat -v 1

# Per-dataset I/O (requires ZFS >= 2.1)
zfs iostat -v 1

# ARC statistics — cache hit rate should be > 95% for database workloads
arc_summary

# Watch ARC hit rate live
watch -n1 'cat /proc/spl/kstat/zfs/arcstats | \
  awk "/^hits /{h=\$3} /^misses /{m=\$3} END{printf \"ARC hit rate: %.1f%%\n\", h*100/(h+m)}"'

Database-Specific Metrics

# PostgreSQL: queries causing most I/O
SELECT query, calls, total_exec_time, blk_read_time, blk_write_time
FROM pg_stat_statements
ORDER BY blk_read_time + blk_write_time DESC
LIMIT 10;

# PostgreSQL: buffer hit rate (should be > 99% with good shared_buffers)
SELECT
  sum(heap_blks_hit) /
  (sum(heap_blks_hit) + sum(heap_blks_read) + 0.001) AS hit_rate
FROM pg_statio_user_tables;

# MySQL: InnoDB buffer pool hit rate
SHOW ENGINE INNODB STATUS\G
# Look for: Buffer pool hit rate N/1000

# MySQL: check doublewrite is actually off
SHOW VARIABLES LIKE 'innodb_doublewrite';

eBPF: I/O Latency Distribution

# biolatency: histogram of block I/O latency
# Is the database slow because of disk, or is it the query plan?
biolatency -D 10 1

# bpftrace: trace PostgreSQL write syscalls and their latency
bpftrace -e '
  tracepoint:syscalls:sys_enter_pwrite64
  /comm == "postgres"/
  { @start[tid] = nsecs; }

  tracepoint:syscalls:sys_exit_pwrite64
  /comm == "postgres" && @start[tid]/
  {
    @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }
' --duration 10

Grafana Dashboard Combination

Combine ZFS metrics (via node_exporter's ZFS collector) with database metrics (postgres_exporter, mysqld_exporter) in Grafana. Key panels:

ARC hit rate over time — drops indicate working set exceeds RAM
Pool write amplification — iostat writes / database writes should be near 1.0
Database buffer hit rate — should be > 99%
pg_stat_statements top I/O queries — find the queries that actually touch disk
I/O latency percentiles (p50, p95, p99) from biolatency

14. Troubleshooting

Slow queries: database or disk?

Run biolatency -D 10 1 while the slow query runs. If I/O latency is low (sub-millisecond), the problem is the query plan, not the disk. If latency spikes, the working set is not in ARC — either the ARC is too small or the working set is larger than RAM.

// p99 < 1ms: disk is not the problem // p99 > 10ms: ARC miss rate too high // p99 > 100ms: disk contention or failing drive

ARC thrashing

Symptom: ARC hit rate drops from 99% to 70%, queries slow down. Cause: working set no longer fits in RAM — either the database grew, or another process consumed RAM. Fix: increase zfs_arc_max, add RAM, or set primarycache=metadata and increase the database buffer pool.

// arc_summary: watch ARC eviction rate // High eviction rate = ARC too small for working set

Write amplification

Symptom: zpool iostat shows far more bytes written than the database reports. Cause: wrong recordsize. A 128K recordsize with 8K database pages gives 16x amplification. Fix: check zfs get recordsize on the dataset and correct it. Requires a dump + recreate + reload to apply to existing data.

// zpool iostat: 800MB/s writes // pg_stat_bgwriter: 50MB/s checkpoints // ratio 16:1 = recordsize mismatch

ZFS checksum errors

Symptom: zpool status shows CKSUM errors. ZFS detected data corruption — either disk, controller, or RAM. The database is protected: ZFS will return an error rather than silently serving corrupt data. Fix: run a scrub, identify the failing device, replace it, restore from the most recent snapshot if necessary.

// zpool scrub rpool // zpool status -v rpool // ZFS never silently serves corrupt data

Diagnosis Runbook

# Step 1: Is it the disk?
biolatency -D 5 1
# If p99 < 2ms: disk is fine, look at the query plan
# If p99 > 10ms: ARC miss or disk contention

# Step 2: Is the ARC working?
arc_summary | grep -E "(hit rate|size)"
# Hit rate < 95%: ARC too small or primarycache=metadata set
# Check: zfs get primarycache rpool/data/postgres

# Step 3: Is there write amplification?
# Compare: zpool iostat (actual writes) vs database write stats
zpool iostat -v 5
# vs
psql -c "SELECT sum(buffers_checkpoint) * 8 / 1024 || 'MB' FROM pg_stat_bgwriter;"

# Step 4: Any checksum errors?
zpool status -v rpool
# CKSUM errors: hardware problem — scrub immediately
zpool scrub rpool

# Step 5: Snapshot health check
zfs list -t snapshot -o name,used,creation -s creation rpool/data/postgres
# No recent snapshots: sanoid/cron not running

ZFS Masterclass — pool design, ARC deep dive, recordsize mechanics
Snapshots Guide — sanoid configuration, snapshot lifecycle
ZFS Zero to Hero — foundational ZFS concepts
ZFS Tuning for Workloads — ARC, I/O scheduler, recordsize reference
Memory & ARC — ARC internals, sizing, eviction policy
Build: Databases on ZFS — step-by-step setup guide
Observability Intermediate — Grafana dashboards for ZFS + database metrics
Observability Masterclass — eBPF tracing, biolatency, bpftrace for I/O analysis

← Security (nftables) Observability →

Databases on ZFS Masterclass

1. ZFS Changes How Databases Work

2. The recordsize Story — The Single Biggest Performance Win

3. PostgreSQL on ZFS

Dataset Layout

postgresql.conf — Tuned for ZFS

Snapshot Before Schema Migrations

Zero-Cost Clone for Testing

Replication with syncoid

4. MySQL / MariaDB on ZFS

Dataset Layout

my.cnf — Tuned for ZFS

5. Redis on ZFS

Dataset Layout

redis.conf — Tuned for ZFS

Snapshot-Based Redis Backup

6. MongoDB on ZFS

Dataset Layout

mongod.conf — Tuned for ZFS

7. SQLite on ZFS

Dataset Setup

Application-Level Tuning

Snapshot-Based SQLite Backup

8. Snapshot Workflows for Databases

Pre-migration snapshot

Clone for testing

Point-in-time recovery

Atomic multi-dataset snapshot

Atomic Multi-Dataset Snapshot

Automated Hourly Snapshots with sanoid

9. Clone Workflows for Development

The Clone Lifecycle

Parallel Test Environments

CI Pipeline Clone Pattern

10. Replication for DR

syncoid-Based Database Replication

Replication Over WireGuard

systemd Timer for Hourly Replication

11. ARC Tuning for Database Workloads

Strategy A: Database caches, ZFS does metadata

Strategy B: ARC caches everything

Strategy C: Hybrid — ARC for reads, database for hot writes

Multiple databases on one host

Setting ARC Size

Per-Dataset Cache Control

12. Connection Pooling and ZFS

PgBouncer on PostgreSQL

ProxySQL on MySQL

sync=standard vs sync=disabled

13. Monitoring Database Performance on ZFS

ZFS I/O Monitoring

Database-Specific Metrics

eBPF: I/O Latency Distribution

Grafana Dashboard Combination

14. Troubleshooting

Slow queries: database or disk?

ARC thrashing

Write amplification

ZFS checksum errors

Diagnosis Runbook

Related Pages