kldload — Databases on ZFS — PostgreSQL, MySQL, Redis

Build Your Own

Databases on ZFS — checksums, snapshots, and clones replace half your DBA toolchain.

Every database has a backup strategy, a corruption detection strategy, and a test-copy strategy. On ext4 or XFS, these are three separate systems with three separate failure modes. On ZFS, they are properties of the filesystem. Checksums catch silent corruption that no database can detect on its own. Snapshots are instant backups with zero I/O overhead. Clones give you a full copy of production for testing in under a second. Compression saves 30-50% on database files without the database knowing or caring.

The catch: ZFS is copy-on-write, and databases have their own write-ahead mechanisms that assume in-place writes. You need to turn off the redundant safety features in the database and let ZFS handle them instead. The configs below do exactly that.

On ext4, a bit flip in a PostgreSQL B-tree page goes undetected until a query returns wrong results weeks later. ZFS checksums every block on every read and auto-heals from redundancy. Set recordsize=8K to match PostgreSQL's page size exactly — no read amplification, no wasted I/O. Need a staging copy of production? zfs clone gives you the entire database in 0.1 seconds, sharing all unchanged blocks. Snapshot before a migration, and a bad ALTER TABLE rolls back in seconds — not hours of pg_restore. Replicate to DR with zfs send: atomic, block-level, checksummed. No pg_dump. No logical replication lag. The whole dataset travels as one verified stream.

1. Why ZFS for databases

Checksums catch silent corruption

Traditional filesystems trust the disk. ZFS verifies every block on read. A single bit flip in a B-tree page can cascade into data loss that the database only discovers when a query returns wrong results — weeks later. ZFS catches it on the next read and auto-repairs from redundancy.

A librarian who never checks the books assumes every page is intact. ZFS reads every page every time and raises an alarm the moment one is damaged.

Snapshots replace backup windows

A pg_basebackup on a 500GB database takes 20 minutes and hammers I/O. A ZFS snapshot takes milliseconds and costs zero I/O. The database never notices. You can snapshot every 15 minutes without impacting production.

Photocopying a 500-page book takes an hour. Bookmarking the page takes a second. ZFS snapshots are bookmarks — they mark a point in time without copying anything.

Clones give you instant test copies

kclone /srv/pgdata /srv/pgdata-test creates a full copy of your production database in 0.1 seconds. It shares all unchanged blocks with the original. Run destructive queries, test migrations, benchmark schema changes — then destroy the clone.

Cloning a database the traditional way is like building a second house brick by brick. ZFS clones are like taking a photograph — instant, and the original house doesn't notice.

Compression saves 30-50%

Database files are full of repeated structures, padding, and null bytes. zstd compression on ZFS typically achieves 1.3x-2x compression on database datasets. The compression happens transparently — the database reads and writes uncompressed data. ZFS handles the rest at the block layer.

You don't zip a file before opening it in an editor. ZFS compresses and decompresses transparently. The database never knows its data takes up half the disk space.

Replication is disaster recovery

syncoid rpool/srv/pgdata root@dr-node:tank/srv/pgdata replicates the entire database dataset — including all snapshots — to a remote node. Incrementals only send changed blocks. If the primary dies, promote the replica and point the app at the new node.

Shipping containers don't care what's inside. syncoid ships ZFS datasets — could be a database, could be logs, could be anything. The replication is filesystem-level, not application-level.

CoW eliminates double-writes

Databases like PostgreSQL and InnoDB write data twice: once to the WAL/redo log, once to the data file. This "double-write" protects against torn pages on crash. But ZFS is copy-on-write — it never modifies data in place, so torn pages cannot happen. Disabling the database's double-write cuts write amplification in half.

You don't wear a seatbelt inside an armored vehicle. ZFS's CoW is the armor. The database's double-write buffer is the redundant seatbelt.

What your application can do now that it couldn't before:

Your API gets a "clone production for this request" button. A customer reports a bug on their data? zfs clone rpool/srv/pgdata@hourly rpool/srv/pgdata-debug — you have an identical copy of production to debug against. Took 0.1 seconds. Uses zero extra disk. When you're done, zfs destroy and it's gone. No pg_dump. No restore. No "can we get a copy of production on staging?" — you clone it on demand, per-ticket, per-developer, per-test-run.

Your CI pipeline becomes instant. Instead of restoring a pg_dump before every test suite (20 minutes), clone the golden database snapshot (0.1 seconds). Run tests. Destroy clone. Next test run, new clone. Every test starts from a guaranteed-identical database state. No flaky tests from leftover data. No shared test database. Each test run gets its own database, cloned from production, for free.

Migrations become reversible. zfs snapshot rpool/srv/pgdata@pre-migration. Run the migration. If it fails or produces wrong results: zfs rollback rpool/srv/pgdata@pre-migration. The entire database — data, indexes, WAL, everything — reverts to the pre-migration state in seconds. Not "restore from backup." Not "run the down migration and hope the data is still consistent." Atomic rollback of the entire storage state.

DR is one pipe. zfs send -i @yesterday @today rpool/srv/pgdata | ssh dr-node zfs recv rpool/srv/pgdata. That's your entire disaster recovery strategy. Incremental. Checksummed. Atomic. The DR node has a byte-identical copy of production at the last snapshot. Failover = point the app at the DR node. No pg_basebackup. No WAL shipping. No replication slots. Block-level replication that works for any database — PostgreSQL, MySQL, MongoDB, Redis — because ZFS doesn't know or care what's in the blocks.

2. PostgreSQL on ZFS

PostgreSQL is the best-documented database for ZFS tuning. The key insight: separate datasets for data, WAL, and temp. Each has different I/O patterns and different optimal recordsizes. Disable PostgreSQL's own checksums and full_page_writes — ZFS does both better.

Create PostgreSQL datasets

# Data directory — 16k recordsize matches PostgreSQL's 8k page size
# (16k allows some coalescing without wasting space on partial blocks)
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
    -o primarycache=metadata -o logbias=throughput \
    -o mountpoint=/srv/pgdata rpool/srv/pgdata

# WAL (Write-Ahead Log) — sequential writes, larger records are fine
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
    -o logbias=latency \
    -o mountpoint=/srv/pgwal rpool/srv/pgwal

# Temp tablespace — throwaway data, optimize for throughput
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/pgtemp rpool/srv/pgtemp

# Set ownership
chown -R postgres:postgres /srv/pgdata /srv/pgwal /srv/pgtemp

You don't store receipts and blueprints in the same drawer. Data, WAL, and temp have different access patterns. Separate datasets let you tune each one independently.

PostgreSQL configuration for ZFS

# postgresql.conf — key ZFS-specific settings

# DISABLE PostgreSQL checksums — ZFS checksums every block already.
# PG checksums add CPU overhead for redundant protection.
# (Set at initdb time: initdb --no-data-checksums)

# DISABLE full_page_writes — ZFS is copy-on-write, so torn pages
# are impossible. This cuts WAL volume by ~30-50%.
full_page_writes = off

# WAL on separate dataset
wal_directory = '/srv/pgwal'

# Temp tablespace on separate dataset
temp_tablespaces = 'pgtemp'

# Shared buffers: 25% of RAM is the PostgreSQL recommendation.
# With ZFS ARC, you have two caches. Start with 25% for PG,
# let ARC take another 25-50%. Monitor and adjust.
shared_buffers = 4GB          # for a 16GB RAM system

# Effective cache size: tell the planner about total cache (PG + ARC)
effective_cache_size = 12GB   # shared_buffers + estimated ARC

# Reduce WAL overhead since full_page_writes is off
wal_compression = off         # ZFS handles compression
wal_buffers = 64MB

# Checkpoints — less frequent since ZFS handles consistency
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
max_wal_size = 4GB

Connection pooling with PgBouncer

# PgBouncer sits between the app and PostgreSQL.
# It reuses database connections instead of opening new ones.
# 500 app connections become 20 database connections.

# /etc/pgbouncer/pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction       # best for most apps
max_client_conn = 500
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3

Opening a database connection is like starting a car engine. PgBouncer keeps the engines running and hands you a running car when you need one. The app thinks it has 500 connections. The database sees 20.

Point-in-time recovery via ZFS snapshots

# Traditional PITR: pg_basebackup + WAL archiving.
# Recovery = restore base backup + replay WAL. Slow.

# ZFS PITR: snapshot + rollback. Fast.

# Snapshot before any risky operation
ksnap /srv/pgdata && ksnap /srv/pgwal

# Something went wrong? Rollback.
systemctl stop postgresql
ksnap rollback /srv/pgdata    # interactive — pick the snapshot
ksnap rollback /srv/pgwal
systemctl start postgresql

# Automated: sanoid snapshots every 15 minutes
# You can recover to any 15-minute boundary
# That's 96 recovery points per day, zero I/O cost

Traditional backup is like rewinding a VHS tape to find a scene. ZFS snapshots are like DVD chapter markers — jump directly to the moment you need.

Clone production for testing in 0.1 seconds

# Need to test a migration against real data?
# Don't pg_dump + pg_restore (hours). Clone (milliseconds).

# Snapshot current production data
ksnap /srv/pgdata

# Clone it
kclone /srv/pgdata /srv/pgdata-staging

# Start a second PostgreSQL instance on the clone
pg_ctl -D /srv/pgdata-staging start -o "-p 5433"

# Run your migration against staging
psql -p 5433 -f migration.sql

# Test it, verify it, benchmark it.
# When done, destroy the clone — zero cost.
pg_ctl -D /srv/pgdata-staging stop
zfs destroy rpool/srv/pgdata-staging

Testing a migration on production is like practicing surgery on the patient. Cloning gives you an identical patient made of light — cut it open, learn from it, throw it away. The real patient never knows.

3. MySQL / MariaDB on ZFS

MySQL's InnoDB engine has its own double-write buffer and checksum mechanism — both redundant on ZFS. Turning them off cuts write amplification and frees CPU cycles. The key settings are innodb_doublewrite=OFF and recordsize=16k to match InnoDB's page size.

Create MySQL datasets

# InnoDB data — 16k recordsize matches InnoDB's 16k page size exactly
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
    -o primarycache=metadata \
    -o mountpoint=/srv/mysql rpool/srv/mysql

# Redo logs — sequential writes, larger records
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/mysql-redo rpool/srv/mysql-redo

# Set ownership
chown -R mysql:mysql /srv/mysql /srv/mysql-redo

MySQL configuration for ZFS

# /etc/my.cnf.d/zfs.cnf
[mysqld]
# DISABLE double-write buffer — ZFS CoW makes torn pages impossible
innodb_doublewrite = OFF

# DISABLE InnoDB checksums — ZFS checksums every block
innodb_checksum_algorithm = none

# Data and redo on separate datasets
datadir = /srv/mysql
innodb_log_group_home_dir = /srv/mysql-redo

# Buffer pool: 25-50% of RAM, same tradeoff as PG with ARC
innodb_buffer_pool_size = 4G  # for a 16GB system

# Flush method: O_DIRECT bypasses OS cache (ZFS ARC is the cache)
innodb_flush_method = O_DIRECT

# Reduce checkpoint frequency
innodb_log_file_size = 1G
innodb_flush_log_at_trx_commit = 1   # keep ACID, ZFS handles the rest

# Let ZFS handle compression — don't compress in InnoDB
innodb_compression_default = OFF

InnoDB was designed to protect itself on unreliable filesystems. ZFS is not an unreliable filesystem. Turning off InnoDB's self-protection on ZFS is like taking off the training wheels when you can ride.

4. Redis on ZFS

Redis is an in-memory database that persists to disk via RDB snapshots or AOF (append-only file). Both persistence mechanisms have tradeoffs: RDB can lose the last few minutes of data, AOF can grow huge and slow down recovery. ZFS snapshots replace both. Redis stays purely in-memory, and ZFS takes care of the durability.

Create Redis dataset

# Redis writes small records — use 4k-8k recordsize
zfs create -o recordsize=8k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/redis rpool/srv/redis

chown -R redis:redis /srv/redis

Redis configuration for ZFS persistence

# /etc/redis/redis.conf

# OPTION 1: Disable Redis persistence entirely.
# Let ZFS snapshots handle durability.
# Redis is a pure memory cache. Snapshots capture the dump file.
save ""
appendonly no

# Data directory (ZFS dataset)
dir /srv/redis

# OPTION 2: Keep RDB for Redis-level recovery, but rely on
# ZFS snapshots for point-in-time rollback.
# save 900 1
# save 300 10
# save 60 10000

# ZFS-based Redis persistence strategy:
# 1. Tell Redis to dump RDB on demand
redis-cli BGSAVE

# 2. Snapshot the dataset (captures the RDB file)
ksnap /srv/redis

# 3. Automate with sanoid — every 15 minutes
# [rpool/srv/redis]
#     use_template = production

# Recovery: rollback + restart
systemctl stop redis
ksnap rollback /srv/redis
systemctl start redis

Redis is a whiteboard. RDB is a photo of the whiteboard. ZFS snapshots are a photo of the entire room — whiteboard, notes, and everything. The whiteboard doesn't need its own camera.

Instant Redis clone for testing

# Clone production Redis for testing
redis-cli BGSAVE && sleep 1   # ensure RDB is fresh
ksnap /srv/redis
kclone /srv/redis /srv/redis-test

# Start test instance on a different port
redis-server --dir /srv/redis-test --port 6380 --daemonize yes

# Run destructive tests against port 6380
redis-cli -p 6380 FLUSHALL    # production is untouched

# Clean up
redis-cli -p 6380 SHUTDOWN
zfs destroy rpool/srv/redis-test

5. Backup strategy — sanoid + syncoid

Every database dataset gets the same treatment: sanoid snapshots locally, syncoid replicates to a backup node. Local snapshots protect against operator error (dropped a table). Remote replication protects against hardware failure (disk died). Together, they cover everything short of a simultaneous fire in both data centers.

Sanoid + syncoid for databases

# /etc/sanoid/sanoid.conf

[rpool/srv/pgdata]
    use_template = database
    recursive = no

[rpool/srv/pgwal]
    use_template = database
    recursive = no

[rpool/srv/mysql]
    use_template = database
    recursive = no

[rpool/srv/redis]
    use_template = database
    recursive = no

[template_database]
    frequently = 4             # every 15 minutes, keep 4 (1 hour)
    hourly = 48                # keep 48 hours
    daily = 30                 # keep 30 days
    monthly = 12               # keep 12 months
    yearly = 1
    autosnap = yes
    autoprune = yes

# Replicate to backup node every hour
cat > /etc/cron.d/db-replication <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
0 * * * * root syncoid --no-sync-snap rpool/srv/pgdata root@backup:tank/srv/pgdata 2>&1 | logger -t syncoid-pgdata
5 * * * * root syncoid --no-sync-snap rpool/srv/pgwal root@backup:tank/srv/pgwal 2>&1 | logger -t syncoid-pgwal
10 * * * * root syncoid --no-sync-snap rpool/srv/mysql root@backup:tank/srv/mysql 2>&1 | logger -t syncoid-mysql
15 * * * * root syncoid --no-sync-snap rpool/srv/redis root@backup:tank/srv/redis 2>&1 | logger -t syncoid-redis
EOF

A fire safe protects against fire in your office. An offsite backup protects against fire in the building. sanoid is the fire safe. syncoid is the offsite backup. Use both.

6. Docker databases on ZFS datasets

Running databases in Docker? Each container's data volume should be a ZFS dataset, not a Docker volume on the default overlay. This gives you snapshots, compression, and replication for every containerized database individually.

Docker Compose with ZFS-backed volumes

# Create datasets first
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/docker-pgdata rpool/srv/docker-pgdata

zfs create -o recordsize=16k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/docker-mysql rpool/srv/docker-mysql

zfs create -o recordsize=8k -o compression=zstd -o atime=off \
    -o mountpoint=/srv/docker-redis rpool/srv/docker-redis

# docker-compose.yml
version: "3.8"

services:
  postgres:
    image: postgres:16
    container_name: postgres
    restart: unless-stopped
    environment:
      POSTGRES_PASSWORD: changeme
      POSTGRES_DB: myapp
    ports:
      - "5432:5432"
    volumes:
      - /srv/docker-pgdata:/var/lib/postgresql/data
    command: >
      postgres
        -c full_page_writes=off
        -c shared_buffers=2GB
        -c effective_cache_size=6GB
        -c wal_compression=off
        -c checkpoint_timeout=15min

  mysql:
    image: mysql:8
    container_name: mysql
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: changeme
    ports:
      - "3306:3306"
    volumes:
      - /srv/docker-mysql:/var/lib/mysql
    command: >
      --innodb-doublewrite=OFF
      --innodb-checksum-algorithm=none
      --innodb-flush-method=O_DIRECT
      --innodb-buffer-pool-size=2G

  redis:
    image: redis:7
    container_name: redis
    restart: unless-stopped
    ports:
      - "6379:6379"
    volumes:
      - /srv/docker-redis:/data
    command: redis-server --save "" --appendonly no

# Snapshot all database volumes before upgrading containers
ksnap /srv/docker-pgdata && ksnap /srv/docker-mysql && ksnap /srv/docker-redis
docker compose pull && docker compose up -d

Containers are cattle. Data is pets. Let Docker manage the containers. Let ZFS manage the data. When you upgrade the container image, the data survives because it lives on a separate, snapshotted dataset.

7. Performance tuning — ARC vs database buffer pool

Every database has its own buffer pool (PostgreSQL's shared_buffers, InnoDB's innodb_buffer_pool_size). ZFS has ARC. Both cache data in RAM. You need to balance them so they don't fight for memory. The rule: give the database 25% of RAM, give ARC 50%, leave 25% for the OS and application code. Monitor and adjust.

Memory allocation on a 32GB system

Database buffer pool: 8GB (25%)
ZFS ARC: 16GB (50%)
OS + apps: 8GB (25%)

# Set ARC max to 16GB
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max

# Persist across reboots
echo 'options zfs zfs_arc_max=17179869184' \
    > /etc/modprobe.d/zfs.conf

Two caches sharing RAM is like two roommates sharing a fridge. Set boundaries upfront or one takes all the shelf space.

L2ARC for read-heavy databases

If your working set exceeds RAM but reads dominate writes, an L2ARC on a fast SSD extends the cache to flash. L2ARC shines for databases with hot datasets larger than RAM — think analytics queries scanning months of data.

# Add an L2ARC device
zpool add rpool cache /dev/disk/by-id/nvme-SSD

# Monitor L2ARC effectiveness
arc_summary | grep -A5 "L2ARC"

# Key question: is L2ARC hit rate > 50%?
# If not, the working set doesn't fit even on SSD.
# Remove it — L2ARC headers consume ARC RAM.

ARC is the desk. L2ARC is the filing cabinet next to the desk. If you keep reaching for files that aren't on the desk, a bigger filing cabinet helps. If the files are too numerous for even the cabinet, you need a bigger desk (more RAM).

Monitor ARC vs database cache

# ZFS ARC hit rate
awk '/^hits /{h=$3} /^misses /{m=$3} END{printf "ARC hit rate: %.1f%%\n",h/(h+m)*100}' \
    /proc/spl/kstat/zfs/arcstats

# PostgreSQL buffer cache hit rate
psql -c "SELECT round(100.0 * sum(blks_hit) / (sum(blks_hit) + sum(blks_read)), 1)
         AS pg_cache_hit_percent FROM pg_stat_database;"

# MySQL InnoDB buffer pool hit rate
mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_read%';" | \
    awk '/requests/{r=$2} /reads$/{d=$2} END{printf "InnoDB hit rate: %.1f%%\n",(r-d)/r*100}'

# Target: both hit rates above 95% for OLTP workloads.
# If PG cache is low, increase shared_buffers.
# If ARC is low, increase zfs_arc_max or add L2ARC.
# If both are low, you need more RAM.

Every database deserves a filesystem that takes data integrity seriously. Traditional filesystems trust the disk and hope for the best. ZFS verifies every block, compresses it, snapshots it, and replicates it — all below the database layer, all transparent, all automatic.

The configs on this page are not theoretical. They are the settings you deploy when the data matters — when you need to recover to 15 minutes ago without a 3-hour pg_basebackup, when you need a full test copy of production without waiting for a dump, when you need to know that a silent bit flip won't corrupt your customer table and go undetected for months.

recordsize=16k. full_page_writes=off. innodb_doublewrite=OFF. These aren't optimizations. They are the correct settings for a filesystem that already handles what the database was trying to handle on its own.

← Grafana, Prometheus & Alerting on ZFS — observe everything, lose nothing. Containers on ZFS — Docker, Podman & Firecracker. →