Databases on ZFS — checksums, snapshots, and clones replace half your DBA toolchain.
Every database has a backup strategy, a corruption detection strategy, and a test-copy strategy. On ext4 or XFS, these are three separate systems with three separate failure modes. On ZFS, they are properties of the filesystem. Checksums catch silent corruption that no database can detect on its own. Snapshots are instant backups with zero I/O overhead. Clones give you a full copy of production for testing in under a second. Compression saves 30-50% on database files without the database knowing or caring.
The catch: ZFS is copy-on-write, and databases have their own write-ahead mechanisms that assume in-place writes. You need to turn off the redundant safety features in the database and let ZFS handle them instead. The configs below do exactly that.
1. Why ZFS for databases
Checksums catch silent corruption
Traditional filesystems trust the disk. ZFS verifies every block on read. A single bit flip in a B-tree page can cascade into data loss that the database only discovers when a query returns wrong results — weeks later. ZFS catches it on the next read and auto-repairs from redundancy.
Snapshots replace backup windows
A pg_basebackup on a 500GB database takes 20 minutes and hammers I/O.
A ZFS snapshot takes milliseconds and costs zero I/O. The database never notices.
You can snapshot every 15 minutes without impacting production.
Clones give you instant test copies
kclone /srv/pgdata /srv/pgdata-test creates a full copy of your production
database in 0.1 seconds. It shares all unchanged blocks with the original.
Run destructive queries, test migrations, benchmark schema changes — then destroy the clone.
Compression saves 30-50%
Database files are full of repeated structures, padding, and null bytes. zstd compression on ZFS typically achieves 1.3x-2x compression on database datasets. The compression happens transparently — the database reads and writes uncompressed data. ZFS handles the rest at the block layer.
Replication is disaster recovery
syncoid rpool/srv/pgdata root@dr-node:tank/srv/pgdata replicates the entire
database dataset — including all snapshots — to a remote node. Incrementals only send changed blocks.
If the primary dies, promote the replica and point the app at the new node.
CoW eliminates double-writes
Databases like PostgreSQL and InnoDB write data twice: once to the WAL/redo log, once to the data file. This "double-write" protects against torn pages on crash. But ZFS is copy-on-write — it never modifies data in place, so torn pages cannot happen. Disabling the database's double-write cuts write amplification in half.
2. PostgreSQL on ZFS
PostgreSQL is the best-documented database for ZFS tuning. The key insight: separate datasets for data, WAL, and temp. Each has different I/O patterns and different optimal recordsizes. Disable PostgreSQL's own checksums and full_page_writes — ZFS does both better.
Create PostgreSQL datasets
# Data directory — 16k recordsize matches PostgreSQL's 8k page size
# (16k allows some coalescing without wasting space on partial blocks)
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
-o primarycache=metadata -o logbias=throughput \
-o mountpoint=/srv/pgdata rpool/srv/pgdata
# WAL (Write-Ahead Log) — sequential writes, larger records are fine
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
-o logbias=latency \
-o mountpoint=/srv/pgwal rpool/srv/pgwal
# Temp tablespace — throwaway data, optimize for throughput
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
-o mountpoint=/srv/pgtemp rpool/srv/pgtemp
# Set ownership
chown -R postgres:postgres /srv/pgdata /srv/pgwal /srv/pgtemp
PostgreSQL configuration for ZFS
# postgresql.conf — key ZFS-specific settings
# DISABLE PostgreSQL checksums — ZFS checksums every block already.
# PG checksums add CPU overhead for redundant protection.
# (Set at initdb time: initdb --no-data-checksums)
# DISABLE full_page_writes — ZFS is copy-on-write, so torn pages
# are impossible. This cuts WAL volume by ~30-50%.
full_page_writes = off
# WAL on separate dataset
wal_directory = '/srv/pgwal'
# Temp tablespace on separate dataset
temp_tablespaces = 'pgtemp'
# Shared buffers: 25% of RAM is the PostgreSQL recommendation.
# With ZFS ARC, you have two caches. Start with 25% for PG,
# let ARC take another 25-50%. Monitor and adjust.
shared_buffers = 4GB # for a 16GB RAM system
# Effective cache size: tell the planner about total cache (PG + ARC)
effective_cache_size = 12GB # shared_buffers + estimated ARC
# Reduce WAL overhead since full_page_writes is off
wal_compression = off # ZFS handles compression
wal_buffers = 64MB
# Checkpoints — less frequent since ZFS handles consistency
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
max_wal_size = 4GB
Connection pooling with PgBouncer
# PgBouncer sits between the app and PostgreSQL.
# It reuses database connections instead of opening new ones.
# 500 app connections become 20 database connections.
# /etc/pgbouncer/pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction # best for most apps
max_client_conn = 500
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
Point-in-time recovery via ZFS snapshots
# Traditional PITR: pg_basebackup + WAL archiving.
# Recovery = restore base backup + replay WAL. Slow.
# ZFS PITR: snapshot + rollback. Fast.
# Snapshot before any risky operation
ksnap /srv/pgdata && ksnap /srv/pgwal
# Something went wrong? Rollback.
systemctl stop postgresql
ksnap rollback /srv/pgdata # interactive — pick the snapshot
ksnap rollback /srv/pgwal
systemctl start postgresql
# Automated: sanoid snapshots every 15 minutes
# You can recover to any 15-minute boundary
# That's 96 recovery points per day, zero I/O cost
Clone production for testing in 0.1 seconds
# Need to test a migration against real data?
# Don't pg_dump + pg_restore (hours). Clone (milliseconds).
# Snapshot current production data
ksnap /srv/pgdata
# Clone it
kclone /srv/pgdata /srv/pgdata-staging
# Start a second PostgreSQL instance on the clone
pg_ctl -D /srv/pgdata-staging start -o "-p 5433"
# Run your migration against staging
psql -p 5433 -f migration.sql
# Test it, verify it, benchmark it.
# When done, destroy the clone — zero cost.
pg_ctl -D /srv/pgdata-staging stop
zfs destroy rpool/srv/pgdata-staging
3. MySQL / MariaDB on ZFS
MySQL's InnoDB engine has its own double-write buffer and checksum mechanism — both redundant on ZFS.
Turning them off cuts write amplification and frees CPU cycles. The key settings are
innodb_doublewrite=OFF and recordsize=16k to match InnoDB's page size.
Create MySQL datasets
# InnoDB data — 16k recordsize matches InnoDB's 16k page size exactly
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
-o primarycache=metadata \
-o mountpoint=/srv/mysql rpool/srv/mysql
# Redo logs — sequential writes, larger records
zfs create -o recordsize=128k -o compression=zstd -o atime=off \
-o mountpoint=/srv/mysql-redo rpool/srv/mysql-redo
# Set ownership
chown -R mysql:mysql /srv/mysql /srv/mysql-redo
MySQL configuration for ZFS
# /etc/my.cnf.d/zfs.cnf
[mysqld]
# DISABLE double-write buffer — ZFS CoW makes torn pages impossible
innodb_doublewrite = OFF
# DISABLE InnoDB checksums — ZFS checksums every block
innodb_checksum_algorithm = none
# Data and redo on separate datasets
datadir = /srv/mysql
innodb_log_group_home_dir = /srv/mysql-redo
# Buffer pool: 25-50% of RAM, same tradeoff as PG with ARC
innodb_buffer_pool_size = 4G # for a 16GB system
# Flush method: O_DIRECT bypasses OS cache (ZFS ARC is the cache)
innodb_flush_method = O_DIRECT
# Reduce checkpoint frequency
innodb_log_file_size = 1G
innodb_flush_log_at_trx_commit = 1 # keep ACID, ZFS handles the rest
# Let ZFS handle compression — don't compress in InnoDB
innodb_compression_default = OFF
4. Redis on ZFS
Redis is an in-memory database that persists to disk via RDB snapshots or AOF (append-only file). Both persistence mechanisms have tradeoffs: RDB can lose the last few minutes of data, AOF can grow huge and slow down recovery. ZFS snapshots replace both. Redis stays purely in-memory, and ZFS takes care of the durability.
Create Redis dataset
# Redis writes small records — use 4k-8k recordsize
zfs create -o recordsize=8k -o compression=zstd -o atime=off \
-o mountpoint=/srv/redis rpool/srv/redis
chown -R redis:redis /srv/redis
Redis configuration for ZFS persistence
# /etc/redis/redis.conf
# OPTION 1: Disable Redis persistence entirely.
# Let ZFS snapshots handle durability.
# Redis is a pure memory cache. Snapshots capture the dump file.
save ""
appendonly no
# Data directory (ZFS dataset)
dir /srv/redis
# OPTION 2: Keep RDB for Redis-level recovery, but rely on
# ZFS snapshots for point-in-time rollback.
# save 900 1
# save 300 10
# save 60 10000
# ZFS-based Redis persistence strategy:
# 1. Tell Redis to dump RDB on demand
redis-cli BGSAVE
# 2. Snapshot the dataset (captures the RDB file)
ksnap /srv/redis
# 3. Automate with sanoid — every 15 minutes
# [rpool/srv/redis]
# use_template = production
# Recovery: rollback + restart
systemctl stop redis
ksnap rollback /srv/redis
systemctl start redis
Instant Redis clone for testing
# Clone production Redis for testing
redis-cli BGSAVE && sleep 1 # ensure RDB is fresh
ksnap /srv/redis
kclone /srv/redis /srv/redis-test
# Start test instance on a different port
redis-server --dir /srv/redis-test --port 6380 --daemonize yes
# Run destructive tests against port 6380
redis-cli -p 6380 FLUSHALL # production is untouched
# Clean up
redis-cli -p 6380 SHUTDOWN
zfs destroy rpool/srv/redis-test
5. Backup strategy — sanoid + syncoid
Every database dataset gets the same treatment: sanoid snapshots locally, syncoid replicates to a backup node. Local snapshots protect against operator error (dropped a table). Remote replication protects against hardware failure (disk died). Together, they cover everything short of a simultaneous fire in both data centers.
Sanoid + syncoid for databases
# /etc/sanoid/sanoid.conf
[rpool/srv/pgdata]
use_template = database
recursive = no
[rpool/srv/pgwal]
use_template = database
recursive = no
[rpool/srv/mysql]
use_template = database
recursive = no
[rpool/srv/redis]
use_template = database
recursive = no
[template_database]
frequently = 4 # every 15 minutes, keep 4 (1 hour)
hourly = 48 # keep 48 hours
daily = 30 # keep 30 days
monthly = 12 # keep 12 months
yearly = 1
autosnap = yes
autoprune = yes
# Replicate to backup node every hour
cat > /etc/cron.d/db-replication <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
0 * * * * root syncoid --no-sync-snap rpool/srv/pgdata root@backup:tank/srv/pgdata 2>&1 | logger -t syncoid-pgdata
5 * * * * root syncoid --no-sync-snap rpool/srv/pgwal root@backup:tank/srv/pgwal 2>&1 | logger -t syncoid-pgwal
10 * * * * root syncoid --no-sync-snap rpool/srv/mysql root@backup:tank/srv/mysql 2>&1 | logger -t syncoid-mysql
15 * * * * root syncoid --no-sync-snap rpool/srv/redis root@backup:tank/srv/redis 2>&1 | logger -t syncoid-redis
EOF
6. Docker databases on ZFS datasets
Running databases in Docker? Each container's data volume should be a ZFS dataset, not a Docker volume on the default overlay. This gives you snapshots, compression, and replication for every containerized database individually.
Docker Compose with ZFS-backed volumes
# Create datasets first
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
-o mountpoint=/srv/docker-pgdata rpool/srv/docker-pgdata
zfs create -o recordsize=16k -o compression=zstd -o atime=off \
-o mountpoint=/srv/docker-mysql rpool/srv/docker-mysql
zfs create -o recordsize=8k -o compression=zstd -o atime=off \
-o mountpoint=/srv/docker-redis rpool/srv/docker-redis
# docker-compose.yml
version: "3.8"
services:
postgres:
image: postgres:16
container_name: postgres
restart: unless-stopped
environment:
POSTGRES_PASSWORD: changeme
POSTGRES_DB: myapp
ports:
- "5432:5432"
volumes:
- /srv/docker-pgdata:/var/lib/postgresql/data
command: >
postgres
-c full_page_writes=off
-c shared_buffers=2GB
-c effective_cache_size=6GB
-c wal_compression=off
-c checkpoint_timeout=15min
mysql:
image: mysql:8
container_name: mysql
restart: unless-stopped
environment:
MYSQL_ROOT_PASSWORD: changeme
ports:
- "3306:3306"
volumes:
- /srv/docker-mysql:/var/lib/mysql
command: >
--innodb-doublewrite=OFF
--innodb-checksum-algorithm=none
--innodb-flush-method=O_DIRECT
--innodb-buffer-pool-size=2G
redis:
image: redis:7
container_name: redis
restart: unless-stopped
ports:
- "6379:6379"
volumes:
- /srv/docker-redis:/data
command: redis-server --save "" --appendonly no
# Snapshot all database volumes before upgrading containers
ksnap /srv/docker-pgdata && ksnap /srv/docker-mysql && ksnap /srv/docker-redis
docker compose pull && docker compose up -d
7. Performance tuning — ARC vs database buffer pool
Every database has its own buffer pool (PostgreSQL's shared_buffers, InnoDB's
innodb_buffer_pool_size). ZFS has ARC. Both cache data in RAM. You need to balance them
so they don't fight for memory. The rule: give the database 25% of RAM, give ARC 50%,
leave 25% for the OS and application code. Monitor and adjust.
Memory allocation on a 32GB system
Database buffer pool: 8GB (25%)
ZFS ARC: 16GB (50%)
OS + apps: 8GB (25%)
# Set ARC max to 16GB
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
# Persist across reboots
echo 'options zfs zfs_arc_max=17179869184' \
> /etc/modprobe.d/zfs.conf
L2ARC for read-heavy databases
If your working set exceeds RAM but reads dominate writes, an L2ARC on a fast SSD extends the cache to flash. L2ARC shines for databases with hot datasets larger than RAM — think analytics queries scanning months of data.
# Add an L2ARC device
zpool add rpool cache /dev/disk/by-id/nvme-SSD
# Monitor L2ARC effectiveness
arc_summary | grep -A5 "L2ARC"
# Key question: is L2ARC hit rate > 50%?
# If not, the working set doesn't fit even on SSD.
# Remove it — L2ARC headers consume ARC RAM.
Monitor ARC vs database cache
# ZFS ARC hit rate
awk '/^hits /{h=$3} /^misses /{m=$3} END{printf "ARC hit rate: %.1f%%\n",h/(h+m)*100}' \
/proc/spl/kstat/zfs/arcstats
# PostgreSQL buffer cache hit rate
psql -c "SELECT round(100.0 * sum(blks_hit) / (sum(blks_hit) + sum(blks_read)), 1)
AS pg_cache_hit_percent FROM pg_stat_database;"
# MySQL InnoDB buffer pool hit rate
mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_read%';" | \
awk '/requests/{r=$2} /reads$/{d=$2} END{printf "InnoDB hit rate: %.1f%%\n",(r-d)/r*100}'
# Target: both hit rates above 95% for OLTP workloads.
# If PG cache is low, increase shared_buffers.
# If ARC is low, increase zfs_arc_max or add L2ARC.
# If both are low, you need more RAM.
Every database deserves a filesystem that takes data integrity seriously. Traditional filesystems trust the disk and hope for the best. ZFS verifies every block, compresses it, snapshots it, and replicates it — all below the database layer, all transparent, all automatic.
The configs on this page are not theoretical. They are the settings you deploy when the data matters — when you need to recover to 15 minutes ago without a 3-hour pg_basebackup, when you need a full test copy of production without waiting for a dump, when you need to know that a silent bit flip won't corrupt your customer table and go undetected for months.
recordsize=16k. full_page_writes=off. innodb_doublewrite=OFF.
These aren't optimizations. They are the correct settings for a filesystem that already handles what
the database was trying to handle on its own.