ZFS Boot Menu (a bootloader, like GRUB) has great introductory material: https://docs.zfsbootmenu.org.
Note that in Debian, ZFS-related packages live in contrib
.
Ubuntu-based distros (which? versions?) use Ubiquity in their live images, which supports (experimentally?) to use ZFS. This (sometimes?) seems to require ZFS-related packages to be installed manually before starting Ubiquity:
sudo apt install zfsutils-linux zfs-initramfs zfs-zed
Debian Live uses Calamares which has a ZFS module, non-Live uses Debian-Installer (ZFS support?).
Red Hat/Fedora-related distros use Anaconda (ZFS support?).
sudo zpool history "$pool"
might be handy for figuring out what zpool
commands an installer ran. Obviously, reading the source of the installers is better when determining what algorithm they used to arrive at those commands.
apt-cache search -- -initramfs
clevis
: automated encryption framework
clevis-initramfs
clevis-dracut
zfs-initramfs
zfs-dracut
https://docs.zfsbootmenu.org/en/v2.3.x/guides/ubuntu/noble-uefi.html#configure-efi-boot-entries
man kernel-install
)mkinitcpio
(Arch only) (TODO: Is this not an initrd
?)initrd
generators
initramfs-tools
dracut
Debian-based: debootstrap
, chroot
Fedora-based: dnf --installroot
, systemd-nspawn -b
Arch-based: pacstrap
, arch-chroot
(The examples assume a dataset layout that we have not yet standardized on.)
$HOME
host="$(hostname)"
user="$USER"
dataset="tank/$host/HOME/$user"
sudo zfs allow -u "$user" create,mount,rollback,snapshot "$dataset"
sudo zfs allow -u "$user" -d destroy "$dataset"
Linux restricts mount operations only to root, which results in the following message when trying to zfs create
a dataset under $dataset
:
filesystem successfully created, but it may only be mounted by root
As long as the dataset is not mounted, zfs destroy
does in fact destroy the dataset.
host="some-remote"
user="backup-$host"
dataset="tank/$host"
# `-d` flag untested
sudo zfs allow -u "$user" -d receive "$dataset"
TODO: Threat model (we have one written down somewhere).
Since ZFS does not give strong guarantees of the effectiveness of key rotation with zfs change-key
(see the man page), key rotation should be handled outside of ZFS. Make sure that during the creation of a new key, it is not temporarily stored on ZFS encrypted with the old key! Since we want all our important data to be stored on encrypted ZFS, flash storage that has verified working secure erase or a hardware token are the only viable places to store (and rotate) keys (potentially wrapping the ZFS “wrapping key”). If any translation is required in order to get the security provided by a hardware token into the correct format for zfs load-key
, that translation should be stored with the pool/dataset to prevent unintended loss of access. One option is a zvol (Ubuntu) or a dataset property (pivy-zfs
). Since this translation is stored on ZFS it should only provide translation, not relied on for security!
zvols have the advantage that they be formatted with LUKS which have good integration with other tools, such as initramfs-tools
and hardware tokens. We don’t want to customize things regarding early boot and encryption too much.
Another advantage of zvols are that it’s very obvious where the keys live. They show up directly in a zfs list
, you don’t have to look at e.g. individual properties. It’s also obvious if the keys are being replicated or not.
Too many snapshots?
zfs list -o name,usedbydataset,usedbysnapshots -S usedbysnapshots
zfs list -o name,used,creation -t snapshot "$dataset"
# zfs destroy -vn "$dataset@$snap1%$snap2"
Feeling tempted to enable deduplication? (Probably this will put your mind at ease and confirm that, no, enabling dedup would not help. Either way, consider using symlinks/hardlinks or some other manual way of handling duplicated files, dedup is expensive.)
sudo zdb -S $pool
Wishlist:
zpool
or $(hostname)
.Questions:
$hostname-pool/BACK
hierarchy gain us anything? How do we make sure that the backup host does not try to mount received datasets (specifically, on reboot)?Work in progress suggestion:
os="$(
if [ -e /etc/lsb-release ]
then
. /etc/lsb-release
printf '%s' "$DISTRIB_DESCRIPTION"
elif [ -e /etc/os-release ]
then
. /etc/os-release;
printf '%s' "$PRETTY_NAME"
fi \
| tr '[:upper:]' '[:lower:]' \
| tr -c '[:graph:]' '-'
)"
key="$(printf '%s' "$dataset" | tr '/' '_')"
$hostname-pool/KEYS/$hostname/$key
$hostname-pool/ROOT/$hostname/$os
$hostname-pool/HOME/$hostname/$user
$hostname-pool/DATA/$hostname/$data
Talks:
There are two ways to “waste space” in RAIDZ:
We only consider a single vdev. Some definitions:
data
: Number of “usable data” disks.parity
: Number of “parity” disks.disks = data + parity
: Total number of disks.sectorsize = pow(2, ashift)
: Size in bytes of chunks in which pools ask disks for usable space. This should, but does not have to, correspond to actual disk sector sizes. Valid: 512-64K, default depends on disks used on pool creation. We assume 4K, i.e. an ashift
of\ 12.recordsize
: (Maximum) size in bytes of chunks in which filesystem datasets ask pools for usable space. Valid: powers of 2 in 512-128K, default: 128K.volblocksize
: Size in bytes of chunks in which volume datasets ask pools for usable space. Valid: powers of 2 in 512-128K, default: 8K.recordsectors = recordsize / sectorsize
: The number of sectors in a record.data
disksA dataset asks the underlying pool to give it usable space in blocks of size (at most) recordsize
. For each started “stripe” of disks
number of sectors, RAIDZ adds parity
number of parity sectors. So to fully take advantage of that parity, we want recordsectors
to be evenly divisible by data
. Since recordsectors
is a power of 2 (i.e. has 2 as its only prime factor), data
needs to also be a power of 2 (i.e. have 2 as its only prime factor):
recordsectors % data == 0
recordsectors % (disks - parity) == 0
data = 2^n
, for some integer n
(recordsectors / data * disks) % (parity + 1) == 0
Jim Salter’s “blessed” topologies (see also Jim Salter’s “cursed” topologies):
2^N+1, 1 <= N <= 1
, i.e. 3
(redundancy ratios 1/3
).2^N+2, 1 <= N <= 3
, i.e. 4, 6, 10
(redundancy ratios 1/2, 1/3, 1/5
).You want data
to evenly divide into recordsize_sectors == recordsize / sector
. Since both recordsize
and sector
are powers of two, i.e. the only prime factor is 2
, data
needs to also have 2
as its only prime factor.
Why the big difference in range of N
in RAIDz levels? Probability of any single drive failing scales linearly with number of drives. Probability of multiple drives failing decreases exponentially. TODO: Expand on this.
Note that this gets less straightforward when compression=on
(which you probably want) and you have compressible data, see ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ 2014-06-05 by Matthew Ahrens on the Delphix blog.
6-wide RAIDZ2 “complies” with both these recommendations.
When changing recordsize
/volblocksize
, which is mostly relevant for databases on ZFS filesystems or for ZFS volumes (neither of which is very interesting to us), see [RAIDZ on-disk format vs. small blocks][] 2019-06-25 by Mike Gerdts at Joyent Technical Discussion and the accompanying comment.
Note that recordsize
/volblocksize
is part of zfs send
streams! If you’re going to do something fancy, make sure it makes sense on all (current and future!) replication targets.
(“ashift
is an implementation detail, the user tunable should be called sectorsize
and expressed in bytes” — Alan Jude)
from math import *
sectorsize = 4 * 1024 # "Advanced Format Drive"
recordsize = 128 * 1024 # default
recordsectors = recordsize // sectorsize
for parity in [1, 2, 3]:
for disks in range(parity+2, 4**(2**(parity-1))+1):
# for disks in range(parity+2, 64+1):
data = disks - parity
ratio = parity / disks
alloc = parity + 1
left = (data - recordsectors % data ) % data
skip = (alloc - (recordsectors // data * disks) % alloc) % alloc
if left != 0 or skip != 0:
continue
# if left != 0:
# continue
# if left == 0 or skip == 0:
# continue
print(f"RAIDZ{parity} {disks:>2} disks: ratio={ratio:.2f} left={left} skip={skip}")
# for recordsize in range(recordsize_max, -sector, -sector):
# skip = sector * ((alloc - (recordsize // sector + parity) % alloc) % alloc)
# aligned = recordsize % (sector * (disks - parity)) == 0
# if skip == 0 and aligned:
# break
# print(f"RAIDZ{parity} {disks:>2} disks: ratio={ratio:.2f} recordsize={recordsize//1024}K/{recordsize//sector}s")
RAIDZ on-disk format vs. small blocks
TODO: Ubuntu linuxmint-21.3-cinnamon-64bit.iso linuxmint-22-cinnamon-64bit.iso TrueNAS-SCALE-24.04.2.3.iso Zorin-OS-17.2-Core-64-bit.iso cachyos-desktop-linux-241003.iso
Is there a benefit of having pool names be unique (i.e. not zpool
, bpool
, rpool
, etc)
https://github.com/rlaager/zfscrypt
https://old.reddit.com/r/zfs/comments/18hcjam/upgrading_zfs_on_2004_from_083_from_apt_to_2x_i/
ZFS on Linux mailing list (was https://list.zfsonlinux.org/pipermail/zfs-discuss/?)
TechSNAP podcast:
Companies:
People:
siebenmann
cks
Pull requests
v2.2.0
v2.2.0
ZED_EMAIL_{ADDR,PROG,OPTS}
in /etc/zfs/zed.d/zed.rc
.
zed_notify_email()
in /etc/zfs/zed.d/zed-functions.sh
.
Talks:
Articles.
Agenda and notes on Google Docs.
With notes from “Agenda and notes”.
send --no-encryption
for -p
/-R
/etc without preserving encryption https://github.com/openzfs/zfs/pull/15310 (Rich E)@
notation on the properties to distinguish different keys (keyformat@key1
, keyformat@key2
, …)send
/recv
has several known bugs.
send
/recv
) has few bugs.
zfs change-key
has an open issue< 0.6.4
versions of OpenZFS. Related PR #13014.With notes from me.
man zstream
The encryptionroot
dataset property specifies what other dataset owns the wrapping key. Different datasets will always have different master keys except if one is a clone of the other. I think the rationale was to try as hard as possible to never share keys, but when cloning that’s impossible since the datasets literally share encrypted blocks (so obviously they need the same key to read that block).
https://www.youtube.com/watch?v=fTIJgGpV3HE&t=38m25s
Terminology (somewhat unofficial):
my entire backup regime is all built on encryption and replication
— Rob Norris robn
, ZFS Developer, 2023-11-03
im just not really an active member of the project at the moment and I don’t have the bandwidth to get back into it. I would be more than happy to talk to anyone / review any work someone makes towards a PR though.
— Tom Caputi tcaputi
, ZFS Native Encryption author, 2023-11-19
Pull requests, and possible workarounds:
$(hostname)
(or $(hostname | cut -d- -f1)
)?- {tank,???}
- ROOT
- "$(hostname)"
- "$(. "/etc/os-release"; echo "$ID-$VERSION_ID")" ...
- HOME
- "$(hostname)"
- "$USER" ...