ZFS

TODO

Documentation

Ubuntu 24.04 Desktop supports (optionally encrypted) ZFS on root. For server, see https://discourse.ubuntu.com/t/zfs-root-in-24-04/42274/9. It still uses zsys with the bpool/rpool split. Jim Salters suggests doing a manual debootstrap with ZFSBootMenu (which supports zpool compatibility=openzfs-2.1-linux) instead of GRUB (compatibility=grub2).

If we don’t go the ZFSBootMenu path, Dracut may be an improvement to Debian’s default initramfs-tools. Presumably ZFSBootMenu can only boot straight to ZFS and can only deal with ZFS encryption. If we want stuff like a Yubikey to work, ZFSBootMenu might be out. ACTUALLY, ZFSBootMenu boots into a ZFS filesystem, it doesn’t need to be the final root, it can be an initrd?

Actually, there is support for different initrds in zfs/contrib:

Installers (probably do not use directly, but look at the code)

Encryption (keys) (, and send/recv encrypted datasets/snapshots)

zfs allow

Swap/hibernation

for package in zfs-dkms zfs-initramfs zfs-zed zfsutils-linux
do
    echo "$package"
    apt-file list -x "$package" \
    | awk '{print $2}' \
    | sed -n \
        -e 's#^/usr/share/\(doc\)/[^/]*/\(.*README.*\)#  \1 \2#p' \
        -e 's#^/usr/share/\(man\)/[^/]*/\([^.]*\)\.\(.*\)\.gz#  \1 \2(\3)#p' \
    | sort
done

find '/usr/share/man' -name 'z*concepts*':

Concepts

  • zpoolconcepts(8)
  • zfsconcepts(8)

Options

Partitioning

TODO: GRUB lacks support for many desirable ZFS features, which more or less forces you to use a separate boot pool and makes things complicated. zfsbootmenu (which is also a full fledged (UEFI only) boot manager/loader) does not require this and may be a very nice alternative. This also allows us to easily boot into many separate operating systems (distros) without the prohibitively expensive (logistics/admin and to some degree storage) requirement of several boot pools (i.e. several partitions, which have to be carved out ahead of time).

ZFSBootMenu had a guide for Debian Bookworm (the latest one on openzfs-docs is only Bullseye as of the time of writing): https://docs.zfsbootmenu.org/en/latest/guides/debian/bookworm-uefi.html

The guide is a little unclear (to the uninitiated) about where the encryption passphrase file is stored (is it on the unencrypted EFI partition!? no). See https://old.reddit.com/r/zfs/comments/lkd10u/still_confused_as_to_how_zfsbootmenu_handles/ for a clarification.

The guide suggests dmesg | grep -i efivars to detect EFI support but test -d /sys/firmware/efi is less brittle (the dmesg ringbuffer might have paged out the boot string, or some other unrelated log line might contain the grepped string.)

The guide mentions a hostid, which to me was unfamiliar. As far as I can tell, this is used by ZFS to mark which system currently “owns” a pool. This mark is cleared on pool export. gethostid is standardized by POSIX. ZFS uses it through the Solaris Porting Layer (which is now a part of OpenZFS). It is documented in the spl-module-parameters man page.

The ZFSBootMenu testing scripts could be a good reference when setting up a system: https://github.com/zbm-dev/zfsbootmenu/tree/master/testing, in particular the helpers directory.

TODO: It would probably be good to write a guide starting from the beginning:

  • BIOS vs UEFI
  • Different boot managers/loader and their respective ZFS support (GRUB, systemd-boot, ZFSBootMenu, rEFInd, the Linux EFI Stub Loader)
  • Partitioning requirements
    • “Self contained (i.e. include OS)” vs “dumb storage”
    • Flexibility (switch boot managers/loaders down the line?)
    • Swap? Encrypted swap? That supports hibernation?
    • Fast OS disk(s)?

The below assumes GRUB:

When whole disks are given to zpool it automatically partitions them to allow some slack when replacing disks with slightly smaller disks.

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#whole-disks-versus-partitions

Boot partition

https://wiki.archlinux.org/title/ZFS#GRUB-compatible_pool_creation

Root partition

Note that you cannot e.g. export and import a pool which contains the dataset that is mounted on / (or other important locations). If there is a pool that is expected to receive a lot of administrative action, maybe putting the root dataset on it is not a good idea.

Tuning

Sector size:

TLER:

Cache:

  • SLOG (“write cache”) cannot be removed (or fail) while in use. SLOG without redundancy is dangerous!
  • L2ARC (“read cache”) can be removed while in use.
    • Tweak l2arc_write_max?

Note that a dataset is created automatically when a pool is created.

Encryption

  • zfs get -p pbkdf2iters,pbkdf2salt
  • zfs get -p encryptionroot
  • encryption=off|on|aes-128-ccm|aes-192-ccm|aes-256-ccm|aes-128-gcm|aes-192-gcm|aes-256-gcm Cannot be changed after dataset creation.
  • keyformat=raw|hex|passphrase Must be provided when encryption is set. Can be changed later with zfs-change-key
  • keylocation=prompt|file://</absolute/file/path> Defaults to prompt. Can be changed later with zfs-change-key
  • pbkdf2iters=iterations
  • encryptionroot
  • keystatus
  • https://www.youtube.com/watch?v=RmJMqacoPw4
  • https://www.youtube.com/watch?v=XOm9aLqb0x4
  • OpenZFS Leadership Meeting
  • openzfs pull requests
    • #9819: Document zfs change-key caveats
  • openzfs issues
    • #6624: Filesystem created via zfs send –raw between encrypted roots fails to mount
    • #12000: Repair encryption hierarchy of ‘send -Rw | recv -d’ datasets that do not bring their encryption root
    • #12123: After replicating the encrypted dataset and perform key inheritance on the target dataset (change-key -i), next incremental snapshot will break the dataset \ volume.
    • #12614: Replicating encrypted child dataset + change-key + incremental receive overwrites master key of replica, causes permission denied on remount
    • #12649: Encryption keys/roots management tools needed
    • #14011: OpenZFS cannot write/create/delete on parent dataset

April 2022 OpenZFS Leadership Meeting:

encryptionroot only refers to the wrapping key.

zfs-change-key(1):

If the user’s key is compromised, zfs change-key does not necessarily protect existing or newly-written data from attack. Newly-written data will continue to be encrypted with the same master key as the existing data. The master key is compromised if an attacker obtains a user key and the corresponding wrapped master key. Currently, zfs change-key does not overwrite the previous wrapped master key on disk, so it is accessible via forensic analysis for an indeterminate length of time.

In the event of a master key compromise, ideally the drives should be securely erased to remove all the old data (which is readable using the compromised master key), a new pool created, and the data copied back. This can be approximated in place by creating new datasets, copying the data (e.g. using zfs send | zfs recv), and then clearing the free space with zpool trim –secure if supported by your hardware, otherwise zpool initialize.

zpool-trim(1):

-d –secure Causes a secure TRIM to be initiated. When performing a secure TRIM, the device guarantees that data stored on the trimmed blocks has been erased. This requires support from the device and is not supported by all SSDs.

Matt Ahrens:

Kind of the point of encryption is to protect you if they have [physical] access to your disk.

Let’s not give the users the false idea that changing your passphrase actually does anything. Changing your passphrase is for “I don’t like typing that old thing anymore, I’m going to type some new thing”, not for “somebody knows my old passphrase, let me change it to one that people don’t know”.

Someone else coming to the same conclusion: https://old.reddit.com/r/zfs/comments/wk4t14/safely_remove_old_encryption_keys_and_some_other/

Compression

  • zpool create -o compression=on / compression=on (look at lz4_compress pool feature)

Maybe worth looking at zstd for slow pools of spinning disks on file servers? https://github.com/openzfs/zfs/pull/10278 https://github.com/openzfs/zfs/commit/10b3c7f5e424f54b3ba82dbf1600d866e64ec0a0 https://news.ycombinator.com/item?id=23210491 lz4 implements an early abort which is currently not implemented for zstd, so there might be a higher than expected performance difference for uncompressible data. This is only relevant during write though, so if they occur infrequently enough the difference may not matter. Also, if the data is compressed with lz4 specifically, it will stay compressed in the ARC (decompression is so fast that this isn’t in RAM), effectively giving you a bigger ARC which is very nice. Standardizing on a compression algorithm also avoids (is this implemented?) recompressing compressed send/receive.

Auto-expand

  • zpool create -o autoexpand=on

Note that default filesystem properties can be set when creating a pool with -O (pool properties are supplied with -o).

TRIM

  • autotrim=on pool option (device support is detected automatically)
  • l2arc_trim_ahead kernel module option

Access time

  • zfs create -o atime=off (is relatime=on better?)

Other filesystem options

  • mountpoint=none
  • canmount=off
  • devices=off
  • acltype=posix
  • xattr=sa (recommended when acltype=posix)
  • dnodesize=legacy (default on OpenZFS 2.0.6)
  • normalization=formD

  • quota=10GB, reservation=1GB

Names

# set -- raidz2 sda sdb
codename="$(lsb_release -sc)"
pool='zpool'

options_pool='ashift=12 autotrim=on'
# TODO: Some of these are default?
options_dataset='compression=on acltype=posix xattr=sa relatime=on normalization=formD dnodesize=auto sync=disabled'

zpool create "$pool" \
    $(echo "$options_pool"    | xargs -n1 printf '-o %s ') \
    $(echo "$options_dataset" | xargs -n1 printf '-O %s ') \
    "$@"

zpool export "$pool"
zpool import -d '/dev/disk/by-id/' "$pool"

Special datasets:

  • /home/
  • /var/log/
  • /var/cache/
  • Media: zfs create -o secondarycache=metadata -o recordsize=1m

Permissions:

Look at zfs allow -d/-l and -c flags.

sudo zfs allow -u "$USER" snapshot,send  "$HOME"

properties="compression"
sudo zfs allow -u "$USER" create,receive "$HOME"
sudo zfs allow -u "$USER" "$properties" "$HOME"

Hibernation

https://www.combustible.me/blog/linux-mint-zfs-root-full-disk-encryption-hibernation-encrypted-swap.html

Encryption keys

libpam-zfs?

Backup / replication

People often like to use a third party tool to manage their snapshots, some sort of automated thing. Do you have favorite such tool?

I don’t have any favorite snapshot or replication management tool.

Matt Ahrens (Founding member of OpenZFS, one of the main architects of ZFS), 2018-04-04. Note that he doesn’t question the need for such a tool (and indeed their Delphix software may be one such tool).

Cold storage

Amazon Glacier?

https://jrs-s.net/2016/09/15/zfs-snapshots-and-cold-storage/

Articles

Booting

UEFI

This looks like a good reference: https://www.rodsbooks.com/efi-bootloaders/index.html.

${esp:-/boot/efi}/EFI/BOOT/BOOTX64.EFI is used for “ad-hoc” booting (e.g. from removable storage).

UEFI (Unified Extensible Firmware Interface)

The UEFI firmware decides which boot manager on the ESP (EFI system partition) to load

UEFI variables are stored in firmware NVRAM and can be accessed from an operating system with the following programs:

  • Linux
    • efibootmgr
    • efivar --list
  • Windows
    • bcdedit /enum FIRMWARE (bcd stands for “Boot Configuration Data”)

Debian/Ubuntu

distro="$(  lsb_release --short --id)"
codename="$(lsb_release --short --codename)"
case "$distro" in

Debian)
# https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/index.html
sudo sh -c "cat > /etc/apt/sources.list.d/$codename-backports.list" << EOF
deb http://deb.debian.org/debian $codename-backports main contrib
EOF
sudo sh -c "cat > /etc/apt/preferences.d/90_zfs" << EOF
Package: src:zfs-linux
Pin: release n=$codename-backports
Pin-Priority: 990
EOF
sudo apt update
sudo apt install dpkg-dev linux-headers-generic linux-image-generic
sudo apt install zfs-dkms zfsutils-linux
;;

Ubuntu)
# https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/index.html
sudo sh -c "cat > /etc/apt/sources.list.d/$codename-universe.list" << EOF
deb http://archive.ubuntu.com/ubuntu $codename main universe
EOF
sudo apt-get update
sudo apt-get install zfsutils-linux
;;

esac
sudo apt install zfs-zed
sudo modprobe zfs
sudo systemctl start zfs-zed.service

# TODO: Check systemd targets at <https://www.youtube.com/watch?v=ELdvHS3jtoY&t=6m48s>.

# TODO: Also install `samba`?
# sudo zfs set sharesmb=on "$pool/$dataset"
# sudo smbpasswd -a "$USER"

Partitioning

TODO: Look at zsys-setup for inspiration.

We want:

  • GPT
  • On “boot disk”
    • EFI partition: 512M, fat
    • Boot partition: 1G, ext4 or zfs-member (bpool) with some features turned off
    • Swap: encrypted?
  • On “storage disks”
    • ZFS: whole disk rounded and reduced for slack,zfs-member (rpool)
# https://www.youtube.com/watch?v=7F7Ch-ZkiQU
# https://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs
EFI_SYSTEM_PARTITION='C12A7328-F81F-11D2-BA4B-00A0C93EC93B'
LINUX_BOOT='BC13C2FF-59E6-4262-A352-B275FD6F7172'
LINUX_ROOT_X86_64='4F68BCE3-E8CD-4DB1-96E7-FBCAF984B709'
LINUX_FILESYSTEM_DATA='0FC63DAF-8483-4772-8E79-3D69D8477DE4'
SOLARIS_BOOT='6A82CB45-1DD2-11B2-99A6-080020736631'
SOLARIS_ROOT='6A85CF4D-1DD2-11B2-99A6-080020736631'
# TODO: Set the types for all partitions (`fdisk` can list the types with `L`
# at the `t` prompt).
sudo fdisk "$dev" << EOF
g
n


+512M
t
1
n


+1G
n


+2G
EOF
mkfs.fat -F32 "{$dev}1"

Persistent naming:

SMART

Installation

Topology:

  • dataset (filesystem, can be mounted), zvol (device, can contain other filesystem), multiple of which live in a
    • zpool, which is a collection of
      • vdev, which is any of
        • single drive
        • mirror
        • RAIDZ1 (N drives, 1 drive parity)
        • RAIDZ2 (N drives, 2 drive parity)
        • RAIDZ3 (N drives, 3 drive parity)

Maintenance considerations:

  • zpools are only as redundant as their least redundant vdev.
  • vdevs can never be removed from a zpool.
  • vdevs can (only) be grown by replacing all the drives in it, one at a time.
  • vdev RAIDZ cannot grow in number of drives (unlike traditional md RAID).
  • Use /dev/disk/by-id/ names when adding drives.

RAIDZ best practice:

  • Avoid RAIDZ1 (the same best practice as for RAID5; when a drive fails there is no redundancy left and healing a RAID(Z) is an intense task).
  • Number of drives \(n\) in a RAIDZ\(N\) should be roughly \(2 N < n \le 3 N\).
  • Remember that the math required (performance degradation) for RAIDZ\(N\) increases as \(N\) increases.

https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2022.04%20Root%20on%20ZFS.html

lsblk -p -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT

Layout conventions (as used in e.g. Ubuntu 20.04 LTS):

id="$(lsb_release --id --short | tr '[:upper:]' '[:lower:]')" or id="$(. '/etc/os-release'; echo "$ID")"

  • bpool
    • BOOT
      • ${id}_${rand1}: /boot
  • rpool
    • ROOT
      • ${id}_${rand1}: /
        • $dir: /$dir (there are quite a few, nested)
    • USERDATA
      • root_${rand2}: /root
      • ${user}_${rand2}: /home/${user}

/etc/fstab:

  • UUID=... (EFI partition) -> /boot/efi (vfat)
  • /boot/efi/grub -> /boot/grub (bind)
  • /dev/mapper/cryptoswap -> none (swap)

/etc/crypttab:

  • /dev/nvme0n1p2 -> cryptoswap (/dev/urandom, swap,initramfs)

??

  • /dev/zvol/rpool/keystore -> /dev/zd0 -> /dev/mapper/keystore-rpool -> /run/keystore/rpool

References:

zsys

TODO: Move stuff here from other places in this document?

TODO: Make zsys-layout more robust, test it, and create a repo for it:

#!/bin/sh
set -euC

# https://web.archive.org/web/20200922112105/https://didrocks.fr/2020/06/16/zfs-focus-on-ubuntu-20.04-lts-zsys-dataset-layout/#why-so-many-system-datasets

host_id="$(TODO)"

datasets="$(printf "%s\n" \
    "srv" \
    "usr/local" \
    "usr/lib" \
    "usr/games" \
    "usr/log" \
    "usr/mail" \
    "usr/snap" \
    "usr/spool" \
    "usr/www" \
)"

rename() {
    for dataset in "$datasets"
    do
        zfs rename "$1/$dataset" "$2/$dataset"
    done
}

case "$action" in
    'server') rename "rpool/ROOT/$host_id" "rpool" ;;
    'desktop') rename "rpool" "rpool/ROOT/$host_id" ;;
esac

Encrypted backups

Debian 11

Linux Mint 21 (based on Ubuntu 22.04 LTS)

Monitoring

  • zed (ZFS Event Daemon)

Backups

  • zrepl
  • httm

Threat model

You are the “family admin” tasked (by yourself, let’s face it) to set up a storage solution. You cross-backup to/from several locations that are not adversarial but (technically) untrusted.

There are several ways in which the disks can leave your control:

  1. Controlled.

    Decommissioning. The drives have reached end of life and you want to get rid of them in a secure manner. This should include selling to recuperate the cost of upgraded storage. You are not prepared to trust simply overwriting the bulk storage.

    Mitigation: Encryption. Keys stored on media other than the bulk storage which is trusted to do secure erase, or is cheap enough that physical destruction is acceptable (can be repurposed for the same task for the new storage until one of these two options is ultimately carried out).

  2. Uncontrolled.

    Slipping through the cracks. Non-technical people tend to associate physical drives entirely with data availability and forgetting about data confidentiality. Once the data is copied to some upgraded storage, the old drives are simply (insecurely) discarded. There are plenty of ways in a family setting that drives slip out of the hands of the security-conscious admin (“I thought I didn’t need them anymore, so I sold them”).

    Theft.

    Mitigation: Passphrases on keys. Preferably strong, even more preferably hardware-based (on hardware unlikely to be lost together with hardware holding data). User ergonomics matter here.

  3. Destruction.

    Natural disasters or other events that leave data/key/passphrase-holding hardware unrecoverable.

    Mitigation: Off-site backups. Including intermediary keys and passphrases (and their key derivation parameters). To offline media such as paper or USB-sticks. If hardware is used for passphrases, these can additionally be backed up to identical hardware for faster recovery.

See also the Linux fscrypt thread model.