ZFS

ZFS Boot Menu (a bootloader, like GRUB) has great introductory material: https://docs.zfsbootmenu.org.

Note that in Debian, ZFS-related packages live in contrib.

Distros with ZFS in installer

Ubuntu-based distros (which? versions?) use Ubiquity in their live images, which supports (experimentally?) to use ZFS. This (sometimes?) seems to require ZFS-related packages to be installed manually before starting Ubiquity:

sudo apt install zfsutils-linux zfs-initramfs zfs-zed

Debian Live uses Calamares which has a ZFS module, non-Live uses Debian-Installer (ZFS support?).

Red Hat/Fedora-related distros use Anaconda (ZFS support?).

sudo zpool history "$pool" might be handy for figuring out what zpool commands an installer ran. Obviously, reading the source of the installers is better when determining what algorithm they used to arrive at those commands.

Software

apt-cache search -- -initramfs

  • clevis: automated encryption framework
    • clevis-initramfs
    • clevis-dracut
  • ZFS
    • zfs-initramfs
    • zfs-dracut

Booting

Bootstrapped install

Debian-based: debootstrap, chroot Fedora-based: dnf --installroot, systemd-nspawn -b Arch-based: pacstrap, arch-chroot

Delegation

(The examples assume a dataset layout that we have not yet standardized on.)

Descendants of $HOME

host="$(hostname)"
user="$USER"
dataset="tank/$host/HOME/$user"
sudo zfs allow -u "$user"    create,mount,rollback,snapshot "$dataset"
sudo zfs allow -u "$user" -d destroy                        "$dataset"

Linux restricts mount operations only to root, which results in the following message when trying to zfs create a dataset under $dataset:

filesystem successfully created, but it may only be mounted by root

As long as the dataset is not mounted, zfs destroy does in fact destroy the dataset.

Backups

host="some-remote"
user="backup-$host"
dataset="tank/$host"
# `-d` flag untested
sudo zfs allow -u "$user" -d receive "$dataset"

Encryption

TODO: Threat model (we have one written down somewhere).

Since ZFS does not give strong guarantees of the effectiveness of key rotation with zfs change-key (see the man page), key rotation should be handled outside of ZFS. Make sure that during the creation of a new key, it is not temporarily stored on ZFS encrypted with the old key! Since we want all our important data to be stored on encrypted ZFS, flash storage that has verified working secure erase or a hardware token are the only viable places to store (and rotate) keys (potentially wrapping the ZFS “wrapping key”). If any translation is required in order to get the security provided by a hardware token into the correct format for zfs load-key, that translation should be stored with the pool/dataset to prevent unintended loss of access. One option is a zvol (Ubuntu) or a dataset property (pivy-zfs). Since this translation is stored on ZFS it should only provide translation, not relied on for security!

zvols have the advantage that they be formatted with LUKS which have good integration with other tools, such as initramfs-tools and hardware tokens. We don’t want to customize things regarding early boot and encryption too much.

Another advantage of zvols are that it’s very obvious where the keys live. They show up directly in a zfs list, you don’t have to look at e.g. individual properties. It’s also obvious if the keys are being replicated or not.

Where did my free space go?

Too many snapshots?

zfs list -o name,usedbydataset,usedbysnapshots -S usedbysnapshots
zfs list -o name,used,creation -t snapshot "$dataset"
# zfs destroy -vn "$dataset@$snap1%$snap2"

Feeling tempted to enable deduplication? (Probably this will put your mind at ease and confirm that, no, enabling dedup would not help. Either way, consider using symlinks/hardlinks or some other manual way of handling duplicated files, dedup is expensive.)

sudo zdb -S $pool

Pool/dataset layout

Wishlist:

  • Avoid pool names that could be confused with other things. We want commands to error out if we misunderstand the syntax instead of “reinterpreting” what we mean. Do not name pools zpool or $(hostname).
  • Datasets that are replicated should have a globally unique names.
  • We can do better than Ubuntu’s randomly generated dataset suffixes.

Questions:

  • Where do we store backups? Does a separate $hostname-pool/BACK hierarchy gain us anything? How do we make sure that the backup host does not try to mount received datasets (specifically, on reboot)?
  • Should key datasets be named after the key (hardware token name, hardware token owner, etc) or the dataset it unlocks?

Work in progress suggestion:

os="$(
  if [ -e /etc/lsb-release ]
  then
    . /etc/lsb-release
    printf '%s' "$DISTRIB_DESCRIPTION"
  elif [ -e /etc/os-release ]
  then
    . /etc/os-release;
    printf '%s' "$PRETTY_NAME"
  fi \
  | tr    '[:upper:]' '[:lower:]' \
  | tr -c '[:graph:]' '-'
)"
key="$(printf '%s' "$dataset" | tr '/' '_')"
$hostname-pool/KEYS/$hostname/$key
$hostname-pool/ROOT/$hostname/$os
$hostname-pool/HOME/$hostname/$user
$hostname-pool/DATA/$hostname/$data

Learning

Talks:

RAIDZ space efficiency

There are two ways to “waste space” in RAIDZ:

  1. Not maxing out the number of data sectors for each parity sector.
  2. Forcing ZFS to add padding to make sure you can not end up with holes in which no useful data can be stored.

We only consider a single vdev. Some definitions:

  • data: Number of “usable data” disks.
  • parity: Number of “parity” disks.
  • disks = data + parity: Total number of disks.
  • sectorsize = pow(2, ashift): Size in bytes of chunks in which pools ask disks for usable space. This should, but does not have to, correspond to actual disk sector sizes. Valid: 512-64K, default depends on disks used on pool creation. We assume 4K, i.e. an ashift of\ 12.
  • recordsize: (Maximum) size in bytes of chunks in which filesystem datasets ask pools for usable space. Valid: powers of 2 in 512-128K, default: 128K.
  • volblocksize: Size in bytes of chunks in which volume datasets ask pools for usable space. Valid: powers of 2 in 512-128K, default: 8K.
  • recordsectors = recordsize / sectorsize: The number of sectors in a record.

Number of data disks

A dataset asks the underlying pool to give it usable space in blocks of size (at most) recordsize. For each started “stripe” of disks number of sectors, RAIDZ adds parity number of parity sectors. So to fully take advantage of that parity, we want recordsectors to be evenly divisible by data. Since recordsectors is a power of 2 (i.e. has 2 as its only prime factor), data needs to also be a power of 2 (i.e. have 2 as its only prime factor):

recordsectors % data == 0 recordsectors % (disks - parity) == 0

data = 2^n, for some integer n

(recordsectors / data * disks) % (parity + 1) == 0

Misc

Jim Salter’s “blessed” topologies (see also Jim Salter’s “cursed” topologies):

  • Mirrors: Fast, easy, great.
  • RAIDz1: 2^N+1, 1 <= N <= 1, i.e. 3 (redundancy ratios 1/3).
  • RAIDz2: 2^N+2, 1 <= N <= 3, i.e. 4, 6, 10 (redundancy ratios 1/2, 1/3, 1/5).

You want data to evenly divide into recordsize_sectors == recordsize / sector. Since both recordsize and sector are powers of two, i.e. the only prime factor is 2, data needs to also have 2 as its only prime factor.

Why the big difference in range of N in RAIDz levels? Probability of any single drive failing scales linearly with number of drives. Probability of multiple drives failing decreases exponentially. TODO: Expand on this.

Note that this gets less straightforward when compression=on (which you probably want) and you have compressible data, see ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ 2014-06-05 by Matthew Ahrens on the Delphix blog.

6-wide RAIDZ2 “complies” with both these recommendations.

When changing recordsize/volblocksize, which is mostly relevant for databases on ZFS filesystems or for ZFS volumes (neither of which is very interesting to us), see [RAIDZ on-disk format vs. small blocks][] 2019-06-25 by Mike Gerdts at Joyent Technical Discussion and the accompanying comment.

Note that recordsize/volblocksize is part of zfs send streams! If you’re going to do something fancy, make sure it makes sense on all (current and future!) replication targets.

(“ashift is an implementation detail, the user tunable should be called sectorsize and expressed in bytes” — Alan Jude)

from math import *
sectorsize = 4 * 1024  # "Advanced Format Drive"
recordsize = 128 * 1024  # default
recordsectors = recordsize // sectorsize
for parity in [1, 2, 3]:
    for disks in range(parity+2, 4**(2**(parity-1))+1):
    # for disks in range(parity+2, 64+1):
        data  = disks - parity
        ratio = parity / disks
        alloc = parity + 1
        left = (data  -  recordsectors                  % data ) % data
        skip = (alloc - (recordsectors // data * disks) % alloc) % alloc
        if left != 0 or skip != 0:
            continue
        # if left != 0:
        #     continue
        # if left == 0 or skip == 0:
        #     continue
        print(f"RAIDZ{parity} {disks:>2} disks: ratio={ratio:.2f} left={left} skip={skip}")
        # for recordsize in range(recordsize_max, -sector, -sector):
        #     skip = sector * ((alloc - (recordsize // sector + parity) % alloc) % alloc)
        #     aligned = recordsize % (sector * (disks - parity)) == 0
        #     if skip == 0 and aligned:
        #         break
        # print(f"RAIDZ{parity} {disks:>2} disks: ratio={ratio:.2f} recordsize={recordsize//1024}K/{recordsize//sector}s")

RAIDZ on-disk format vs. small blocks

TODO

TODO: Ubuntu linuxmint-21.3-cinnamon-64bit.iso linuxmint-22-cinnamon-64bit.iso TrueNAS-SCALE-24.04.2.3.iso Zorin-OS-17.2-Core-64-bit.iso cachyos-desktop-linux-241003.iso

Is there a benefit of having pool names be unique (i.e. not zpool, bpool, rpool, etc)

https://github.com/rlaager/zfscrypt

ZFS on Linux mailing list (was https://list.zfsonlinux.org/pipermail/zfs-discuss/?)

TechSNAP podcast:

  • 2019-10-18 Rooting for ZFS We dive into Ubuntu 19.10’s experimental ZFS installer and share our tips for making the most of ZFS on root.
  • 2018-05-02 Catching up with Allan We catch up with Allan Jude and he shares […] some classic ZFS updates.
  • 2017-04-18 Tales of FileSystems Dan dives deep into the wonderful world of ZFS and FreeBSD jails & shows us how he is putting them to use in his latest server build.
  • 2017-03-22 Check Yo Checksum […] a few more reasons you should already be using ZFS.
  • 2015-07-09 ZFS does not prevent Stupidity

Companies:

  • Nexenta
  • Delphix
  • Joyent
  • Datto

People:

Pull requests

ZED_EMAIL_{ADDR,PROG,OPTS} in /etc/zfs/zed.d/zed.rc.

zed_notify_email() in /etc/zfs/zed.d/zed-functions.sh.

Talks:

Articles.

Leadership meatings

Agenda and notes on Google Docs.

Encryption

With notes from “Agenda and notes”.

  • 2024-06-18
  • 2024-03-26
    • As the zfs encryption layer developer ins’t there anymore is zfs native encryption in actual state useful for production systems related to version 2.1.15 and 2.2.3 or if not probably when?
  • 2023-01-31
    • I recently submitted PR#14249, adding Chacha20-Poly1305 as an encryption option. I’d like to talk to someone about next steps towards getting this PR reviewed and merged. (Rob N)
  • 2023-01-03
    • I recently submitted PR#14249, adding Chacha20-Poly1305 as an encryption option. I’d like to talk to someone about next steps towards getting this PR reviewed and merged. (Rob N)
    • Should the documentation have warnings about using native encryption, similar to dedup? (Rich E)
      • Several people who got burned by native encryption recently asked me why there were no warnings around it if it has known bad failure modes, and I didn’t really have a good answer.
      • Klara investigating: https://github.com/openzfs/zfs/issues/12775
  • 2022-09-13
  • 2022-06-21
    • Multiple user-keys for encryption (Jonathan Waldrep)
      • UX proposal: use @ notation on the properties to distinguish different keys (keyformat@key1, keyformat@key2, …)
  • 2022-04-26
    • Michael Dexter: Can OpenZFS native encryption be recommended with confidence or specific caveats, not unlike deduplication?
      • send/recv has several known bugs.
      • Standalone (= without send/recv) has few bugs.
        • Illumos has an open bug related to projectquota upgrades.
        • Alex Motin: some TrueNAS / FreeNAS users are using native encryption already, plan is to only support it, instead of GELI, in future releases.
        • zfs change-key has an open issue
  • 2022-02-01
    • Encryption Bugs (Rich Ercolani)
      • (Was not able to make it to the meeting, summary written ahead of time.)
      • One set of bugs (#12981 et al) got a workaround from George Amanakis - thanks George!
      • Another bug (#12720) manifested as an error during raw send/recv. The underlying cause is faulty on-disk dnodes with contradicting bonuslength/spill pointer flag in < 0.6.4 versions of OpenZFS. Related PR #13014.
      • WIP PR #12943 for issue #11679 had an issue reported, going to try and ameliorate with even more locking, but could use someone with familiarity with send/recv a/o the ARC code to help, as I’m pretty sure this is just papering over something being done incorrectly.
        • Issue #11679 has drawn lots of attention. Someone familiar with the DMU should take a look.
      • WIP I should circulate for extending FORCE_INHERIT/FORCE_NEW_KEY to allow you to escape situations like #12614, need to write more tests, feel free to ping me if you’re in this situation and want to try it. (Also trying to figure out what a reasonable thing to do in most cases when you receive a change-key in an incremental send is - so far, all of the options seem to violate POLP sometimes.)
        • Downside: not insignificantly sized foot-gun to allow you to change-key -f
        • Storing last N copies of the wrapped key and trying them all would help you avoid that, but then you have N ways to unlock the key…
      • Someone reported issues with receiving unencrypted under an encrypted and not unlocked parent (#13033; not data loss or anything, just mentioning for completeness)
  • 2022-01-04
  • 2021-12-07
    • Native encryption needs some work (Rich Ercolani)
    • Spreadsheet of encryption bugs
    • Biggest and scariest: incorrect dnode refcounting
      • Panic stacks look very different, trapping at different ASSERTs
      • Not reproducible on x86(_64), but on SPARC and on PowerPC
      • Reproduces within 24 hours on a PowerPC 64 KVM VM (but not qemu-emulated)
    • Earliest occurrence dates all the way back to introduction of native encryption
  • 2921-10-12
    • Encryption Incompatible On-Disk Format Change for illumos and OSX - (Jason King)
  • 2020-01-07
    • Any further comments on this Ubuntu developer’s proposal to enable encryption by default with a fixed passphrase: https://bugs.launchpad.net/bugs/1857398 (rlaager)
      • Is there any additional feedback that we haven’t already provided them? Feel free to add in the thread.
      • Tangent on secure-erase and changing encryption keys:
        • Secure erase and not having dataset’s being partially secure were two of the main motivators on the current encryption design.
        • The latter is ensured by setting encryption on at creation but not guaranteed as the encryption key may later be changed and will affect only new blocks.
        • Even if encryption key (user/wrapping key) is changed, new blocks can be read/manipulated if the old (compromised/public) key is known, and the old-key-wrapped master key is found (e.g. by forensic analysis of the disk).
        • The trade-off being usability and practicality varies wildly between cases.
      • Given the above tangent, we should really understand what the Canonical folks want to do and try to come up with the best practises and design. This includes communicating best practices, and potentially implementing something different than what we currently have.
      • We need to be more clear about the security implications of “zfs change-key”
    • Change encryption=on from aes-256-ccm to aes-256-gcm? See especially the comments starting here: https://github.com/zfsonlinux/zfs/pull/9749#issuecomment-568633557 (rlaager)
      • The two main motivators of this proposal are security and performance.
        • From a security standpoint, Mozilla and TLS default to gcm.
        • According to Richard’s estimates, performance could get a ~3x improvement with gcm.
      • There seems to be an overall consensus on this but we should really check with Tom Caputi.
    • encryption: from_ivset_guid check missing on resumed recv? (Christian Schwarz)
      • turned into issue post-meeting, do not include in upcoming agenda
  • 2019-12-10
    • Feature Request: Encryption to work with dedup across multiple datasets - Tom:
      • Today different “clone families” have different master keys (the key actually used to encrypt the blocks), so blocks with the same plaintext will have different cyphertext if they are in the same clone family - even if they have the same wrapping (user) key (i.e. same/inherited keysource property)
      • Want to add a mechanism to have the same master key for different clone families
      • Need to design user interface to make it clear what’s going on.
      • Suggestion: use a property to indicate that all children have the same master key
        • Need to work out the details of how this would interact with things like rename (into / out of the “same master key hierarchy”)
  • 2019-05-28
    • Request from sef! The On-Disk Format document is out of date, and doesn’t reflect recent changes (e.g., encryption). It would be really helpful to update this and keep it up-to-date going forward (Perhaps in a semi-automated way?)
      • Tom: The info about how things look today exists in various places, so it is just a question of pulling it together into a single place.
      • Matt: We only have a pdf for that, and we probably can’t just copy-paste it somewhere else. It would be good to have a document with this info. Let’s put out a call for a volunteer to do this.
        • Paul pointed out that it is licensed under a Berkeley License
  • 2019-01-29
    • Mixing raw/non-raw send streams (Tom)
      • Last(?) encryption issue (that we know of!)
      • Tom is working on the PR
  • 2018-11-06
    • Status update on encryption ports
      • Update from Jason King: continuing to try to reproduce reported issue
      • Update from Sean Fagan & Matt Macy: still seeing some ztest failures, need to investigate

Partitioning

With notes from me.

  • 2022-09-13
    • On Linux and Illumos, but not FreeBSD, if you give ZFS a whole disk it will partition it with GPT with a 1MB aligned partition and internally mark it as “whole disk” (Mark it how? Can that be checked after the fact?). One advantage of partitioning is that other tools will recognize that ZFS is using it and not accidentally clobber it.

Encryption, how does it work??

The encryptionroot dataset property specifies what other dataset owns the wrapping key. Different datasets will always have different master keys except if one is a clone of the other. I think the rationale was to try as hard as possible to never share keys, but when cloning that’s impossible since the datasets literally share encrypted blocks (so obviously they need the same key to read that block).

https://www.youtube.com/watch?v=fTIJgGpV3HE&t=38m25s

Terminology (somewhat unofficial):

  • Clone family: Consists of a dataset and clones of shapshots of that dataset. Datasets within a clone family are the only datasets that share encryption master keys. For encrypted datasets they are the only datasets within the same clone family share deduplication tables.

my entire backup regime is all built on encryption and replication

Rob Norris robn, ZFS Developer, 2023-11-03

im just not really an active member of the project at the moment and I don’t have the bandwidth to get back into it. I would be more than happy to talk to anyone / review any work someone makes towards a PR though.

Tom Caputi tcaputi, ZFS Native Encryption author, 2023-11-19

Pull requests, and possible workarounds:

Pool layout

  • Do pool names need to be unique? Being able to mount every pool under our care into the same system at once might be useful in disaster recovery. In that case, prefix pool names with $(hostname) (or $(hostname | cut -d- -f1))?
-   {tank,???}
    -   ROOT
        -   "$(hostname)"
            -   "$(. "/etc/os-release"; echo "$ID-$VERSION_ID")" ...
    -   HOME
        -   "$(hostname)"
            -   "$USER" ...