Back to the main page

ZFS File System

This is the 128-bit operating system by Sun Microsystems. 

Design principles: 

1. It does for storages same what virtual memory does for memory.
2. It keeps data always consistent on the disk(s)

History of File systems 

1. (Begging) File system managed single hard disk
2. (Volumes) Insert s/w (SVM) between FS and physical disk(s) that manages all physical disks, create volumes and present volume to FS as one reliable disk.  
3. (ZFS) ZFS sits above pool of hard disks (one or more ZFS). 
	There is no partitions
	Pool size grows automatically when added new disks (same as RAM grows when added new memory module)
	All storage in the pool is shared among all ZFS 

FS/Volume I/O Stack characteristics: 

The loss of power creates non consistent data on disks, so fsck is frequently used tool to correct inconsistency. 
Solution for this problem is journaling - logging any changes to journal before committing them to FS. After crash, recovery involves replying changes from journal until FS is consistent again.   

1. Write to journal "I will rename file-1 to file-2"
2. Perform I/O on disk to accomplish renaming 
3. After renaming, going back to journal and say "Renaming successful" 
4. In case of power outage in the middle of step 2, there is inconsistency on the disk. 
5. Recovery check journal and see that renaming was action when power went off, so knows exactly where to go and fix inconsistency. 

ZFS I/O Stack characteristics:

ZFS is transactional FS (same as banking payment online succeed or fall as a whole, any operation on ZFS succeed or fall as a whole). 
Live data is never overwritten (initial block with consistent data is intact - can be snapshot, while new blocks with new data become consistent).    

1. There is DMU (Data Management Unit) between ZFS and disk(s) pool that is doing transactions for ZFS. 
2. ZFS says to DMU - I want to perform list of these operations in order to rename the file. Do all of them, if you cannot do all of them then do none of them.  
3. DMU take list of all operations (steps) and create transactional group. 
4. DMU performs transactional group on pool, so operations from 2) are done "all or nothing". 
5. DMU also doesn't overwrite existing data, so FS is always consistent. 

ZFS also does disk scrubbing - scan disks in pool, validate checksum and if there is error, corrects it.

Creating zpool and zfs (and working with them) 
 
Just run zpool or zfs and OS tells that you miss commands and shows the usage. 
# zpool ( or zfs)
missing command
usage: zpool (or zfs) command args ...
where 'command' is one of the following:
If you are not sure what would happen, use option -n to tell you what would happen without actually doing anything (also add -f to force and ignore possible errors)
# zpool create -fn pool-1 c0t2d0 c0t3d0
would create 'pool-1' with the following layout:

        pool-1
          c0t2d0
          c0t3d0
Let's create pool of 2 disks, both size 146G.
# zpool create -f pool-1 c0t2d0 c0t3d0
So pool is now created as well as ZFS (also mounted).
# zpool list
NAME     SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
pool-1   272G     90K    272G     0%  ONLINE     -

# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
pool-1    87K   268G  24.5K  /pool-1
ZFS file system is cheap, easily created and deleted.
# zfs create pool-1/home
# zfs create pool-1/home/user1
# zfs create pool-1/home/user2
# zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
pool-1              180K   268G  25.5K  /pool-1
pool-1/home        76.5K   268G  27.5K  /pool-1/home
pool-1/home/user1  24.5K   268G  24.5K  /pool-1/home/user1
pool-1/home/user2  24.5K   268G  24.5K  /pool-1/home/user2
Destroy one ZFS (there is no leading / in front of pool-1, it is for mount point)
# zfs destroy pool-1/home/user2
Destroy parent ZFS with children ZFS (use -r)
# zfs destroy -r pool-1/home
See results, all cheap ZFS are gone.
# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
pool-1    90K   268G  24.5K  /pool-1
Basically, ZFS is much easier to manage then UFS. Mounting ZFS doesn't require work with /erc/vfstab file.
# zfs set mountpoint=/mnt pool-1
Get info about quota and mount.
# zfs get quota,mountpoint pool-1
NAME    PROPERTY    VALUE       SOURCE
pool-1  quota       none        default
pool-1  mountpoint  /mnt        local
Set quota.
# zfs set quota=10g pool-1

# zfs get quota,mountpoint pool-1
NAME    PROPERTY    VALUE       SOURCE
pool-1  quota       10G         local
pool-1  mountpoint  /mnt        local
Get ALL info.
# zfs get all pool-1
NAME    PROPERTY       VALUE                  SOURCE
pool-1  type           filesystem             -
pool-1  creation       Fri Nov 13  9:50 2009  -
pool-1  used           97K                    -
pool-1  available      10.0G                  -
pool-1  referenced     24.5K                  -
pool-1  compressratio  1.00x                  -
pool-1  mounted        yes                    -
pool-1  quota          10G                    local
pool-1  reservation    none                   default
pool-1  recordsize     128K                   default
pool-1  mountpoint     /mnt                   local
pool-1  sharenfs       off                    default
pool-1  checksum       on                     default
pool-1  compression    off                    default
pool-1  atime          on                     default
pool-1  devices        on                     default
pool-1  exec           on                     default
pool-1  setuid         on                     default
pool-1  readonly       off                    default
pool-1  zoned          off                    default
pool-1  snapdir        hidden                 default
pool-1  aclmode        groupmask              default
pool-1  aclinherit     secure                 default
pool-1  canmount       on                     default
pool-1  shareiscsi     off                    default
pool-1  xattr          on                     default
See available space which is quoted to 10H.
# zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
pool-1    97K  10.0G  24.5K  /mnt
But we have much more space left in the pool.
# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
pool-1                  272G    103K    272G     0%  ONLINE     -
Create new ZFS with setting quota.
# zfs create -o quota=5g pool-1/jumpstart
# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
pool-1             126K  10.0G  24.5K  /mnt
pool-1/jumpstart  24.5K  5.0G  24.5K  /mnt/jumpstart
Sharing file system is also easy and no need to work with /etc/dfs/dfstab file.
# zfs set sharenfs=on pool-1/jumpstart

# showmount -e my-hostname
export list for my-hostname:
/mnt/jumpstart (everyone)

# zfs get sharenfs
NAME              PROPERTY  VALUE             SOURCE
pool-1            sharenfs  off               default
pool-1/jumpstart  sharenfs  on                local
Managing disks zpool Although slices can be members of zpool, I do not think many people use them. Just go with whole disk. ZFS makes one slice over whole disk (format disk using EFI label). Total disk sectors available: 286722910 + 16384 (reserved sectors)
Part      Tag    Flag     First Sector         Size         Last Sector
  0        usr    wm                34      136.72GB          286722910
  1 unassigned    wm                 0           0               0
  2 unassigned    wm                 0           0               0
  3 unassigned    wm                 0           0               0
  4 unassigned    wm                 0           0               0
  5 unassigned    wm                 0           0               0
  6 unassigned    wm                 0           0               0
  8   reserved    wm         286722911        8.00MB          286739294
EFI (Extensible Firmware Interface) label is size of 34 sectors (17KB = 34 x 512B) and slice 8 is for some additional system information. There is no info about disk cylinders. EFI supports size of physical disk and virtual volumes greater then 2 terabytes. NOTE: See that this is different than disk formatted using SMI (Sun Microsystem Inc.) label - some people call it VTOC (Volume Table Of Content). (example from SunFire X4200).
Total disk cylinders available: 8921 + 2 (reserved cylinders)
Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm     524 - 1046        4.01GB    (523/0/0)    8401995
  1       swap    wu       1 -  523        4.01GB    (523/0/0)    8401995
  2     backup    wm       0 - 8920       68.34GB    (8921/0/0) 143315865
  3        var    wm    1047 - 1569        4.01GB    (523/0/0)    8401995
  4 unassigned    wm    1570 - 4180       20.00GB    (2611/0/0)  41945715
  5 unassigned    wm    4181 - 8887       36.06GB    (4707/0/0)  75617955
  6 unassigned    wu       0               0         (0/0/0)            0
  7 unassigned    wm    8888 - 8920      258.86MB    (33/0/0)      530145
  8       boot    wu       0 -    0        7.84MB    (1/0/0)        16065
  9 unassigned    wu       0               0         (0/0/0)            0  
Back to the main page