Back to the main page
ZFS File System
This is the 128-bit operating system by Sun Microsystems.
Design principles:
1. It does for storages same what virtual memory does for memory.
2. It keeps data always consistent on the disk(s)
History of File systems
1. (Begging) File system managed single hard disk
2. (Volumes) Insert s/w (SVM) between FS and physical disk(s) that manages all physical disks, create volumes and present volume to FS as one reliable disk.
3. (ZFS) ZFS sits above pool of hard disks (one or more ZFS).
There is no partitions
Pool size grows automatically when added new disks (same as RAM grows when added new memory module)
All storage in the pool is shared among all ZFS
FS/Volume I/O Stack characteristics:
The loss of power creates non consistent data on disks, so fsck is frequently used tool to correct inconsistency.
Solution for this problem is journaling - logging any changes to journal before committing them to FS. After crash, recovery involves replying changes from journal until FS is consistent again.
1. Write to journal "I will rename file-1 to file-2"
2. Perform I/O on disk to accomplish renaming
3. After renaming, going back to journal and say "Renaming successful"
4. In case of power outage in the middle of step 2, there is inconsistency on the disk.
5. Recovery check journal and see that renaming was action when power went off, so knows exactly where to go and fix inconsistency.
ZFS I/O Stack characteristics:
ZFS is transactional FS (same as banking payment online succeed or fall as a whole, any operation on ZFS succeed or fall as a whole).
Live data is never overwritten (initial block with consistent data is intact - can be snapshot, while new blocks with new data become consistent).
1. There is DMU (Data Management Unit) between ZFS and disk(s) pool that is doing transactions for ZFS.
2. ZFS says to DMU - I want to perform list of these operations in order to rename the file. Do all of them, if you cannot do all of them then do none of them.
3. DMU take list of all operations (steps) and create transactional group.
4. DMU performs transactional group on pool, so operations from 2) are done "all or nothing".
5. DMU also doesn't overwrite existing data, so FS is always consistent.
ZFS also does disk scrubbing - scan disks in pool, validate checksum and if there is error, corrects it.
Creating zpool and zfs (and working with them)
Just run zpool or zfs and OS tells that you miss commands and shows the usage.
# zpool ( or zfs)
missing command
usage: zpool (or zfs) command args ...
where 'command' is one of the following:
|
If you are not sure what would happen, use option -n to tell you what would happen without actually doing anything (also add -f to force and ignore possible errors)
# zpool create -fn pool-1 c0t2d0 c0t3d0
would create 'pool-1' with the following layout:
pool-1
c0t2d0
c0t3d0
|
Let's create pool of 2 disks, both size 146G.
# zpool create -f pool-1 c0t2d0 c0t3d0
|
So pool is now created as well as ZFS (also mounted).
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pool-1 272G 90K 272G 0% ONLINE -
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool-1 87K 268G 24.5K /pool-1
|
ZFS file system is cheap, easily created and deleted.
# zfs create pool-1/home
# zfs create pool-1/home/user1
# zfs create pool-1/home/user2
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool-1 180K 268G 25.5K /pool-1
pool-1/home 76.5K 268G 27.5K /pool-1/home
pool-1/home/user1 24.5K 268G 24.5K /pool-1/home/user1
pool-1/home/user2 24.5K 268G 24.5K /pool-1/home/user2
|
Destroy one ZFS (there is no leading / in front of pool-1, it is for mount point)
# zfs destroy pool-1/home/user2
|
Destroy parent ZFS with children ZFS (use -r)
# zfs destroy -r pool-1/home
|
See results, all cheap ZFS are gone.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool-1 90K 268G 24.5K /pool-1
|
Basically, ZFS is much easier to manage then UFS.
Mounting ZFS doesn't require work with /erc/vfstab file.
# zfs set mountpoint=/mnt pool-1
|
Get info about quota and mount.
# zfs get quota,mountpoint pool-1
NAME PROPERTY VALUE SOURCE
pool-1 quota none default
pool-1 mountpoint /mnt local
|
Set quota.
# zfs set quota=10g pool-1
# zfs get quota,mountpoint pool-1
NAME PROPERTY VALUE SOURCE
pool-1 quota 10G local
pool-1 mountpoint /mnt local
|
Get ALL info.
# zfs get all pool-1
NAME PROPERTY VALUE SOURCE
pool-1 type filesystem -
pool-1 creation Fri Nov 13 9:50 2009 -
pool-1 used 97K -
pool-1 available 10.0G -
pool-1 referenced 24.5K -
pool-1 compressratio 1.00x -
pool-1 mounted yes -
pool-1 quota 10G local
pool-1 reservation none default
pool-1 recordsize 128K default
pool-1 mountpoint /mnt local
pool-1 sharenfs off default
pool-1 checksum on default
pool-1 compression off default
pool-1 atime on default
pool-1 devices on default
pool-1 exec on default
pool-1 setuid on default
pool-1 readonly off default
pool-1 zoned off default
pool-1 snapdir hidden default
pool-1 aclmode groupmask default
pool-1 aclinherit secure default
pool-1 canmount on default
pool-1 shareiscsi off default
pool-1 xattr on default
|
See available space which is quoted to 10H.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool-1 97K 10.0G 24.5K /mnt
|
But we have much more space left in the pool.
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pool-1 272G 103K 272G 0% ONLINE -
|
Create new ZFS with setting quota.
# zfs create -o quota=5g pool-1/jumpstart
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool-1 126K 10.0G 24.5K /mnt
pool-1/jumpstart 24.5K 5.0G 24.5K /mnt/jumpstart
|
Sharing file system is also easy and no need to work with /etc/dfs/dfstab file.
# zfs set sharenfs=on pool-1/jumpstart
# showmount -e my-hostname
export list for my-hostname:
/mnt/jumpstart (everyone)
# zfs get sharenfs
NAME PROPERTY VALUE SOURCE
pool-1 sharenfs off default
pool-1/jumpstart sharenfs on local
|
Managing disks zpool
Although slices can be members of zpool, I do not think many people use them. Just go with whole disk.
ZFS makes one slice over whole disk (format disk using EFI label).
Total disk sectors available: 286722910 + 16384 (reserved sectors)
Part Tag Flag First Sector Size Last Sector
0 usr wm 34 136.72GB 286722910
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 286722911 8.00MB 286739294
|
EFI (Extensible Firmware Interface) label is size of 34 sectors (17KB = 34 x 512B) and slice 8 is for some additional system information.
There is no info about disk cylinders.
EFI supports size of physical disk and virtual volumes greater then 2 terabytes.
NOTE:
See that this is different than disk formatted using SMI (Sun Microsystem Inc.) label - some people call it VTOC (Volume Table Of Content).
(example from SunFire X4200).
Total disk cylinders available: 8921 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 524 - 1046 4.01GB (523/0/0) 8401995
1 swap wu 1 - 523 4.01GB (523/0/0) 8401995
2 backup wm 0 - 8920 68.34GB (8921/0/0) 143315865
3 var wm 1047 - 1569 4.01GB (523/0/0) 8401995
4 unassigned wm 1570 - 4180 20.00GB (2611/0/0) 41945715
5 unassigned wm 4181 - 8887 36.06GB (4707/0/0) 75617955
6 unassigned wu 0 0 (0/0/0) 0
7 unassigned wm 8888 - 8920 258.86MB (33/0/0) 530145
8 boot wu 0 - 0 7.84MB (1/0/0) 16065
9 unassigned wu 0 0 (0/0/0) 0
|
Back to the main page