Discussion:
Pool performance when nearly full
(too old to reply)
sol
2012-12-20 17:25:54 UTC
Permalink
Raw Message
Hi

I know some of this has been discussed in the past but I can't quite find the exact information I'm seeking
(and I'd check the ZFS wikis but the websites are down at the moment).

Firstly, which is correct, free space shown by "zfs list" or by "zpool iostat" ?

zfs list:
used 50.3 TB, free 13.7 TB, total = 64 TB, free = 21.4%

zpool iostat:
used 61.9 TB, free 18.1 TB, total = 80 TB, free = 22.6%

(That's a big difference, and the percentage doesn't agree)

Secondly, there's 8 vdevs each of 11 disks.
6 vdevs show used 8.19 TB, free 1.81 TB, free = 18.1%
2 vdevs show used 6.39 TB, free 3.61 TB, free = 36.1%

I've heard that 
a) performance degrades when free space is below a certain amount
b) data is written to different vdevs depending on free space

So a) how do I determine the exact value when performance degrades and how significant is it?
b) has that threshold been reached (or exceeded?) in the first six vdevs?
and if so are the two emptier vdevs being used exclusively to prevent performance degrading
so it will only degrade when all vdevs reach the magic 18.1% free (or whatever it is)?

Presumably there's no way to identify which files are on which vdevs in order to delete them and recover the performance?

Thanks for any explanations!
Cindy Swearingen
2012-12-20 18:18:13 UTC
Permalink
Raw Message
Hi Sol,

You can review the Solaris 11 ZFS best practices info, here:

http://docs.oracle.com/cd/E26502_01/html/E29007/practice-1.html#scrolltoc

The above section also provides info about the full pool performance
penalty.

For S11 releases, we're going to increase the 80% pool capacity
recommendation to 90%.

Pool/file system space accounting is dependent on the type
of pool that you can read about, here:

http://docs.oracle.com/cd/E26502_01/html/E29007/gbbti.html#scrolltoc

Thanks,

Cindy
Post by sol
Hi
I know some of this has been discussed in the past but I can't quite
find the exact information I'm seeking
(and I'd check the ZFS wikis but the websites are down at the moment).
Firstly, which is correct, free space shown by "zfs list" or by "zpool iostat" ?
used 50.3 TB, free 13.7 TB, total = 64 TB, free = 21.4%
used 61.9 TB, free 18.1 TB, total = 80 TB, free = 22.6%
(That's a big difference, and the percentage doesn't agree)
Secondly, there's 8 vdevs each of 11 disks.
6 vdevs show used 8.19 TB, free 1.81 TB, free = 18.1%
2 vdevs show used 6.39 TB, free 3.61 TB, free = 36.1%
I've heard that
a) performance degrades when free space is below a certain amount
b) data is written to different vdevs depending on free space
So a) how do I determine the exact value when performance degrades and
how significant is it?
b) has that threshold been reached (or exceeded?) in the first six vdevs?
and if so are the two emptier vdevs being used exclusively to prevent performance degrading
so it will only degrade when all vdevs reach the magic 18.1% free (or whatever it is)?
Presumably there's no way to identify which files are on which vdevs in
order to delete them and recover the performance?
Thanks for any explanations!
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Klimov
2012-12-20 20:32:50 UTC
Permalink
Raw Message
Post by sol
Hi
I know some of this has been discussed in the past but I can't quite
find the exact information I'm seeking
(and I'd check the ZFS wikis but the websites are down at the moment).
Firstly, which is correct, free space shown by "zfs list" or by "zpool
iostat" ? (...)
(That's a big difference, and the percentage doesn't agree)
I believe, zpool iostat (and zpool list) report raw storage accounting,
basically - the number of HDD sectors available and consumed, including
redundancy and metadata (so available space also includes the unused-yet
redundancy overhead), and the reserved space (like 1/64 of the pool size
for system use - including attempts to counter the said performance
degradation on full pools).

zfs list displays user-data info - what is available after redundancy
and system reservations, and in general subject to "(ref)reservation"
and "(ref)quota" on datasets in the pool. When cloning and dedup come
into play as well as compression, this accounting becomes tricky.
Overall, there is one number you can trust: the used space in a dataset
says how much userdata (including directory structures, but also after
compression) is referenced in this filesystem, if you limit or bill by
consumption - the end-user value of your service. This does not mean
that only this filesystem references these blocks, though. And the
other numbers are more vague (i.e. with good dedup+compress ratios you
can sum up the used spaces to much more than the raw pool sizes).
Post by sol
Secondly, there's 8 vdevs each of 11 disks.
6 vdevs show used 8.19 TB, free 1.81 TB, free = 18.1%
2 vdevs show used 6.39 TB, free 3.61 TB, free = 36.1%
How did you look that up? ;)
Post by sol
I've heard that
a) performance degrades when free space is below a certain amount
Basically, the "mechanics" of the degradation is that ZFS writes new
data into available space "bubbles" within a range called "metaslab".
It tries to make sequential writes to do stuff faster. If your pool
has seen lots of writes and deletions, its free spaces may have become
fragmented, so search for the "bubbles" takes longer, and they are too
small to fit the whole incoming transaction - leading to more HDD seeks
and thus more latency on write. In extreme, ZFS can't even find holes
big enough for a block, so it splits the block data into several pieces
and writes "gang blocks", using many tiny IOs with many mechanical HDD
seeks.

Numbers - how full is a pool to display these problems - are highly
individual. Some pools saw it after filling to 60%, typical is 80-90%,
and for write-only pools you might never see this problem because you
don't delete stuff (well, except maybe for metadata during updates,
all of which usually consumes 1-3% of total allocation).
Post by sol
b) data is written to different vdevs depending on free space
There are several rules which influence the preference of a Top-level
VDEV and of a metaslab region inside it, which probably include free
space, known presence of large "bubbles" to write into, and location
on the disk (slower-faster LBA tracks).
Post by sol
So a) how do I determine the exact value when performance degrades and
how significant is it?
b) has that threshold been reached (or exceeded?) in the first six vdevs?
and if so are the two emptier vdevs being used exclusively to prevent performance degrading
so it will only degrade when all vdevs reach the magic 18.1% free (or whatever it is)?
Hopefully, this was answered above :)
Post by sol
Presumably there's no way to identify which files are on which vdevs in
order to delete them and recover the performance?
It is possible, but not simple, and is not guaranteed to get the
result you want (though there is little harm in trying).

You can use "zdb" to extract information about an inode on a dataset
as a listing of block pointer entries which form a tree for this file.

For example:
# ls -lani /lib/libnsl.so.1
9239 -rwxr-xr-x 1 0 2 649720 Jun 8 2012 /lib/libnsl.so.1

# df -k /lib/libnsl.so.1
Filesystem kbytes used avail capacity Mounted on
rpool/ROOT/oi_151a4 61415424 452128 24120824 2% /

Here the first number from "ls -i" gives us the inode of the file,
and the "df" confirms the dataset name. So we can zdb walk:

# zdb -ddddd -bbbbbb rpool/ROOT/oi_151a4 9239
Dataset rpool/ROOT/oi_151a4 [ZPL], ID 5299, cr_txg 1349648, 442M,
8213 objects, rootbp DVA[0]=<0:a6921d600:200> DVA[1]=<0:2ffc7b400:200>
[L0 DMU objset] fletcher4 lzjb LE contiguous unique double
size=800L/200P birth=4682209L/4682209P fill=8213
cksum=16f122cb05:77d20eea7b8:155c69ed5a6ce:2b90104e19641f

Object lvl iblk dblk dsize lsize %full type
9239 2 16K 128K 642K 640K 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 4
path /lib/libnsl.so.1
uid 0
gid 2
atime Fri Jun 8 00:22:17 2012
mtime Fri Jun 8 00:22:17 2012
ctime Fri Jun 8 00:22:17 2012
crtime Fri Jun 8 00:22:17 2012
gen 1349746
mode 100755
size 649720
parent 25
links 1
pflags 40800000104
Indirect blocks:
0 L1 DVA[0]=<0:940298000:400> DVA[1]=<0:263234a00:400>
[L1 ZFS plain file] fletcher4 lzjb LE contiguous unique double
size=4000L/400P birth=1349746L/1349746P fill=5
cksum=682d4fda0b:3cc1aa306094:13ebb22837cf14:4c5c67e522dbca8

0 L0 DVA[0]=<0:95f337000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=23fce6aa160b:5ab11e5fcbc6c2e:5b38f230e01d508d:12cf92941e4b2487

20000 L0 DVA[0]=<0:95f357000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=3f0ac207affd:f8ed413113d6bdd:24e36c7682cfc297:2549c866ab61e464

40000 L0 DVA[0]=<0:95f377000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=3d40bf3329f0:f459bc876303dd7:2230ee348b7b08c5:3a65d1ebbf52c9dc

60000 L0 DVA[0]=<0:95f397000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=19e01b53eb67:956b52d1df6ecd4:38ff9bd1302bf879:e4661798dd1ae8a0

80000 L0 DVA[0]=<0:95f3b7000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=361e6fd03d40:d0903e491fa09e9:7a2e453ed28baa92:28562c53af3c0495

segment [0000000000000000, 00000000000a0000) size 640K

After several higher layers of the pointers (just L1 in example above),
you have "L0" entries which point to actual data blocks with their DVA
fields.

The example file above fits in five 128K blocks at level L0.

The first component of the DVA address is the top-level vdev ID,
followed by offset and allocation size (including raidzN redundancy).
Depending on your pool's history, larger files may have been striped
over several TLVDEVs however, and relocating them (copying over and
deleting the original) might help or not help free up a particular
TLVDEV (upon rewrite they will be striped again, albeit maybe ZFS
will make different decisions upon a new write - and prefer the more
free devices).

Also, if the file's blocks are referenced via snapshots, clones,
dedup or hardlinks, they won't actually be released when you delete
a particular copy of the file.

HTH,
//Jim Klimov
Timothy Coalson
2012-12-20 22:40:41 UTC
Permalink
Raw Message
Post by sol
Secondly, there's 8 vdevs each of 11 disks.
Post by sol
6 vdevs show used 8.19 TB, free 1.81 TB, free = 18.1%
2 vdevs show used 6.39 TB, free 3.61 TB, free = 36.1%
How did you look that up? ;)
"zpool iostat -v" or "zpool list -v"

Tim

Loading...