Discussion:
zfs defragmentation via resilvering?
Jim Klimov
2012-01-07 14:50:18 UTC
Permalink
Hello all,

I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).

I believe it was sometimes implied on this list that such
fragmentation for "static" data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).

I wonder how resilvering works, namely - does it write
blocks "as they were" or in an optimized (defragmented)
fashion, in two usecases:
1) Resilvering from a healthy array (vdev) onto a spare drive
in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
new drive in order to repair the array and make it redundant
again.

Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?

Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?

For example, can(does?) metadata live "separately"
from data in some "dedicated" disk areas, while data
blocks are written as contiguously as they can?

Many Windows defrag programs group files into several
"zones" on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...

I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)

What do you think? Thanks,
//Jim Klimov
Edward Ned Harvey
2012-01-07 15:34:26 UTC
Permalink
Post by Jim Klimov
I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).
I believe it was sometimes implied on this list that such
fragmentation for "static" data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).
Can be combated by sending & receiving. But that's not the only way. You
can defrag, (or apply/remove dedup and/or compression, or any of the other
stuff that's dependent on BP rewrite) by doing any technique which
sequentially reads the existing data, and writes it back to disk again. For
example, if you "cp -p file1 file2 && mv file2 file1" then you have
effectively defragged file1 (or added/removed dedup or compression). But of
course it's requisite that file1 is sufficiently "not being used" right now.
Post by Jim Klimov
I wonder how resilvering works, namely - does it write
blocks "as they were" or in an optimized (defragmented)
resilver goes according to temporal order. While this might sometimes yield
a slightly better organization (If a whole bunch of small writes were
previously spread out over a large period of time on a largely idle system,
they will now be write-aggregated to sequential blocks) usually resilvering
recreates fragmentation similar to the pre-existing fragmentation.

In fact, even if you zfs send | zfs receive while preserving snapshots,
you're still recreating the data in something loosely temporal order.
Because it will do all the blocks of the oldest snapshot, and then all the
blocks of the second oldest snapshot, etc. So by preserving the old
snapshots, you might sometimes be recreating significant amount of
fragmentation anyway.
Post by Jim Klimov
1) Resilvering from a healthy array (vdev) onto a spare drive
in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
new drive in order to repair the array and make it redundant
again.
Same behavior either way. Unless... If your old disks are small and very
full, and your new disks are bigger, then sometimes in the past you may have
suffered fragmentation due to lack of available sequential unused blocks.
So resilvering onto new *larger* disks might make a difference.
Post by Jim Klimov
Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs
Yes. But that's not unique to ZFS or COW. No matter what your system, if
your disk is nearly full, you will suffer from fragmentation.
Post by Jim Klimov
and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?
With BP rewrite, yes you can effectively combat fragmentation.
Unfortunately it doesn't exist. :-/

Without BP rewrite... Define "effectively." ;-) I have successfully
defragged, compressed, enabled/disabled dedup on pools before, by using zfs
send | zfs receive... Or by asking users, "Ok, we're all in agreement, this
weekend, nobody will be using the "a" directory. Right?" So then I sudo rm
-rf a, and restore from the latest snapshot. Or something along those
lines. Next weekend, we'll do the "b" directory...
Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.
2012-01-07 16:10:28 UTC
Permalink
it seems that s11 shadow migration can help:-)
Post by Jim Klimov
Hello all,
I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).
I believe it was sometimes implied on this list that such
fragmentation for "static" data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).
I wonder how resilvering works, namely - does it write
blocks "as they were" or in an optimized (defragmented)
1) Resilvering from a healthy array (vdev) onto a spare drive
in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
new drive in order to repair the array and make it redundant
again.
Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?
Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?
For example, can(does?) metadata live "separately"
from data in some "dedicated" disk areas, while data
blocks are written as contiguously as they can?
Many Windows defrag programs group files into several
"zones" on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...
I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)
What do you think? Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Hung-Sheng Tsao Ph D.
Founder& Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/
Bob Friesenhahn
2012-01-08 19:40:04 UTC
Permalink
Post by Jim Klimov
I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).
To put things in proper perspective, with 128K filesystem blocks, the
worst case file fragmentation as a percentage is 0.39%
(100*1/((128*1024)/512)). On a Microsoft Windows system, the
defragger might suggest that defragmentation is not warranted for this
percentage level.
Post by Jim Klimov
Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?
There are different types of fragmentation. The fragmentation which
causes a slowdown when writing to an almost full pool is fragmentation
of the free-list/area (causing zfs to take longer to find free space
to write to) as opposed to fragmentation of the files themselves.
The files themselves will still not be fragmented any more severely
than the zfs blocksize. However, there are seeks and there are
*seeks* and some seeks take longer than others so some forms of
fragmentation are worse than others. When the free space is
fragmented into smaller blocks, there is necessarily more file
fragmentation then the file is written.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey
2012-01-09 13:44:59 UTC
Permalink
Post by Bob Friesenhahn
To put things in proper perspective, with 128K filesystem blocks, the
worst case file fragmentation as a percentage is 0.39%
(100*1/((128*1024)/512)). On a Microsoft Windows system, the
defragger might suggest that defragmentation is not warranted for this
percentage level.
I don't think that's correct...
Suppose you write a 1G file to disk. It is a database store. Now you start
running your db server. It starts performing transactions all over the
place. It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on. Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk. Depending on how quickly they're happening, they may be aggregated
as writes... But that's not going to help the sequential read speed of the
file, later when you stop your db server and try to sequentially copy your
file for backup purposes.

In the pathological worst case, you would write a file that takes up half of
the disk. Then you would snapshot it, and overwrite it in random order,
using the smallest possible block size. Now your disk is 100% full, and if
you read that file, you will be performing worst case random IO spanning 50%
of the total disk space. Granted, this is not a very realistic case, but it
is the worst case, and it's really really really bad for read performance.
Richard Elling
2012-01-09 14:03:22 UTC
Permalink
Post by Edward Ned Harvey
Post by Bob Friesenhahn
To put things in proper perspective, with 128K filesystem blocks, the
worst case file fragmentation as a percentage is 0.39%
(100*1/((128*1024)/512)). On a Microsoft Windows system, the
defragger might suggest that defragmentation is not warranted for this
percentage level.
I don't think that's correct...
Suppose you write a 1G file to disk. It is a database store. Now you start
running your db server. It starts performing transactions all over the
place. It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on.
It depends on the database, but many (eg Oracle database) are COW and
write fixed block sizes so your example does not apply.
Post by Edward Ned Harvey
Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk. Depending on how quickly they're happening, they may be aggregated
as writes... But that's not going to help the sequential read speed of the
file, later when you stop your db server and try to sequentially copy your
file for backup purposes.
Those who expect sequential to get performance out of HDDs usually end up
being sad :-( Interestingly, if you run Oracle database on top of ZFS on top of
SSDs, then you have COW over COW over COW. Now all we need is a bull! :-)
-- richard
--
ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/
Bob Friesenhahn
2012-01-09 15:14:51 UTC
Permalink
Post by Edward Ned Harvey
I don't think that's correct...
But it is! :-)
Post by Edward Ned Harvey
Suppose you write a 1G file to disk. It is a database store. Now you start
running your db server. It starts performing transactions all over the
place. It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on. Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk. Depending on how quickly they're happening, they may be aggregated
Oops. I see an error in the above. Other than tail blocks, or due to
compression, zfs will not write a COW data block smaller than the zfs
filesystem blocksize. If the blocksize was 128K, then updating just
one byte in that 128K block results in writing a whole new 128K block.
This is pretty significant write-amplification but the resulting
fragmentation is still limited by the 128K block size. Remember that
any fragmentation calculation needs to be based on the disk's minimum
read (i.e. sector) size.

However, it is worth remembering that it is common to set the block
size to a much smaller value than default (e.g. 8K) if the filesystem
is going to support a database. In that case it is possible for there
to be fragmentation for every 8K of data. The worst case
fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25%
((100*1/((8*1024)/512))). That would be a high enough percentage that
Microsoft Windows defrag would recommend defragging the disk.

Metadata chunks can not be any smaller than the disk's sector size
(e.g. 512 bytes or 4K bytes). Metadata can be seen as contributing to
fragmentation, which is why it is so valuable to cache it. If the
metadata is not conveniently close to the data, then it may result in
a big ugly disk seek (same impact as data fragmentation) to read it.

In summary, with zfs's default 128K block size, data fragmentation is
not a significant issue, If the zfs filesystem block size is reduced
to a much smaller value (e.g. 8K) then it can become a significant
issue. As Richard Elling points out, a database layered on top of zfs
may already be fragmented by design.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2012-01-09 15:50:13 UTC
Permalink
Post by Bob Friesenhahn
In summary, with zfs's default 128K block size, data fragmentation is
not a significant issue, If the zfs filesystem block size is reduced to
a much smaller value (e.g. 8K) then it can become a significant issue.
As Richard Elling points out, a database layered on top of zfs may
already be fragmented by design.
I THINK there is some fallacy in your discussion: I've seen 128K
referred to as the maximum filesystem block size, i.e. for large
"streaming" writes. For smaller writes ZFS adapts with smaller
blocks. I am not sure how it would rewrite a few bytes inside
a larger block - split it into many smaller ones or COW all 128K.

Intermixing variable-sized indivisible blocks can in turn lead
to more fragmentation than would otherwise be expected/possible ;)

Fixed block sizes are used (only?) for volume datasets.
Post by Bob Friesenhahn
If the metadata is not conveniently close to the data, then it may
result in a big ugly disk seek (same impact as data fragmentation)
to read it.
Also I'm not sure about ths argument. If VDEV prefetch does not
slurp in data blocks, then by the time metadata is discovered in
read-from-disk blocks and data block locations are determined,
the disk may have rotated away from the head, so at least one
rotational delay is incurred even if metadata is immediately
followed by its referred data... no?

//Jim
Edward Ned Harvey
2012-01-13 04:15:37 UTC
Permalink
Post by Edward Ned Harvey
Post by Edward Ned Harvey
Suppose you write a 1G file to disk. It is a database store. Now you start
running your db server. It starts performing transactions all over the
place. It overwrites the middle 4k of the file, and it overwrites 512b
somewhere else, and so on. Since this is COW, each one of these little
writes in the middle of the file will actually get mapped to unused sectors
of disk. Depending on how quickly they're happening, they may be
aggregated
Oops. I see an error in the above. Other than tail blocks, or due to
compression, zfs will not write a COW data block smaller than the zfs
filesystem blocksize. If the blocksize was 128K, then updating just
one byte in that 128K block results in writing a whole new 128K block.
Before anything else, let's define what "fragmentation" means in this
context, or more importantly, why anyone would care.

Fragmentation, in this context, is a measurement of how many blocks exist
sequentially aligned on disk, such that a sequential read will not suffer a
seek/latency penalty. So the reason somebody would care is a function of
performance - disk work payload versus disk work wasted overhead time. But
wait! There are different types of reads. If you read using a scrub or a
zfs send, then it will read the blocks in temporal order, so anything which
was previously write coalesced (even from many different files) will again
be read-coalesced (which is nice). But if you read a file using something
like tar or cp or cat, then it reads the file in sequential file order,
which would be different from temporal order unless the file was originally
written sequentially and never overwritten by COW.

Suppose you have a 1G file open, and a snapshot of this file is on disk from
a previous point in time.
for ( i=0 ; i<1trillion ; i++ ) {
seek(random integer in range[0 to 1G]);
write(4k);
}

Something like this would quickly try to write a bunch of separate and
scattered 4k blocks at different offsets within the file. Every 32 of these
4k writes would be write-coalesced into a single 128k on-disk block.

Sometime later, you read the whole file sequentially such as cp or tar or
cat. The first 4k come from this 128k block... The next 4k come from
another 128k block... The next 4k come from yet another 128k block...
Essentially, the file has become very fragmented and scattered about on the
physical disk. Every 4k read results in a random disk seek.
Post by Edward Ned Harvey
The worst case
fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25%
((100*1/((8*1024)/512))).
You seem to be assuming that reading 512b disk sector and its neighboring
512b sector count as contiguous blocks. And since there are guaranteed to
be exactly 256 sectors in every 128k filesystem block, then there is no
fragmentation for 256 contiguous sectors, guaranteed. Unfortunately, the
512b sector size is just an arbitrary number (and variable, actually 4k on
modern disks), and the resultant percentage of fragmentation is equally
arbitrary.

To produce a number that actually matters - What you need to do is calculate
the percentage of time the disk is able to deliver payload, versus the
percentage of time the disk is performing time-wasting "overhead" operations
- seek and latency.

Suppose your disk speed is 1Gbit/sec while actively engaging the head, and
suppose the average random access (seek & latency) is 10ms. Suppose you
wish for 99% efficiency. The 10ms must be 1% of the time, and the head must
be engaged for 99% of the time, which is 990ms, which is very near 1Gbit, or
approximately 123MB sequential data for every random disk access. You need
123MB sequential data payload for every random disk access.

That's 944 times larger than the largest 128k block size currently in zfs,
and obviously larger still compared to what you mentioned - 4k or 8k
recordsizes or 512b disk sectors...

Suppose you have 128k blocks written to disk, and all scattered about in
random order. Your disk must seek & rotate for 10ms, and then it will be
engaged for 1.3ms reading the 128k, and then it will seek & rotate again for
10ms... I would call that a 13% payload and 87% wasted time. Fragmentation
at this level hurts you really bad.

Suppose there is a TXG flush every 5 seconds. You write a program, which
will write a single byte to disk once every 5.1 seconds. Then you leave
that program running for a very very long time. You now have millions of
128k blocks written on disk scattered about in random order. You start a
scrub. It will read 128k, and then random seek, and then read 128k, etc.

I would call that 100% fragmentation, because there are no contiguously
aligned sequential blocks on disk anywhere. But again, any measure of
"percent fragmentation" is purely arbitrary, unless you know (a) which type
of read behavior is being meaured (temporal or file order) and you know (b)
the sequential engaged disk speed, and you know (c) the average random
access time.
Bob Friesenhahn
2012-01-13 20:30:10 UTC
Permalink
Post by Edward Ned Harvey
Suppose you have a 1G file open, and a snapshot of this file is on disk from
a previous point in time.
for ( i=0 ; i<1trillion ; i++ ) {
seek(random integer in range[0 to 1G]);
write(4k);
}
Something like this would quickly try to write a bunch of separate and
scattered 4k blocks at different offsets within the file. Every 32 of these
4k writes would be write-coalesced into a single 128k on-disk block.
Sometime later, you read the whole file sequentially such as cp or tar or
cat. The first 4k come from this 128k block... The next 4k come from
another 128k block... The next 4k come from yet another 128k block...
Essentially, the file has become very fragmented and scattered about on the
physical disk. Every 4k read results in a random disk seek.
Are you talking about some other filesystem or are you talking about
zfs? Because zfs does not work like that ...

However, I did ignore the additional fragmentation due to using raidz
type formats. These break the 128K block into smaller chunks and so
there can be more fragmentation.
Post by Edward Ned Harvey
Post by Bob Friesenhahn
The worst case
fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25%
((100*1/((8*1024)/512))).
You seem to be assuming that reading 512b disk sector and its neighboring
512b sector count as contiguous blocks. And since there are guaranteed to
be exactly 256 sectors in every 128k filesystem block, then there is no
fragmentation for 256 contiguous sectors, guaranteed. Unfortunately, the
512b sector size is just an arbitrary number (and variable, actually 4k on
modern disks), and the resultant percentage of fragmentation is equally
arbitrary.
Yes, I am saying that zfs writes its data in contiguous chunks
(filesystem blocksize in the case of mirrors).
Post by Edward Ned Harvey
To produce a number that actually matters - What you need to do is calculate
the percentage of time the disk is able to deliver payload, versus the
percentage of time the disk is performing time-wasting "overhead" operations
- seek and latency.
Yes, latency is the critical factor.
Post by Edward Ned Harvey
That's 944 times larger than the largest 128k block size currently in zfs,
and obviously larger still compared to what you mentioned - 4k or 8k
recordsizes or 512b disk sectors...
Yes, fragmentation is still important even with 128K chunks.
Post by Edward Ned Harvey
I would call that 100% fragmentation, because there are no contiguously
aligned sequential blocks on disk anywhere. But again, any measure of
"percent fragmentation" is purely arbitrary, unless you know (a) which type
I agree that the notion of percent fragmentation is arbitrary. I used
one that I invented, and which is based on underlying disk sectors
rather than filesystem blocks.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey
2012-01-16 02:12:36 UTC
Permalink
Post by Bob Friesenhahn
Post by Edward Ned Harvey
Suppose you have a 1G file open, and a snapshot of this file is on disk from
a previous point in time.
for ( i=0 ; i<1trillion ; i++ ) {
seek(random integer in range[0 to 1G]);
write(4k);
}
Something like this would quickly try to write a bunch of separate and
scattered 4k blocks at different offsets within the file. Every 32 of these
4k writes would be write-coalesced into a single 128k on-disk block.
Sometime later, you read the whole file sequentially such as cp or tar or
cat. The first 4k come from this 128k block... The next 4k come from
another 128k block... The next 4k come from yet another 128k block...
Essentially, the file has become very fragmented and scattered about on
the
Post by Edward Ned Harvey
physical disk. Every 4k read results in a random disk seek.
Are you talking about some other filesystem or are you talking about
zfs? Because zfs does not work like that ...
In what way? I've only described behavior of COW and write coalescing.
Which part are you saying is un-ZFS-like?

Before answering, here, let's do some test work:

Create a new pool, with a single disk, no compression or dedup or anything,
called "junk"

run this script. All it does is generate some data in a file sequentially,
and then randomly overwrite random pieces of the file in random order,
creating snapshots all along the way... Many times, until the file has been
completely overwritten many times over. This should be a fragmentation
nightmare.
http://dl.dropbox.com/u/543241/fragmenter.py

Then reboot, to ensure cache is clear.
And see how long it takes to sequentially read the original sequential file,
as compared to the highly fragmented one:
cat /junk/.zfs/snapshot/sequential-before/out.txt | pv > /dev/null ; cat
/junk/.zfs/snapshot/random1399/out.txt | pv > /dev/null

While I'm waiting for this to run, I'll make some predictions:
The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
the initial sequential file should take ~16 sec
After fragmentation, it should be essentially random 4k fragments (32768
bits). I figure each time the head is able to find useful data, it takes
32us to read the 4kb, followed by 10ms random access time... disk is doing
useful work 0.3% of the time and wasting 99.7% of the time doing random
seeks. Should be about 300x longer to read the fragmented file.

... (Ding!) ... Test is done. Thank you for patiently waiting during this
time warp. ;-)

Actual result: 15s and 45s. So it was 3x longer, not 300x. Either way it
proves the point - but I want to see results that are at least 100x worse
due to fragmentation, to REALLY drive home the point, that fragmentation
matters.

I hypothesize, that the mere 3x performance degradation is because I have
only a single 2G file in a 2T pool, and no other activity, and no other
files. So all my supposedly randomly distributed data might reside very
close to each other on platter... The combination of short stroke & read
prefetcher could be doing wonders in this case. So now I'll repeat that
test, but this time... I allow the sequential data to be written
sequentially again just like before, but after it starts the random
rewriting, I'll run a couple of separate threads writing and removing other
junk to the pool, so the write coalescing will include other files, spread
more across a larger percentage of the total disk, getting closer to the
worst case random distribution on disk...

(destroy & recreate the pool in between test runs...)

Actual result: 15s and 104s. So it's only 6.9x performance degradation.
That's the worst I can do without hurting myself. It proves the point, but
not to the magnitude that I expected.
Bob Friesenhahn
2012-01-16 04:39:45 UTC
Permalink
Post by Edward Ned Harvey
The file is 2GB (16 Gbit) and the disk reads around 1Gbit/sec, so reading
the initial sequential file should take ~16 sec
After fragmentation, it should be essentially random 4k fragments (32768
bits). I figure each time the head is able to find useful data, it takes
The 4k fragments is the part I don't agree with. Zfs does not do
that. If you were to run raidzN over a wide enough array of disks you
could end up with 4K fragments (distributed across the disks), but
then you would always have 4K fragments.

Zfs writes linear strips of data in units of the zfs blocksize, unless
it is sliced-n-diced by raidzN for striping across disks. If part of
a zfs filesystem block is overwritten, then the underlying block is
read, modified in memory, and then the whole block written to a new
location. The need to read the existing block is a reason why the zfs
ARC is so vitally important to write performance.

If the filesystem has compression enabled, then the blocksize is still
the same, but the data written may be shorter (due to compression).
File tail blocks may also be shorter.

There are dtrace tools you can use to observe low level I/O and see
the size of the writes.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2012-01-16 13:02:27 UTC
Permalink
This post might be inappropriate. Click to display it.
Bob Friesenhahn
2012-01-16 15:13:03 UTC
Permalink
Post by Jim Klimov
I think that in order to create a truly fragmented ZFS layout,
Edward needs to do sync writes (without a ZIL?) so that every
block and its metadata go to disk (coalesced as they may be)
and no two blocks of the file would be sequenced on disk together.
Although creating snapshots should give that effect...
Creating snapshots does not in itself cause fragmentation since COW
would cause that level of fragmentation to exist anyway. However,
snapshots cause old blocks to be maintained so the disk becomes more
full, fresh blocks may be less appropriately situated, and the disk
seeks may become more expensive due to needing to seek over more
tracks.

In my experience, most files on Unix systems are re-written from
scatch. For example, when one edits a file in an editor, the editor
loads the file into memory, performs the edit, and then writes out the
whole file. Given sufficient free disk space, these files are
unlikely to be fragmented.

The case of slowly written log files or random-access databases are
the worse cases for causing fragmentation.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gary Mills
2012-01-16 15:49:32 UTC
Permalink
Post by Bob Friesenhahn
Post by Jim Klimov
I think that in order to create a truly fragmented ZFS layout,
Edward needs to do sync writes (without a ZIL?) so that every
block and its metadata go to disk (coalesced as they may be)
and no two blocks of the file would be sequenced on disk together.
Although creating snapshots should give that effect...
In my experience, most files on Unix systems are re-written from
scatch. For example, when one edits a file in an editor, the editor
loads the file into memory, performs the edit, and then writes out
the whole file. Given sufficient free disk space, these files are
unlikely to be fragmented.
The case of slowly written log files or random-access databases are
the worse cases for causing fragmentation.
The case I've seen was with an IMAP server with many users. E-mail
folders were represented as ZFS directories, and e-mail messages as
files within those directories. New messages arrived randomly in the
INBOX folder, so that those files were written all over the place on
the storage. Users also deleted many messages from their INBOX
folder, but the files were retained in snapshots for two weeks. On
IMAP session startup, the server typically had to read all of the
messages in the INBOX folder, making this portion slow. The server
also had to refresh the folder whenever new messages arrived, making
that portion slow as well. Performance degraded when the storage
became 50% full. It would increase markedly when the oldest snapshot
was deleted.
--
-Gary Mills- -refurb- -Winnipeg, Manitoba, Canada-
Loading...