Discussion:
zfs sata mirror slower than single disk
Michael Hase
2012-07-16 09:43:59 UTC
Permalink
Hello list,

did some bonnie++ benchmarks for different zpool configurations
consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512
bytes/sector, 7.2k), and got some strange results, please see
attachements for exact numbers and pool config:

seq write factor seq read factor
MB/sec MB/sec
single 123 1 135 1
raid0 114 1 249 2
mirror 57 0.5 129 1

Each of the disks is capable of about 135 MB/sec sequential reads and
about 120 MB/sec sequential writes, iostat -En shows no defects. Disks
are 100% busy in all tests, and show normal service times. This is on
opensolaris 130b, rebooting with openindiana 151a live cd gives the
same results, dd tests give the same results, too. Storage controller
is an lsi 1068 using mpt driver. The pools are newly created and
empty. atime on/off doesn't make a difference.

Is there an explanation why

1) in the raid0 case the write speed is more or less the same as a
single disk.

2) in the mirror case the write speed is cut by half, and the read
speed is the same as a single disk. I'd expect about twice the
performance for both reading and writing, maybe a bit less, but
definitely more than measured.

For comparison I did the same tests with 2 old 2.5" 36gb sas 10k disks
maxing out at about 50-60 MB/sec on the outer tracks.

seq write factor seq read factor
MB/sec MB/sec
single 38 1 50 1
raid0 89 2 111 2
mirror 36 1 92 2

Here we get the expected behaviour: raid0 with about double the
performance for reading and writing, mirror about the same performance
for writing, and double the speed for reading, compared to a single
disk. An old scsi system with 4x2 mirror pairs also shows these
scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec
write, each disk capable of 80 MB/sec. I don't care about absolute
numbers, just don't get why the sata system is so much slower than
expected, especially for a simple mirror. Any ideas?

Thanks,
Michael
--
Michael Hase
http://edition-software.de
Richard Elling
2012-07-16 15:40:25 UTC
Permalink
Post by Michael Hase
Hello list,
did some bonnie++ benchmarks for different zpool configurations
consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512
bytes/sector, 7.2k), and got some strange results, please see
seq write factor seq read factor
MB/sec MB/sec
single 123 1 135 1
raid0 114 1 249 2
mirror 57 0.5 129 1
Each of the disks is capable of about 135 MB/sec sequential reads and
about 120 MB/sec sequential writes, iostat -En shows no defects. Disks
are 100% busy in all tests, and show normal service times.
For 7,200 rpm disks, average service times should be on the order of 10ms
writes and 13ms reads. If you see averages > 20ms, then you are likely
running into scheduling issues.
-- richard
Post by Michael Hase
This is on
opensolaris 130b, rebooting with openindiana 151a live cd gives the
same results, dd tests give the same results, too. Storage controller
is an lsi 1068 using mpt driver. The pools are newly created and
empty. atime on/off doesn't make a difference.
Is there an explanation why
1) in the raid0 case the write speed is more or less the same as a
single disk.
2) in the mirror case the write speed is cut by half, and the read
speed is the same as a single disk. I'd expect about twice the
performance for both reading and writing, maybe a bit less, but
definitely more than measured.
For comparison I did the same tests with 2 old 2.5" 36gb sas 10k disks
maxing out at about 50-60 MB/sec on the outer tracks.
seq write factor seq read factor
MB/sec MB/sec
single 38 1 50 1
raid0 89 2 111 2
mirror 36 1 92 2
Here we get the expected behaviour: raid0 with about double the
performance for reading and writing, mirror about the same performance
for writing, and double the speed for reading, compared to a single
disk. An old scsi system with 4x2 mirror pairs also shows these
scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec
write, each disk capable of 80 MB/sec. I don't care about absolute
numbers, just don't get why the sata system is so much slower than
expected, especially for a simple mirror. Any ideas?
Thanks,
Michael
--
Michael Hase
http://edition-software.de<sata.txt><sas.txt>_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
ZFS Performance and Training
***@RichardElling.com
+1-760-896-4422
Stefan Ring
2012-07-16 15:42:02 UTC
Permalink
Post by Michael Hase
2) in the mirror case the write speed is cut by half, and the read
speed is the same as a single disk. I'd expect about twice the
performance for both reading and writing, maybe a bit less, but
definitely more than measured.
I wouldn't expect mirrored read to be faster than single-disk read,
because the individual disks would need to read small chunks of data
with holes in-between. Regardless of the holes being read or not, the
disk will spin at the same speed.
Bob Friesenhahn
2012-07-16 15:49:55 UTC
Permalink
Post by Stefan Ring
I wouldn't expect mirrored read to be faster than single-disk read,
because the individual disks would need to read small chunks of data
with holes in-between. Regardless of the holes being read or not, the
disk will spin at the same speed.
It is normal for reads from mirrors to be faster than for a single
disk because reads can be scheduled from either disk, with different
I/Os being handled in parallel.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Stefan Ring
2012-07-16 16:41:56 UTC
Permalink
It is normal for reads from mirrors to be faster than for a single disk
because reads can be scheduled from either disk, with different I/Os being
handled in parallel.
That assumes that there *are* outstanding requests to be scheduled in
parallel, which would only happen with multiple readers or a large
read-ahead buffer.
Bob Friesenhahn
2012-07-16 16:47:38 UTC
Permalink
Post by Stefan Ring
It is normal for reads from mirrors to be faster than for a single disk
because reads can be scheduled from either disk, with different I/Os being
handled in parallel.
That assumes that there *are* outstanding requests to be scheduled in
parallel, which would only happen with multiple readers or a large
read-ahead buffer.
That is true. Zfs tries to detect the case of sequential reads and
requests to read more data than the application has already requested.
In this case the data may be prefetched from the other disk before the
application has requested it.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Michael Hase
2012-07-16 17:22:03 UTC
Permalink
Post by Stefan Ring
It is normal for reads from mirrors to be faster than for a single disk
because reads can be scheduled from either disk, with different I/Os being
handled in parallel.
That assumes that there *are* outstanding requests to be scheduled in
parallel, which would only happen with multiple readers or a large
read-ahead buffer.
That is true. Zfs tries to detect the case of sequential reads and requests
to read more data than the application has already requested. In this case
the data may be prefetched from the other disk before the application has
requested it.
This is my understanding of zfs: it should load balance read requests even
for a single sequential reader. zfs_prefetch_disable is the default 0. And
I can see exactly this scaling behaviour with sas disks and with scsi
disks, just not on this sata pool.

zfs_vdev_max_pending is already tuned down to 3 as recommended for sata
disks, iostat -Mxnz 2 looks something like

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
507.1 0.0 63.4 0.0 0.0 2.9 0.0 5.8 1 99 c13t5d0
477.6 0.0 59.7 0.0 0.0 2.8 0.0 5.8 1 94 c13t4d0

when reading from the zfs mirror. The default zfs_vdev_max_pending=10
leads to much higher service times in the 20-30msec range, throughput
remains roughly the same.

I can read from the dsk or rdsk devices in parallel with real platter
speeds:

dd if=/dev/dsk/c13t4d0s0 of=/dev/null bs=1024k count=8192 &
dd if=/dev/dsk/c13t5d0s0 of=/dev/null bs=1024k count=8192 &

extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
2467.5 0.0 134.9 0.0 0.0 0.9 0.0 0.4 1 87 c13t5d0
2546.5 0.0 139.3 0.0 0.0 0.8 0.0 0.3 1 84 c13t4d0

So I think there is no problem with the disks.

Maybe it's a corner case which doesn't matter in real world applications?
The random seek values in my bonnie output show the expected performance
boost when going from one disk to a mirrored configuration. It's just the
sequential read/write case, that's different for sata and sas disks.

Michael
Bob
--
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2012-07-16 18:08:10 UTC
Permalink
Post by Michael Hase
This is my understanding of zfs: it should load balance read requests even
for a single sequential reader. zfs_prefetch_disable is the default 0. And I
can see exactly this scaling behaviour with sas disks and with scsi disks,
just not on this sata pool.
Is the BIOS configured to use AHCI mode or is it using IDE mode?

Are the disks 512 byte/sector or 4K?
Post by Michael Hase
Maybe it's a corner case which doesn't matter in real world applications? The
random seek values in my bonnie output show the expected performance boost
when going from one disk to a mirrored configuration. It's just the
sequential read/write case, that's different for sata and sas disks.
I don't have a whole lot of experience with SATA disks but it is my
impression that you might see this sort of performance if the BIOS was
configured so that the drives were used as IDE disks. If not that,
then there must be a bottleneck in your hardware somewhere.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Michael Hase
2012-07-16 20:24:01 UTC
Permalink
Post by Bob Friesenhahn
Post by Michael Hase
This is my understanding of zfs: it should load balance read requests even
for a single sequential reader. zfs_prefetch_disable is the default 0. And
I can see exactly this scaling behaviour with sas disks and with scsi
disks, just not on this sata pool.
Is the BIOS configured to use AHCI mode or is it using IDE mode?
Not relevant here, disks are connected to an onboard sas hba (lsi 1068,
see first post), hardware is a primergy rx330 with 2 qc opterons.
Post by Bob Friesenhahn
Are the disks 512 byte/sector or 4K?
512 byte/sector, HDS721010CLA330
Post by Bob Friesenhahn
Post by Michael Hase
Maybe it's a corner case which doesn't matter in real world applications?
The random seek values in my bonnie output show the expected performance
boost when going from one disk to a mirrored configuration. It's just the
sequential read/write case, that's different for sata and sas disks.
I don't have a whole lot of experience with SATA disks but it is my
impression that you might see this sort of performance if the BIOS was
configured so that the drives were used as IDE disks. If not that, then
there must be a bottleneck in your hardware somewhere.
With early nevada releases I had indeed the IDE/AHCI problem, albeit on
different hardware. Solaris only ran in IDE mode, disks were 4 times
slower than on linux, see
http://www.oracle.com/webfolder/technetwork/hcl/data/components/details/intel/sol_10_05_08/2999.html

Wouldn't a hardware bottleneck show up on raw dd tests as well? I can
stream > 130 MB/sec from each of the two disks in parallel. dd reading
from more than these two disks at the same time results in a slight
slowdown, but here we talk about nearly 400 MB/sec aggregated bandwidth
through the onboard hba, the box has 6 disk slots:

extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
94.5 0.0 94.5 0.0 0.0 1.0 0.0 10.5 0 100 c13t6d0
94.5 0.0 94.5 0.0 0.0 1.0 0.0 10.6 0 100 c13t1d0
93.0 0.0 93.0 0.0 0.0 1.0 0.0 10.7 0 100 c13t2d0
94.5 0.0 94.5 0.0 0.0 1.0 0.0 10.5 0 100 c13t5d0

Don't know why this is a bit slower, maybe some pci-e bottleneck. Or
something with the mpt driver, intrstat shows only one cpu handles all mpt
interrupts. Or even the slow cpus? These are 1.8ghz opterons.

During sequential reads from the zfs mirror I see > 1000 interrupts/sec on
one cpu. So it could really be a bottleneck somewhere triggerd by the
"smallish" 128k i/o requests from the zfs side. I think I'll benchmark
again on a xeon box with faster cpus, my tests with sas disks were done on
this other box.

Michael
Post by Bob Friesenhahn
Bob
--
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey
2012-07-16 19:50:06 UTC
Permalink
Post by Michael Hase
got some strange results, please see
seq write factor seq read factor
MB/sec MB/sec
single 123 1 135 1
raid0 114 1 249 2
mirror 57 0.5 129 1
I agree with you these look wrong. Here is what you should expect:

seq W seq R
single 1.0 1.0
stripe 2.0 2.0
mirror 1.0 2.0

You have three things wrong:
(a) stripe should write 2x
(b) mirror should write 1x
(c) mirror should read 2x

I would have simply said "for some reason your drives are unable to operate
concurrently" but you have the stripe read 2x.

I cannot think of a single reason that the stripe should be able to read 2x,
and the mirror only 1x.
Michael Hase
2012-07-16 22:40:54 UTC
Permalink
Post by Edward Ned Harvey
Post by Michael Hase
got some strange results, please see
seq write factor seq read factor
MB/sec MB/sec
single 123 1 135 1
raid0 114 1 249 2
mirror 57 0.5 129 1
seq W seq R
single 1.0 1.0
stripe 2.0 2.0
mirror 1.0 2.0
(a) stripe should write 2x
(b) mirror should write 1x
(c) mirror should read 2x
I would have simply said "for some reason your drives are unable to operate
concurrently" but you have the stripe read 2x.
I cannot think of a single reason that the stripe should be able to read 2x,
and the mirror only 1x.
Yes, I think so too. In the meantime I switched the two disks to another
box (hp xw8400, 2 xeon 5150 cpus, 16gb ram). On this machine I did the
previous sas tests. OS is now OpenIndiana 151a (vs OpenSolaris b130
before), the mirror pool was upgraded from version 22 to 28, the raid0
pool newly created. The results look quite different:

seq write factor seq read factor
MB/sec MB/sec
raid0 236 2 330 2.5
mirror 111 1 128 1

Now the raid0 case shows excellent performance, the 330 MB/sec are a bit
on the optimistic side, maybe some arc cache effects (file size 32gb, 16gb
ram). iostat during sequential read shows about 115 MB/sec from each disk,
which is great.

The (really desired) mirror case still has a problem with sequential
reads. sequential writes to the mirror are twice as fast as before, and
show the expected performance for a single disk.

So only one thing left: mirror should read 2x

I suspect the difference is not the hardware, both boxess should have
enough horsepower to easily do sequential reads with way more than 200
MB/sec. In all tests cpu time (user and system) remained quite low. I
think it's an OS issue: OpenSolaris b130 is over 2 years old, OI 151a
dates 11/2011.

Could someone please send me some bonnie++ results for a 2 disk mirror or
a 2x2 disk mirror pool with sata disks?

Michael
--
Michael Hase
http://edition-software.de
Bob Friesenhahn
2012-07-16 23:09:24 UTC
Permalink
Post by Michael Hase
So only one thing left: mirror should read 2x
I don't think that mirror should necessarily read 2x faster even
though the potential is there to do so. Last I heard, zfs did not
include a special read scheduler for sequential reads from a mirrored
pair. As a result, 50% of the time, a read will be scheduled for a
device which already has a read scheduled. If this is indeed true,
the typical performance would be 150%. There may be some other
scheduling factor (e.g. estimate of busyness) which might still allow
zfs to select the right side and do better than that.

If you were to add a second vdev (i.e. stripe) then you should see
very close to 200% due to the default round-robin scheduling of the
writes.

It is really difficult to measure zfs read performance due to caching
effects. One way to do it is to write a large file (containing random
data such as returned from /dev/urandom) to a zfs filesystem, unmount
the filesystem, remount the filesystem, and then time how long it
takes to read the file once. The reason why this works is because
remounting the filesystem restarts the filesystem cache.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Michael Hase
2012-07-17 11:33:28 UTC
Permalink
sorry to insist, but still no real answer...
Post by Michael Hase
So only one thing left: mirror should read 2x
I don't think that mirror should necessarily read 2x faster even though the
potential is there to do so. Last I heard, zfs did not include a special
read scheduler for sequential reads from a mirrored pair. As a result, 50%
of the time, a read will be scheduled for a device which already has a read
scheduled. If this is indeed true, the typical performance would be 150%.
There may be some other scheduling factor (e.g. estimate of busyness) which
might still allow zfs to select the right side and do better than that.
If you were to add a second vdev (i.e. stripe) then you should see very close
to 200% due to the default round-robin scheduling of the writes.
My expectation would be > 200%, as 4 disks are involved. It may not be the
perfect 4x scaling, but imho it should be (and is for a scsi system) more
than half of the theoretical throughput. This is solaris or a solaris
derivative, not linux ;-)
It is really difficult to measure zfs read performance due to caching
effects. One way to do it is to write a large file (containing random data
such as returned from /dev/urandom) to a zfs filesystem, unmount the
filesystem, remount the filesystem, and then time how long it takes to read
the file once. The reason why this works is because remounting the
filesystem restarts the filesystem cache.
Ok, did a zpool export/import cycle between the dd read and write test.
This really empties the arc, checked this with arc_summary.pl. the test
even uses two processes in parallel (doesn't make a difference). Result is
still the same:

dd write: 2x 58 MB/sec --> perfect, each disk does > 110 MB/sec
dd read: 2x 68 MB/sec --> imho too slow, about 68 MB/sec per disk

For writes each disk gets 900 128k io requests/sec with asvc_t in the 8-9
msec range. For reads each disk only gets 500 io requests/sec, asvc_t
18-20 msec with the default zfs_vdev_maxpending=10. When reducing
zfs_vdev_maxpending the asvc_t drops accordingly, the i/o rate remains at
500/sec per disk, throughput also the same. I think iostat values should
be reliable here. These high iops numbers make sense as we work on empty
pools so there aren't very high seek times.

All benchmarks (dd, bonnie, will try iozone) lead to the same result: on
the sata mirror pair read performance is in the range of a single disk.
For the sas disks (only two available for testing) and for the scsi system
there is quite good throughput scaling.

Here for comparison a table for 1-4 36gb 15k u320 scsi disks on an old
sxde box (nevada b130):

seq write factor seq read factor
MB/sec MB/sec
single 82 1 78 1
mirror 79 1 137 1.75
2x mirror 120 1.5 251 3.2

This is exactly what's imho to be expected from mirrors and striped
mirrors. It just doesn't happen for my sata pool. Still have no reference
numbers for other sata pools, just one with the 4k/512bytes sector problem
which is even slower than mine. It seems the zfs performance people just
use sas disks and be done.

Michael
Bob
--
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2012-07-17 14:01:38 UTC
Permalink
Post by Michael Hase
If you were to add a second vdev (i.e. stripe) then you should see very
close to 200% due to the default round-robin scheduling of the writes.
My expectation would be > 200%, as 4 disks are involved. It may not be the
perfect 4x scaling, but imho it should be (and is for a scsi system) more
than half of the theoretical throughput. This is solaris or a solaris
derivative, not linux ;-)
Here are some results from my own machine based on the 'virgin mount'
test approach. The results show less boost than is reported by a
benchmark tool like 'iozone' which sees benefits from caching.

I get an initial sequential read speed of 657 MB/s on my new pool
which has 1200 MB/s of raw bandwidth (if mirrors could produce 100%
boost). Reading the file a second time reports 6.9 GB/s.

The below is with a 2.6 GB test file but with a 26 GB test file (just
add another zero to 'count' and wait longer) I see an initial read
rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can
transfer 150 MB/s.

% zpool status
pool: tank
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 0h10m with 0 errors on Mon Jul 16 04:30:48 2012
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c7t50000393E8CA21FAd0p0 ONLINE 0 0 0
c11t50000393D8CA34B2d0p0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c8t50000393E8CA2066d0p0 ONLINE 0 0 0
c12t50000393E8CA2196d0p0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c9t50000393D8CA82A2d0p0 ONLINE 0 0 0
c13t50000393E8CA2116d0p0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c10t50000393D8CA59C2d0p0 ONLINE 0 0 0
c14t50000393D8CA828Ed0p0 ONLINE 0 0 0

errors: No known data errors
% pfexec zfs create tank/zfstest
% pfexec zfs create tank/zfstest/defaults
% cd /tank/zfstest/defaults
% pfexec dd if=/dev/urandom of=random.dat bs=128k count=20000
20000+0 records in
20000+0 records out
2621440000 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s
% cd ..
% pfexec zfs umount tank/zfstest/defaults
% pfexec zfs mount tank/zfstest/defaults
% cd defaults
% dd if=random.dat of=/dev/null bs=128k count=20000
20000+0 records in
20000+0 records out
2621440000 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s
% pfexec dd if=/dev/rdsk/c7t50000393E8CA21FAd0p0 of=/dev/null bs=128k count=2000
2000+0 records in
2000+0 records out
262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s
% bc
scale=8
657/150
4.38000000

It is very difficult to benchmark with a cache which works so well:

% dd if=random.dat of=/dev/null bs=128k count=20000
20000+0 records in
20000+0 records out
2621440000 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Michael Hase
2012-07-17 16:11:40 UTC
Permalink
Post by Michael Hase
If you were to add a second vdev (i.e. stripe) then you should see very
close to 200% due to the default round-robin scheduling of the writes.
My expectation would be > 200%, as 4 disks are involved. It may not be the
perfect 4x scaling, but imho it should be (and is for a scsi system) more
than half of the theoretical throughput. This is solaris or a solaris
derivative, not linux ;-)
Here are some results from my own machine based on the 'virgin mount' test
approach. The results show less boost than is reported by a benchmark tool
like 'iozone' which sees benefits from caching.
I get an initial sequential read speed of 657 MB/s on my new pool which has
1200 MB/s of raw bandwidth (if mirrors could produce 100% boost). Reading
the file a second time reports 6.9 GB/s.
The below is with a 2.6 GB test file but with a 26 GB test file (just add
another zero to 'count' and wait longer) I see an initial read rate of 618
MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s.
To work around these caching effects just use a file > 2 times the size
of ram, iostat then shows the numbers really coming from disk. I always
test like this. a re-read rate of 8.2 GB/s is really just memory
bandwidth, but quite impressive ;-)
% pfexec zfs create tank/zfstest/defaults
% cd /tank/zfstest/defaults
% pfexec dd if=/dev/urandom of=random.dat bs=128k count=20000
20000+0 records in
20000+0 records out
2621440000 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s
% cd ..
% pfexec zfs umount tank/zfstest/defaults
% pfexec zfs mount tank/zfstest/defaults
% cd defaults
% dd if=random.dat of=/dev/null bs=128k count=20000
20000+0 records in
20000+0 records out
2621440000 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s
% pfexec dd if=/dev/rdsk/c7t50000393E8CA21FAd0p0 of=/dev/null bs=128k count=2000
2000+0 records in
2000+0 records out
262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s
% bc
scale=8
657/150
4.38000000
% dd if=random.dat of=/dev/null bs=128k count=20000
20000+0 records in
20000+0 records out
2621440000 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s
This is not my point, I'm pretty sure I did not measure any arc effects -
maybe with the one exception of the raid0 test on the scsi array. Don't
know why the arc had this effect, filesize was 2x of ram. The point is:
I'm searching for an explanation for the relative slowness of a mirror
pair of sata disks, or some tuning knobs, or something like "the disks are
plain crap", or maybe: zfs throttles sata disks in general (don't know the
internals).

In the range of > 600 MB/s other issues may show up (pcie bus contention,
hba contention, cpu load). And performance at this level could be just
good enough, not requiring any further tuning. Could you recheck with only
4 disks (2 mirror pairs)? If you just get some 350 MB/s it could be the
same problem as with my boxes. All sata disks?

Michael
Bob
--
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2012-07-17 16:41:30 UTC
Permalink
The below is with a 2.6 GB test file but with a 26 GB test file (just add
another zero to 'count' and wait longer) I see an initial read rate of 618
MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s.
To work around these caching effects just use a file > 2 times the size of
ram, iostat then shows the numbers really coming from disk. I always test
like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but
quite impressive ;-)
Yes, in the past I have done benchmarking with file size 2X the size
of memory. This does not necessary erase all caching because the ARC
is smart enough not to toss everything.

At the moment I have an iozone benchark run up from 8 GB to 256 GB
file size. I see that it has started the 256 GB size now. It may be
a while. Maybe a day.
In the range of > 600 MB/s other issues may show up (pcie bus contention, hba
contention, cpu load). And performance at this level could be just good
enough, not requiring any further tuning. Could you recheck with only 4 disks
(2 mirror pairs)? If you just get some 350 MB/s it could be the same problem
as with my boxes. All sata disks?
Unfortunately, I already put my pool into use and can not conveniently
destroy it now.

The disks I am using are SAS (7200 RPM, 1 GB) but return similar
per-disk data rates as the SATA disks I use for the boot pool.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2012-07-17 20:18:26 UTC
Permalink
To work around these caching effects just use a file > 2 times the size of
ram, iostat then shows the numbers really coming from disk. I always test
like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but
quite impressive ;-)
Ok, the iozone benchmark finally completed. The results do suggest
that reading from mirrors substantially improves the throughput.
This is interesting since the results differ (better than) from my
'virgin mount' test approach:

Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 8G -g 256G

KB reclen write rewrite read reread
8388608 64 572933 1008668 6945355 7509762
8388608 128 2753805 2388803 6482464 7041942
8388608 256 2508358 2331419 2969764 3045430
8388608 512 2407497 2131829 3021579 3086763
16777216 64 671365 879080 6323844 6608806
16777216 128 1279401 2286287 6409733 6739226
16777216 256 2382223 2211097 2957624 3021704
16777216 512 2237742 2179611 3048039 3085978
33554432 64 933712 699966 6418428 6604694
33554432 128 459896 431640 6443848 6546043
33554432 256 444490 430989 2997615 3026246
33554432 512 427158 430891 3042620 3100287
67108864 64 426720 427167 6628750 6738623
67108864 128 419328 422581 6666153 6743711
67108864 256 419441 419129 3044352 3056615
67108864 512 431053 417203 3090652 3112296
134217728 64 417668 55434 759351 760994
134217728 128 409383 400433 759161 765120
134217728 256 408193 405868 763892 766184
134217728 512 408114 403473 761683 766615
268435456 64 418910 55239 768042 768498
268435456 128 408990 399732 763279 766882
268435456 256 413919 399386 760800 764468
268435456 512 410246 403019 766627 768739

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Paul Kraus
2013-02-26 17:37:47 UTC
Permalink
Be careful when testing ZFS with ozone, I ran a bunch of stats many years ago that produced results that did not pass a basic sanity check. There was *something* about the ozone test data that ZFS either did not like or liked very much, depending on the specific test.

I eventually wrote my own very crude tool to test exactly what our workload was and started getting results that matched the reality we saw.
Post by Bob Friesenhahn
To work around these caching effects just use a file > 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-)
Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 8G -g 256G
KB reclen write rewrite read reread
8388608 64 572933 1008668 6945355 7509762
8388608 128 2753805 2388803 6482464 7041942
8388608 256 2508358 2331419 2969764 3045430
8388608 512 2407497 2131829 3021579 3086763
16777216 64 671365 879080 6323844 6608806
16777216 128 1279401 2286287 6409733 6739226
16777216 256 2382223 2211097 2957624 3021704
16777216 512 2237742 2179611 3048039 3085978
33554432 64 933712 699966 6418428 6604694
33554432 128 459896 431640 6443848 6546043
33554432 256 444490 430989 2997615 3026246
33554432 512 427158 430891 3042620 3100287
67108864 64 426720 427167 6628750 6738623
67108864 128 419328 422581 6666153 6743711
67108864 256 419441 419129 3044352 3056615
67108864 512 431053 417203 3090652 3112296
134217728 64 417668 55434 759351 760994
134217728 128 409383 400433 759161 765120
134217728 256 408193 405868 763892 766184
134217728 512 408114 403473 761683 766615
268435456 64 418910 55239 768042 768498
268435456 128 408990 399732 763279 766882
268435456 256 413919 399386 760800 764468
268435456 512 410246 403019 766627 768739
Bob
--
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Paul Kraus
Deputy Technical Director, LoneStarCon 3
Sound Coordinator, Schenectady Light Opera Company
Edward Ned Harvey
2012-07-16 23:40:36 UTC
Permalink
Sent: Monday, July 16, 2012 6:41 PM
So only one thing left: mirror should read 2x
That is still weird -
But all your numbers so far are coming from bonnie. Why don't you do a test
like this? (below)

Write a big file to mirror. Reboot (or something) to clear cache. Now time
read the file.
Sometimes you'll get a different result with dd versus cat.
Could someone please send me some bonnie++ results for a 2 disk mirror or
a 2x2 disk mirror pool with sata disks?
I don't have bonnie, but I have certainly confirmed mirror performance on
solaris before with sata disks. I've generally done iozone, benchmarking
the N-way mirror, and the stripe-of-mirrors. So I know the expectation in
this case is correct.
hagai
2013-02-26 13:41:23 UTC
Permalink
for what is worth..
I had the same problem and found the answer here -
http://forums.freebsd.org/showthread.php?t=27207
Bob Friesenhahn
2013-02-27 02:53:45 UTC
Permalink
Post by hagai
for what is worth..
I had the same problem and found the answer here -
http://forums.freebsd.org/showthread.php?t=27207
Given enough sequential I/O requests, zfs mirrors behave every much
like RAID-0 for reads. Sequential prefetch is very important in order
to avoid the latencies.

While this script may not work perfectly as is for FreeBSD, it was
very good at discovering a zfs performance bug (since corrected) and
is still an interesting exercise for zfs to see how ZFS ARC caching
helps for re-reads. See
"http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh".
The script will exercise an initial uncached read from disks, and then
a (hopefully) cached re-read from disks. I think that it serves as a
useful benchmark.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Loading...