Discussion:
Very slow ZFS write speed to raw zvol
(too old to reply)
Leon Verrall
2009-07-08 15:07:55 UTC
Permalink
Guys,

Have an opensolairs x86 box running:

SunOS thsudfile01 5.11 snv_111b i86pc i386 i86pc Solaris

This has 2 old qla2200 1Gbit FC cards attached. Each bus is connected to an old transtec F/C raid array. This has a couple of large luns that form a single large zpool:

***@thsudfile01:~# zpool status bucket
pool: bucket
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
bucket ONLINE 0 0 0
c5t0d0 ONLINE 0 0 0
c8t3d0 ONLINE 0 0 0

errors: No known data errors
***@thsudfile01:~# zfs list bucket
NAME USED AVAIL REFER MOUNTPOINT
bucket 2.69T 5.31T 22K /bucket

This is being used as an iSCSI target for an ESX 4.0 development environemnt. I found the performance to be really poor and found the culprit seems to be write performance to the raw zvol. For example on this zfs filesystem allocated as a volume:

***@thsudfile01:~# zfs list bucket/iSCSI/lun1
NAME USED AVAIL REFER MOUNTPOINT
bucket/iSCSI/lun1 250G 5.55T 3.64G -

***@thsudfile01:~# dd if=/dev/zero of=/dev/zvol/rdsk/bucket/iSCSI/lun1 bs=65536 count=102400
^C7729+0 records in
7729+0 records out
506527744 bytes (507 MB) copied, 241.707 s, 2.1 MB/

Some zpool iostat 1 1000:

bucket 2.44T 5.68T 0 203 0 2.73M
bucket 2.44T 5.68T 0 216 0 2.83M
bucket 2.44T 5.68T 0 120 63.4K 1.58M
bucket 2.44T 5.68T 2 350 190K 16.9M
bucket 2.44T 5.68T 0 123 0 1.64M
bucket 2.44T 5.68T 0 230 0 3.02M

Read performance from that zvol (assuming /dev/null behaves properly) is fine:

***@thsudfile01:/bucket/transtec# dd of=/dev/null if=/dev/zvol/rdsk/bucket/iSCSI/lun1 bs=65536 count=204800
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 47.0256 s, 285 MB/s

Somewhat optimistic that... but iostat shows 100MB/s ish.

Write to a zfs filesystem from that zpool is also fine, here with a a write big enough to exhaust the machines 12GB memory:

***@thsudfile01:/bucket/transtec# dd if=/dev/zero of=FILE bs=65536 count=409600
^C
336645+0 records in
336645+0 records out
22062366720 bytes (22 GB) copied, 176.369 s, 125 MB/s

and bursts of cache flush from iostat:

bucket 2.44T 5.68T 0 342 0 38.7M
bucket 2.44T 5.68T 0 1.47K 0 188M
bucket 2.44T 5.68T 0 240 0 21.3M
bucket 2.44T 5.68T 0 1.54K 0 191M
bucket 2.44T 5.68T 0 1.49K 0 191M
bucket 2.44T 5.68T 0 434 0 44.2M

So we seem to be able to get data down to disk via the cache at a reasonable rate and read from a raw zvol OK, but writes are horribly slow.

Am I missing something obvious? let me know what info would be diagnostic and I'll post it...

Cheers,

Leon
--
This message posted from opensolaris.org
Jim Klimov
2009-07-09 06:15:13 UTC
Permalink
Do you have any older benchmarks on these cards and arrays (in their pre-ZFS
life?) Perhaps this is not a ZFS regression but a hardware config issue?

Perhaps there's some caching (like per-disk write-through) not enabled on the
arrays? As you may know, the ability (and reliability) of such cache often depends
on presence of battery back-up (BBU) modules on an array controller.

It is also often pointed out that ZFS prefers raw disks instead of hardware vendors'
RAID arrays (with possible errors crawling into data bits and undetected by the
firmware's algorithms unless the whole disk breaks, or errors related to bugs in
the firmware itself - in more elaborate conspiracy theories).

Due to this it is often recommended to use external RAID implementations as
vdev's to a redundant ZFS pool (such as a mirror of two equivalent arrays).

//Jim
--
This message posted from opensolaris.org
Jim Klimov
2009-07-09 07:08:10 UTC
Permalink
Hmm, scratch that. Maybe.

I did not first get the point that your writes to a filesystem dataset work quickly.
Perhaps filesystem is (better) cached indeed, i.e. *maybe* zvol writes are
synchronous and zfs writes may be cached and thus async? Try playing around
with relevant dataset attributes...

I'm running a test on my system (a snv_114 Thumper, 16Gb RAM, used for other
purposes as well), the CPU is mostly idle now (2.5-3.2% kernel time, that's about
it). Seems I have results not unlike yours. Not cool because I wanted to play with
COMSTAR iSCSI - and I'm not sure it will perform well ;)

I'm dd'ing 30Gb to an uncompressed test zvol with same 64kb block sizes (maybe
they are too small?), and zpool iostat goes like this - a hundred IOs at 7Mbps for a
minute, then a burst of 100-170Mbps and 20-25K IOps for a second:

pond 5.79T 4.41T 0 106 0 7.09M
pond 5.79T 4.41T 0 1.93K 0 20.7M
pond 5.79T 4.41T 0 13.3K 0 106M
pond 5.79T 4.41T 0 116 0 7.76M
pond 5.79T 4.41T 0 108 0 7.23M
pond 5.79T 4.41T 0 107 0 7.16M
pond 5.79T 4.41T 0 107 0 7.16M

or

pond 5.79T 4.41T 0 117 0 7.83M
pond 5.79T 4.41T 0 5.61K 0 49.7M
pond 5.79T 4.41T 0 19.0K 504 149M
pond 5.79T 4.41T 0 104 0 6.96M

Weird indeed.

It wrote 10Gb (according to "zfs get usedbydataset pond/test") taking roughly 30
minutes after which I killed it.

Now, writing to an uncompressed filesystem dataset (although very far from
what's trumpeted as Thumper performance) yields quite different numbers:

pond 5.80T 4.40T 1 3.64K 1022 457M
pond 5.80T 4.40T 0 866 967 75.7M
pond 5.80T 4.40T 0 4.65K 0 586M
pond 5.80T 4.40T 6 802 33.4K 69.2M
pond 5.80T 4.40T 29 2.44K 1.10M 301M
pond 5.80T 4.40T 32 691 735K 25.0M
pond 5.80T 4.40T 56 1.59K 2.29M 184M
pond 5.80T 4.40T 150 768 4.61M 10.5M
pond 5.80T 4.40T 2 0 25.5K 0
pond 5.80T 4.40T 0 2.75K 0 341M
pond 5.80T 4.40T 7 3.96K 339K 497M
pond 5.80T 4.39T 85 740 3.57M 59.0M
pond 5.80T 4.39T 67 0 2.22M 0
pond 5.80T 4.39T 9 4.67K 292K 581M
pond 5.80T 4.39T 4 1.07K 126K 137M
pond 5.80T 4.39T 27 333 338K 9.15M
pond 5.80T 4.39T 5 0 28.0K 3.99K
pond 5.82T 4.37T 1 5.42K 1.67K 677M
pond 5.83T 4.37T 3 1.69K 8.36K 173M
pond 5.83T 4.37T 2 0 5.49K 0
pond 5.83T 4.37T 0 6.32K 0 790M
pond 5.83T 4.37T 2 290 7.95K 27.8M
pond 5.83T 4.37T 0 9.64K 1.23K 1.18G

The numbers are jumpy (maybe due to fragmentation, other processes, etc.) but
there are often spikes in excess of 500MBps.

The whole test took a relatively little time:

# time dd if=/dev/zero of=/pond/tmpnocompress/test30g bs=65536 count=500000
500000+0 records in
500000+0 records out

real 1m27.657s
user 0m0.302s
sys 0m46.976s

# du -hs /pond/tmpnocompress/test30g
30G /pond/tmpnocompress/test30g

To detail about the pool:

The pool is on a Sun X4500 with 48 250Gb SATA drives. It was created as a 9x5
set (9 stripes made of 5-disk raidz1 vdevs) spread across different controllers,
with the command:

# zpool create -f pond \
raidz1 c0t0d0 c1t0d0 c4t0d0 c6t0d0 c7t0d0 \
raidz1 c0t1d0 c1t2d0 c4t3d0 c6t5d0 c7t6d0 \
raidz1 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 \
raidz1 c0t2d0 c4t2d0 c5t2d0 c6t2d0 c7t2d0 \
raidz1 c0t3d0 c1t3d0 c5t3d0 c6t3d0 c7t3d0 \
raidz1 c0t4d0 c1t4d0 c4t4d0 c6t4d0 c7t4d0 \
raidz1 c0t5d0 c1t5d0 c4t5d0 c5t5d0 c7t5d0 \
raidz1 c0t6d0 c1t6d0 c4t6d0 c5t6d0 c6t6d0 \
raidz1 c1t7d0 c4t7d0 c5t7d0 c6t7d0 c7t7d0 \
spare c0t7d0

Alas, while there were many blogs, I couldn't find a definitive answer last year as
to which Thumper layout is optimal in performance and/or reliability (in regard to
6 controllers of 8 disks each, with 2 disks on one of the controllers reserved for
booting).

As a result, we spread each raidz1 across 5 controllers, so the loss of one
controller should have minimal impact on data loss on the average. Since the
system layout is not symmetrical, some controllers are more important than
others (say, the boot one).

//Jim
--
This message posted from opensolaris.org
Jim Klimov
2009-07-09 08:22:59 UTC
Permalink
To tell the truth, I expected zvols to be faster than filesystem datasets. They seem
to have less overhead without inodes, posix, acls and so on. So I'm puzzled by test
results.

I'm now considering the dd i/o block size, and it means a lot indeed, especially if
compared to zvol results with small blocks like 64k.

I ran a number of tests with a zvol recreated by commands before each run (this
may however cause varying fragmentation impacting results of different runs):

# zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512; sync

and tests going like

# time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
1024+0 records in
1024+0 records out

real 0m42.442s
user 0m0.006s
sys 0m4.292s

The test progresses were quite jumpy (with "zpool iostat pond 1" values varying
from 30 to 70 MBps, reads coming in sometimes).

So I'd stick to overall result - the rounded wallclock time it takes to write 1024
records of varying size and resulting average end-user MBps. I also write "sys"
time since that's what is consumed by the kernel and the disk subsystem, after all.
I don't write zpool iostat speeds, since they vary too much and I don't bother
with a spreadsheen right now. But the reported values stay about halfway between
"wallclock MBps" ans "sys MBps" calculations, on the perceived average, peaking
at about 350MBps for large block sizes (>4MB).

1 MB (bs=1048576): 42s (4s), 24MBps
4 MB (bs=4194304): 42s (15s), 96MBps
16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps

Similar results for writing a file to a filesystem; "zpool iostat" values again
jumped anywhere between single MBps to GBps. Simple cleanups used like:

# rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g
count=1024 bs=33554432

Values remain somewhat consistent (in the same league, at least):
1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps

1 MB (bs=1048576): 2.3s (0.6s), 445MBps
4 MB (bs=4194304): 8s (3s), 512MBps
16MB (bs=16777216): 37s (15s), 442MBps
32MB (bs=33554432): 74-103s (32-42s), 442-318MBps

64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps

All in all, to make more precise results these tests should be made in greater
numbers and averaged. But here we got some figures to think about...

On a side note, now I'll pay more attention to tuning suggestions which involve
multi-megabyte buffers for network sockets, etc. They can actually cause an
impact to performance many times over!

On another note,

For some reason I occasionally got results like this:
write: File too large
1+0 records in
1+0 records out

I think the zvol was not considered created by that time. In about 10-15 sec I was
able to commence the test run. Perhaps it helped that I "initialized" the zvol by a
small write after creation, then:
# dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
Strange...
--
This message posted from opensolaris.org
Ross Walker
2009-07-09 14:05:41 UTC
Permalink
Post by Jim Klimov
To tell the truth, I expected zvols to be faster than filesystem datasets. They seem
to have less overhead without inodes, posix, acls and so on. So I'm puzzled by test
results.
I'm now considering the dd i/o block size, and it means a lot
indeed, especially if
compared to zvol results with small blocks like 64k.
I ran a number of tests with a zvol recreated by commands before each run (this
may however cause varying fragmentation impacting results of
# zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set
compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/
pond/test count=1000 bs=512; sync
and tests going like
# time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
1024+0 records in
1024+0 records out
real 0m42.442s
user 0m0.006s
sys 0m4.292s
The test progresses were quite jumpy (with "zpool iostat pond 1" values varying
from 30 to 70 MBps, reads coming in sometimes).
So I'd stick to overall result - the rounded wallclock time it takes to write 1024
records of varying size and resulting average end-user MBps. I also write "sys"
time since that's what is consumed by the kernel and the disk
subsystem, after all.
I don't write zpool iostat speeds, since they vary too much and I don't bother
with a spreadsheen right now. But the reported values stay about halfway between
"wallclock MBps" ans "sys MBps" calculations, on the perceived
average, peaking
at about 350MBps for large block sizes (>4MB).
1 MB (bs=1048576): 42s (4s), 24MBps
4 MB (bs=4194304): 42s (15s), 96MBps
16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps
Similar results for writing a file to a filesystem; "zpool iostat" values again
# rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g
count=1024 bs=33554432
1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps
1 MB (bs=1048576): 2.3s (0.6s), 445MBps
4 MB (bs=4194304): 8s (3s), 512MBps
16MB (bs=16777216): 37s (15s), 442MBps
32MB (bs=33554432): 74-103s (32-42s), 442-318MBps
64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps
All in all, to make more precise results these tests should be made in greater
numbers and averaged. But here we got some figures to think about...
On a side note, now I'll pay more attention to tuning suggestions which involve
multi-megabyte buffers for network sockets, etc. They can actually cause an
impact to performance many times over!
On another note,
write: File too large
1+0 records in
1+0 records out
I think the zvol was not considered created by that time. In about 10-15 sec I was
able to commence the test run. Perhaps it helped that I
"initialized" the zvol by a
# dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
Strange...
When running throughput tests the block sizes to be concerned about
are: 4k, 8k, 16k and 64k. These are the sizes that most file systems
an databases use.

If you get 4k to perform well chances are the others will fall into
line.

-Ross
Jim Klimov
2009-07-09 19:52:33 UTC
Permalink
After reading many-many threads on ZFS performance today (top of the list in the
forum, and some chains of references), I applied a bit of tuning to the server.

In particular, I've set the zfs_write_limit_override to 384Mb so my cache is spooled
to disks more frequently (if streaming lots of writes) and in smaller increments.

* echo zfs_write_limit_override/W0t402653184 | mdb -kw
set zfs:zfs_write_limit_override = 0x18000000

The system seems to be working more smoothly (vs. jerky), and "zpool iostat"
values are not quite as jumpy (i.e. 320MBps to 360MBps for a certain test).
The results also seem faster and more consistent.

With this tuning applied, I'm writing to a 40G zvol, 1M records (count=1048576) of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 54s (30s), 303MBps
32k (bs=32768): 113s (56s), 290MBps
64k (bs=65536): 269s (104s), 243MBps

And 10240 larger records of:
1 MB (bs=1048576): 33s (8s), 310MBps
2 MB (bs=2097152): 74s (23s), 276MBps

And 1024 yet larger records:
1 MB (bs=1048576): 4s (1s), 256MBps
4 MB (bs=4194304): 12s (5s), 341MBps
16MB (bs=16777216): 71s (18s), 230MBps
32MB (bs=33554432): 150s (36s), 218MBps

So the zvol picture is quite better now (albeit not perfect - i.e. no values are near
the 1GBps noted previously in "zpool iostat"), for both small and large blocks.

For filesystem dataset the new values are very similar (like, to tenths of a second
on smaller blocksizes!) but as the blocksize grows, filesystems start losing to the
zvols. Overall the result seems lower than achieved before I tried tuning.

1M records (count=1048576) of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 67s (30s), 245MBps
32k (bs=32768): 144s (55s), 228MBps
64k (bs=65536): 275s (98s), 238MBps

And 10240 larger records go better:
1 MB (bs=1048576): 33s (9s), 310MBps
2 MB (bs=2097152): 70s (21s), 292MBps

And 1024 yet larger records:
1 MB (bs=1048576): 2.8s (0.8s), 366MBps
4 MB (bs=4194304): 12s (4s), 341MBps
16MB (bs=16777216): 55s (17s), 298MBps
32MB (bs=33554432): 140s (36s), 234MBps

Occasionally I did reruns; user time for the same setups can vary significantly
(like 65s vs 84s) while the system time stays pretty much the same.

"zpool iostat" shows larger values (like 320MBps typically) but I think that can be
attributed to writing parity stripes on raidz vdevs.

//Jim

PS: for completeness, I'll try smaller blocks without tuning in a future post.
--
This message posted from opensolaris.org
Eric C. Taylor
2009-07-10 14:46:49 UTC
Permalink
Writes using the character interface (/dev/zvol/rdsk) are synchronous.
If you want caching, you can go through the block interface
(/dev/zvol/dsk) instead.

- Eric
--
This message posted from opensolaris.org
Continue reading on narkive:
Loading...