Discussion:
Performance Testing
Paul Kraus
2010-08-11 19:40:52 UTC
Permalink
I know that performance has been discussed often here, but I
have just gone through some testing in preparation for deploying a
large configuration (120 drives is a large configuration for me) and I
wanted to share my results, both to share the results as well as to
see if anyone sees anything wrong in either my methodology or results.

First the hardware and requirements. We have an M4000
production server and a T2000 test server. The storage resides in five
J4400 dual attached to the T2000 (and soon to be connected to the
M4000 as well). The drives are all 750 GB SATA disks. So we have 120
drives. The data is currently residing on other storage and will be
migrated to the new storage as soon as we are happy with the
configuration. There is about 20 TB or data today, and we need growth
to at least 40 TB. We also need a small set of drives for testing. My
plan is to use 80 to 100 drives for production and 20 drives for test.
The I/O pattern is a small number of large sequential writes (to load
the data) followed by lots of random reads and some random writes (5%
sequential writes, 10% random writes, 85% random reads). The files are
relatively small, as they are scans (TIFF) of documents, median size
is 23KB. The data is divided into projects, each of which varies in
size between almost nothing up to almost 50 million objects. We
currently have multiple zpools (based on department) and multiple
datasets in each (based on project). The plan for the new storage is
to go with one zpool, and then have a dataset per department, and
datasets within the departments for each project.

Based on recommendations from our local Sun / Oracle staff, we
are planning on using raidz2 for recoverability reasons over mirroring
(to get a comparable level of fault tolerance with mirrors would
require three-way mirrors, and that does not get us the capacity we
need). I have been testing various raidz2 configurations to confirm
the data I have found regarding performance vs. number of vdevs and
size of raidz2 vdevs. I used 40 disks out of the 120 and used the same
40 disks (after culling out any that showed unusual asvc_t via iostat.
I used filebench for the testing as it seemed to generate real
differences based on zpool configuration (other tools I tried show no
statistical difference between zpool configurations).

See https://spreadsheets.google.com/pub?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&hl=en&output=html
for a summary of the results. The random read numbers agree with what
is expected (performance scales linearly with the number of vdevs).
The random write numbers also agree with the expected result, except
for the 4 vdevs of 10 disk raidz2 which showed higher performance than
expected. The sequential write performance actually was fairly
consistent and even showed a slight improvement with fewer vdevs of
more disks. Based on these results, and our capacity needs, I am
planning to go with 5 disk raidz2 vdevs. Since we have five J4400, I
am considering using one disk in each of the five arrays per vdev, so
that a complete failure of a J4400 does not cause any loss of data.
What is the general opinion of that approach and does anyone know how
to map the MPxIO device name back to a physical drive ?

Does anyone see any issues with either the results or the
tentative plan for the new storage layout ? Thanks, in advance, for
your feedback.

P.S. Let me know if you want me to post the filebench workloads I
used, they are the defaults with a few small tweeks (random workload
ran 64 threads, for example).
--
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
Marion Hakanson
2010-08-11 20:24:02 UTC
Permalink
Based on these results, and our capacity needs, I am planning to go with 5
disk raidz2 vdevs.
I did similar tests with a Thumper in 2008, with X4150/J4400 in 2009,
and more recently comparing X4170/J4400 and X4170/MD1200:
http://acc.ohsu.edu/~hakansom/thumper_bench.html
http://acc.ohsu.edu/~hakansom/j4400_bench.html
http://acc.ohsu.edu/~hakansom/md1200_loadbal_bench.html

On the Thumper, we went with 7x(4D+2P) raidz2, and as a general-purpose
NFS server performance has been fantastic except (as expected without
any NV ZIL) for the very rare "lots of small synchronous I/O" workloads
(like extracting a tar archive via an NFS client).

In fact, our experience with the above has led us to go with 6x(5D+2P)
on our new X4170/J4400 NFS server. The difference between this config
and 7x(4D+2P) on the same hardware is pretty small, and both are faster
than the Thumper.
Since we have five J4400, I am considering using one disk
in each of the five arrays per vdev, so that a complete failure of a J4400
does not cause any loss of data. What is the general opinion of that approach
We did something like this on the Thumper, with one disk on each of
the internal HBA's. Since our new system has only two J4400's, we
didn't try to cover this type of failure.
and does anyone know how to map the MPxIO device name back to a physical
drive ?
You can use CAM to view the mapping of physical drives to device names
(with or without MPxIO enabled). That's the most human-friendly way
that I've found.

If you're using Oracle/Sun LSI HBA's (mpt), a "raidctl -l" will list
out devices names like 0.0.0, 0.1.0, and so on. That middle digit does
seem to correspond with the physical slot number in the J4400's, at least
initially. Unfortunately (for this purpose), if you move drives around,
the "raidctl" names follow the drives to their new locations, as do the
Solaris device names (verified by "dd if=/dev/dsk/... of=/dev/null" and
watching the blinkenlights). Also, with multiple paths, devices will
show up with two different names in "raidctl -l", so it's a bit of a
pain to make sense of it all.

So, just use CAM....

Regards,

Marion
Marion Hakanson
2012-03-21 17:40:18 UTC
Permalink
Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
Achieving 500MB/sec. with 8KB files and lots of random accesses is really
hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
100MB+ files is much easier.
. . .
For ZFS, performance is proportional to the number of vdevs NOT the
number of drives or the number of drives per vdev. See https://
docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
Xc for some testing I did a while back. I did not test sequential read as
that is not part of our workload.
. . .
I understand why the read performance scales with the number of vdevs,
but I have never really understood _why_ it does not also scale with the
number of drives in each vdev. When I did my testing with 40 dribves, I
expected similar READ performance regardless of the layout, but that was NOT
the case.
In your first paragraph you make the important point that "performance"
is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above,
you go back to using "performance" in its ambiguous form. I assume that
by "performance" you are mostly focussing on random-read performance....

My experience is that sequential read performance _does_ scale with the number
of drives in each vdev. Both sequential and random write performance also
scales in this manner (note that ZFS tends to save up small, random writes
and flush them out in a sequential batch).

Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping. In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application. This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.
The recommendation is to not go over 8 or so drives per vdev, but that is
a performance issue NOT a reliability one. I have also not been able to
duplicate others observations that 2^N drives per vdev is a magic number (4,
8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
reliable, just (relatively) slow :-)
Again, the "performance issue" you describe above is for the random-read
case, not sequential. If you rarely experience small-random-read workloads,
then raidz* will perform just fine. We often see 2000 MBytes/sec sequential
read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
(using 2TB drives).

However, when a disk fails and must be resilvered, that's when you will
run into the slow performance of the small, random read workload. This
is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
especially of the 1TB+ size. That way if it takes 200 hours to resilver,
you've still got a lot of redundancy in place.

Regards,

Marion
Jim Klimov
2012-03-21 18:26:52 UTC
Permalink
Post by Marion Hakanson
Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping. In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application. This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.
True, but if the stars align so nicely that all the sectors
related to the block are read simultaneously in parallel
from several drives of the top-level vdev, so there is no
(substantial) *latency* incurred by waiting between the first
and last drives to complete the read request, then the
*aggregate bandwidth* of the array is (should be) similar
to performance (bandwidth) of a stripe.

This gain would probably be hidden by caches and averages,
unless the stars align so nicely for many blocks in a row,
such as a sequential uninterrupted read of a file written
out sequentially - so that component drives would stream
it off the platter track by track in a row... Ah, what a
wonderful world that would be! ;)

Also, after the sector is read by the disk and passed to
the OS, it is supposedly cached until all sectors of the
block arrive into the cache and the checksum matches.
During this time the HDD is available to do other queued
mechanical tasks. I am not sure which cache that might be:
too early for ARC - no block yet, and the vdev-caches now
drop non-metadata sectors. Perhaps it is just a variable
buffer space in the instance of the reading routine which
tries to gather all pieces of the block together and pass
it to the reader (and into ARC)...

//Jim
Richard Elling
2012-03-21 18:53:29 UTC
Permalink
comments below...
Post by Marion Hakanson
Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
Achieving 500MB/sec. with 8KB files and lots of random accesses is really
hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
100MB+ files is much easier.
. . .
For ZFS, performance is proportional to the number of vdevs NOT the
number of drives or the number of drives per vdev. See https://
docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
Xc for some testing I did a while back. I did not test sequential read as
that is not part of our workload.
Actually, few people have sequential workloads. Many think they do, but I say
prove it with iopattern.
Post by Marion Hakanson
. . .
I understand why the read performance scales with the number of vdevs,
but I have never really understood _why_ it does not also scale with the
number of drives in each vdev. When I did my testing with 40 dribves, I
expected similar READ performance regardless of the layout, but that was NOT
the case.
In your first paragraph you make the important point that "performance"
is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above,
you go back to using "performance" in its ambiguous form. I assume that
by "performance" you are mostly focussing on random-read performance....
My experience is that sequential read performance _does_ scale with the number
of drives in each vdev. Both sequential and random write performance also
scales in this manner (note that ZFS tends to save up small, random writes
and flush them out in a sequential batch).
Yes.

I wrote a small, random read performance model that considers the various caches.
It is described here:
http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf

The spreadsheet shown in figure 3 is available for the asking (and it works on your
iphone or ipad :-)
Post by Marion Hakanson
Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping. In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application. This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.
It is not as bad as that. The actual worst case number for a HDD with zfs_vdev_max_pending
of one is:
average IOPS * ((D+P) / D)
where,
D = number of data vdevs
P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
total disks per set = D + P

We did many studies that verified this. More recent studies show zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in my talk at
OpenStorage Summit last fall.
Post by Marion Hakanson
The recommendation is to not go over 8 or so drives per vdev, but that is
a performance issue NOT a reliability one. I have also not been able to
duplicate others observations that 2^N drives per vdev is a magic number (4,
8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
reliable, just (relatively) slow :-)
Paul, I have a considerable amount of data that refutes your findings. Can we agree
that YMMV and varies dramatically, depending on your workload?
Post by Marion Hakanson
Again, the "performance issue" you describe above is for the random-read
case, not sequential. If you rarely experience small-random-read workloads,
then raidz* will perform just fine. We often see 2000 MBytes/sec sequential
read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
(using 2TB drives).
Yes, this is relatively easy to see. I've seen 6GByes/sec for large configs, but
that begins to push the system boundaries in many ways.
Post by Marion Hakanson
However, when a disk fails and must be resilvered, that's when you will
run into the slow performance of the small, random read workload. This
is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
especially of the 1TB+ size. That way if it takes 200 hours to resilver,
you've still got a lot of redundancy in place.
Regards,
Marion
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
***@RichardElling.com
+1-760-896-4422
Jim Klimov
2012-03-22 10:03:30 UTC
Permalink
2012-03-21 22:53, Richard Elling wrote:
...
Post by Richard Elling
Post by Marion Hakanson
This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.
It is not as bad as that. The actual worst case number for a HDD with zfs_vdev_max_pending
average IOPS * ((D+P) / D)
where,
D = number of data vdevs
P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
total disks per set = D + P
I wrote in this thread that AFAIK for small blocks (i.e. 1-sector
size worth of data) there would be P+1 sectors used to store the
block, which is an even worse case at least capacity-wise, as well
as impacting fragmentation => seeks, but might occasionally allow
parallel reads of different objects (tasks running on disks not
involved in storage of the one data sector and maybe its parities
when required).

Is there any truth to this picture?

Were there any research or tests regarding storage of many small
files (1-sector sized or close to that) on different vdev layouts?
I believe that such files would use a single-sector-sized set of
indirect blocks (dittoed at least twice), so one single-sector
sized file would use at least 9 sectors in raidz2.

Thanks :)
Post by Richard Elling
We did many studies that verified this. More recent studies show zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in my talk at
OpenStorage Summit last fall.
What about drives without (a good implementation of) NCQ/TCQ/whatever?
Does ZFS in-kernel caching, queuing and sorting of pending requests
provide a similar service? Is it controllable with the same switch?

Or, alternatively, is it a kernel-only feature which does not depend
on hardware *CQ? Are there any benefits to disks with *CQ then? :)
Richard Elling
2012-03-22 16:52:12 UTC
Permalink
Post by Jim Klimov
...
Post by Richard Elling
Post by Marion Hakanson
This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.
It is not as bad as that. The actual worst case number for a HDD with
zfs_vdev_max_pending
average IOPS * ((D+P) / D)
where,
D = number of data vdevs
P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
total disks per set = D + P
I wrote in this thread that AFAIK for small blocks (i.e. 1-sector
size worth of data) there would be P+1 sectors used to store the
block, which is an even worse case at least capacity-wise, as well
as impacting fragmentation => seeks, but might occasionally allow
parallel reads of different objects (tasks running on disks not
involved in storage of the one data sector and maybe its parities
when required).
Is there any truth to this picture?
Yes, but it is a rare case for 512b sectors. It could be more common for 4KB
sector disks when ashift=12. However, in that case the performance increases
to the equivalent of mirroring, so there are some benefits.

FWIW, some people call this "RAID-1E"
Post by Jim Klimov
Were there any research or tests regarding storage of many small
files (1-sector sized or close to that) on different vdev layouts?
It is not a common case, so why bother?
Post by Jim Klimov
I believe that such files would use a single-sector-sized set of
indirect blocks (dittoed at least twice), so one single-sector
sized file would use at least 9 sectors in raidz2.
No. You can't account for the metadata that way. Metadata space is not 1:1 with
data space. Metadata tends to get written in 16KB chunks, compressed.
Post by Jim Klimov
Thanks :)
Post by Richard Elling
We did many studies that verified this. More recent studies show zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in my talk at
OpenStorage Summit last fall.
What about drives without (a good implementation of) NCQ/TCQ/whatever?
All HDDs I've tested suck. The form of the suckage is that the number of IOPS
stays relatively constant, but the average latency increases dramatically. This
makes sense, due to the way elevator algorithms work.
Post by Jim Klimov
Does ZFS in-kernel caching, queuing and sorting of pending requests
provide a similar service? Is it controllable with the same switch?
There are many caches at play here, with many tunables. The analysis doesn't
fit in an email.
Post by Jim Klimov
Or, alternatively, is it a kernel-only feature which does not depend
on hardware *CQ? Are there any benefits to disks with *CQ then? :)
Yes, SSDs with NCQ work very well.
-- richard

--
DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
***@RichardElling.com
+1-760-896-4422
Jim Klimov
2012-03-22 18:01:31 UTC
Permalink
Post by Richard Elling
Yes, but it is a rare case for 512b sectors.
It could be more common for 4KB sector disks when ashift=12.
...
Post by Richard Elling
Post by Jim Klimov
Were there any research or tests regarding storage of many small
files (1-sector sized or close to that) on different vdev layouts?
It is not a common case, so why bother?
I think that a certain Bob F. would disagree, especially
when larger native sectors and ashist=12 come into play.
Namely, one scenario where this is important is automated
storage of thumbnails for websites, or some similar small
objects in vast amounts.

I agree that hordes of 512b files would be rare; 4kb-sized
files (or a bit larger - 2-3 userdata sectors) are a lot
more probable ;)
Post by Richard Elling
Post by Jim Klimov
I believe that such files would use a single-sector-sized set of
indirect blocks (dittoed at least twice), so one single-sector
sized file would use at least 9 sectors in raidz2.
No. You can't account for the metadata that way. Metadata space is not 1:1 with
data space. Metadata tends to get written in 16KB chunks, compressed.
I purportedly made an example of single-sector-sized files.
The way I get it (maybe wrong though), the tree of indirect
blocks (dnode?) for a file is stored separately from other
similar objects. While different L0 blkptr_t objects (BPs)
"parented" by the same L1 object are stored as a single
block on disk (128 BPs sized 128 bytes each = 16kb), further
redundanced and ditto-copied, I believe that L0 BPs from
different files are stored in separate blocks - as well
as L0 BPs parented by different L1 BPs from different
byterange stretches of the same file. Likewise for other
layers of L(N+1) pointers if the file is sufficiently
large (in amount of blocks used to write it).

The BP tree for a file is itself an object for a ZFS dataset,
individually referenced (as inode number) and there's a
pointer to its root from the DMU dnode of the dataset.

If the above rant is true, then the single-block file should
have a single L0 blkptr playing as its whole indirect tree
of block pointers, and that L0 would be stored in a dedicated
block (not shared with other files' BPs), inflated by ditto
copies=2 and raidz/mirror redundancy.

Right/wrong?

Thanks,
//Jim
Bob Friesenhahn
2012-03-22 20:33:45 UTC
Permalink
Post by Jim Klimov
I think that a certain Bob F. would disagree, especially
when larger native sectors and ashist=12 come into play.
Namely, one scenario where this is important is automated
storage of thumbnails for websites, or some similar small
objects in vast amounts.
I don't know about that Bob F. but this Bob F. just took a look and
noticed that thumbnail files for full-color images are typically 4KB
or a bit larger. Low-color thumbnails can be much smaller.

For a very large photo site, it would make sense to replicate just the
thumbnails across a number of front-end servers and put the larger
files on fewer storage servers because they are requested much less
often and stream out better. This would mean that those front-end
"thumbnail" servers would primarily contain small files.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jeff Bacon
2012-03-24 23:33:17 UTC
Permalink
I have been running ZFS in a mission critical application since
zpool version 10 and have not seen any issues with some of the vdevs
in a zpool full while others are virtually empty. We have been running
commercial Solaris 10 releases. The configuration was that each
Thanks for sharing some real-life data from larger deployments,
as you often did. That's something I don't often have access
to nowadays, with a liberty to tell :)
Here's another datapoint, then:

I'm using sol10u9 and u10 on a number of supermicro boxes,
mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
Application is NFS file service to a bunch of clients, and
we also have an in-house database application written in Java
which implements a column-oriented db in files. Just about all
of it is raidz2, much of it running gzip-compressed.

Since I can't find anything saying not to other than some common
wisdom about not putting your eggs all in one basket that I'm
choosing to reject in some cases, I just keep adding vdevs to
the pool. started with 2TB barracudas for dev/test/archive
usage and constellations for prod, now 3TB drives, have just
added some of the new Pipeline drives with nothing particularly
of interest to note therefrom.

You can create a startlingly large pool this way:

ny-fs7(68)% zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
srv 177T 114T 63.3T 64% ONLINE -

most pools are smaller. this is an archive box that's also
the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
one is 130TB in 11 vdevs of 8 drives raidz2. I won't guess
at the mix of 2TB and 3TB. these are both sol10u9.

Another box has 150TB in 6 pools, raidz2/gzip using 2TB
constellations, dual X5690s with 144GB RAM running 20-30
Java db workers. We do manage to break this box on the
odd occasion - there's a race condition in the ZIO code
where a buffer can be freed while the block buffer is in
the process of being "loaned" out to the compression code.
However, it takes 600 zpool threads plus another 600-900
java threads running at the same time with a backlog of
80000 ZIOs in queue, so it's not the sort of thing that
anyone's likely to run across much. :) It's fixed
in sol11, I understand; however, our intended fix is
to split the whole thing so that the workload (which
for various reasons needs to be on one box) is moved
to a 4-socket Westmere, and all of the data pools
are served via NFS from other boxes.

I did lose some data once, long ago, using LSI 1068-based
controllers on older kit, but pretty much I can attribute
that to something between me-being-stupid and the 1068s
really not being especially friendly towards the LSI
expander chips in the older 3Gb/s SMC backplanes when used
for SATA-over-SAS tunneling. The current arrangements
are pretty solid otherwise.

The SATA-based boxes can be a little cranky when a drive
toasts, of course - they sit and hang for a while until they
finally decide to offline the drive. We take that as par
for the course; for the application in question (basically,
storing huge amounts of data on the odd occasion that someone
has a need for it), it's not exactly a showstopper.


I am curious as to whether there is any practical upper-limit
on the number of vdevs, or how far one might push this kind of
configuration in terms of pool size - assuming a sufficient
quantity of RAM, of course.... I'm sure I will need to
split this up someday but for the application there's just
something hideously convenient about leaving it all in one
filesystem in one pool.


-bacon
Richard Elling
2012-03-25 00:06:15 UTC
Permalink
Thanks for sharing, Jeff!
Comments below...
Post by Jeff Bacon
I have been running ZFS in a mission critical application since
zpool version 10 and have not seen any issues with some of the vdevs
in a zpool full while others are virtually empty. We have been running
commercial Solaris 10 releases. The configuration was that each
Thanks for sharing some real-life data from larger deployments,
as you often did. That's something I don't often have access
to nowadays, with a liberty to tell :)
I'm using sol10u9 and u10 on a number of supermicro boxes,
mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
Application is NFS file service to a bunch of clients, and
we also have an in-house database application written in Java
which implements a column-oriented db in files. Just about all
of it is raidz2, much of it running gzip-compressed.
Since I can't find anything saying not to other than some common
wisdom about not putting your eggs all in one basket that I'm
choosing to reject in some cases, I just keep adding vdevs to
the pool. started with 2TB barracudas for dev/test/archive
usage and constellations for prod, now 3TB drives, have just
added some of the new Pipeline drives with nothing particularly
of interest to note therefrom.
ny-fs7(68)% zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
srv 177T 114T 63.3T 64% ONLINE -
most pools are smaller. this is an archive box that's also
the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
one is 130TB in 11 vdevs of 8 drives raidz2. I won't guess
at the mix of 2TB and 3TB. these are both sol10u9.
Another box has 150TB in 6 pools, raidz2/gzip using 2TB
constellations, dual X5690s with 144GB RAM running 20-30
Java db workers. We do manage to break this box on the
odd occasion - there's a race condition in the ZIO code
where a buffer can be freed while the block buffer is in
the process of being "loaned" out to the compression code.
However, it takes 600 zpool threads plus another 600-900
java threads running at the same time with a backlog of
80000 ZIOs in queue, so it's not the sort of thing that
anyone's likely to run across much. :) It's fixed
in sol11, I understand; however, our intended fix is
to split the whole thing so that the workload (which
for various reasons needs to be on one box) is moved
to a 4-socket Westmere, and all of the data pools
are served via NFS from other boxes.
I did lose some data once, long ago, using LSI 1068-based
controllers on older kit, but pretty much I can attribute
that to something between me-being-stupid and the 1068s
really not being especially friendly towards the LSI
expander chips in the older 3Gb/s SMC backplanes when used
for SATA-over-SAS tunneling. The current arrangements
are pretty solid otherwise.
In general, mixing SATA and SAS directly behind expanders (eg without
SAS/SATA intereposers) seems to be bad juju that an OS can't fix.
Post by Jeff Bacon
The SATA-based boxes can be a little cranky when a drive
toasts, of course - they sit and hang for a while until they
finally decide to offline the drive. We take that as par
for the course; for the application in question (basically,
storing huge amounts of data on the odd occasion that someone
has a need for it), it's not exactly a showstopper.
I am curious as to whether there is any practical upper-limit
on the number of vdevs, or how far one might push this kind of
configuration in terms of pool size - assuming a sufficient
quantity of RAM, of course.... I'm sure I will need to
split this up someday but for the application there's just
something hideously convenient about leaving it all in one
filesystem in one pool.
I've run pools with > 100 top-level vdevs. It is not uncommon to see
40+ top-level vdevs.
-- richard

--
DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
***@RichardElling.com
+1-760-896-4422
Jeff Bacon
2012-03-25 13:26:14 UTC
Permalink
Post by Richard Elling
In general, mixing SATA and SAS directly behind expanders (eg without
SAS/SATA intereposers) seems to be bad juju that an OS can't fix.
In general I'd agree. Just mixing them on the same box can be problematic,
I've noticed - though I think as much as anything that the firmware
on the 3G/s expanders just isn't as well-tuned as the firmware
on the 6G/s expanders, and for all I know there's a firmware update
that will make things better.

SSDs seem to be an exception, however. Several boxes have a mix of
Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual
purposes on the expander with the constellations, or in one case,
Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same
expander under massive loads - the aforementioned box suffering from
80k ZIO queues - with nary a blip. (The SSDs are swap drives, and
we were force-swapping processes out to disk as part of task management.
Meanwhile, the Java processes are doing batch import processing using
the Cheetahs as staging area, so those two expanders are under constant
heavy load. Yes that is as ugly as it sounds, don't ask, and don't do
this yourself. This is what happens when you develop a database without
clear specs and have to just throw hardware underneath it guessing
all the way. But to give you an idea of the load they were/are under.)

The SSDs were chosen with an eye towards expander-friendliness, and tested
relatively extensively before use. YMMV of course and this is nowhere
to skimp on a-data or Kingston; buy what Anand says to buy and you
seem to do very well.

I would say, never do it on LSI 3G/s expanders. Be careful with using
SATA spindles. Test the hell out of any SSD you use first. But you seem
to be able to get away with the better consumer-class SATA SSDs.

(I realize that many here would say that if you are going to use
SSD in an enterprise config, you shouldn't be messing with anything
short of Deneva or the SAS-based SSDs. I'd say there are simply
a bunch of caveats with the consumer MLC SSDs in such situations
to consider and if you are very clear about them up front, then
they can be just fine.

I suspect the real difficulty in these situations is in having
a management chain that is capable of both grokking the caveats up
front and remembering that they agreed to them when something
does go wrong. :) As in this case I am the management chain,
it's not an issue. This is of course not the usual case.)

-bacon
Richard Elling
2012-03-25 16:55:22 UTC
Permalink
Post by Jeff Bacon
Post by Richard Elling
In general, mixing SATA and SAS directly behind expanders (eg without
SAS/SATA intereposers) seems to be bad juju that an OS can't fix.
In general I'd agree. Just mixing them on the same box can be problematic,
I've noticed - though I think as much as anything that the firmware
on the 3G/s expanders just isn't as well-tuned as the firmware
on the 6G/s expanders, and for all I know there's a firmware update
that will make things better.
I haven't noticed a big difference in the expanders, does anyone else see
issues with 6G expanders?
Post by Jeff Bacon
SSDs seem to be an exception, however. Several boxes have a mix of
Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual
purposes on the expander with the constellations, or in one case,
Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same
expander under massive loads - the aforementioned box suffering from
80k ZIO queues - with nary a blip. (The SSDs are swap drives, and
we were force-swapping processes out to disk as part of task management.
Meanwhile, the Java processes are doing batch import processing using
the Cheetahs as staging area, so those two expanders are under constant
heavy load. Yes that is as ugly as it sounds, don't ask, and don't do
this yourself. This is what happens when you develop a database without
clear specs and have to just throw hardware underneath it guessing
all the way. But to give you an idea of the load they were/are under.)
Sometime over beers we can trade war stories... many beers... :-)
Post by Jeff Bacon
The SSDs were chosen with an eye towards expander-friendliness, and tested
relatively extensively before use. YMMV of course and this is nowhere
to skimp on a-data or Kingston; buy what Anand says to buy and you
seem to do very well.
Yes. Be aware that companies like Kingston rebadge drives from other,
reputable suppliers. And some reputable suppliers have less-than-perfect
models.
Post by Jeff Bacon
I would say, never do it on LSI 3G/s expanders. Be careful with using
SATA spindles. Test the hell out of any SSD you use first. But you seem
to be able to get away with the better consumer-class SATA SSDs.
(I realize that many here would say that if you are going to use
SSD in an enterprise config, you shouldn't be messing with anything
short of Deneva or the SAS-based SSDs. I'd say there are simply
a bunch of caveats with the consumer MLC SSDs in such situations
to consider and if you are very clear about them up front, then
they can be just fine.
I suspect the real difficulty in these situations is in having
a management chain that is capable of both grokking the caveats up
front and remembering that they agreed to them when something
does go wrong. :) As in this case I am the management chain,
it's not an issue. This is of course not the usual case.)
We'd like to think that given the correct information, reasonable people will
make the best choice. And then there are PHBs.
-- richard

--
DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
***@RichardElling.com
+1-760-896-4422

Loading...