Discussion:
# devices in raidz.
(too old to reply)
ozan s. yigit
2006-11-03 14:57:00 UTC
Permalink
for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
basis for this recommendation? i assume it is performance and not failure
resilience, but i am just guessing... [i know, recommendation was intended
for people who know their raid cold, so it needed no further explanation]

thanks... oz
--
ozan s. yigit | ***@somanetworks.com | 416 977 1414 x 1540
I have a hard time getting enough time to do even trivial
blogging: being truly thoughtful takes a lot of time. -- james gosling
Robert Milkowski
2006-11-03 15:02:58 UTC
Permalink
Hello ozan,

Friday, November 3, 2006, 3:57:00 PM, you wrote:

osy> for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
osy> basis for this recommendation? i assume it is performance and not failure
osy> resilience, but i am just guessing... [i know, recommendation was intended
osy> for people who know their raid cold, so it needed no further explanation]

Performance reason for random reads.

ps. however the bigger raid-z group the more risky it could be - but
this is obvious.
--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com
Richard Elling - PAE
2006-11-03 17:44:17 UTC
Permalink
Post by ozan s. yigit
for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
basis for this recommendation? i assume it is performance and not failure
resilience, but i am just guessing... [i know, recommendation was intended
for people who know their raid cold, so it needed no further explanation]
Both actually.
The small, random read performance will approximate that of a single disk.
The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2
volumes.

For example, suppose you have 12 disks and insist on RAID-Z.
Given
1. small, random read iops for a single disk is 141 (eg. 2.5" SAS
10k rpm drive)
2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor)
3. no spares
4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space
utilization
5. infinite service life

Scenario 1: 12-way RAID-Z
performance = 141 iops
MTTDL[1] = 68,530 years
space = 11 * disk size

Scenario 2: 2x 6-way RAID-Z+0
performance = 282 iops
MTTDL[1] = 150,767 years
space = 10 * disk size

[1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)

-- richard
Al Hopper
2006-11-03 23:08:04 UTC
Permalink
Post by Richard Elling - PAE
Post by ozan s. yigit
for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
basis for this recommendation? i assume it is performance and not failure
resilience, but i am just guessing... [i know, recommendation was intended
for people who know their raid cold, so it needed no further explanation]
Both actually.
The small, random read performance will approximate that of a single disk.
The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2
volumes.
For example, suppose you have 12 disks and insist on RAID-Z.
Given
1. small, random read iops for a single disk is 141 (eg. 2.5" SAS
10k rpm drive)
2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor)
3. no spares
4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space
utilization
5. infinite service life
Scenario 1: 12-way RAID-Z
performance = 141 iops
MTTDL[1] = 68,530 years
space = 11 * disk size
Scenario 2: 2x 6-way RAID-Z+0
performance = 282 iops
MTTDL[1] = 150,767 years
space = 10 * disk size
[1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)
But ... I'm not sure I buy into your numbers given the probability that
more than one disk will fail inside the service window - given that the
disks are identical? Or ... a disk failure occurs at 5:01 PM (quitting
time) on a Friday and won't be replaced until 8:00AM on Monday morning.
Does the failure data you have access to support my hypothesis that
failures of identical mechanical systems tend to occur in small clusters
within a relatively small window of time?

Call me paranoid, but I'd prefer to see a product like thumper configured
with 50% of the disks manufactured by vendor A and the other 50%
manufactured by someone else.

This paranoia is based on a personal experience, many years ago (before we
had smart fans etc), where we had a rack full of expensive custom
equipment cooled by (what we thought was) a highly redundant group of 5
fans. One fan suffered infant mortality and its failure went unnoticed,
leaving 4 fans running. Two of the fans died on the same extended weekend
(public holiday). It was an expensive and embarassing disaster.

Regards,

Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
Richard Elling - PAE
2006-11-03 23:46:05 UTC
Permalink
Post by Al Hopper
Post by Richard Elling - PAE
[1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)
But ... I'm not sure I buy into your numbers given the probability that
more than one disk will fail inside the service window - given that the
disks are identical? Or ... a disk failure occurs at 5:01 PM (quitting
time) on a Friday and won't be replaced until 8:00AM on Monday morning.
Does the failure data you have access to support my hypothesis that
failures of identical mechanical systems tend to occur in small clusters
within a relatively small window of time?
Separating the right hand side:
MTTDL = MTBF/N * MTBF/(N-1)*MTTR

the right-most product is the probability that one of the N-1 disks fail
during the recovery window for the first disk's failure. As the MTTR
increases, the probability of the 2nd disk failure also increases.
RAIDoptimizer calculates the MTTR as:
MTTR = service response time + resync time
where
resync time = size * space used (%) / resync rate

Incidentally, since ZFS schedules the resync iops itself, then it can
really move along on a mostly idle system. You should be able to resync
at near the media speed for an idle system. By contrast, a hardware
RAID array has no knowledge of the context of the data or the I/O scheduling,
so they will perform resyncs using a throttle. Not only do they end up
resyncing unused space, but they also take a long time (4-18 GBytes/hr for
some arrays) and thus expose you to a higher probability of second disk
failure.
Post by Al Hopper
Call me paranoid, but I'd prefer to see a product like thumper configured
with 50% of the disks manufactured by vendor A and the other 50%
manufactured by someone else.
Diversity is usually a good thing. Unfortunately, this is often impractical
for a manufacturer.
Post by Al Hopper
This paranoia is based on a personal experience, many years ago (before we
had smart fans etc), where we had a rack full of expensive custom
equipment cooled by (what we thought was) a highly redundant group of 5
fans. One fan suffered infant mortality and its failure went unnoticed,
leaving 4 fans running. Two of the fans died on the same extended weekend
(public holiday). It was an expensive and embarassing disaster.
Modelling such as this assumes independence of failures. Common cause or
bad lots are not that hard to model, but you may never find any failure rate
data for them. You can look at the MTBF sensitivities, though that is an
opening to another set of results. I prefer to ignore the absolute values
and judge competing designs by their relative results. To wit, I fully
expect to be beyond dust in 150,767 years, and the expected lifetime of
most disks is 5 years. But given two competing designs using the same
model, a design predicting and MTTDL 150,767 years will very likely demonstrate
better MTTDL than a design predicting 68,530 years.
-- richard
Torrey McMahon
2006-11-06 21:23:46 UTC
Permalink
Post by Richard Elling - PAE
Incidentally, since ZFS schedules the resync iops itself, then it can
really move along on a mostly idle system. You should be able to resync
at near the media speed for an idle system. By contrast, a hardware
RAID array has no knowledge of the context of the data or the I/O scheduling,
so they will perform resyncs using a throttle. Not only do they end up
resyncing unused space, but they also take a long time (4-18 GBytes/hr for
some arrays) and thus expose you to a higher probability of second disk
failure.
Just as an other data point: It is true that the array doesn't know the
context of the data or the i/o scheduling but some arrays do watch the
incoming data rate and throttle accordingly. (T3 used to for example.)
Robert Milkowski
2006-11-07 12:32:13 UTC
Permalink
Hello Richard,

Saturday, November 4, 2006, 12:46:05 AM, you wrote:


REP> Incidentally, since ZFS schedules the resync iops itself, then it can
REP> really move along on a mostly idle system. You should be able to resync
REP> at near the media speed for an idle system. By contrast, a hardware
REP> RAID array has no knowledge of the context of the data or the I/O scheduling,
REP> so they will perform resyncs using a throttle. Not only do they end up
REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for
REP> some arrays) and thus expose you to a higher probability of second disk
REP> failure.


However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.

Something like 'zpool scrub -r 10 pool' - which would mean 10% of
speed.
--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com
Richard Elling - PAE
2006-11-07 16:19:07 UTC
Permalink
Post by Robert Milkowski
REP> Incidentally, since ZFS schedules the resync iops itself, then it can
REP> really move along on a mostly idle system. You should be able to resync
REP> at near the media speed for an idle system. By contrast, a hardware
REP> RAID array has no knowledge of the context of the data or the I/O scheduling,
REP> so they will perform resyncs using a throttle. Not only do they end up
REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for
REP> some arrays) and thus expose you to a higher probability of second disk
REP> failure.
However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.
Something like 'zpool scrub -r 10 pool' - which would mean 10% of
speed.
I think this has some merit for scrubs, but I wouldn't suggest it for resilver.
If your data is at risk, there is nothing more important than protecting it.
While that sounds harsh, in reality there is a practical limit determined by
the ability of a single LUN to absorb a (large, sequential?) write workload.
For JBODs, that would be approximately the media speed.

The big question, though, is "10% of what?" User CPU? iops?
-- richard
Daniel Rock
2006-11-07 16:47:53 UTC
Permalink
Post by Richard Elling - PAE
The big question, though, is "10% of what?" User CPU? iops?
Maybe something like the "slow" parameter of VxVM?

slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).



Daniel
Richard Elling - PAE
2006-11-07 17:12:44 UTC
Permalink
Post by Daniel Rock
Post by Richard Elling - PAE
The big question, though, is "10% of what?" User CPU? iops?
Maybe something like the "slow" parameter of VxVM?
slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky? In the bad old
days when disks were small, and the systems were slow, this made some
sense. The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
-- richard
Torrey McMahon
2006-11-07 17:49:37 UTC
Permalink
Post by Richard Elling - PAE
The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
This implies that the filesystem has exclusive use of the channel - SAN
or otherwise - as well as the storage array front end controllers,
cache, and the raid groups that may be behind it. What we really need in
this case, and a few others, is the filesystem and backend storage
working together...but I'll save that rant for an other day. ;)
Daniel Rock
2006-11-07 19:36:06 UTC
Permalink
Post by Richard Elling - PAE
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?
And what about encrypted disks? Simply create a zpool with checksum=sha256,
fill it up, then scrub. I'd be happy if I could use my machine during
scrubbing. A throttling of scrubbing would help. Maybe also running the
scrubbing with a "high nice level" in kernel.
Post by Richard Elling - PAE
NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?
250ms is the Veritas default. It doesn't have to be the ZFS default also.


Daniel
Victor Latushkin
2006-11-13 19:30:51 UTC
Permalink
Post by Richard Elling - PAE
Post by Daniel Rock
Maybe something like the "slow" parameter of VxVM?
slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?
NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky? In the bad old
days when disks were small, and the systems were slow, this made some
sense. The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
Well, we are trying to balance impact of resilvering on running
applications with a speed of resilvering.

I think that having an option to tell filesystem to postpone
full-throttle resilvering till some quieter period of time may help.
This may be combined with some throttling mechanism so during quieter
period resilvering is done with full speed, and during busy period it
may continue with reduced speed. Such arrangement may be useful for
customers with e.g. well-defined SLAs.

Wbr,
Victor
Mike Seda
2006-11-13 20:38:26 UTC
Permalink
Hi All,
From reading the docs, it seems that you can add devices (non-spares)
to a zpool, but you cannot take them away, right?
Best,
Mike
Post by Victor Latushkin
Post by Richard Elling - PAE
Post by Daniel Rock
Maybe something like the "slow" parameter of VxVM?
slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?
NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky? In the bad old
days when disks were small, and the systems were slow, this made some
sense. The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
Well, we are trying to balance impact of resilvering on running
applications with a speed of resilvering.
I think that having an option to tell filesystem to postpone
full-throttle resilvering till some quieter period of time may help.
This may be combined with some throttling mechanism so during quieter
period resilvering is done with full speed, and during busy period it
may continue with reduced speed. Such arrangement may be useful for
customers with e.g. well-defined SLAs.
Wbr,
Victor
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Cindy Swearingen
2006-11-13 23:35:06 UTC
Permalink
Hi Mike,

Yes, outside of the hot-spares feature, you can detach, offline, and
replace existing devices in a pool, but you can't remove devices, yet.

This feature work is being tracked under this RFE:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783

Cindy
Post by Mike Seda
Hi All,
From reading the docs, it seems that you can add devices (non-spares)
to a zpool, but you cannot take them away, right?
Best,
Mike
Post by Victor Latushkin
Post by Richard Elling - PAE
Post by Daniel Rock
Maybe something like the "slow" parameter of VxVM?
slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?
NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky? In the bad old
days when disks were small, and the systems were slow, this made some
sense. The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
Well, we are trying to balance impact of resilvering on running
applications with a speed of resilvering.
I think that having an option to tell filesystem to postpone
full-throttle resilvering till some quieter period of time may help.
This may be combined with some throttling mechanism so during quieter
period resilvering is done with full speed, and during busy period it
may continue with reduced speed. Such arrangement may be useful for
customers with e.g. well-defined SLAs.
Wbr,
Victor
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mike Seda
2007-04-10 19:27:06 UTC
Permalink
I noticed that there is still an open bug regarding removing devices
from a zpool:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Does anyone know if or when this feature will be implemented?
Post by Cindy Swearingen
Hi Mike,
Yes, outside of the hot-spares feature, you can detach, offline, and
replace existing devices in a pool, but you can't remove devices, yet.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Cindy
Post by Mike Seda
Hi All,
From reading the docs, it seems that you can add devices
(non-spares) to a zpool, but you cannot take them away, right?
Best,
Mike
Post by Victor Latushkin
Post by Richard Elling - PAE
Post by Daniel Rock
Maybe something like the "slow" parameter of VxVM?
slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?
NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky? In the bad old
days when disks were small, and the systems were slow, this made some
sense. The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
Well, we are trying to balance impact of resilvering on running
applications with a speed of resilvering.
I think that having an option to tell filesystem to postpone
full-throttle resilvering till some quieter period of time may help.
This may be combined with some throttling mechanism so during
quieter period resilvering is done with full speed, and during busy
period it may continue with reduced speed. Such arrangement may be
useful for customers with e.g. well-defined SLAs.
Wbr,
Victor
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
C***@Sun.COM
2007-04-11 22:30:50 UTC
Permalink
Mike,

This RFE is still being worked and I have no ETA on completion...

cs
Post by Mike Seda
I noticed that there is still an open bug regarding removing devices
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Does anyone know if or when this feature will be implemented?
Post by Cindy Swearingen
Hi Mike,
Yes, outside of the hot-spares feature, you can detach, offline, and
replace existing devices in a pool, but you can't remove devices, yet.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Cindy
Post by Mike Seda
Hi All,
From reading the docs, it seems that you can add devices
(non-spares) to a zpool, but you cannot take them away, right?
Best,
Mike
Post by Victor Latushkin
Post by Richard Elling - PAE
Post by Daniel Rock
Maybe something like the "slow" parameter of VxVM?
slow[=iodelay]
Reduces toe system performance impact of copy
operations. Such operations are usually per-
formed on small regions of the volume (nor-
mally from 16 kilobytes to 128 kilobytes).
This option inserts a delay between the
recovery of each such region . A specific
delay can be specified with iodelay as a
number of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?
NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky? In the bad old
days when disks were small, and the systems were slow, this made some
sense. The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
Well, we are trying to balance impact of resilvering on running
applications with a speed of resilvering.
I think that having an option to tell filesystem to postpone
full-throttle resilvering till some quieter period of time may help.
This may be combined with some throttling mechanism so during
quieter period resilvering is done with full speed, and during busy
period it may continue with reduced speed. Such arrangement may be
useful for customers with e.g. well-defined SLAs.
Wbr,
Victor
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2006-11-09 22:50:58 UTC
Permalink
Hello Richard,
Post by Robert Milkowski
REP> Incidentally, since ZFS schedules the resync iops itself, then it can
REP> really move along on a mostly idle system. You should be able to resync
REP> at near the media speed for an idle system. By contrast, a hardware
REP> RAID array has no knowledge of the context of the data or the I/O scheduling,
REP> so they will perform resyncs using a throttle. Not only do they end up
REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for
REP> some arrays) and thus expose you to a higher probability of second disk
REP> failure.
However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.
Something like 'zpool scrub -r 10 pool' - which would mean 10% of
speed.
REP> I think this has some merit for scrubs, but I wouldn't suggest it for resilver.
REP> If your data is at risk, there is nothing more important than protecting it.
REP> While that sounds harsh, in reality there is a practical limit determined by
REP> the ability of a single LUN to absorb a (large, sequential?) write workload.
REP> For JBODs, that would be approximately the media speed.

I can't agree. I have some performance sensitive environments and I
know that during the day I do not want loose performance even if it
means longer resilvering times. That's exactly what I do on HW RAID
controller. In other environments I do want to resilver ASAP, you're
right.

Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.


REP> The big question, though, is "10% of what?" User CPU? iops?

Just slower the reate resilvering/scrub is done.
Insert some kind of delay as some one else suggested.
--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com
Al Hopper
2006-11-10 13:21:38 UTC
Permalink
Post by Robert Milkowski
Hello Richard,
Post by Robert Milkowski
REP> Incidentally, since ZFS schedules the resync iops itself, then it can
REP> really move along on a mostly idle system. You should be able to resync
REP> at near the media speed for an idle system. By contrast, a hardware
REP> RAID array has no knowledge of the context of the data or the I/O scheduling,
REP> so they will perform resyncs using a throttle. Not only do they end up
REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for
REP> some arrays) and thus expose you to a higher probability of second disk
REP> failure.
However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.
Something like 'zpool scrub -r 10 pool' - which would mean 10% of
speed.
REP> I think this has some merit for scrubs, but I wouldn't suggest it for resilver.
REP> If your data is at risk, there is nothing more important than protecting it.
REP> While that sounds harsh, in reality there is a practical limit determined by
REP> the ability of a single LUN to absorb a (large, sequential?) write workload.
REP> For JBODs, that would be approximately the media speed.
I can't agree. I have some performance sensitive environments and I
know that during the day I do not want loose performance even if it
means longer resilvering times. That's exactly what I do on HW RAID
controller. In other environments I do want to resilver ASAP, you're
right.
Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
Probably N% of I/O Ops/Second would work well.
Post by Robert Milkowski
Just slower the reate resilvering/scrub is done.
Insert some kind of delay as some one else suggested.
Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
Robert Milkowski
2006-11-10 14:51:27 UTC
Permalink
Hello Al,
Post by Robert Milkowski
Hello Richard,
Post by Robert Milkowski
REP> Incidentally, since ZFS schedules the resync iops itself, then it can
REP> really move along on a mostly idle system. You should be able to resync
REP> at near the media speed for an idle system. By contrast, a hardware
REP> RAID array has no knowledge of the context of the data or the I/O scheduling,
REP> so they will perform resyncs using a throttle. Not only do they end up
REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr for
REP> some arrays) and thus expose you to a higher probability of second disk
REP> failure.
However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.
Something like 'zpool scrub -r 10 pool' - which would mean 10% of
speed.
REP> I think this has some merit for scrubs, but I wouldn't suggest it for resilver.
REP> If your data is at risk, there is nothing more important than protecting it.
REP> While that sounds harsh, in reality there is a practical limit determined by
REP> the ability of a single LUN to absorb a (large, sequential?) write workload.
REP> For JBODs, that would be approximately the media speed.
I can't agree. I have some performance sensitive environments and I
know that during the day I do not want loose performance even if it
means longer resilvering times. That's exactly what I do on HW RAID
controller. In other environments I do want to resilver ASAP, you're
right.
Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
AH> Probably N% of I/O Ops/Second would work well.

Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).

It would be more intuitive than specifying some numbers like IOPS,
etc.

Additionally setting it to 0 means freeze, then setting it above 0%
means continue (continue not start from the beginning).
--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com
Torrey McMahon
2006-11-10 22:31:31 UTC
Permalink
Post by Robert Milkowski
Post by Robert Milkowski
Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
AH> Probably N% of I/O Ops/Second would work well.
Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).
It would be more intuitive than specifying some numbers like IOPS,
etc.
In any case you're still going to have to provide a tunable for this
even if the resulting algorithm works well on the host side. Keep in
mind that a scrub can also impact the array(s) you're filesystem lives
on. If all my ZFS systems started scrubbing at full speed - Because they
thought they weren't busy - at the same time it might cause issues with
other I/O on the array itself.
Robert Milkowski
2006-11-11 08:44:48 UTC
Permalink
Hello Torrey,
Post by Robert Milkowski
Post by Robert Milkowski
Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
AH> Probably N% of I/O Ops/Second would work well.
Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).
It would be more intuitive than specifying some numbers like IOPS,
etc.
TM> In any case you're still going to have to provide a tunable for this
TM> even if the resulting algorithm works well on the host side. Keep in
TM> mind that a scrub can also impact the array(s) you're filesystem lives
TM> on. If all my ZFS systems started scrubbing at full speed - Because they
TM> thought they weren't busy - at the same time it might cause issues with
TM> other I/O on the array itself.

Tunable in a form of pool property, with default 100%.

On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.
--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com
Torrey McMahon
2006-11-13 04:07:02 UTC
Permalink
Post by Robert Milkowski
Hello Torrey,
Post by Robert Milkowski
Post by Robert Milkowski
Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
AH> Probably N% of I/O Ops/Second would work well.
Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).
It would be more intuitive than specifying some numbers like IOPS,
etc.
TM> In any case you're still going to have to provide a tunable for this
TM> even if the resulting algorithm works well on the host side. Keep in
TM> mind that a scrub can also impact the array(s) you're filesystem lives
TM> on. If all my ZFS systems started scrubbing at full speed - Because they
TM> thought they weren't busy - at the same time it might cause issues with
TM> other I/O on the array itself.
Tunable in a form of pool property, with default 100%.
On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.
I think a not-to-convoluted algorithm as people have suggested would be
ideal and then let people override it as necessary. I would think a 100%
default might be a call generator but I'm up for debate. ("Hey my array
just went crazy. All the lights are blinking but my application isn't
doing any I/O. What gives?")
Robert Milkowski
2006-11-13 07:12:56 UTC
Permalink
Hello Torrey,
Post by Robert Milkowski
Hello Torrey,
Post by Robert Milkowski
Post by Robert Milkowski
Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
AH> Probably N% of I/O Ops/Second would work well.
Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).
It would be more intuitive than specifying some numbers like IOPS,
etc.
TM> In any case you're still going to have to provide a tunable for this
TM> even if the resulting algorithm works well on the host side. Keep in
TM> mind that a scrub can also impact the array(s) you're filesystem lives
TM> on. If all my ZFS systems started scrubbing at full speed - Because they
TM> thought they weren't busy - at the same time it might cause issues with
TM> other I/O on the array itself.
Tunable in a form of pool property, with default 100%.
On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.
TM> I think a not-to-convoluted algorithm as people have suggested would be
TM> ideal and then let people override it as necessary. I would think a 100%
TM> default might be a call generator but I'm up for debate. ("Hey my array
TM> just went crazy. All the lights are blinking but my application isn't
TM> doing any I/O. What gives?")

You've got the same behavior with any LVM when you replace a disk.
So it's not something unexpected for admins. Also most of the time
they expect LVM to resilver ASAP. With default setting not being 100%
you'll definitely see people complaining ZFS is slooow, etc.
--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com
Torrey McMahon
2006-11-13 18:08:20 UTC
Permalink
Howdy Robert.
Post by Robert Milkowski
You've got the same behavior with any LVM when you replace a disk.
So it's not something unexpected for admins. Also most of the time
they expect LVM to resilver ASAP. With default setting not being 100%
you'll definitely see people complaining ZFS is slooow, etc.
It's quite possible that I've only seen the other side of the coin but
in my past I've had support calls where
customers complained that they {replaced a drive, resilvered a mirror,
... } and it knocked the performance of other things. My fave was a set
of A5200s on a hub and after they cranked the i/o rate up on the mirror
it caused some other app - Me thinks it was Oracle - get too slow, think
there was a disk problem, crash(!), and then initiate a cluster
failover. Given the disk group was not in perfect health....oh the fun
we had.

In any case the key is documenting the behavior well enough so people
can see what is going on, how to tune it slower or faster on the fly,
etc. I'm more concerned with that then the actual algorithm or method used.
Richard Elling - PAE
2006-11-14 05:55:18 UTC
Permalink
Post by Torrey McMahon
Post by Robert Milkowski
Hello Torrey,
Post by Robert Milkowski
Post by Robert Milkowski
Also scrub can consume all CPU power on smaller and older
machines and
that's not always what I would like.
REP> The big question, though, is "10% of what?" User CPU? iops?
AH> Probably N% of I/O Ops/Second would work well.
Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).
It would be more intuitive than specifying some numbers like IOPS,
etc.
TM> In any case you're still going to have to provide a tunable for
this TM> even if the resulting algorithm works well on the host side.
Keep in TM> mind that a scrub can also impact the array(s) you're
filesystem lives
TM> on. If all my ZFS systems started scrubbing at full speed - Because they
TM> thought they weren't busy - at the same time it might cause issues with
TM> other I/O on the array itself.
Tunable in a form of pool property, with default 100%.
On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.
I think a not-to-convoluted algorithm as people have suggested would
be ideal and then let people override it as necessary. I would think a
100% default might be a call generator but I'm up for debate. ("Hey my
array just went crazy. All the lights are blinking but my application
isn't doing any I/O. What gives?")
I'll argue that *any* random % is bogus. What you really want to
do is prioritize activity where resources are constrained. From a RAS
perspective, idle systems are the devil's playground :-). ZFS already
does prioritize I/O that it knows about. Prioritizing on CPU might have
some merit, but to integrate into Solaris' resource management system
might bring some added system admin complexity which is unwanted.
-- richard
Torrey McMahon
2006-11-15 01:25:16 UTC
Permalink
Post by Richard Elling - PAE
Post by Torrey McMahon
Post by Robert Milkowski
Hello Torrey,
[SNIP]
Tunable in a form of pool property, with default 100%.
On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.
I think a not-to-convoluted algorithm as people have suggested would
be ideal and then let people override it as necessary. I would think
a 100% default might be a call generator but I'm up for debate. ("Hey
my array just went crazy. All the lights are blinking but my
application isn't doing any I/O. What gives?")
I'll argue that *any* random % is bogus. What you really want to
do is prioritize activity where resources are constrained. From a RAS
perspective, idle systems are the devil's playground :-). ZFS already
does prioritize I/O that it knows about. Prioritizing on CPU might have
some merit, but to integrate into Solaris' resource management system
might bring some added system admin complexity which is unwanted.
I agree but the problem as I see it as that nothing has a overview of
the entire environment. ZFS knows what I/O is coming in and what its
sending out but that's it. Even if we had an easy to use resource
management framework across all the Sun applications and devices we'd
still run into non-Sun bits that place demands on shared components like
networking, san, arrays, etc. Anything that can be auto-tuned is great
but I'm afraid we're still going to need manual tuning in some cases.
Richard Elling - PAE
2006-11-15 21:00:26 UTC
Permalink
Post by Torrey McMahon
Post by Richard Elling - PAE
Post by Torrey McMahon
Post by Robert Milkowski
Hello Torrey,
[SNIP]
Tunable in a form of pool property, with default 100%.
On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.
I think a not-to-convoluted algorithm as people have suggested would
be ideal and then let people override it as necessary. I would think
a 100% default might be a call generator but I'm up for debate.
("Hey my array just went crazy. All the lights are blinking but my
application isn't doing any I/O. What gives?")
I'll argue that *any* random % is bogus. What you really want to
do is prioritize activity where resources are constrained. From a RAS
perspective, idle systems are the devil's playground :-). ZFS already
does prioritize I/O that it knows about. Prioritizing on CPU might have
some merit, but to integrate into Solaris' resource management system
might bring some added system admin complexity which is unwanted.
I agree but the problem as I see it as that nothing has a overview of
the entire environment. ZFS knows what I/O is coming in and what its
sending out but that's it. Even if we had an easy to use resource
management framework across all the Sun applications and devices we'd
still run into non-Sun bits that place demands on shared components
like networking, san, arrays, etc. Anything that can be auto-tuned is
great but I'm afraid we're still going to need manual tuning in some
cases.
I think this is reason #7429 why I hate SANs: no meaningful QoS
related to reason #85823 why I hate SANs: sdd_max_throttle is a
butt-ugly hack
:-)
-- richard
Continue reading on narkive:
Loading...