RAID Failure Calculator (for 8x 2TB RAIDZ)

Discussion:

Matthew Angelo

2011-02-07 02:45:36 UTC

I require a new high capacity 8 disk zpool. The disks I will be
purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
bits read) of 1 in 10^14 and will be 2TB. I'm staying clear of WD
because they have the new 2048b sectors which don't play nice with ZFS
at the moment.

My question is, how do I determine which of the following zpool and
vdev configuration I should run to maximize space whilst mitigating
rebuild failure risk?

1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(7+1) vdev

I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
2TB disks.

Cheers

Ian Collins

2011-02-07 04:18:01 UTC

Permalink

I assume 3 was 6+2.

A bigger issue than drive error rates is how long a new 2TB drive will
take to resilver if one dies. How long are you willing to run without
redundancy in your pool?

--
Ian.

Edward Ned Harvey

2011-02-07 04:48:20 UTC

Permalink

Post by Matthew Angelo
My question is, how do I determine which of the following zpool and
vdev configuration I should run to maximize space whilst mitigating
rebuild failure risk?
1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(6+2) vdev
I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
2TB disks.

(Corrected type-o, 6+2 for you).
Sounds like you made up your mind already. Nothing wrong with that. You
are apparently uncomfortable running with only 1 disk worth of redundancy.
There is nothing fundamentally wrong with the raidz1 configuration, but the
probability of failure is obviously higher.

Question is how do you calculate the probability? Because if we're talking
abou 5e-21 versus 3e-19 then you probably don't care about the difference...
They're both essentially zero probability... Well... There's no good
answer to that.

With the cited probability of bit error rate, you're just representing the
probability of a bit error. You're not representing the probability of a
failed drive. And you're not representing the probability of a drive
failure within a specified time window. What you really care about is the
probability of two drives (or 3 drives) failing concurrently... In which
case, you need to model the probability of any one drive failing within a
specified time window. And even if you want to model that probability, in
reality it's not linear. The probability of a drive failing between 1yr and
1yr+3hrs is smaller than the probability of the drive failing between 3yr
and 3yr+3hrs. Because after 3yrs, the failure rate will be higher. So
after 3 yrs, the probability of multiple simultaneous failures is higher.

I recently saw some seagate data sheets which specified the annual disk
failure rate to be 0.3%. Again, this is a linear model, representing a
nonlinear reality.

Suppose one disk fails... How many weeks does it take to get a replacement
onsite under the 3yr limited mail-in warranty?

But then again after 3 years, you're probably considering this your antique
hardware, and all the stuff you care about is on a newer server. Etc.

There's no good answer to your question.

You are obviously uncomfortable with a single disk worth of redundancy. Go
with your gut. Sleep well at night. It only costs you $100. You probably
have a cell phone with no backups worth more than that in your pocket right
now.

Matthew Angelo

2011-02-07 06:22:51 UTC

Permalink

Yes I did mean 6+2, Thank you for fixing the typo.

I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the "scales would tip". I understand
RAIDZ2 protects against failures during a rebuild process. Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my 'exposure' time.

For example, I would hazard a confident guess that 7+1 RAIDZ1 with 6TB
drives wouldn't be a smart idea. I'm just trying to extrapolate down.

I will be running hot (or maybe cold) spare. So I don't need to
factor in "Time it takes for a manufacture to replace the drive".

On Mon, Feb 7, 2011 at 2:48 PM, Edward Ned Harvey

Post by Edward Ned Harvey

(Corrected type-o, 6+2 for you).
Sounds like you made up your mind already. Nothing wrong with that. You
are apparently uncomfortable running with only 1 disk worth of redundancy.
There is nothing fundamentally wrong with the raidz1 configuration, but the
probability of failure is obviously higher.
Question is how do you calculate the probability? Because if we're talking
abou 5e-21 versus 3e-19 then you probably don't care about the difference...
They're both essentially zero probability... Well... There's no good
answer to that.
With the cited probability of bit error rate, you're just representing the
probability of a bit error. You're not representing the probability of a
failed drive. And you're not representing the probability of a drive
failure within a specified time window. What you really care about is the
probability of two drives (or 3 drives) failing concurrently... In which
case, you need to model the probability of any one drive failing within a
specified time window. And even if you want to model that probability, in
reality it's not linear. The probability of a drive failing between 1yr and
1yr+3hrs is smaller than the probability of the drive failing between 3yr
and 3yr+3hrs. Because after 3yrs, the failure rate will be higher. So
after 3 yrs, the probability of multiple simultaneous failures is higher.
I recently saw some seagate data sheets which specified the annual disk
failure rate to be 0.3%. Again, this is a linear model, representing a
nonlinear reality.
Suppose one disk fails... How many weeks does it take to get a replacement
onsite under the 3yr limited mail-in warranty?
But then again after 3 years, you're probably considering this your antique
hardware, and all the stuff you care about is on a newer server. Etc.
There's no good answer to your question.
You are obviously uncomfortable with a single disk worth of redundancy. Go
with your gut. Sleep well at night. It only costs you $100. You probably
have a cell phone with no backups worth more than that in your pocket right
now.

Peter Jeremy

2011-02-07 21:07:27 UTC

Permalink

Post by Matthew Angelo
I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the "scales would tip".

It's not that simple. Whilst resilver time is proportional to device
size, it's far more impacted by the degree of fragmentation of the
pool. And there's no 'tipping point' - it's a gradual slope so it's
really up to you to decide where you want to sit on the probability
curve.

Post by Matthew Angelo
I understand
RAIDZ2 protects against failures during a rebuild process.

This would be its current primary purpose.

Post by Matthew Angelo
Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my 'exposure' time.

Unless this is a write-once pool, you can probably also assume that
your pool will get more fragmented over time, so by the time your
pool gets to twice it's current capacity, it might well take 3 days
to rebuild due to the additional fragmentation.

One point I haven't seen mentioned elsewhere in this thread is that
all the calculations so far have assumed that drive failures were
independent. In practice, this probably isn't true. All HDD
manufacturers have their "off" days - where whole batches or models of
disks are cr*p and fail unexpectedly early. The WD EARS is simply a
demonstration that it's WD's turn to turn out junk. Your best
protection against this is to have disks from enough different batches
that a batch failure won't take out your pool.

PSU, fan and SATA controller failures are likely to take out multiple
disks but it's far harder to include enough redundancy to handle this
and your best approach is probably to have good backups.

Post by Matthew Angelo
I will be running hot (or maybe cold) spare. So I don't need to
factor in "Time it takes for a manufacture to replace the drive".

In which case, the question is more whether 8-way RAIDZ1 with a
hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2). In the latter
case, your "hot spare" is already part of the pool so you don't
lose the time-to-notice plus time-to-resilver before regaining
redundancy. The downside is that actively using the "hot spare"
may increase the probability of it failing.

--
Peter Jeremy

Richard Elling

2011-02-08 00:53:35 UTC

Permalink

Post by Peter Jeremy

Post by Matthew Angelo
I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the "scales would tip".

The "tipping point" won't occur for similar configurations. The tip
occurs for different configurations. In particular, if the size of the
N+M parity scheme is very large and the resilver times become
very, very large (weeks) then a (M-1)-way mirror scheme can provide
better performance and dependability. But I consider these to be
extreme cases.

Post by Peter Jeremy

Post by Matthew Angelo
I understand
RAIDZ2 protects against failures during a rebuild process.

This would be its current primary purpose.

Post by Matthew Angelo
Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my 'exposure' time.

Unless this is a write-once pool, you can probably also assume that
your pool will get more fragmented over time, so by the time your
pool gets to twice it's current capacity, it might well take 3 days
to rebuild due to the additional fragmentation.
One point I haven't seen mentioned elsewhere in this thread is that
all the calculations so far have assumed that drive failures were
independent. In practice, this probably isn't true. All HDD
manufacturers have their "off" days - where whole batches or models of
disks are cr*p and fail unexpectedly early. The WD EARS is simply a
demonstration that it's WD's turn to turn out junk. Your best
protection against this is to have disks from enough different batches
that a batch failure won't take out your pool.

The problem with considering the failures as interdependent is that
you cannot get the failure rate information from the vendors. You could
guess, or use your own, but it would not always help you make a better design
decision.

Post by Peter Jeremy
PSU, fan and SATA controller failures are likely to take out multiple
disks but it's far harder to include enough redundancy to handle this
and your best approach is probably to have good backups.

The top 4 items that fail most often, in no particular order, are: fans,
power supplies, memory, and disk. This is why you will see the enterprise
class servers use redundant fans, multiple high-quality power supplies,
ECC memory, and some sort of RAID.

Post by Peter Jeremy

Post by Matthew Angelo
I will be running hot (or maybe cold) spare. So I don't need to
factor in "Time it takes for a manufacture to replace the drive".

In which case, the question is more whether 8-way RAIDZ1 with a
hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).

In this case, raidz2 is much better for dependability because the "spare"
is already "resilvered." It also performs better, though the dependability
gains tend to be bigger than the performance gains.

Post by Peter Jeremy
In the latter
case, your "hot spare" is already part of the pool so you don't
lose the time-to-notice plus time-to-resilver before regaining
redundancy. The downside is that actively using the "hot spare"
may increase the probability of it failing.

No. The disk failure rate data does not conclusively show that activity
causes premature failure. Other failure modes dominate.
-- richard

Paul Kraus

2011-02-14 12:55:03 UTC

Permalink

Post by Richard Elling

Post by Matthew Angelo
I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the "scales would tip".

Empirically it seems that resilver time is related to number of
objects as much (if not more than) amount of data. zpools (mirrors)
with similar amounts of data but radically different numbers of
objects take very different amounts of time to resilver. I have NOT
(yet) started actually measuring and tracking this, but the above is
based on casual observation.

P.S. I am measuring number of objects via `zdb -d` as that is faster
than trying to count files and directories and I expect is a much
better measure of what the underlying zfs code is dealing with (a
particular dataset may have lots of snapshot data that does not
(easily) show up).

--
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Nico Williams

2011-02-14 13:12:46 UTC

Permalink

Post by Paul Kraus
P.S. I am measuring number of objects via `zdb -d` as that is faster
than trying to count files and directories and I expect is a much
better measure of what the underlying zfs code is dealing with (a
particular dataset may have lots of snapshot data that does not
(easily) show up).

It's faster because; a) no atime updates, b) no ZPL overhead.

Nico
--

Richard Elling

2011-02-07 07:01:44 UTC

Permalink

The MTTDL[2] model will work.
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl
As described, this model doesn't scale well for N > 3 or 4, but it will get
you in the ballpark.

You will also need to know the MTBF from the data sheet, but if you
don't have that info, that is ok because you are asking the right question:
given a single drive type, what is the best configuration for preventing
data loss. Finally, to calculate the raidz2 result, you need to know the
mean time to recovery (MTTR) which includes the logistical replacement
time and resilver time.

Basically, the model calculates the probability of a data loss event during
reconstruction. This is different for ZFS and most other LVMs because ZFS
will only resilver data and the total data <= disk size.

1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(7+1) vdev
I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
2TB disks.

Double parity will win over single parity. Intuitively, when you add parity you
multiply by the MTBF. When you add disks to a set, you change the denominator
by a few digits. Obviously multiplication is a good thing, dividing not so much.
In short, raidz2 is the better choice.
-- richard

Sandon Van Ness

2011-02-07 13:23:07 UTC

Permalink

I think as far as data integrity and complete volume loss is most likely
in the following order:

1. 1x Raidz(7+1)
2. 2x RaidZ(3+1)
3. 1x Raidz2(6+2)

Simple raidz certainly is an option with only 8 disks (8 is about the
maximum I would go) but to be honest I would feel safer going raidz2.
The 2x raidz (3+1) would probably perform the best but I would prefer
going with the 3rd option (raidz2) as it is better for redundancy. With
raidz2 any two disks can fail and with dual parity if you get some
un-recoverable read errors during a scrub you have a much better chance
of not having corruption due to the double parity on the same set of data.

I require a new high capacity 8 disk zpool. The disks I will be
purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
bits read) of 1 in 10^14 and will be 2TB. I'm staying clear of WD
because they have the new 2048b sectors which don't play nice with ZFS
at the moment.
My question is, how do I determine which of the following zpool and
vdev configuration I should run to maximize space whilst mitigating
rebuild failure risk?
1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(7+1) vdev
I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
2TB disks.
Cheers
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss