Raidz - what is stored in parity?

Discussion:

Peter Taps

2010-08-10 22:40:18 UTC

Hi,

I am going through understanding the fundamentals of raidz. From the man pages, a raidz configuration of P disks and N parity provides (P-N)*X storage space where X is the size of the disk. For example, if I have 3 disks of 10G each and I configure it with raidz1, I will have 20G of usable storage. In addition, I continue to work even if 1 disk fails.

First, I don't understand why parity takes so much space. From what I know about parity, there is typically one parity bit per byte. Therefore, the parity should be taking 1/8 of storage, not 1/3 of storage. What am I missing?

Second, if one disk fails, how is my lost data reconstructed? There is no duplicate data as this is not a mirrored configuration. Somehow, there should be enough information in the parity disk to reconstruct the lost data. How is this possible?

Thank you in advance for your help.

Regards,
Peter

--
This message posted from opensolaris.org

Eric D. Mudama

2010-08-10 22:47:24 UTC

Permalink

Post by Peter Taps
Hi,
First, I don't understand why parity takes so much space. From what
I know about parity, there is typically one parity bit per
byte. Therefore, the parity should be taking 1/8 of storage, not 1/3
of storage. What am I missing?

Think of it as 1 bit of parity per N-wide RAID'd bit stored on your
data drives, which is why it occupies 1/N size.

With 3 disks it's 1/3, with 8 disks it's 1/8, and with 10983 disks it
would be 1/10983, because you're generating parity across the "width"
of your stripe, not as a tail to each stored byte on individual
devices.

--
Eric D. Mudama
***@mail.bounceswoosh.org

Peter Taps

2010-08-11 04:57:49 UTC

Permalink

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it simple let's not consider block sizes.

Let's say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info may tell me that something is bad but I don't see how my data will get recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter

--
This message posted from opensolaris.org

Erik Trimble

2010-08-11 05:46:07 UTC

Permalink

Post by Peter Taps
Hi Eric,
Thank you for your help. At least one part is clear now.
I still am confused about how the system is still functional after one disk fails.
Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it simple let's not consider block sizes.
Let's say I send a write value "abcdef" to the zpool.
As the data gets striped, we will have 2 characters per disk.
disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info
Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info may tell me that something is bad but I don't see how my data will get recovered.
The only good thing is that any newer data will now be striped over two disks.
Perhaps I am missing some fundamental concept about raidz.
Regards,
Peter

Parity is not intended to tell you *if* something is bad (well, it's not
*designed* for that). It tells you how to RECONSTRUCT something should
it be bad. ZFS uses Checksums of the data (which are stored as data
themselves) to tell if some data is bad, and thus needs to be re-written
(which is what virtually no other filesystem does now). Parity is used
at a lower level to reconstruct data on devices after a device failure.
It is not directly used to determine if a device (or block of data) is bad.

To simplify, let's assume we're talking about raidz1 (the principles
generally apply to raidz2 and raidz3, but the details differ slightly).

Parity is constructed using mathematical XOR, which has the following
property:

if A XOR B = C
then
A XOR C = B and also B XOR C = A

(XOR is also fully commutative, so A XOR B = B XOR A )

So, in your case, what we have some some data "abcdef", and three disks.
So, assuming we have a stripe set up so that 1 BYTE (i.e. character)
gets stored on each device, then what you have is this:

Stripe Device 1 Device 2 Device 3
1 A B A XOR B
2 C XOR D C D
3 E E XOR F F

(where X XOR Y means the binary value computed by XOR-ing X with Y)

In any case, if I lose one of the devices above, I simply XOR the
corresponding values from the other two devices to reconstruct what I need.

For RaidZ[23], there are 2 or three parity calculations (it's not a
straight XOR, I forget the algorithm), but the process is the same - you
use the data from the remaining devices to recompute the lost device or
devices. As the parity block for a stripe is stored in a balanced manner
across all devices (there is no dedicated parity-only device), it
becomes simpler to recover data while retaining performance.

--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA

Marty Scholes

2010-08-11 13:52:46 UTC

Permalink

Post by Peter Taps

Post by Peter Taps
Hi Eric,
Thank you for your help. At least one part is clear

now.

Post by Peter Taps
I still am confused about how the system is still

functional after one disk fails.

Post by Peter Taps
Consider my earlier example of 3 disks zpool

configured for raidz-1. To keep it simple let's not
consider block sizes.

Post by Peter Taps
Let's say I send a write value "abcdef" to the

zpool.

Post by Peter Taps
As the data gets striped, we will have 2 characters

per disk.

Post by Peter Taps
disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info
Now, if disk2 fails, I lost "cd." How will I ever

recover this? The parity info may tell me that
something is bad but I don't see how my data will get
recovered.

Post by Peter Taps
The only good thing is that any newer data will now

be striped over two disks.

Post by Peter Taps
Perhaps I am missing some fundamental concept about

raidz.

Post by Peter Taps
Regards,
Peter

To follow up Erik's post, parity is used both to detect and correct errors in a string of equal sized numbers, each parity is equal in size to each of the numbers. In the old serial protocols, one bit was used to detect an error in a string of 7 bits, so each "number" in the string was a one bit. In the case of ZFS, each "number" in the string is a disk block. The length of the string of numbers is completely arbitrary.

I am rusty on parity math, but Reed-Solomon is used (of which XOR is a degenerate case) such that each parity is independent of the other parities. RAIDZ can support up to three parities per stripe.

Generally, a single parity can either detect a single corrupt number in a string or if it is known which number is corrupt, a single parity can correct that number. Traditional RAID5 makes the assumption that it knows which number (i.e. block) is bad because the disk failed and therefore can use the parity block to reconstruct it. RAID5 cannot reconstruct a random bit-flip.

RAIDZ takes a different approach where the checksum for the number string (i.e. stripe) exists in a different, already validated stripe. With that checksum in hand, ZFS knows when a stripe is corrupt but not which block. ZFS will then reconstruct each data block in the stripe using the parity block, one data block at a time until the checksum matches. At that point ZFS knows which block is bad and can rebuild it and write it to disk. A scrub does this for all stripes and all parities in each stripe.

Using the example above, the disk layout would look more like the following for a single stripe, and as Erik mentioned, the location of the data and parity blocks will change from stripe to stripe:
disk1 = "ab"
disk2 = "cd"
disk3 = parity info

Again using the example above, if disk 2 fails, or even stays online but producess bad data, the information can be reconstructed from disk 3.

The beauty of ZFS is that it does not depend on parity to detect errors, your stripes can be as wide as you want (up to 100-ish devices) and you can choose 1, 2 or 3 parity devices.

Hope that makes sense,
Marty

--
This message posted from opensolaris.org

Haudy Kazemi

2010-08-11 06:48:27 UTC

Permalink

It's done via math and numbers. :) In a computer, everything is
numbers, stored in base 2 (binary)...there are no letters or other
symbols. Your sample value of 'abcdef' will be represented as a
sequence of numbers, probably using the ASCII equivalent numbers, which
are in turn represented as a binary sequence.

A simplified view of how you can protect multiple independent pieces of
information with once piece of parity is as follows.
(Note: this simplified view is not exactly how RAID5 or RAIDZ work, as
they actually make use of XOR at a bitwise level).

Consider an equation with variables (unrelated to your sample value) A,
B, and P, where A + B = P. P is the parity value.
A and B are numbers representing your data; they were indirectly chosen
by you when you created your data. P is the generated parity value.

If A=97, and B=98, then P=97+98=195.

Each of the three variables is stored on a different disk. If any one
variable is lost (the disk failed), the missing variable can be
recalculated by rearranging the formula and using the known values.

Assuming 'A' was lost, then A=P-B
P-B=195-98
195-98=97
A=97. Data recovered.

In this simplified example, one piece of parity data P is generated for
every pair of A and B values that are written. Special cases handle
things when only one value needs to be written (zero padding). For more
than 3 disks, the formula can expand to variations of A+B+C+D+E+F=P
where P is the parity. Additional levels of parity require using more
complex techniques to generate the needed parity values.

There are lots of other explanations online that might help you out as
well: http://www.google.com/#hl=en&q=how+raid+works

Thomas Burgess

2010-08-11 12:00:00 UTC

Permalink

Post by Peter Taps
Hi Eric,
Thank you for your help. At least one part is clear now.
I still am confused about how the system is still functional after one disk fails.
Consider my earlier example of 3 disks zpool configured for raidz-1. To
keep it simple let's not consider block sizes.
Let's say I send a write value "abcdef" to the zpool.
As the data gets striped, we will have 2 characters per disk.
disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info
Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity
info may tell me that something is bad but I don't see how my data will get
recovered.
The only good thing is that any newer data will now be striped over two disks.
Perhaps I am missing some fundamental concept about raidz.
Regards,
Peter

I find the best way to understand how parity works is to think back to your
algebra class when you'd have something like

1x +2 = 3

and you could solve for x....it's not EXACTLY like that but solving the
parity stuff is similar to solving for x

Post by Peter Taps
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Peter Taps

2010-08-11 17:53:18 UTC

Permalink

Thank you all for your help. It appears my understanding of parity was rather limited. I kept on thinking about parity in memory where the extra bit would be used to ensure that the total of all 9 bits is always even.

In case of zfs, the above type of checking is actually moved into checksum. What zfs calls parity is much more than a simple check. No wonder it takes more space.

One question though. Marty mentioned that raidz parity is limited to 3. But in my experiment, it seems I can get parity to any level.

You create a raidz zpool as:

# zpool create mypool raidzx disk1 diskk2 ....

Here, x in raidzx is a numeric value indicating the desired parity.

In my experiment, the following command seems to work:

# zpool create mypool raidz10 disk1 disk2 ...

In my case, it gives an error that I need at least 11 disks (which I don't) but the point is that raidz parity does not seem to be limited to 3. Is this not true?

Thank you once again for your help.

Regards,
Peter

--
This message posted from opensolaris.org

Marty Scholes

2010-08-11 19:13:04 UTC

Permalink

Post by Peter Taps
One question though. Marty mentioned that raidz
parity is limited to 3. But in my experiment, it
seems I can get parity to any level.
# zpool create mypool raidzx disk1 diskk2 ....
Here, x in raidzx is a numeric value indicating the
desired parity.
In my experiment, the following command seems to
# zpool create mypool raidz10 disk1 disk2 ...
In my case, it gives an error that I need at least 11
disks (which I don't) but the point is that raidz
parity does not seem to be limited to 3. Is this not
true?

You have my curiousity. I was asking for that feature in these forums last year.

What OS, version and ZFS version are you running?

--
This message posted from opensolaris.org

Peter Taps

2010-08-11 20:26:46 UTC

Permalink

I am running ZFS file system version 5 on Nexenta.

Peter

--
This message posted from opensolaris.org

Adam Leventhal

2010-08-11 19:31:07 UTC

Permalink

Post by Peter Taps
In my case, it gives an error that I need at least 11 disks (which I don't) but the point is that raidz parity does not seem to be limited to 3. Is this not true?

RAID-Z is limited to 3 parity disks. The error message is giving you false hope and that's a bug. If you had plugged in 11 disks or more in the example you provided you would have simply gotten a different error.

- ahl

Eric D. Mudama

2010-08-11 17:45:01 UTC

Permalink

Post by Peter Taps
Hi Eric,
Thank you for your help. At least one part is clear now.
I still am confused about how the system is still functional after one disk fails.

The data for any given sector striped across all drives can be thought
of as:

A+B+C = P

where A..C represent the contents of sector N on devices a..c, and P
is the parity located on device p.

From that, you can do some simple algebra to convert it to:

A+B+C-P = 0

If any of A,B,C or P are unreadable (assume B), from simple algebra,
you can solve for any single unknown (x) to recreate it:

A+x+C = P
A+x+C-A-C = P-A-C
x = P-A-C

and voila, you now have your original B contents, since B=x.

--eric

--
Eric D. Mudama
***@mail.bounceswoosh.org

Peter Taps

2010-08-11 20:29:18 UTC

Permalink

Thank you, Eric. Your explanation is clear to understand.

Regards,
Peter

--
This message posted from opensolaris.org

Arne Schwabe

2010-08-10 22:44:05 UTC

Permalink

Post by Peter Taps
Hi,
I am going through understanding the fundamentals of raidz. From the man pages, a raidz configuration of P disks and N parity provides (P-N)*X storage space where X is the size of the disk. For example, if I have 3 disks of 10G each and I configure it with raidz1, I will have 20G of usable storage. In addition, I continue to work even if 1 disk fails.
First, I don't understand why parity takes so much space. From what I know about parity, there is typically one parity bit per byte. Therefore, the parity should be taking 1/8 of storage, not 1/3 of storage. What am I missing?
Second, if one disk fails, how is my lost data reconstructed? There is no duplicate data as this is not a mirrored configuration. Somehow, there should be enough information in the parity disk to reconstruct the lost data. How is this possible?
Thank you in advance for your help.

Nah it is more like Disk3 is disk2 xor disk1. You can read about it on
Raid5 (raidz is more complicated but the basic idea stays the same). The
parity you describe is only for error checking. More like a zfs checksum
which also one takes very little additional space.

Arne