Discussion:
Corrupted LUN in RAIDZ group -- How to repair?
(too old to reply)
David Smith
2006-09-10 05:28:51 UTC
Permalink
Background: We have a ZFS pool setup from LUNS which are from a SAN connected StorageTek/Engenio Flexline 380 storage system. Just this past Friday the storage environment went down causing the system to go down.

After looking at the storage environment, we had several volume groups which needed to be carefully put back together to prevent corruption. Well, one of the volume groups and the volumes/LUNs coming from it got corrupted. Since our ZFS pools is setup to only have a LUN from each volume group we basically ended up with a single disk loss in our RAIDZ group. So I believe we should be able to recover from this.

My question is how to replace this disk (LUN). Basically the LUN is again okay, but the data on the LUN is not.

I have tried to do a zpool replace, but ZFS seems to know that the disk/lun is the same device. Using a -f (force) didn't work either. How does one replace a LUN with ZFS?

I'm currently doing a "scrub", but don't know if that will help.

I first just had read errors on a lun in the raidz group, but just tonight noticed that I now have a checksum error on another lun as well. (see zpool status command output below)

Below is a zpool status -x output. Can anyone advise how to recover from this?

# zpool status -x
pool: mypool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 66.00% done, 10h45m to go
config:

NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz ONLINE 0 0 0
c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0
c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0
c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0
c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0
c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0
c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0
c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0
c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 9
c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0
c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 50 0 0
c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0
c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0
c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0
c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0
c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 6
c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0
c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 70 0 0
c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0

This system is at Solaris 10, U2.

Thank you,

David


This message posted from opensolaris.org
James Dickens
2006-09-10 08:12:15 UTC
Permalink
Post by David Smith
Background: We have a ZFS pool setup from LUNS which are from a SAN connected StorageTek/Engenio Flexline 380 storage system. Just this past Friday the storage environment went down causing the system to go down.
After looking at the storage environment, we had several volume groups which needed to be carefully put back together to prevent corruption. Well, one of the volume groups and the volumes/LUNs coming from it got corrupted. Since our ZFS pools is setup to only have a LUN from each volume group we basically ended up with a single disk loss in our RAIDZ group. So I believe we should be able to recover from this.
My question is how to replace this disk (LUN). Basically the LUN is again okay, but the data on the LUN is not.
I have tried to do a zpool replace, but ZFS seems to know that the disk/lun is the same device. Using a -f (force) didn't work either. How does one replace a LUN with ZFS?
I'm currently doing a "scrub", but don't know if that will help.
I first just had read errors on a lun in the raidz group, but just tonight noticed that I now have a checksum error on another lun as well. (see zpool status command output below)
Below is a zpool status -x output. Can anyone advise how to recover from this?
# zpool status -x
pool: mypool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
okay zfs noticed that there was an error, it is now trying to fix it
Post by David Smith
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 66.00% done, 10h45m to go
its scrubing the drives repariing bad data, in 10 hours and 45 minutes
it should be done.
Post by David Smith
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz ONLINE 0 0 0
since everything is still online and ready and no errors, no further
action should be required, after the scrub is done, should those
messages change, you can use zpool replace or come back here and ask
for more help, but at this time there is nothing to worry about.

James Dickens
uadmin.blogspot.com
Post by David Smith
c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0
c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0
c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0
c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0
c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0
c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0
c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0
c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 9
c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0
c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 50 0 0
c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0
c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0
c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0
c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0
c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 6
c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0
c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 70 0 0
c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0
This system is at Solaris 10, U2.
Thank you,
David
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
David Smith
2006-09-10 14:58:57 UTC
Permalink
James,

Thanks for the reply.

It looks like now the scrub has completed. Should I now clear these warnings?

bash-3.00# zpool status -x
pool: mypool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed with 0 errors on Sun Sep 10 07:44:36 2006
config:

NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz ONLINE 0 0 0
c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0 0
c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0 0
c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0 0
c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0 0
c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0 0
c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0 0
c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0 0
c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0 15
c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0 0
c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 50 0 0
c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0 0
raidz ONLINE 0 0 0
c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0 0
c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0 0
c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0 0
c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0 0
c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0 0
c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0 0
c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0 14
c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0 0
c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 70 0 0
c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0 0


David


This message posted from opensolaris.org
Jeff Bonwick
2006-09-10 21:41:21 UTC
Permalink
Post by David Smith
It looks like now the scrub has completed. Should I now clear these warnings?
Yep. You survived the Unfortunate Event unscathed. You're golden.

Jeff
David Smith
2006-09-14 15:09:07 UTC
Permalink
I have run zpool scrub again, and I now see checksum errors again. Wouldn't the checksum errors gotten fixed with the first zpool scrub?

Can anyone recommend actions I should do at this point?

Thanks,

David


This message posted from opensolaris.org
Bill Moore
2006-09-14 20:55:33 UTC
Permalink
Post by David Smith
I have run zpool scrub again, and I now see checksum errors again.
Wouldn't the checksum errors gotten fixed with the first zpool scrub?
Can anyone recommend actions I should do at this point?
After running the first scrub, did you run "zpool clear <pool>" to zero
out the error counts? If not, you will still be seeing the error counts
from the first scrub. Could you send the output of "zpool status -v"?


--Bill
David W. Smith
2006-09-14 21:08:29 UTC
Permalink
Post by Bill Moore
Post by David Smith
I have run zpool scrub again, and I now see checksum errors again.
Wouldn't the checksum errors gotten fixed with the first zpool scrub?
Can anyone recommend actions I should do at this point?
After running the first scrub, did you run "zpool clear <pool>" to zero
out the error counts? If not, you will still be seeing the error counts
from the first scrub. Could you send the output of "zpool status -v"?
--Bill
Bill,

Yes, I cleared the errors after the first scrub.

Here is the output (pool name changed to protect the innocent):


bash-3.00# zpool status -x
pool: mypool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed with 0 errors on Thu Sep 14 11:30:18 2006
config:

NAME STATE READ WRITE
CKSUM
mypool ONLINE 0 0 0
raidz ONLINE 0 0
0
c10t600A0B800011730E000066C544C5EBB8d0 ONLINE 0 0
0
c10t600A0B800011730E000066CA44C5EBEAd0 ONLINE 0 0
0
c10t600A0B800011730E000066CF44C5EC1Cd0 ONLINE 0 0
0
c10t600A0B800011730E000066D444C5EC5Cd0 ONLINE 0 0
0
c10t600A0B800011730E000066D944C5ECA0d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C144C5ECDFd0 ONLINE 0 0
0
c10t600A0B800011730E000066E244C5ED2Cd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C644C5ED87d0 ONLINE 0 0
0
c10t600A0B800011730E000066EB44C5EDD8d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5CB44C5EE29d0 ONLINE 0 0
0
c10t600A0B800011730E000066F444C5EE7Ed0 ONLINE 0 0
13
c10t600A0B800011652E0000E5D044C5EEC9d0 ONLINE 0 0
0
c10t600A0B800011730E000066FD44C5EF1Ad0 ONLINE 0 0
0
c10t600A0B800011652E0000E5D544C5EF63d0 ONLINE 0 0
0
raidz ONLINE 0 0
0
c10t600A0B800011652E0000E5B844C5EBCBd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5BA44C5EBF5d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5BC44C5EC2Dd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5BE44C5EC6Bd0 ONLINE 0 0
0
c10t600A0B800011730E000066DB44C5ECB4d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C344C5ECF9d0 ONLINE 0 0
0
c10t600A0B800011730E000066E444C5ED5Ad0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C844C5EDA1d0 ONLINE 0 0
0
c10t600A0B800011730E000066ED44C5EDFAd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5CD44C5EE47d0 ONLINE 0 0
0
c10t600A0B800011730E000066F644C5EE96d0 ONLINE 0 0
16
c10t600A0B800011652E0000E5D244C5EEE7d0 ONLINE 0 0
0
c10t600A0B800011730E000066FF44C5EF32d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5D744C5EF7Fd0 ONLINE 0 0
0
raidz ONLINE 0 0
0
c10t600A0B800011730E000066C844C5EBD8d0 ONLINE 0 0
0
c10t600A0B800011730E000066CD44C5EC02d0 ONLINE 0 0
0
c10t600A0B800011730E000066D244C5EC40d0 ONLINE 0 0
0
c10t600A0B800011730E000066D744C5EC7Cd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C044C5ECC1d0 ONLINE 0 0
0
c10t600A0B800011730E000066E044C5ED0Ad0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C544C5ED67d0 ONLINE 0 0
0
c10t600A0B800011730E000066E944C5EDB4d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5CA44C5EE09d0 ONLINE 0 0
0
c10t600A0B800011730E000066F244C5EE5Cd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5CF44C5EEA7d0 ONLINE 0 0
13
c10t600A0B800011730E000066FB44C5EEFAd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5D444C5EF3Fd0 ONLINE 0 0
0
c10t600A0B800011730E0000670444C5EF92d0 ONLINE 0 0
0
raidz ONLINE 0 0
0
c10t600A0B800011652E0000E5B944C5EBDDd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5BB44C5EC0Dd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5BD44C5EC4Bd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5BF44C5EC8Dd0 ONLINE 0 0
0
c10t600A0B800011730E000066DD44C5ECD0d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C444C5ED19d0 ONLINE 0 0
0
c10t600A0B800011730E000066E644C5ED7Ad0 ONLINE 0 0
0
c10t600A0B800011652E0000E5C944C5EDC7d0 ONLINE 0 0
0
c10t600A0B800011730E000066EF44C5EE1Cd0 ONLINE 0 0
0
c10t600A0B800011652E0000E5CE44C5EE6Bd0 ONLINE 0 0
0
c10t600A0B800011730E000066F844C5EEBAd0 ONLINE 0 0
16
c10t600A0B800011652E0000E5D344C5EF07d0 ONLINE 0 0
0
c10t600A0B800011730E0000670144C5EF52d0 ONLINE 0 0
0
c10t600A0B800011652E0000E5D844C5EFA3d0 ONLINE 0 0
0

errors: No known data errors

Thanks,

David

Continue reading on narkive:
Loading...