Discussion:
pool metadata has duplicate children
(too old to reply)
John Giannandrea
2013-01-08 18:05:59 UTC
Permalink
Raw Message
I seem to have managed to end up with a pool that is confused abut its children disks. The pool is faulted with corrupt metadata:

pool: d
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
a backup source.
see: http://illumos.org/msg/ZFS-8000-72
scan: none requested
config:

NAME STATE READ WRITE CKSUM
d FAULTED 0 0 1
raidz1-0 FAULTED 0 0 6
da1 ONLINE 0 0 0
3419704811362497180 OFFLINE 0 0 0 was /dev/da2
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0

But if I look at the labels on all the online disks I see this:

# zdb -ul /dev/da1 | egrep '(children|path)'
children[0]:
path: '/dev/da1'
children[1]:
path: '/dev/da2'
children[2]:
path: '/dev/da2'
children[3]:
path: '/dev/da3'
children[4]:
path: '/dev/da4'
...

But the offline disk (da2) shows the older correct label:

children[0]:
path: '/dev/da1'
children[1]:
path: '/dev/da2'
children[2]:
path: '/dev/da3'
children[3]:
path: '/dev/da4'
children[4]:
path: '/dev/da5'

zpool import -F doesnt help because none of the labels on the unfaulted disks seem to have the right label. And unless I can import the pool I cant replace the bad drive.

Also zpool seems to really not want to import a raidz1 pool with one faulted drive even though that should be readable. I have read about the undocumented -V option but dont know if that would help.

I got into this state when i noticed the pool was DEGRADED and was trying to replace the bad disk. I am debugging it under FreeBSD 9.1

Suggestions of things to try welcome, Im more interested in learning what went wrong than restoring the pool. I dont think I should have been able to go from one offline drive to a unrecoverable pool this easily.

-jg
Gregg Wonderly
2013-01-08 18:33:15 UTC
Permalink
Raw Message
Have you tried importing the pool with that drive completely unplugged? Which HBA are you using? How many of these disks are on same or separate HBAs?

Gregg Wonderly
Post by John Giannandrea
pool: d
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
a backup source.
see: http://illumos.org/msg/ZFS-8000-72
scan: none requested
NAME STATE READ WRITE CKSUM
d FAULTED 0 0 1
raidz1-0 FAULTED 0 0 6
da1 ONLINE 0 0 0
3419704811362497180 OFFLINE 0 0 0 was /dev/da2
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0
# zdb -ul /dev/da1 | egrep '(children|path)'
path: '/dev/da1'
path: '/dev/da2'
path: '/dev/da2'
path: '/dev/da3'
path: '/dev/da4'
...
path: '/dev/da1'
path: '/dev/da2'
path: '/dev/da3'
path: '/dev/da4'
path: '/dev/da5'
zpool import -F doesnt help because none of the labels on the unfaulted disks seem to have the right label. And unless I can import the pool I cant replace the bad drive.
Also zpool seems to really not want to import a raidz1 pool with one faulted drive even though that should be readable. I have read about the undocumented -V option but dont know if that would help.
I got into this state when i noticed the pool was DEGRADED and was trying to replace the bad disk. I am debugging it under FreeBSD 9.1
Suggestions of things to try welcome, Im more interested in learning what went wrong than restoring the pool. I dont think I should have been able to go from one offline drive to a unrecoverable pool this easily.
-jg
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
John Giannandrea
2013-01-09 05:30:57 UTC
Permalink
Raw Message
Post by Gregg Wonderly
Have you tried importing the pool with that drive completely unplugged?
Thanks for your reply. I just tried that. zpool import now says:

pool: d
id: 13178956075737687211
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://illumos.org/msg/ZFS-8000-72
config:

d FAULTED corrupted data
raidz1-0 FAULTED corrupted data
da1 ONLINE
3419704811362497180 OFFLINE
da2 ONLINE
da3 ONLINE
da4 ONLINE

Notice that in the absence of the faulted da2 the OS has assigned da3 to da2 etc. I suspect this was part of the original problem in creating a label with two da2s

zdb still reports that the label has two da2 children:

vdev_tree:
type: 'raidz'
id: 0
guid: 11828532517066189487
nparity: 1
metaslab_array: 23
metaslab_shift: 36
ashift: 9
asize: 9999920660480
is_log: 0
children[0]:
type: 'disk'
id: 0
guid: 13697627234083630557
path: '/dev/da1'
whole_disk: 0
DTL: 78
children[1]:
type: 'disk'
id: 1
guid: 3419704811362497180
path: '/dev/da2'
whole_disk: 0
DTL: 71
offline: 1
children[2]:
type: 'disk'
id: 2
guid: 6790266178760006782
path: '/dev/da2'
whole_disk: 0
DTL: 77
children[3]:
type: 'disk'
id: 3
guid: 2883571222332651955
path: '/dev/da3'
whole_disk: 0
DTL: 76
children[4]:
type: 'disk'
id: 4
guid: 16640597255468768296
path: '/dev/da4'
whole_disk: 0
DTL: 75
Post by Gregg Wonderly
Which HBA are you using? How many of these disks are on same or separate HBAs?
all the disks are on the same HBA

twa0: <3ware 9000 series Storage Controller>
twa0: INFO: (0x15: 0x1300): Controller details:: Model 9500S-8, 8 ports, Firmware FE9X 2.08.00.006
da0 at twa0 bus 0 scbus0 target 0 lun 0
da1 at twa0 bus 0 scbus0 target 1 lun 0
da2 at twa0 bus 0 scbus0 target 2 lun 0
da3 at twa0 bus 0 scbus0 target 3 lun 0
da4 at twa0 bus 0 scbus0 target 4 lun 0

-jg
Peter Jeremy
2013-01-10 08:13:12 UTC
Permalink
Raw Message
Post by John Giannandrea
Notice that in the absence of the faulted da2 the OS has assigned da3 to da2 etc. I suspect this was part of the original problem in creating a label with two da2s
The primary vdev identifier is tha guid. Tha path is of secondary
importance (ZFS should automatically recover from juggled disks
without an issue - and has for me).

Try running "zdb -l" on each of your pool disks and verify that
each has 4 identical labels, and that the 5 guids (one on each
disk) are unique and match the vdev_tree you got from zdb.

My suspicion is that you've somehow "lost" the disk with the guid
3419704811362497180.
Post by John Giannandrea
twa0: <3ware 9000 series Storage Controller>
twa0: INFO: (0x15: 0x1300): Controller details:: Model 9500S-8, 8 ports, Firmware FE9X 2.08.00.006
da0 at twa0 bus 0 scbus0 target 0 lun 0
da1 at twa0 bus 0 scbus0 target 1 lun 0
da2 at twa0 bus 0 scbus0 target 2 lun 0
da3 at twa0 bus 0 scbus0 target 3 lun 0
da4 at twa0 bus 0 scbus0 target 4 lun 0
Are these all JBOD devices?
--
Peter Jeremy
Loading...