HELP! RPool problem

Discussion:

HELP! RPool problem

Karl Wagner

2013-02-16 20:33:42 UTC

I have a small problem.

I have a development fileserver box running Solaris 11 Express. The Rpool
is mirrored between an SSD and a hard drive. Today, the SSD deveoped a
fault for some reason. While trying to diagnose the problem, the system
panicked and rebooted.

The SSD was the first boot drive, and every time it tried to boot it
panicked and rebooted, ending up in a loop. I tried to change to the second
rpool drive, but either I forgot to install grub on it or it has become
corrupted (probably the first, I can be that stupid at times).

Can anyone give me any advice on how to get this system back? Can I trick
grub, installed on the SSD, to boot from the HDD's rpool mirror? Is
something more sinister going on?

By the way, whatever the error message is when booting, it disapears so
quickly I can't read it, so I am only guessing that this is the reason.

PLEASE HELP!

Thanks
Karl

John D Groenveld

2013-02-16 20:49:43 UTC

Permalink

Post by Karl Wagner
The SSD was the first boot drive, and every time it tried to boot it
panicked and rebooted, ending up in a loop. I tried to change to the second
rpool drive, but either I forgot to install grub on it or it has become
corrupted (probably the first, I can be that stupid at times).
Can anyone give me any advice on how to get this system back? Can I trick
grub, installed on the SSD, to boot from the HDD's rpool mirror? Is
something more sinister going on?

Remove the broken drive, boot installation media, import the
mirror drive.
If it imports, you will be able to installgrub(1M).

Post by Karl Wagner
By the way, whatever the error message is when booting, it disapears so
quickly I can't read it, so I am only guessing that this is the reason.

Boot with kernel debugger so you can see the panic.

John
***@acm.org

Sašo Kiselkov

2013-02-16 20:54:44 UTC

Permalink

Post by John D Groenveld
Boot with kernel debugger so you can see the panic.

Sadly, though, without access to the source code, all he do can at that
point is log a support ticket with Oracle (assuming he has paid his
support fees) and hope it will get picked up by somebody there. People
on this list have few, if any ways of helping out.

Cheers,
--
Saso

James C. McPherson

2013-02-16 21:47:36 UTC

Permalink

Post by SaÅ¡o Kiselkov

Post by John D Groenveld
Boot with kernel debugger so you can see the panic.

You're missing the point. Booting with kmdb enabled
is The Way(tm) to get anything remotely resembling
a paused screen so you can see what the message is.

Whether that message winds up being something you need
to talk with a Oracle about is entirely different.

The OP mentioned that he was running S11 Express, for
which, iirc, you can dig through source on a non-Oracle
site and investigate.

Really, though, just adding

-k

to the kernel$ line in the grub menu prior to booting
should be enough for him to make significant progress.

James C. McPherson
--
Oracle
Systems / Solaris / Core
http://www.jmcpdotcom.com/blog

Sašo Kiselkov

2013-02-16 22:48:15 UTC

Permalink

Post by James C. McPherson

Post by SaÅ¡o Kiselkov

Post by John D Groenveld
Boot with kernel debugger so you can see the panic.

He got a kernel panic on a completely legitimate operation (booting with
one half of the root mirror faulted). There's a good chance that the
only thing he'll see is something like BAD TRAP and a stack trace.
Without source, that's where the investigation ends.

Post by James C. McPherson
The OP mentioned that he was running S11 Express, for
which, iirc, you can dig through source on a non-Oracle
site and investigate.

And once he's found the problem, what then? Can he build a new ZFS
kernel module? Can he submit a patch?

Post by James C. McPherson
Really, though, just adding
-k
to the kernel$ line in the grub menu prior to booting
should be enough for him to make significant progress.

If by "significant progress" you mean sending a stack trace to Oracle,
then yes.

Look I'm not accusing you or anybody else for not trying to help - there
are some wonderful people around here who both care deeply for their
users and are proud of their work. I fully applaud that stance.
All I'm doing is just pointing out the facts of the matter - take from
that what you will.

Cheers,
--
Saso

James C. McPherson

2013-02-16 22:54:56 UTC

Permalink

...

Post by SaÅ¡o Kiselkov

Post by James C. McPherson
Whether that message winds up being something you need
to talk with a Oracle about is entirely different.

There is significant information provided in a panic message
which does NOT require that you go and ask Oracle for help.

As I pointed out, too, there is a non-Oracle source repo which
does contain the code which went into the release and build
which the OP is running. He's running Solaris 11 Express, which
we published/delivered as build snv_151b. One would hope that
there are sufficient hints in the previous 2 sentences to enable
debugging if that is required.

Post by SaÅ¡o Kiselkov

Post by James C. McPherson
The OP mentioned that he was running S11 Express, for
which, iirc, you can dig through source on a non-Oracle
site and investigate.

And once he's found the problem, what then? Can he build a new ZFS
kernel module? Can he submit a patch?

You're assuming that he's found a bug which is unfixed,
and not related to failed hardware. Big assumption.

Post by SaÅ¡o Kiselkov

Post by James C. McPherson
Really, though, just adding
-k
to the kernel$ line in the grub menu prior to booting
should be enough for him to make significant progress.

If by "significant progress" you mean sending a stack trace to Oracle,
then yes.

I think you are insulting the OP by assuming that he has
insufficient understanding of how to use a search engine.

Post by SaÅ¡o Kiselkov
Look I'm not accusing you or anybody else for not trying to help - there
are some wonderful people around here who both care deeply for their
users and are proud of their work. I fully applaud that stance.
All I'm doing is just pointing out the facts of the matter - take from
that what you will.

Your opinion is no doubt coloured by the recent announcement
re opensolaris.org.

I have corresponded privately with the OP on this matter. I
will not respond further to this thread.

James C. McPherson
--
Oracle
Systems / Solaris / Core
http://www.jmcpdotcom.com/blog

Ian Collins

2013-02-17 04:16:12 UTC

Permalink

Post by SaÅ¡o Kiselkov

Post by John D Groenveld
Boot with kernel debugger so you can see the panic.

If he can boot from a recent install media and import the pool, that's a
pretty good indicator that the problem has been fixed. He can then
upgrade the what ever he booted with (which could be OI or Solaris11.1)
and recover his data.

--
Ian.

Jim Klimov

2013-02-16 23:26:55 UTC

Permalink

Post by John D Groenveld

Post by Karl Wagner
By the way, whatever the error message is when booting, it disapears so
quickly I can't read it, so I am only guessing that this is the reason.

Boot with kernel debugger so you can see the panic.

And that would be so:
1) In the boot loader (GRUB) edit the boot options (press "e",
select "kernel" line, press "e" again), and add "-kd" to the
kernel bootup. Maybe also "-v" to add verbosity.

2) Press enter to save the change and "b" to boot

3) The kmdb prompt should pop up; enter ":c" to continue execution
The bootup should start, throw the kernel panic and pause.
It is likely that there would be so much info that it doesn't
fit on screen - I can only suggest a serial console in this case.

However, the end of dump info should point you in the right
direction. For example, an error in "mount_vfs_root" is popular,
and usually means either corrupt media or simply unexpected device
name for the root pool (i.e. disk plugged on a different port, or
BIOS changes between SATA-IDE modes, etc.)

The device name changes should go away if you can boot from anything
that can import your rpool (livecd, installer cd, failsafe boot image)
and just "zpool import -f rpool; zpool export rpool" - this should
clear the dependency on exact device names, and next bootup should
work.

And yes, I think it is a bug for such a fixable problem to behave so
inconveniently - the official docs go as far as to suggest an OS
reinstallation in this case.

//Jim