[rescue] Drive Replacement Question

Fri Sep 7 15:35:12 CDT 2007

On 9/7/07, Brian Deloria <bdeloria at gmail.com> wrote:
> I had also read that I may have had to use the metaclear and metainit
> command.  I don't believe that this would have solved my problem plus I was
> fairly unclear on the syntax and usage and was quite concerned I'd kill the
> mirror.  It seemed like the examples wanted me to drop the good submirror
> and recreate the mirror and attach each submirror again.  I was also
> concerned over the vagueness of the examples as to which submirror would
> overwrite the other.  The last thing that I wanted to have happen is for
> the good submirror to be overwritten by the blank disk.

If I'm understanding correctly, you're alluding to a complete removal
of DiskSuite/SVM from the disks altogether. I've always considered
this horrible overkill, especially within the constraints of a limited
maintenance window for a production box; it's simply not necessary and
adds extra confusion to an already busy procedure.

Note, though, that metaclear is required in my recommendation
too--except limited only to clearing and initializing the submirrors
on the failed / replaced disk and *not* the top level mirror. This
means that the config is a "1-way mirror" (top level mirror with a
single submirror attached) during the replacement. There are no
changes necessary to /etc/system or /etc/vfstab, and reboots aren't
inherently required.

In any event, the DiskSuite/SVM documentation freely available on
http://docs.sun.com is some of the best OEM-produced stuff I've come
across. Clear, concise, and best of all, task-oriented (see here:
http://docs.sun.com/app/docs/doc/816-4519/6manoju18?l=en&a=view).  It
has plenty of syntax examples. You should check it out to help clear
up any uncertainty you might have in that regard.

> [insert "KS" anecdotes here]

IMHO, those horror stories are all the more reason to stick with the
tried and true methods that have documentation to back them up.
Sometimes that's the only thing available to CYA when faced with an
overzealous and hardheaded colleague who has more influence on
management than you do.

> Ah well, thanks again everyone for their input.  I too prefer to 'break'
> things and 'prove' that replacement failover / raid reattachments do in fact
> work and do so properly.  You end up with a better understanding of how a
> repair is supposed to go and the timeframe for it to take place.  I
> unfortunately have walked into a situation where there are many legacy
> systems to consider the dependancies are ridiculous at times and the
> documentation is non-existant.
>
I agree, there's not much worse than having to inherit legacy systems
with inexplicable configs. And the most fun part is being blamed when
you can't get it squared away in short order after a failure.

I came across a great one where a customer couldn't figure out why he
couldn't restore his RAID5 metadevice from its "Needs Maintenance"
state after a proper replacement of the offending disk. At first
glance at the metastat, I missed it too. Turns out, the guy's
predecessor saw it fit to use *two slices per disk* to make the RAID5,
so when a single disk failed the RAID5 lost two members and all data
was immediately lost. There was a hot spare pool configured, but they
never associated it with any volumes, so it sat idly by. Not that it
would have been able to do anything, unless there were read/write
errors on only one of the two slices and the reconstruction had time
to complete before errors were detected on the other slice... A
further check of servers at the site showed that several others were
configured with this ticking timebomb configuration... it was the
definition of a hot sticky mess.

This has all been a good reminder for me to start finishing some long
overdue server bibles... :-)

-A