[geeks] Murphy, instantiated

Wed Jun 3 22:54:19 CDT 2009

On Tue, Jun 2, 2009 at 7:22 AM, Phil Stracchino <alaric at metrocast.net> wrote:
> Jonathan Groll wrote:
>> On Tue, Jun 02, 2009 at 09:37:44AM -0400, Phil Stracchino wrote:
>>> Jonathan Groll wrote:
>>> Two identical controllers, 12 disks.  I don't think it's a simple write
>>> corruption issue, because the three failed disks didn't even come back
>>> up on the bus after reboot.  They're dead as doornails.
>>>
>> So, the odds of all three disks failing on one controller are:
>> 1/2 * 1/2 * 1/2 = 1/8
>>
>> or 1 in 4 that all three failed disks will belong to the same
>> controller!
>>
>> At the very least, it is worthwhile decommisioning the 'bad'
>> controller (wouldn't trust it), and trying the 'failed' disks in
>> another box altogether...
>
> If I take down that controller, I don't have enough channels to run all
> the disks.  I'm going to test the disks elsewhere once I pull them, but
> I don't hold out much hope for them.  I trust the controller more than I
> trust the disks; the disks already had two strikes against them -
> they're Maxtor disks, and they've already seen several years of use
> before I got them.

I've taken to scrubbing all my zpools every Sunday night as a
precaution.  After reading the Google white paper on disks, any early
errors on the new 1TB disks mean get the fsck out of the zpool and
being delegated to sneakernet duty.

I trust my 500GB disks slightly more, because they are around 2-3 yrs
old, and have not shown any errors at all.  I'm still scrubbing their
non-redundant pool, though (they are the backup device).

Bill posted the link to the full pdf some time ago on the list--if
you're an IT geek and haven't read it, you should!  It's quite
counter-intuitive what they found based on stats on 80000 disks they
analyzed.

=Nadine=