[geeks] Murphy, instantiated
velociraptor
velociraptor at gmail.com
Wed Jun 3 22:54:19 CDT 2009
On Tue, Jun 2, 2009 at 7:22 AM, Phil Stracchino <alaric at metrocast.net> wrote:
> Jonathan Groll wrote:
>> On Tue, Jun 02, 2009 at 09:37:44AM -0400, Phil Stracchino wrote:
>>> Jonathan Groll wrote:
>>> Two identical controllers, 12 disks. I don't think it's a simple write
>>> corruption issue, because the three failed disks didn't even come back
>>> up on the bus after reboot. They're dead as doornails.
>>>
>> So, the odds of all three disks failing on one controller are:
>> 1/2 * 1/2 * 1/2 = 1/8
>>
>> or 1 in 4 that all three failed disks will belong to the same
>> controller!
>>
>> At the very least, it is worthwhile decommisioning the 'bad'
>> controller (wouldn't trust it), and trying the 'failed' disks in
>> another box altogether...
>
> If I take down that controller, I don't have enough channels to run all
> the disks. I'm going to test the disks elsewhere once I pull them, but
> I don't hold out much hope for them. I trust the controller more than I
> trust the disks; the disks already had two strikes against them -
> they're Maxtor disks, and they've already seen several years of use
> before I got them.
I've taken to scrubbing all my zpools every Sunday night as a
precaution. After reading the Google white paper on disks, any early
errors on the new 1TB disks mean get the fsck out of the zpool and
being delegated to sneakernet duty.
I trust my 500GB disks slightly more, because they are around 2-3 yrs
old, and have not shown any errors at all. I'm still scrubbing their
non-redundant pool, though (they are the backup device).
Bill posted the link to the full pdf some time ago on the list--if
you're an IT geek and haven't read it, you should! It's quite
counter-intuitive what they found based on stats on 80000 disks they
analyzed.
=Nadine=
More information about the geeks
mailing list