[geeks] Murphy, instantiated

Thu Jun 4 00:13:49 CDT 2009

On Jun 3, 2009, at 23:54 , velociraptor wrote:

> On Tue, Jun 2, 2009 at 7:22 AM, Phil Stracchino  
> <alaric at metrocast.net> wrote:
>> Jonathan Groll wrote:
>>> On Tue, Jun 02, 2009 at 09:37:44AM -0400, Phil Stracchino wrote:
>>>> Jonathan Groll wrote:
>>>> Two identical controllers, 12 disks.  I don't think it's a simple  
>>>> write
>>>> corruption issue, because the three failed disks didn't even come  
>>>> back
>>>> up on the bus after reboot.  They're dead as doornails.
>>>>
>>> So, the odds of all three disks failing on one controller are:
>>> 1/2 * 1/2 * 1/2 = 1/8
>>>
>>> or 1 in 4 that all three failed disks will belong to the same
>>> controller!
>>>
>>> At the very least, it is worthwhile decommisioning the 'bad'
>>> controller (wouldn't trust it), and trying the 'failed' disks in
>>> another box altogether...
>>
>> If I take down that controller, I don't have enough channels to run  
>> all
>> the disks.  I'm going to test the disks elsewhere once I pull them,  
>> but
>> I don't hold out much hope for them.  I trust the controller more  
>> than I
>> trust the disks; the disks already had two strikes against them -
>> they're Maxtor disks, and they've already seen several years of use
>> before I got them.
>
> I've taken to scrubbing all my zpools every Sunday night as a
> precaution.  After reading the Google white paper on disks, any early
> errors on the new 1TB disks mean get the fsck out of the zpool and
> being delegated to sneakernet duty.
>
> I trust my 500GB disks slightly more, because they are around 2-3 yrs
> old, and have not shown any errors at all.  I'm still scrubbing their
> non-redundant pool, though (they are the backup device).
>
> Bill posted the link to the full pdf some time ago on the list--if
> you're an IT geek and haven't read it, you should!  It's quite
> counter-intuitive what they found based on stats on 80000 disks they
> analyzed.

If you mean the Google report on drive reliability, I am not sure I  
would believe everything they said.  Some of their methods are flawed,  
and they also reported a lot of drives in the same way, even though  
they were not used in the same way, and other little mistakes like that.

I guess they don't really track this like a test shop would, so maybe  
that's forgivable, but I have not had the same experience, and neither  
have quite a few other people I know in large storage shops.

I can partially agree on their SCSI versus SATA, but that's an apple  
and oranges comparison.  The SCSI drives are latest generation,  
maximum performance, and were run in very hot arrays.  The SATA drives  
were 5-10 year old tech in far less stress.

Where the SATA drives just as good or simply not pushed as hard?

In my own experience, if you keep them in like conditions, the SCSI  
drives are slightly better, and a lot faster.  However, it is true  
that SATA drives in general seem better than ATA drives.

Temperature: I can't agree with them there.  They seem to be saying  
that temperature does not matter, and I have never been in a storage  
shop or talked to anyone in a storage shop that did not see higher  
failures when the heat went up.

Some other recent observations: I used to avoid WD drives completely,  
due to years of less-than-great failure rates personally and at work.   
Not terrible, like Maxtor, but not the best either.  Seagate and  
Fujitsu always seemed to work, and Hitachi was OK as long as you  
avoided certain models.

Now WD drives are doing better, a little.  I still see failures when  
the heat goes up in RAID setups, but otherwise they seem to have  
improved.  Too new to say on some of their new super quiet drives, but  
so far so good at six months in several arrays, including one of my own.

Seagate: I've noticed since about 2 years ago that a lot of their  
drives come from China.  In almost every case, they vibrate enough to  
feel it even when brand new.  The
Seagates made in Singapore are always vibration free until they start  
dying.

I see higher failure rates from Seagate in general, but have not  
narrowed it down to Chinese drives, though there are more of them now  
so that seems likely.

Fujitsu still seems OK, and I wish they made SATA drives.

Just some thoughts on recent observations.  I am worried about  
increasing drive failure rates from everyone, and wonder what the  
industry is going to do to counter it.

We have tons of storage, no really good way to back it up, and besides  
ZFS there is almost no move to try and counter the situation in any way.

I figure eventually something has to change to goad the industry into  
doing something about it.

-- 
"Where some they sell their dreams for small desires."