[geeks] Murphy, instantiated

Thu Jun 4 02:07:28 CDT 2009

On Jun 3, 2009, at 10:13 PM, Shannon Hendrix wrote:

> If you mean the Google report on drive reliability, I am not sure I  
> would believe everything they said.  Some of their methods are  
> flawed, and they also reported a lot of drives in the same way, even  
> though they were not used in the same way, and other little mistakes  
> like that.

Not having access to the raw stats nor the stats background to dig  
into it even if I did, I can only say a set of 80K disks is more data  
than I have in my 10+ years using consumer and enterprise disks.
> Temperature: I can't agree with them there.  They seem to be saying  
> that temperature does not matter, and I have never been in a storage  
> shop or talked to anyone in a storage shop that did not see higher  
> failures when the heat went up.

My personal experience disagrees with theirs as well, but who is to  
say that it's not other environmental factors?  I don't recall if  
their temp. measurement was ambient or actual disk temp.  None of the  
DC's I've worked in have ever noted anything other than temp, and only  
that in the broadest sense.  I presume that $work's hosting (a large  
multi-national telco's hosting) monitors more than temp, b/c  
surprisingly it's a constant 72F vs. the 68F or lower their client  
workspace downstairs is kept at.

I don't really worry about heat--if it gets too hot in my house, I'll  
turn off the computer.  My interest in the report was a) disks that  
show errors early have a higher incidence of failure; b) related to  
that,  no correlation between SMART reporting and failure.  Which  
implies, if you are using ZFS, that you should be scrubbing very  
regularly early in the disks' lives, to determine early errors sooner,  
rather than later.  After 12 months, you can probably ease off.

I liken this to something else I noticed these days--almost everyone I  
know (or have read opinions from on various pc hardware forums) tell  
you to stress test your RAM out of the gate with at least a 24 hr  
memtest run.  I never saw anyone recommending that outside of builds  
for data centers 10 years ago.  The intent is to ferret out the  
marginal components early so you can get them replaced via RMA or  
before warranty runs out.  Doing stress tests on disks seems to be  
just as logical.

<snip>

> Just some thoughts on recent observations.  I am worried about  
> increasing drive failure rates from everyone, and wonder what the  
> industry is going to do to counter it.

My observation: whatever the hell EMC uses in their arrays, they fail  
a lot.  Yes, we beat those disks to death, but we have similar (fewer)  
file systems on the NetApp in use, and I've seen no disks fail in  
those since I started, nor have we had more than 1-2 disk failures in  
our crappy Penguin boxes that get beat to death (MTAs for example)-- 
which I am sure are non-"enterprise" disks.

> We have tons of storage, no really good way to back it up, and  
> besides ZFS there is almost no move to try and counter the situation  
> in any way.
>
> I figure eventually something has to change to goad the industry  
> into doing something about it.

People keep buying bigger disks, I doubt it will change.

In the enterprise space, "de-duplification" is the new hotness.   
NetApp and EMC are fighting over Data Domain, even though both already  
have de-dup systems.  Greenbytes has a modified OpenSolaris ZFS ("ZFS 
+") thing they are selling for the same purpose (Thumper with the  
software pre-installed).

I think ZFS has gotten everyone to start looking at file systems in a  
new way.  People are thinking more outside of the box now.

=Nadine=