[geeks] Murphy, instantiated

Phil Stracchino alaric at metrocast.net
Tue Jun 2 07:11:00 CDT 2009


Lionel Peterson wrote:
> On Jun 1, 2009, at 2:36 PM, Phil Stracchino <alaric at metrocast.net>  
> wrote:
> 
>> A RAIDZ2 pool without hot spares can survive simultaneous failure of  
>> up
>> to two devices and still continue operating in degraded mode.
>>
>> So, naturally, between when we left for the elementary school this
>> morning at 0755 and when we got back home at about 1015, apparently
>> *THREE* of the twelve disks in babylon4's main storage pool failed.
> 
> Sounds like you didn't have a choice, but I like to source my drives  
> from multiple sources, hoping to get drives from different lots,  
> hoping to avoid just this kind of problem (of course there is no  
> guarantee to avoid problems, but diversity might reduce tv likelyhood  
> of a triple-drive failure inside of 3 hours!).


Some after-the-fact forensics made it pretty clear what happened in this
case.  It's all used hardware, so all the disks were unknown quantities.
 A network-wide full backup started at 03:10, putting heavy load on the
array.  At 04:29:55, c1t7d0, evidently the weakest disk, buckled under
the load, increasing the load on the remaining disks.  Then at 08:49:29,
c1t6d0 folded as well, and the array went into fully degraded mode,
increasing the load on the remaining disks even further.  Eight minutes
later at 08:57:06, c1t4d0 gave up and the entire array went down.  The
fact that it was three drives on the same controller is coincidence, I
think.



-- 
  Phil Stracchino, CDK#2     DoD#299792458     ICBM: 43.5607, -71.355
  alaric at caerllewys.net   alaric at metrocast.net   phil at co.ordinate.org
         Renaissance Man, Unix ronin, Perl hacker, Free Stater
                 It's not the years, it's the mileage.



More information about the geeks mailing list