[geeks] Murphy, instantiated

Phil Stracchino alaric at metrocast.net
Tue Jun 2 08:37:44 CDT 2009


Jonathan Groll wrote:
> On Tue, Jun 02, 2009 at 08:11:00AM -0400, Phil Stracchino wrote:
>> Some after-the-fact forensics made it pretty clear what happened in this
>> case.  It's all used hardware, so all the disks were unknown quantities.
>> A network-wide full backup started at 03:10, putting heavy load on the
>> array.  At 04:29:55, c1t7d0, evidently the weakest disk, buckled under
>> the load, increasing the load on the remaining disks.  Then at 08:49:29,
>> c1t6d0 folded as well, and the array went into fully degraded mode,
>> increasing the load on the remaining disks even further.  Eight minutes
>> later at 08:57:06, c1t4d0 gave up and the entire array went down.  The
>> fact that it was three drives on the same controller is coincidence, I
>> think.
> 
> A more likely hypothesis is that the controller was corrupting disk
> writes though, don't you think? (My first thought was to ask if they
> were on different controllers). How many controllers do you have in
> this box?

Two identical controllers, 12 disks.  I don't think it's a simple write
corruption issue, because the three failed disks didn't even come back
up on the bus after reboot.  They're dead as doornails.


-- 
  Phil Stracchino, CDK#2     DoD#299792458     ICBM: 43.5607, -71.355
  alaric at caerllewys.net   alaric at metrocast.net   phil at co.ordinate.org
         Renaissance Man, Unix ronin, Perl hacker, Free Stater
                 It's not the years, it's the mileage.



More information about the geeks mailing list