[rescue] Oracle making just a little harder to keep old machines in use

Patrick Giagnocavo patrick at zill.net
Thu May 6 21:39:30 CDT 2010


Patrick Finnegan wrote:
> 
> ECC and Chip-kill avert most memory-related problems.
> 
> If you're developing uncorrectable ECC errors in the main data pathways 
> of your system, there's no reason to expect the machine to function 
> properly.   What happens on Solaris if the RAM the scheduler or other 
> kernel code bits live in develop double-bit errors that ECC can't 
> correct?
> 
> Pat


That is a good question.  I assume that it would either crash completely
or crash with a crashdump file, sending a message to the fault
management daemon if it could.  If not, then hopefully you turned on ECC
memory testing on boot.

I have seen single bit errors on one system; it failed part of each
module that had the error, forcing the OS to reduce its RAM from 8192K
to 8000K.  It kept running for 300 more days or so before it was rebooted.

Note on some newer Sparcs it can recover from double-bit ECC errors,
however that is a hardware design point and has little to do with the
software.

What would Linux do in such a case?

--Patrick



More information about the rescue mailing list