[SunRescue] Logging memory errors

Dave Reader rescue at sunhelp.org
Tue Dec 5 02:54:51 CST 2000


On Tue, 5 Dec 2000, Paul Khoury wrote:

> On Mon, 04 Dec 2000 21:12:14 -0800, Paul Theodoropoulos wrote:
> 
> >Dec  4 20:40:31 e4500a unix:  Corrected MemMod Board 0 J3800
> >Dec  4 20:40:31 e4500a unix:    ECC Data Bit 11 was corrected
> >
> >I refuse to use anything but SPARC running Solaris for core 
> >infrastructure. Nothing is as reliable.
> 
> How do the memory errors work, BTW?  Does Solaris just map around them
> in realtime? I'm sure Linux would have a fit if it encountered that.

It's ECC memory - Error Checking and Correcting.

It is possible for the ECC memory to seamlessly "heal" single-bit errors
and raise an alert that an error has occurred.

When this happens, it means "your memory has started to degrade and
introduce errors, i'm correcting single bit errors but you'd better
replace it before it gets worse" (ECC only protects you from a single bit
error, and is there only to allow you time to swap out the memory without
the machine crashing horribly first).

With Linux, at least on x86 hardware - i've not seen ECC errors under
Linux on Sparc - it will say something like "Received NMI - Maybe you have
a memory problem? .. continuing anyway" .. okay, so thats a little vague
(perhaps because Linux is still only just breaking out into the market
where ECC is the norm), but it does detect it, report it, and continue.

dave.





More information about the rescue mailing list