[SunRescue] Logging memory errors

rescue at sunhelp.org rescue at sunhelp.org
Tue Dec 5 09:10:46 CST 2000


It's more a problem on how intel boxes report ecc errors.  As I understand
it they arn't very verbose, and NMI's could mean other things.
	Nick

On Tue, 5 Dec 2000, Dave Reader wrote:

> 
> 
> On Tue, 5 Dec 2000, Paul Khoury wrote:
> 
> > On Mon, 04 Dec 2000 21:12:14 -0800, Paul Theodoropoulos wrote:
> > 
> > >Dec  4 20:40:31 e4500a unix:  Corrected MemMod Board 0 J3800
> > >Dec  4 20:40:31 e4500a unix:    ECC Data Bit 11 was corrected
> > >
> > >I refuse to use anything but SPARC running Solaris for core 
> > >infrastructure. Nothing is as reliable.
> > 
> > How do the memory errors work, BTW?  Does Solaris just map around them
> > in realtime? I'm sure Linux would have a fit if it encountered that.
> 
> It's ECC memory - Error Checking and Correcting.
> 
> It is possible for the ECC memory to seamlessly "heal" single-bit errors
> and raise an alert that an error has occurred.
> 
> When this happens, it means "your memory has started to degrade and
> introduce errors, i'm correcting single bit errors but you'd better
> replace it before it gets worse" (ECC only protects you from a single bit
> error, and is there only to allow you time to swap out the memory without
> the machine crashing horribly first).
> 
> With Linux, at least on x86 hardware - i've not seen ECC errors under
> Linux on Sparc - it will say something like "Received NMI - Maybe you have
> a memory problem? .. continuing anyway" .. okay, so thats a little vague
> (perhaps because Linux is still only just breaking out into the market
> where ECC is the norm), but it does detect it, report it, and continue.
> 
> dave.
> 
> 
> _______________________________________________
> Rescue maillist  -  Rescue at sunhelp.org
> http://www.sunhelp.org/mailman/listinfo/rescue
> 




More information about the rescue mailing list