[SunRescue] Logging memory errors

Maarten Deen rescue at sunhelp.org
Tue Dec 5 09:37:31 CST 2000


> It's more a problem on how intel boxes report ecc errors.  As I understand
> it they arn't very verbose, and NMI's could mean other things.

Specifically, NMI stands for Non Maskable Interrupt, AFAIK.

Maarten

> 
> On Tue, 5 Dec 2000, Dave Reader wrote:
> 
> > On Tue, 5 Dec 2000, Paul Khoury wrote:
> > 
> > > On Mon, 04 Dec 2000 21:12:14 -0800, Paul Theodoropoulos wrote:
> > > 
> > > >Dec  4 20:40:31 e4500a unix:  Corrected MemMod Board 0 J3800
> > > >Dec  4 20:40:31 e4500a unix:    ECC Data Bit 11 was corrected
> > > >
> > > >I refuse to use anything but SPARC running Solaris for core 
> > > >infrastructure. Nothing is as reliable.
> > > 
> > > How do the memory errors work, BTW?  Does Solaris just map around them
> > > in realtime? I'm sure Linux would have a fit if it encountered that.
> > 
> > It's ECC memory - Error Checking and Correcting.
> > 
> > It is possible for the ECC memory to seamlessly "heal" single-bit errors
> > and raise an alert that an error has occurred.
> > 
> > When this happens, it means "your memory has started to degrade and
> > introduce errors, i'm correcting single bit errors but you'd better
> > replace it before it gets worse" (ECC only protects you from a single bit
> > error, and is there only to allow you time to swap out the memory without
> > the machine crashing horribly first).
> > 
> > With Linux, at least on x86 hardware - i've not seen ECC errors under
> > Linux on Sparc - it will say something like "Received NMI - Maybe you have
> > a memory problem? .. continuing anyway" .. okay, so thats a little vague
> > (perhaps because Linux is still only just breaking out into the market
> > where ECC is the norm), but it does detect it, report it, and continue.
> > 
> > dave.
> > 
> > 
> > _______________________________________________
> > Rescue maillist  -  Rescue at sunhelp.org
> > http://www.sunhelp.org/mailman/listinfo/rescue
> > 
> 
> _______________________________________________
> Rescue maillist  -  Rescue at sunhelp.org
> http://www.sunhelp.org/mailman/listinfo/rescue
> 


-- 
"I announced to the spectators that he [Ivanisevic] could not carry on
because of lack of appropriate equipment. That was something that came
to my mind on the spur of the moment."
(Gerry Armstrong at the 2000 Samsung Open in Brighton after Ivanisevic
 broke all his rackets in frustration)



More information about the rescue mailing list