[SunRescue] Logging memory errors

Dave Reader rescue at sunhelp.org
Tue Dec 5 10:37:07 CST 2000


On Tue, 5 Dec 2000, Maarten Deen wrote:

> > It's more a problem on how intel boxes report ecc errors.  As I understand
> > it they arn't very verbose, and NMI's could mean other things.
> 
> Specifically, NMI stands for Non Maskable Interrupt, AFAIK.

Yes,

The question is, after the memory system raises an NMI on an x86 system to
alert the OS to an ECC error, is there a mechanism by which the OS can
determine the reason for the NMI or must it just guess? (the linux
behaviour would seem to suggest that it guesses... though i haven't read
the source to confirm:)

Clearly, the Sun architecture has what it needs to report precisely to the
OS what the problem is/was. This isn't suprising of course - it's really a
requirement, since you dont want to spend all day swapping out each memory
board in turn trying to locate the culprit (and it may be an intermittent 
fault anyway..).

Of course, the Suns have always been built as servers, and in the x86
world what you get are, afterall, just over-specced PeeCee's.

I do wonder if perhaps a/the mechanism reporting for ECC errors on the x86
platform is not standardised and that different vendors implement
different means for supplying details to the OS. All the NMI can do,
afterall, is guarantee to alert the OS to the fact that /something/
happened.

To get back to the original point, I wonder if Linux knows about whatever
reporting mechanism Suns use for ECC errors. Linux does boldly try to
continue on x86, but that doesnt mean that the relevant code is there for
sparc. Does anyone know? .. I'm only running it on old Suns so it's not
really an issue for me :)


dave.





More information about the rescue mailing list