[rescue] mysterious hard hangs, two different sun4m's

James Lockwood james at foonly.com
Tue Sep 3 02:06:43 CDT 2002


On Mon, 2 Sep 2002, Skeezics Boondoggle wrote:

> anyway, i'm at a loss.  i suppose i should try enabling the deadman code
> in the kernel, see if i can get ANY kind of debugging info out of it...
> i'm not sure what advice y'all might have that i haven't already thought
> of or tried; in 12 years banging on sun hardware these kinds of hard hangs
> are so rare that i'm just *mystified* that i've now moved the problem from
> one machine to another by just swapping their places on the desk.  i think
> it's gremlins.  a cia plot.  sunspots.  or i'm just going mad.

How hard of a lock is it?  Does the system still respond to L1-A?

If so, boot with kadb and try to reproduce it.  Once it hangs, break into
the debugger and get a backtrace.  First order analysis: suspect hardware
if the hang point wanders dramatically, suspect software if it stays in a
relatively small number of places.  Using the SX stresses some weird parts
of the memory controller.

Is the watchdog reset enabled?  If it is a "hard" hang, does it respond to
a keyboard replug event (which enters the kernel at a higher interrupt
priority than L1-A)?

Try pulling one CPU and see what happens.  If you still get the hangs,
swap it for the other.  Drop down to a single DIMM.  You know, standard
problem isolation techniques.

Unless your desk has a Tesla coil directly underneath it I wouldn't worry
about the problem migrating with position.  :)

-James



More information about the rescue mailing list