[rescue] mysterious hard hangs, two different sun4m's

Skeezics Boondoggle skeezics at q7.com
Mon Sep 2 04:56:27 CDT 2002


the problem:

i have both an ss4 and an ss20 that are locking up tight, sporadically,
for no apparent reason.  when they go they're completely frozen up; won't
respond to L1-A, pointer freezes, keyboard won't respond, network dead.

the hardware:

ss4, 128mb, solaris 2.6, one internal disk, cdrom+floppy, onboard tcx
video.  only locks up during interactive use, but often runs fine for
weeks or months with daily usage with no problems - other times it hangs
two or three times a day.

ss20, 128mb, solaris 7, one internal disk, cdrom+floppy, 4mb vsimm.  was 
running headless with zero problems.  now it locks up too!

both machines are in my basement cave, which is comfortably cool all year 
'round.

the background:

the ss4 has been my primary machine at home for many years.  i was too
lazy to do a proper reinstall from scratch, so it is a bit of a mongrel -
running solaris 2.6 (upgraded from 2.5 or 2.5.1), and it's been through a
whole series of configuration changes and tinkering. :-)  for some time it
had a fore atm card in it, because i was feeling masochistic and wanted to
fling some bits over the fiber at my old asx-100... i soon got over that,
then installed an fsbe (later a sunswift) and ipfilter, and was my dsl nat
gateway and firewall, with a three-slot and-a-taco disk enclosure hanging
off it.  then i found a different ss20 to be the firewall machine and run 
some services and stuff, and got those off the ss4.

later i found another ss20 and intended to make it my new desktop and put
nextstep/openstep on the ss4 instead, because building three-architecture
(m68k/sparc/hppa) fat binaries is cool.  (quad fat is cooler, except that 
would mean having to own an x86 box...)

the symptoms and diagnosis:

at first i thought the hangs were hardware related; the 110mhz ss4 clocks
its sbus at 22.5mhz and that seems to make some sbus cards unhappy.  in
all the hangs, though, i never found ANYTHING in any of the system logs,
on the console, or during any of the post runs or self tests.  even after
i'd stripped out all the extra hardware, pulled off the serial splitter
cable, removed the external disks and uninstalled disksuite, and fscked
the internal drive twice, pkgchk'ed and rpm -vV'ed all my installed
software, and found absolutely NOTHING to indicate a hardware problem, i
finally just got fed up - and a couple of power outages last week were the
last straw.

so i decided to move the ss4 off my desk and get off my lazy butt and
update the internal network. :-)  the headless 4-way ss20 firewall machine
is now on ups power with the F330.  the dual-75 ss20 now has the ss4's 20"
monitor, type5 and mouse.  and the ss4 was finally going to get a fresh
nextstep 3.3 or openstep 4.2 install...


then after a day of getting my recently jumpstarted and patched solaris 7
machine updated with all my latest rpm builds (get paid to build and
deploy 'em at work, saves a TON of time just downloading 'em and
installing 'em at home :-) i was quite pleased with how zippy it was with
the sx video enabled... for giggles, before i nuked and repaved the ss4, i
stuck another monitor on it and a spare type6 keyboard, and decided to run
a few tests.  (for posterity, an informal benchmark shows that the
dual-75mhz supersparc-ii can render 98 jolene blalock .jpgs with 'xv' 2:10
faster than the 110mhz ss4, while the 110mhz sparcbook 3gx, sadly, takes a
full 5:13 longer, largely because it had to scale down most of the images,
so it wasn't a fair test. :-)

another test was running a two player computer vs. computer game on the
latest 'xconq' cvs snapshot, which i'd just rebuilt today... imagine my
chagrine when the ss20 locked up tight, while the ss4 sat there patiently,
wondering why the zimbabwean mplayer had suddenly stopped talking to it...

okay, i thought, these are *totally* separate machines, but now the ss4 is
*fine* and the ss20 is hanging.  i'm using my old type5 and my old mouse
on the new machine... could THAT be the problem?  dug up a spare type5,
still in bubble wrap, which turned out to have that godawful "pc layout"  
(anyone actually LIKE those?  i'll gladly trade for a unix layout type5,
with the control key in the right place, and NO, thank you, screwing with
xmodmap is annoying and i refuse to do that on principle) and swapped out
the keyboard.  still froze.  swapped out the mouse.  still froze.  
checked the power cord with a meter:  120V.  opened the case and blew out
a tiny bit of dust, but this was a fairly new machine and had been running
just fine, headless, for a couple of months.  checked that everything was 
seated, all cables fine, etc.

at least i can now reproduce the problem with some sort of regularity:  
power cycle and boot up, log back in, start up a two-player xconq game and
about 12-18 turns in, bingo, the 20 locks up and the 4 is fine... (i'm
composing this on the 4 right now.)

okay, so the only other thing in common across the two machines is that
i'm running olvwm on top of openwin - yeah, i know, but i think the folks
here can appreciate how convenient it is to have the same login name and
uid and windowing environment over six jobs in the last 11 years...  :-)
with all the gyrations sun has done with their X environment, it's nice to
just set up a new home directory by unpacking a small tarball and getting
to work. :-)  cde?  please.  gnome?  yet another conspiracy to sell ram 
upgrades...

but having checked or swapped out the only hardware components that were
common between the two machines, the only thing i can think of now is that
either my window manager or something in my login environment or some bit
of opensource software that i've built is causing it.  or, perhaps this
particular spot on my desk is the sun4m VORTEX OF DOOM, and any
unfortunate sparc v8 pizzabox i put there is bombarded with wierd cosmic
rays or something...

anyway, i'm at a loss.  i suppose i should try enabling the deadman code 
in the kernel, see if i can get ANY kind of debugging info out of it... 
i'm not sure what advice y'all might have that i haven't already thought 
of or tried; in 12 years banging on sun hardware these kinds of hard hangs 
are so rare that i'm just *mystified* that i've now moved the problem from 
one machine to another by just swapping their places on the desk.  i think 
it's gremlins.  a cia plot.  sunspots.  or i'm just going mad.

"maybe it's time to move..." :-)

ah, well.  graphics are overrated anyway.  i'll just put them back in a 
stack (oooh, it's a 'tower of power' :-) and run them all headless, pull 
the vt320 off the netapp and get a serial port switch.  80x24 is fine for 
'pine' and 'lynx'.  hell, Real Men read their mail with 'cat' and flow 
control...

sigh.

thanks for letting me rant.  if anyone is running olvwm on a sun4m and 
suffering sporadic hangs, maybe we have a culprit... any other advice or 
suggestions from the peanut gallery gladly accepted.

-- skeez



More information about the rescue mailing list