[Sunhelp] My take on the system reboots

Fri Jan 14 02:35:08 CST 2000

Ok ladies and gents.  Be advised, this is long, dry, and insanely boring.  I
suggest a Bailey's and Coffee to make it go down a little smoother.

I've been watching this system reboot issue, and with some snips courtesy of
S. Condit and Melinda Taylor, I've found some consistancy, although I'm not
exactly sure what it means yet.

It appears there's a bad trap occurring somewhere on these systems.  Both
systems are running Solaris Release 5.5, 1 machine a Sparc 5, one a Sparc 4
(if my notes serve me correctly).

The first snip from S. Condit propagates as follows:

Jan 10 02:47:45 halebopp unix: fp=fc002c08, args=3 e9258 198 ef6d1898 40
effffbac
Jan 10 02:47:45 halebopp unix: Called from 1ac8c, fp=effffc78, args=69aa0 8
1 effffce0 effffcf0 8
Jan 10 02:47:45 halebopp unix: End traceback...
Jan 10 02:47:45 halebopp unix: panic: Data fault
Jan 10 02:47:45 halebopp unix: syncing file systems... [3] 2 [3] [3] [3] [3]
[3]
  [3] [3] [3] [3] [3] [3] [3] [3] [3] [3]
  [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3]
[3] [3]
  [3] [3] [3] [3] [3] [3] [3] [3] [3] [3]
  [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] cannot sync --
giving up
Jan 10 02:47:45 halebopp unix:  4276 static and sysmap kernel pages
Jan 10 02:47:45 halebopp unix:    67 dynamic kernel data pages
Jan 10 02:47:45 halebopp unix:   496 kernel-pageable pages
Jan 10 02:47:45 halebopp unix:     0 segkmap kernel pages
Jan 10 02:47:45 halebopp unix:     0 segvn kernel pages
Jan 10 02:47:45 halebopp unix:   520 current user process pages
Jan 10 02:47:45 halebopp unix:  5359 total pages (5359 chunks)
Jan 10 02:47:45 halebopp unix: dumping to vp f5affe1c, offset 93896
Jan 10 02:47:45 halebopp unix:  5359 total pages, dump succeeded

The machine had been running approximately 20 minutes prior to this, and
there was no other chatter in the syslog.  A little later, we see this:

Jan 10 02:47:49 halebopp unix: BAD TRAP: type=1 rp=fbe2eda4 addr=0
mmu_fsr=164 rw=3
Jan 10 02:47:49 halebopp unix: sched: Text fault
Jan 10 02:47:49 halebopp unix: kernel read fault at addr=0x0, pme=0x0
Jan 10 02:47:49 halebopp unix: MMU sfsr=164: Invalid Address on supv instr
fetch at level 1
Jan 10 02:47:49 halebopp unix: pte addr = 0xf597a000, level = 1
Jan 10 02:47:49 halebopp unix: pid=0, pc=0x0, sp=0xfbe2edf0, psr=0x44000c5,
context=0
Jan 10 02:47:49 halebopp unix: g1-g7: 0, f026b4e4, 44010e6, 600, 0, 1,
fbe2eec0
Jan 10 02:47:49 halebopp unix: Begin traceback... sp = fbe2edf0
Jan 10 02:47:49 halebopp unix: Called from f00588b8, fp=fbe2ee60, args=0
f028a350 f627c080 f028a434 0 f028963c
Jan 10 02:47:49 halebopp unix: Called from f00dd980, fp=0, args=0 0 0 0 0 0
Jan 10 02:47:49 halebopp unix: End traceback...
Jan 10 02:47:49 halebopp unix: panic: Text fault
Jan 10 02:47:49 halebopp unix: syncing file systems... [60] 3 [60] [60] 
[60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60]
[60] [60] [60] [60] [60] [60] [60] [60] [60] 
[60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60]
[60] [60] [60] [60] [60] [60] [60] [60] [60] 
[60] [60] [60] [60] [60] [60] [60] [60] [60] [60] cannot sync -- giving up
Jan 10 02:47:50 halebopp unix:  3676 static and sysmap kernel pages
Jan 10 02:47:50 halebopp unix:    51 dynamic kernel data pages
Jan 10 02:47:50 halebopp unix:   442 kernel-pageable pages
Jan 10 02:47:50 halebopp unix:     0 segkmap kernel pages
Jan 10 02:47:50 halebopp unix:     0 segvn kernel pages
Jan 10 02:47:50 halebopp unix:     0 current user process pages
Jan 10 02:47:50 halebopp unix:  4169 total pages (4169 chunks)
Jan 10 02:47:50 halebopp unix: dumping to vp f5affe1c, offset 103416
Jan 10 02:47:50 halebopp unix:  4169 total pages, dump succeeded

The noteworthy bit I see here is in the BAD TRAP sequence.

Here's the snip from a crash on Melinda's machine:

Dec 15 08:25:47 ham unix: zs1 is /obio/zs at 0,0
Dec 15 08:25:47 ham unix: BAD TRAP: type=9 rp=f02415ac addr=18 mmu_fsr=126
rw=1
Dec 15 08:25:47 ham unix: : Data fault
Dec 15 08:25:47 ham unix: kernel read fault at addr=0x18, pme=0x0
Dec 15 08:25:47 ham unix: MMU sfsr=126: Invalid Address on supv data fetch
at level 1
Dec 15 08:25:47 ham unix: pte addr = 0xf5979000, level = 1
Dec 15 08:25:47 ham unix: pid=0, pc=0xf0082e7c, sp=0xf02415f8,
psr=0x44000c1, context=0
Dec 15 08:25:47 ham unix: g1-g7: f5a7b290, f026b4e4, a00, c00, 0, 1,
f0242020
Dec 15 08:25:47 ham unix: Begin traceback... sp = f02415f8
Dec 15 08:25:47 ham unix: Called from f0082444, fp=f0241670, args=f593dcb0
f59256a0 f593dcf0 f593dce
6 e f593c248
Dec 15 08:25:47 ham unix: Called from f0100c8c, fp=f02416d0, args=f59245d8
230 f593dcb0 f02728e0 f59
3dcb8 0
Dec 15 08:25:47 ham unix: Called from f01007ac, fp=f0241730, args=f59245d8 1
f0108a2c 0 0 f028a350
Dec 15 08:25:47 ham unix: Called from f0100168, fp=f0241790, args=f59245d8 1
f02417f4 f596f650 f590c
180 f5974698
Dec 15 08:25:47 ham unix: Called from f0101b48, fp=f02417f8, args=f027fb2c
f596f650 3 f5927000 8 f59
74690
Dec 15 08:25:47 ham unix: Called from f00e7004, fp=f0241858, args=32
f0278a4c f027bf3c 960 f5970e10 
f596f650
Dec 15 08:25:48 ham unix: Called from f00e88cc, fp=f02418b8, args=32 70
f5927960 f024191c 32 f592796
c
Dec 15 08:25:48 ham unix: Called from f5a55a38, fp=f0241a38, args=f5a67b08 0
f5a67b08 3b 32 ffffffff
Dec 15 08:25:48 ham unix: Called from f007c844, fp=f0241ae8, args=1 879af934
0 f026df20 4d85b4 f5a56
ce8
Dec 15 08:25:48 ham unix: Called from f008f5b8, fp=f0241b48, args=1 879af934
0 13d88 4d85b4 f028a40c
Dec 15 08:25:48 ham unix: Called from f00b38ec, fp=f0241ba8, args=0 ffffffff
28 0 f005db94 0
Dec 15 08:25:48 ham unix: Called from f00411c4, fp=f0241c10, args=4400fe0
4400fe0 f0240000 f0271aa0 
f02719a0 f025dc68
Dec 15 08:25:48 ham unix: Called from 105070, fp=0, args=121800 d6 6df0
ffffe000 98000 2000
Dec 15 08:25:48 ham unix: End traceback...
Dec 15 08:25:48 ham unix: panic: Data fault
Dec 15 08:25:48 ham unix: syncing file systems... done

Perhaps I'm mistaken, but the BAD TRAP here looks very similar, especially
this section:

Dec 15 08:25:47 ham unix: kernel read fault at addr=0x18, pme=0x0
Dec 15 08:25:47 ham unix: MMU sfsr=126: Invalid Address on supv data fetch
at level 1
Dec 15 08:25:47 ham unix: pte addr = 0xf5979000, level = 1

I referenced the patch Jim Richey posted on the 10th, and found that this
patch is in fact for 5.6.  The patch was 105181-17.  After surfing through
all the links related, I could find nothing related to this issue.  I may
have overlooked something (it _is_ 2:30 am), but I don't see anything
pertinent.

Now, I'm not sure what all this means, but I was wondering of any of you
others have found similar situations.  Hopefully, one of the Code Gods
(crosses fingers, hopes Casper Dik is watching) can give us a little insight
on this.

My grep "0.02" dollar > $PAYCHECK for the day.

Hal

Hal Flynn, ICS Inc.        Senior Systems Analyst
Defense   Information   Systems   Agency
flynnh at mont.disa.mil    Commercial:  334-416-3233
DSN:  596-3233