[Sunhelp] My take on the system reboots

Fri Jan 14 04:28:43 CST 2000

I saw something similar on a cycle upgrade 300 Mhz machine once.  Blew up
like three days later.  These don't happen to be 300's are they?  I think I
have the messages file saved from it somewhere.

> -----Original Message-----
> From:	Flynn, Harold M. III [SMTP:Flynnh at mont.disa.mil]
> Sent:	Friday, January 14, 2000 3:35 AM
> To:	'sunhelp at sunhelp.org'
> Subject:	[Sunhelp] My take on the system reboots
> 
> Ok ladies and gents.  Be advised, this is long, dry, and insanely boring.
> I
> suggest a Bailey's and Coffee to make it go down a little smoother.
> 
> I've been watching this system reboot issue, and with some snips courtesy
> of
> S. Condit and Melinda Taylor, I've found some consistancy, although I'm
> not
> exactly sure what it means yet.
> 
> It appears there's a bad trap occurring somewhere on these systems.  Both
> systems are running Solaris Release 5.5, 1 machine a Sparc 5, one a Sparc
> 4
> (if my notes serve me correctly).
> 
> The first snip from S. Condit propagates as follows:
> 
> Jan 10 02:47:45 halebopp unix: fp=fc002c08, args=3 e9258 198 ef6d1898 40
> effffbac
> Jan 10 02:47:45 halebopp unix: Called from 1ac8c, fp=effffc78, args=69aa0
> 8
> 1 effffce0 effffcf0 8
> Jan 10 02:47:45 halebopp unix: End traceback...
> Jan 10 02:47:45 halebopp unix: panic: Data fault
> Jan 10 02:47:45 halebopp unix: syncing file systems... [3] 2 [3] [3] [3]
> [3]
> [3]
>   [3] [3] [3] [3] [3] [3] [3] [3] [3] [3]
>   [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3]
> [3] [3]
>   [3] [3] [3] [3] [3] [3] [3] [3] [3] [3]
>   [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] [3] cannot sync
> --
> giving up
> Jan 10 02:47:45 halebopp unix:  4276 static and sysmap kernel pages
> Jan 10 02:47:45 halebopp unix:    67 dynamic kernel data pages
> Jan 10 02:47:45 halebopp unix:   496 kernel-pageable pages
> Jan 10 02:47:45 halebopp unix:     0 segkmap kernel pages
> Jan 10 02:47:45 halebopp unix:     0 segvn kernel pages
> Jan 10 02:47:45 halebopp unix:   520 current user process pages
> Jan 10 02:47:45 halebopp unix:  5359 total pages (5359 chunks)
> Jan 10 02:47:45 halebopp unix: dumping to vp f5affe1c, offset 93896
> Jan 10 02:47:45 halebopp unix:  5359 total pages, dump succeeded
> 
> The machine had been running approximately 20 minutes prior to this, and
> there was no other chatter in the syslog.  A little later, we see this:
> 
> Jan 10 02:47:49 halebopp unix: BAD TRAP: type=1 rp=fbe2eda4 addr=0
> mmu_fsr=164 rw=3
> Jan 10 02:47:49 halebopp unix: sched: Text fault
> Jan 10 02:47:49 halebopp unix: kernel read fault at addr=0x0, pme=0x0
> Jan 10 02:47:49 halebopp unix: MMU sfsr=164: Invalid Address on supv instr
> fetch at level 1
> Jan 10 02:47:49 halebopp unix: pte addr = 0xf597a000, level = 1
> Jan 10 02:47:49 halebopp unix: pid=0, pc=0x0, sp=0xfbe2edf0,
> psr=0x44000c5,
> context=0
> Jan 10 02:47:49 halebopp unix: g1-g7: 0, f026b4e4, 44010e6, 600, 0, 1,
> fbe2eec0
> Jan 10 02:47:49 halebopp unix: Begin traceback... sp = fbe2edf0
> Jan 10 02:47:49 halebopp unix: Called from f00588b8, fp=fbe2ee60, args=0
> f028a350 f627c080 f028a434 0 f028963c
> Jan 10 02:47:49 halebopp unix: Called from f00dd980, fp=0, args=0 0 0 0 0
> 0
> Jan 10 02:47:49 halebopp unix: End traceback...
> Jan 10 02:47:49 halebopp unix: panic: Text fault
> Jan 10 02:47:49 halebopp unix: syncing file systems... [60] 3 [60] [60] 
> [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60]
> [60] [60] [60] [60] [60] [60] [60] [60] [60] 
> [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] [60]
> [60] [60] [60] [60] [60] [60] [60] [60] [60] 
> [60] [60] [60] [60] [60] [60] [60] [60] [60] [60] cannot sync -- giving up
> Jan 10 02:47:50 halebopp unix:  3676 static and sysmap kernel pages
> Jan 10 02:47:50 halebopp unix:    51 dynamic kernel data pages
> Jan 10 02:47:50 halebopp unix:   442 kernel-pageable pages
> Jan 10 02:47:50 halebopp unix:     0 segkmap kernel pages
> Jan 10 02:47:50 halebopp unix:     0 segvn kernel pages
> Jan 10 02:47:50 halebopp unix:     0 current user process pages
> Jan 10 02:47:50 halebopp unix:  4169 total pages (4169 chunks)
> Jan 10 02:47:50 halebopp unix: dumping to vp f5affe1c, offset 103416
> Jan 10 02:47:50 halebopp unix:  4169 total pages, dump succeeded
> 
> The noteworthy bit I see here is in the BAD TRAP sequence.
> 
> Here's the snip from a crash on Melinda's machine:
> 
> Dec 15 08:25:47 ham unix: zs1 is /obio/zs at 0,0
> Dec 15 08:25:47 ham unix: BAD TRAP: type=9 rp=f02415ac addr=18 mmu_fsr=126
> rw=1
> Dec 15 08:25:47 ham unix: : Data fault
> Dec 15 08:25:47 ham unix: kernel read fault at addr=0x18, pme=0x0
> Dec 15 08:25:47 ham unix: MMU sfsr=126: Invalid Address on supv data fetch
> at level 1
> Dec 15 08:25:47 ham unix: pte addr = 0xf5979000, level = 1
> Dec 15 08:25:47 ham unix: pid=0, pc=0xf0082e7c, sp=0xf02415f8,
> psr=0x44000c1, context=0
> Dec 15 08:25:47 ham unix: g1-g7: f5a7b290, f026b4e4, a00, c00, 0, 1,
> f0242020
> Dec 15 08:25:47 ham unix: Begin traceback... sp = f02415f8
> Dec 15 08:25:47 ham unix: Called from f0082444, fp=f0241670, args=f593dcb0
> f59256a0 f593dcf0 f593dce
> 6 e f593c248
> Dec 15 08:25:47 ham unix: Called from f0100c8c, fp=f02416d0, args=f59245d8
> 230 f593dcb0 f02728e0 f59
> 3dcb8 0
> Dec 15 08:25:47 ham unix: Called from f01007ac, fp=f0241730, args=f59245d8
> 1
> f0108a2c 0 0 f028a350
> Dec 15 08:25:47 ham unix: Called from f0100168, fp=f0241790, args=f59245d8
> 1
> f02417f4 f596f650 f590c
> 180 f5974698
> Dec 15 08:25:47 ham unix: Called from f0101b48, fp=f02417f8, args=f027fb2c
> f596f650 3 f5927000 8 f59
> 74690
> Dec 15 08:25:47 ham unix: Called from f00e7004, fp=f0241858, args=32
> f0278a4c f027bf3c 960 f5970e10 
> f596f650
> Dec 15 08:25:48 ham unix: Called from f00e88cc, fp=f02418b8, args=32 70
> f5927960 f024191c 32 f592796
> c
> Dec 15 08:25:48 ham unix: Called from f5a55a38, fp=f0241a38, args=f5a67b08
> 0
> f5a67b08 3b 32 ffffffff
> Dec 15 08:25:48 ham unix: Called from f007c844, fp=f0241ae8, args=1
> 879af934
> 0 f026df20 4d85b4 f5a56
> ce8
> Dec 15 08:25:48 ham unix: Called from f008f5b8, fp=f0241b48, args=1
> 879af934
> 0 13d88 4d85b4 f028a40c
> Dec 15 08:25:48 ham unix: Called from f00b38ec, fp=f0241ba8, args=0
> ffffffff
> 28 0 f005db94 0
> Dec 15 08:25:48 ham unix: Called from f00411c4, fp=f0241c10, args=4400fe0
> 4400fe0 f0240000 f0271aa0 
> f02719a0 f025dc68
> Dec 15 08:25:48 ham unix: Called from 105070, fp=0, args=121800 d6 6df0
> ffffe000 98000 2000
> Dec 15 08:25:48 ham unix: End traceback...
> Dec 15 08:25:48 ham unix: panic: Data fault
> Dec 15 08:25:48 ham unix: syncing file systems... done
> 
> Perhaps I'm mistaken, but the BAD TRAP here looks very similar, especially
> this section:
> 
> Dec 15 08:25:47 ham unix: kernel read fault at addr=0x18, pme=0x0
> Dec 15 08:25:47 ham unix: MMU sfsr=126: Invalid Address on supv data fetch
> at level 1
> Dec 15 08:25:47 ham unix: pte addr = 0xf5979000, level = 1
> 
> I referenced the patch Jim Richey posted on the 10th, and found that this
> patch is in fact for 5.6.  The patch was 105181-17.  After surfing through
> all the links related, I could find nothing related to this issue.  I may
> have overlooked something (it _is_ 2:30 am), but I don't see anything
> pertinent.
> 
> Now, I'm not sure what all this means, but I was wondering of any of you
> others have found similar situations.  Hopefully, one of the Code Gods
> (crosses fingers, hopes Casper Dik is watching) can give us a little
> insight
> on this.
> 
> My grep "0.02" dollar > $PAYCHECK for the day.
> 
> Hal
> 
> Hal Flynn, ICS Inc.        Senior Systems Analyst
> Defense   Information   Systems   Agency
> flynnh at mont.disa.mil    Commercial:  334-416-3233
> DSN:  596-3233
> 
> 
> _______________________________________________
> SunHELP maillist  -  SunHELP at sunhelp.org
> http://www.sunhelp.org/mailman/listinfo/sunhelp