[SunHELP] Mysterious Reboot

DAUBIGNE Sebastien - BOR ( SDaubigne@bordeaux-bersol.sema.slb.com ) SDaubigne at bordeaux-bersol.sema.slb.com
Mon Jan 28 08:31:21 CST 2002


It seems you caught an external cache  error (WP event, EDP event) on CPU 0.

The external cache of the Ultra-II CPUs were shipped with parity memory
instead of ECC memory.
Basically, parity memory can detect errors but cannot correct them, as ECC
does.
I don"t know why Sun choosed not to use ECC memory (cost ?).
The kernel detects the error and try to recover (at least with latest kernel
patch).
If the kernel can't recover, it tries to panic and reboot.

We caught many (10 in one year) similar errors on our production server
(E6500, 18 CPUs).
Sun first replied that this kind of error are "usual" and that they replace
the CPU if, and only if the same CPU catch an e-cache error several times.
We told Sun that 10 reboots in one year due to the same error is not "usual"
on a production server.
Finally, Sun offered to replace the CPUs with the last Ultra-II which are
shipped with ECC memory.

IMHO, Sun misdesigned the Ultra-II e-cache to save money and/or time and now
hardly admits it.


---
Sebastien DAUBIGNE
sebastien.daubigne at sema.fr <mailto:sebastien.daubigne at sema.fr>  - (+33)
(0)5.57.26.56.36
Sema Global Services - AFM/DW/Pessac

	-----Message d'origine-----
	De:	Markham, Richard [SMTP:RMarkham at hafeleamericas.com]
	Date:	vendredi 25 janvier 2002 19:31
	@:	Sunhelp (E-mail)
	Objet:	[SunHELP] Mysterious Reboot

	<snip>
	Jan 25 09:08:23 magicsun SUNW,UltraSPARC-IIi: [ID 419322
kern.warning]
	WARNING: [AFT1] WP event on CPU0, errID 0x002af4d7.2bd6e2f3
	Jan 25 09:08:23 magicsun     AFSR 0x00000000.00800010<WP> AFAR
	0x000001fe.00000200
	Jan 25 09:08:23 magicsun     AFSR.PSYND 0x0010(Score 95) AFSR.ETS
0x00
	Fault_PC 0x10093374
	Jan 25 09:08:23 magicsun     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000
	UDBL.ESYND 0x00
	Jan 25 09:08:23 magicsun unix: [ID 836849 kern.notice]
	Jan 25 09:08:23 magicsun ^Mpanic[cpu0]/thread=300009477c0:
	Jan 25 09:08:23 magicsun unix: [ID 732233 kern.notice] [AFT1] errID
	0x002af4d7.2bd6e2f3 WP Error(s)
	<snip>

	A mysterious reboot had occured and still unknown as to why.  Can
someone
	provide assistance on how to approach this as this is a production
database
	machine (U5).
	_______________________________________________
	SunHELP maillist  -  SunHELP at sunhelp.org
	http://www.sunhelp.org/mailman/listinfo/sunhelp



More information about the SunHELP mailing list