Patch Name: PHKL_22223

Patch Description: s700 10.20 s700 10.20 WSIO SCSI cumulative patch

Creation Date: 00/08/17

Post Date:  00/08/25

Hardware Platforms - OS Releases:
	s700: 10.20

Products: N/A

Filesets:
	OS-Core.CORE-KRN ProgSupport.C-INC

Automatic Reboot?: Yes

Status: General Superseded

Critical:
	Yes
	PHKL_22223: PANIC OTHER
		ioscan fails to detect luns following a gap in the
		lun numbering.
	PHKL_21084: PANIC HANG
	PHKL_20686: HANG
	PHKL_19614: OTHER
		ioscan output is affected.
		LVM primary paths become unavailable.
		ServiceGuard Tocs node.
	PHKL_19130: PANIC
	PHKL_19097: PANIC
	PHKL_18917: HANG
	PHKL_18390: HANG
	PHKL_17467: HANG
	PHKL_16861: HANG
	PHKL_16926: PANIC

Path Name: /hp-ux_patches/s700/10.X/PHKL_22223

Symptoms:
	PHKL_22223:
	( SR: 8606136160 CR: JAGad05289 )
	The QUEUE FULL handling has caused performance problems at
	customer sites. Performance degradation will be seen during
	accesses to a disk after a QUEUE FULL condition has been
	received from the disk.
	QUEUE FULL conditions (a transient I/O error) are logged
	in syslog and can occur with any SCSI device on any HP-UX
	platform.

	( SR: 8606100450 CR: JAGab31847 )
	ioscan will not discover devices (LUNs) that follow a
	non-existent LUN on a given target.  Ie.  ioscan will only
	currently find existent LUNs that are in sequential order
	starting from LUN 0.  Therefore, such "non-contiguous" LUNs
	will not be seen by ioscan and will not be accessible to the
	system.  This behaviour had been seen with a EMC disk array,
	but may exist with other devices.

	( SR: 8606146079 CR: JAGad15415 )
	A data page fault can occur due to a NULL pointer not
	correctly checked. Although no specific stack trace can
	be expected, ONE out of the five following functions
	should appear near the top of the stack trace to suspect
	that this specific problem has been hit:
		c720_isrSelect()
		c720_isrDataDone()
		c720_isrExtMsgLenIn()
		c720_isrWdtrRespRcvd()
		c720_isrSdtrRespRcvd()

	PHKL_21862:
	( SR: 8606142756 CR: JAGad12108 )
	Any wide SCSI devices attached to the built-in narrow
	single-ended SCSI bus using a 50 pin to 68 pin cable will
	not function properly. The description shown by ioscan
	for the built-in narrow single-ended SCSI bus will
	incorrectly show the bus as "Wide".

	PHKL_21084:
	( SR: 5003462986 CR: JAGab17250 )
	A system with 2 ALT 8-series DLT (Quantum 4000) on the same
	card showed the following panic:
	panic: (display==0xb800, flags==0x0) Data page fault   1111
	The stack trace was:
	scsi_start+0x18
	scsi_retry+0xd8
	invoke_callouts+0x160
	softclock+0x38
	sw_service+0x154
	mp_ext_interrupt+0x2a0
	$RDB_int_patch+0x58
	mpn_splx_free_lock_ul4_brn_target+0x4
	net_callout+0x90
	netisr_netisr+0x1bc
	netisr_daemon+0x68

	( SR: 8606114227 CR: JAGac23205 )
	Pvlink switch does not occur if getting repeated bus resets
	Has been seen on two system configurations with a shared
	bus which is getting reset regularly.
	The I/Os never time out and there is no switch to the
	alternate link.

	( SR: 8606105826 CR: JAGab74163 )
	System experienced Spinlock Deadlock panic.
	The panic string is:
	panic: (display==0xb800, flags==0x0) Spinlock deadlock!
	The stack shows the following:
	panic+0x10
	too_much_time+0x238
	wait_for_lock_spinner+0x2f4
	wait_for_lock_4way+0x2c
	mpn_splx_retry+0x24
	resume_cleanup+0x16c
	resume+0x280
	_swtch+0x138
	real_sleep+0x234
	_sleep+0x14
	unhashdaemon+0xa4
	main+0x59c

	( SR: 8606127843 CR: JAGac78644 )
	The SCSI retry policy introduced with PHKL_17467/8
	is causing a problem. This policy ensured that an
	INQUIRY command would be sent to the device after
	a disk becomes nonresponsive.
	The problem is that when the disk becomes responsive
	again, the inquiry will return the old data instead of
	the current information.

	( SR: 8606132417 CR: JAGad01566 )
	The SCSI log messages don't give the hardware path. To
	decode the device, we must translate the 'dev:' field.

	PHKL_20686:
	( SR: 8606105212 CR: JAGab73159 )
	SCSI hardware failure causes system hang with multiple
	processes waiting for I/O return. LVM expects either
	an I/O error or an EPOWERF (timeout) to continue.

	Multiple console messages are generated which read:

	SCSI:  Third party detected bus hang --
		lbolt:  xxxxxxxx, bus:  x

	PHKL_19787:
	DTS# JAGab72357 SR# 8606104808
	This patch provides new functionality to support HP
	VISUALIZE-fxe graphics.

	DTS# JAGab31968 SR# 8606100742
	A message "SCSI: C720_BMALLOC failed to allocate
	space in c720_DataSetup()" gets logged repeatedly in
	dmesg on a 735.

	PHKL_19614:
	DTS# JAGab70055 SR# 8606103392
	The command "ioscan -kfn" displays erroneous output
	with the patch PHKL_19131. A fast wide differential
	disk may be incorrectly reported as narrow single-
	ended under description column in the ioscan output.

	DTS# JAGab71209 SR# 8606103998
	Suppose a diskarray is hooked up via Fibre Channel with the
	primary paths through mux1 and the alternate paths through
	mux2.  If mux1 is powerfailed, LVM fails over to the
	alternate paths, but when mux1 is powered back up, the
	primary paths are still unavailable.

	DTS# JAGab41088  SR# 8606101027
	ServiceGuard package TOCs node because cluster lock ioctls
	take too long.

	PHKL_19130:
	DTS#: JAGaa42584 SR#: 1653281824
	system panic with "scsi unrecovered deferred error"

	DTS#: JAGaa44446 SR#: 8606101359
	command-mode might stop working at any arbitrary time with
	respect to the application and device trying to use it.

	DTS#: JAGaa08513 SR#: 8606101473
	While doing an Inquiry command to request Unit Serial Number
	Page, one extra byte is transfered.  There are no real
	symptoms associated with this problem.

	PHKL_19097:
	System panics (Data page Fault) in scsi_start_bus_locked()

	PHKL_18917:
	LVM hangs due to I/O requests never being returned by the
	IO subsystem.  The message "Device violation of Contingent
	Allegiance" is issued to syslog.

	PHKL_18390:
	( SR: 1653300004 DTS: JAGaa47696, dup of JAGab11155)
	Slow PVlink failover after installing PHKL_17467.
	Diskinfo reports back on an unavailable disk.

	( SR: 1653300970 DTS: JAGab11365 )
	( SR: 1653290395 DTS: JAGaa47016 )
	A faulty disk can prevent the LVM mirroring from working.

	PHKL_17467:
	I/O failover hang on Fiber Channel PV_link.

	PHKL_16861:
	I/O failover hang on Fiber Channel PV_link.

	PHKL_17639:
	This patch enables new functionality that is part of the
	10.20 ACE (Additional Core Enhancements) Workstation
	bundle, which adds new I/O drivers to support the B1000,
	C3000, and J5000 systems.

	PHKL_16926:
	( SR: 5003434118 DTS: JAGaa23967 )
	System panics (Data Page Fault) in scsi_destroy_scb

	( SR: 5003429654 DTS: JAGaa40369 )
	System panics in c720_invalid_req_done

	( SR: 4701407890 DTS: JAGaa23080 )
	Unexpected Disconnect Messages when using pass through
	driver

Defect Description:
	PHKL_22223:
	( SR: 8606136160 CR: JAGad05289 )
	A QUEUE FULL condition is an error reported by a SCSI
	device which indicates that the device has reached the
	limit of the number of I/Os that it can process
	concurrently and that the rejected I/O request must be
	retried later. The defect that caused a performance
	problem was that we effectively turned the queue
	depth down to 1 and never raised it back again.

	Resolution:
	After a QUEUE FULL, we'll wait for any outstanding I/Os
	to complete and gradually increase the queue depth back
	up to the previous max queue depth in such a way as to
	minimize the likelihood of another immediate QUEUE FULL
	condition.

	( SR: 8606100450 CR: JAGab31847 )
	The ioscan algorithm comes from old SCSI-2 design
	paradigms which are no longer valid. In the old paradigm
	we only scanned each target until we found an invalid
	LUN, which caused us to miss seeing certain LUNs in
	the newer SCSI device paradigms.

	Resolution:
	A redesign of scsi_probe() has been done which allows
	ioscan to discover all existing devices. There are not
	expected to be any increased ioscan times.

	( SR: 8606146079 CR: JAGad15415 )
	We did not check the value of a pointer before
	dereferencing it. When the pointer was NULL the system
	paniced. The pointer can be NULL for a variety of
	corner-case reasons in the operation of the driver,
	and thus checking for NULL should have been done and
	was not.

	Resolution:
	We now check the value of the pointer before dereferencing
	it. If NULL, we dump the contents of the SCSI I/O card
	registers to the syslog and continue processing.

	PHKL_21862:
	( SR: 8606142756 CR: JAGad12108 )
	The built-in narrow single-ended SCSI bus on workstation
	models J5600 and C3600 is incorrectly setup as a wide bus.

	Resolution:
	The model numbers J5600 and C3600 were added to conditionals
	in the c720_init() and c720_pci_attach() routines.

	PHKL_21084:
	( SR: 5003462986 CR: JAGab17250 )
	The panic occurs because the requested target pointer had
	been freed and the lun structure is no longer valid.
	The scsi_retry() walks through the per-bus retry
	queue. During its work, only the per bus lock is held.
	For the restart of I/O, scsi_start() is used which needs
	to hold the bus lock (for ordering reasons) which is so
	released. But during the time that the bus lock is released
	the state can change.

	Resolution:
	Instead of requeuing directly to the tag_q, scsi_retry()
	queues the timed-out requests to a temporary queue.
	This prevents them from being processed and started until
	the bus lock is re-acquiered at which point the requests
	are requeued from the temporary queue to the tag_q.

	( SR: 8606114227 CR: JAGac23205 )
	The hang is due to infinite retry at the scsi layer without
	being passed to LVM to try to allow the timeout and switch
	the link.

	Resolution:
	The fix was to change the callback routine from sd_retry()
	to sd_retry_check(), which checks the B_PFTIMEOUT flag and
	calls sd_nonresponse() if it is set instead of calling
	sd_retry().

	( SR: 8606105826 CR: JAGab74163 )
	The spinlock deadlock occurred because one processor owns
	the SPL lock which some other processor wants. The problem
	is due to a bus lock given instead of the lun lock wanted.
	The real problem is that the lun and bus locks were
	locked in a different order in two places of the code.

	Resolution:
	The locks were correctly reordered in sd_dump_queue().

	( SR: 8606127843 CR: JAGac78644 )
	The inquiry data are buffered and the inquiry command was
	sent directly to the disk only under special circumstances
	causing invalid data to be returned.

	Resolution:
	The inquiry command is always sent to the disk to get
	current data.

	( SR: 8606132417 CR: JAGad01566 )
	Change the scsi_dmesg_log_io function so that the hardware
	path is visible without decoding the 'dev:' field

	Resolution:
	Use the translation functions to add the hardware path
	to the logged information.

	PHKL_20686:
	( SR: 8606105212 CR: JAGab73159 )
	SCSI driver detects hardware failure and resets the bus.
	However, the reset operation cannot resolve the bus hang
	and the reset interrupt never occurs. Without a bus reset
	timeout, the processes hang waiting for I/O's queued for
	the bus.

	Resolution:
	Added code to timeout unsuccessful bus reset, abort the
	I/O and return an EPOWERF to the upper layer.

	PHKL_19787:
	New funtionality to support HP VISUALIZE-fxe graphics
	Resolution:
	Add new functionality.

	DTS# JAGab31968 SR# 8606100742
	A message "SCSI: C720_BMALLOC failed to allocate
	space in c720_DataSetup()" gets logged repeatedly in
	dmesg on a 735.
	Resolution:
	Made the logging of this diagnostic message conditional
	like other SCSI diagnostic messages.

	PHKL_19614:
	DTS# JAGab70055 SR#: 8606103392
	PHKL_19131 affects disk representation in ioscan output.
	This problem happens due to A class system firmware
	returning 4 words of info in response to a GET_INITIATOR PDC
	call instead of 6 words and the 5th and 6th words are zeroed
	out.  Driver misinterpreted these zero values for the 5th
	and 6th words and assumed that the firmware wanted it to
	configure the SCSI card as Narrow Single Ended card.
	Resolution:
	The driver now expects 4 or 6 bytes from a call to
	GET_INITIATOR depending on the type of the machine.
	This has fixed the problem.

	DTS#JAGab71209 8606103998
	The root cause of the problem is bit collision between
	definition of the flags L_EPOWERF and L_DEFERRED_ERROR.
	Resolution:
	As a resolution the flag  L_DEFERRED_ERROR is redefined
	with a different value.

	DTS# JAGab41088  SR# 8606101027
	ServiceGuard TOCs node because cluster lock ioctls take
	too long. This occurs with 10.20 of HP-UX and 10.10 of
	ServiceGuard.
	The root cause of the problem is that the sdisk driver
	responds with cached data when an SIOC_INQUIRY ioctl is
	issued for an open device.  This defeats LVM's attempt to
	determine if the either the device or the path to it have
	failed before issuing IO requests to acquire the cluster
	lock.  When the IO request is subsequently issued, the
	improved error handling recently reintroduced to the sdisk
	driver now results in several retries being attempted before
	the error is reported back to LVM.  This does not allow LVM
	sufficient time to switch to the mirror disk or alternate
	path and successfully complete the operation before
	ServiceGuard TOCs the machine.
	Resolution:
	The resolution to this problem is to make all
	SCSI inquiries go out to the device rather than
	read the data cached in memory.

	PHKL_19130:
	DTS#: JAGaa42584 SR#: 1653281824
	If immediate reporting is enabled and a deferred error
	occurs, the system will panic with "scsi unrecovered
	deferred error".
	Resolution:
	The new deferred error check/handling method is to block
	all IO requests for the disk, when a deferred error occurs,
	until the device is closed and reopened.

	DTS#: JAGaa44446 SR#: 8606101359
	scsi_ctl replaces the cdevsw table entries for d_read and
	d_write when the lun is not in command-mode for performance
	improvements.  The problem is that the cdevsw table is a
	global resource and is not owned by a lun and command-mode
	might stop working at any arbitrary time.
	Resolution:
	Removed that code.

	DTS#: JAGaa08513 SR#: 8606101473
	the FC data length exceeds the maximum SCSI transfer length
	by 1 byte while performing an Unit Serial Number Page.
	Resolution:
	Reduce the size of the scsi serial structure by 1.

	PHKL_19097:
	When sd_open() fails in scsi_lun_open, we goto recover_lck1
	which falls through to recover_lp.  recover_lp sets lp->ddsw
	to NULL but fails to set lp->scb_q_nonempty to NULL. This
	causes a data page fault panic in scsi_start_bus_locked().
	This might occur when an open() fails on a busy device.
	Resolution:
	Set lp->scb_q_nonempty to NULL in label recover_lp in
	scsi_lun_open().

	PHKL_18917:
	When the message is issued (typically caused by a bus
	RESET during contingent allegiance condition (CAC)), the
	corresponding I/O request is then lost and never returned
	to the requestor, eventually causing a system hang.
	Resolution:
	When a bus RESET happens during a CAC, the c720 driver now
	insures that all currently active I/O requests are posted
	as incomplete and scheduled to be retried.

	PHKL_18390:
	( SR: 1653300004 DTS: JAGaa47696, dup of JAGaa11155 )
	Slow PVlink failover or diskinfo reporting good disk status
	on an unavailable disk is due to the SCSI INQUIRY command
	returning cached data instead of sending the command down to
	the device."
	Resolution:
	We now ensure an INQUIRY command will be sent down to the
	device when the disk becomes nonresponsive.

	( SR: 1653300970 DTS: JAGab11365 )
	( SR: 1653290395 DTS: JAGaa47016 )
	If a faulty disk sends NOT_READY sense key to SCSI.  The
	current SCSI policy is to retry the request until the disk
	is ready.  This results in a hang IO situation and prevents
	the LVM mirroring from working.
	Resolution:
	LVM-related NOT_READY requests will be treated as
	nonresponse from the disk and will therefore be failed back
	for LVM to handle.

	PHKL_17467:
	In a hardware configuration, mirrored disks can be accessed
	through primary/alternate Fiber Channel (FC) links. If the
	primary link and the alternate link of a disk of the
	mirrored pairs are down, the other disk should continue to
	sending or receiving data.  The problem is it fails to do
	so and causes an I/O hang.
	Resolution:
	This patch provides fix for this hang problem. The SCSI
	layer will retry the FC requests as long as the PFTIMEOUT
	period has not expired and the request is recoverable.

	PHKL_16861:
	In a hardware configuration, mirrored disks can be accessed
	through primary/alternate Fiber Channel (FC) links. If the
	primary link and the alternate link of a disk of the
	mirrored pairs are down, the other disk should continue to
	sending or receiving data.  The problem is it fails to do
	so and causes an I/O hang.
	This patch will provide a temporary fix for this problem.
	In this fix, the SCSI layer will retry the FC request as
	long as the FC sets a flag to ask for retrying the
	request.

	PHKL_17639:
	New functionality to support the B1000, C3000, and J5000
	systems on HP-UX 10.20. New functionality adds new I/O
	drivers.
	Resolution:
	Add support for new SCSI hardware in the SCSI driver.

	PHKL_16926:
	( SR: 5003434118 DTS: JAGaa23967 )
	There is a race condition between scsi_lun_open and
	scsi_start_bus_locked.  This can be fixed by incrementing
	the in_use counter before releasing the lun lock therefore
	insuring the lun stay open.

	( SR: 5003429654 DTS: JAGaa40369 )
	In c720_invalid_req_done, we directly dereference scb->busp
	without assuring that this scb is a bus scb.  The busp
	pointer is NULL if the scb is a lun scb.  Thus, the fix is
	to add a check to see whether lsp->scb->busp is NULL, if so,
	obtain the busp from lsp->scb->lp->bus instead.

	( SR: 4701407890 DTS: JAGaa23080 )
	When using the pass through driver with the "inhibit Inquiry
	on open" option (see scsi_ctl(7)) and a device on a SCSI bus
	with no other devices and repeatedly opening and closing the
	device to send but a single SCSI command, the bus is
	sometimes in the wrong state when the target device begins
	to transfer data.

SR:
	1653281824 1653290395 1653300004 1653300970 1653306654
	4701398263 4701407668 4701407890 4701414136 5003429654
	5003434118 5003462986 5003464297 8606100742 8606101027
	8606103392 8606103998 8606104808 8606105212 8606105826
	8606114227 8606127843 8606132417 8606136160 8606100450
	8606146079 8606142756

Patch Files:
	/usr/conf/lib/libhp-ux.a(scsi_c720.o)
	/usr/conf/lib/libhp-ux.a(scsi_ctl.o)
	/usr/conf/lib/libhp-ux.a(scsi_disk.o)
	/usr/include/sys/scsi_ctl.h

what(1) Output:
	/usr/conf/lib/libhp-ux.a(scsi_c720.o):
		scsi_c720.c   $Date: 2000/08/17 10:48:26 $ $Revision
			: 1.5.98.52 $ PATCH_10.20 (PHKL_22223)
		scsi_c720.c $Date: 2000/08/17 10:48:26 $ $Revision:
			1.5.98.52 $
	/usr/conf/lib/libhp-ux.a(scsi_ctl.o):
		scsi_ctl.c   $Date: 2000/08/17 09:15:20 $ $Revision:
			 1.9.98.47 $ PATCH_10.20 (PHKL_22223)
	/usr/conf/lib/libhp-ux.a(scsi_disk.o):
		scsi_disk.c   $Date: 2000/03/23 14:41:34 $ $Revision
			: 1.7.98.41 $ PATCH_10.20 (PHKL_21084)
	/usr/include/sys/scsi_ctl.h:
		scsi_ctl.h $Date: 2000/08/17 09:12:19 $ $Revision: 1
			.8.98.14 $ PATCH_10.20 (PHKL_22223)

cksum(1) Output:
	3487515870 98912 /usr/conf/lib/libhp-ux.a(scsi_c720.o)
	1117167513 69392 /usr/conf/lib/libhp-ux.a(scsi_ctl.o)
	48918485 20924 /usr/conf/lib/libhp-ux.a(scsi_disk.o)
	3520525129 52681 /usr/include/sys/scsi_ctl.h

Patch Conflicts: None

Patch Dependencies:
	s700: 10.20: PHKL_16750

Hardware Dependencies: None

Other Dependencies: None

Supersedes:
	PHKL_16861 PHKL_16926 PHKL_17467 PHKL_17639 PHKL_18390 PHKL_18917
	PHKL_19097 PHKL_19130 PHKL_19614 PHKL_19787 PHKL_20686 PHKL_21084
	PHKL_21862

Equivalent Patches:
	PHKL_22224:
	s800: 10.20

Patch Package Size: 310 KBytes

Installation Instructions:
	Please review all instructions and the Hewlett-Packard
	SupportLine User Guide or your Hewlett-Packard support terms
	and conditions for precautions, scope of license,
	restrictions, and, limitation of liability and warranties,
	before installing this patch.
	------------------------------------------------------------
	1. Back up your system before installing a patch.

	2. Login as root.

	3. Copy the patch to the /tmp directory.

	4. Move to the /tmp directory and unshar the patch:

		cd /tmp
		sh PHKL_22223

	5a. For a standalone system, run swinstall to install the
	    patch:

		swinstall -x autoreboot=true -x match_target=true \
			-s /tmp/PHKL_22223.depot

	By default swinstall will archive the original software in
	/var/adm/sw/patch/PHKL_22223.  If you do not wish to retain a
	copy of the original software, you can create an empty file
	named /var/adm/sw/patch/PATCH_NOSAVE.

	WARNING: If this file exists when a patch is installed, the
	         patch cannot be deinstalled.  Please be careful
		 when using this feature.

	It is recommended that you move the PHKL_22223.text file to
	/var/adm/sw/patch for future reference.

	To put this patch on a magnetic tape and install from the
	tape drive, use the command:

		dd if=/tmp/PHKL_22223.depot of=/dev/rmt/0m bs=2k

Special Installation Instructions:
	This patch depends on base patch PHKL_16750.
	For successful installation, please ensure that PHKL_16750
	is in the same depot with this patch, or PHKL_16750  is
	already installed.