Patch Name: PHKL_22223 Patch Description: s700 10.20 s700 10.20 WSIO SCSI cumulative patch Creation Date: 00/08/17 Post Date: 00/08/25 Hardware Platforms - OS Releases: s700: 10.20 Products: N/A Filesets: OS-Core.CORE-KRN ProgSupport.C-INC Automatic Reboot?: Yes Status: General Superseded Critical: Yes PHKL_22223: PANIC OTHER ioscan fails to detect luns following a gap in the lun numbering. PHKL_21084: PANIC HANG PHKL_20686: HANG PHKL_19614: OTHER ioscan output is affected. LVM primary paths become unavailable. ServiceGuard Tocs node. PHKL_19130: PANIC PHKL_19097: PANIC PHKL_18917: HANG PHKL_18390: HANG PHKL_17467: HANG PHKL_16861: HANG PHKL_16926: PANIC Path Name: /hp-ux_patches/s700/10.X/PHKL_22223 Symptoms: PHKL_22223: ( SR: 8606136160 CR: JAGad05289 ) The QUEUE FULL handling has caused performance problems at customer sites. Performance degradation will be seen during accesses to a disk after a QUEUE FULL condition has been received from the disk. QUEUE FULL conditions (a transient I/O error) are logged in syslog and can occur with any SCSI device on any HP-UX platform. ( SR: 8606100450 CR: JAGab31847 ) ioscan will not discover devices (LUNs) that follow a non-existent LUN on a given target. Ie. ioscan will only currently find existent LUNs that are in sequential order starting from LUN 0. Therefore, such "non-contiguous" LUNs will not be seen by ioscan and will not be accessible to the system. This behaviour had been seen with a EMC disk array, but may exist with other devices. ( SR: 8606146079 CR: JAGad15415 ) A data page fault can occur due to a NULL pointer not correctly checked. Although no specific stack trace can be expected, ONE out of the five following functions should appear near the top of the stack trace to suspect that this specific problem has been hit: c720_isrSelect() c720_isrDataDone() c720_isrExtMsgLenIn() c720_isrWdtrRespRcvd() c720_isrSdtrRespRcvd() PHKL_21862: ( SR: 8606142756 CR: JAGad12108 ) Any wide SCSI devices attached to the built-in narrow single-ended SCSI bus using a 50 pin to 68 pin cable will not function properly. The description shown by ioscan for the built-in narrow single-ended SCSI bus will incorrectly show the bus as "Wide". PHKL_21084: ( SR: 5003462986 CR: JAGab17250 ) A system with 2 ALT 8-series DLT (Quantum 4000) on the same card showed the following panic: panic: (display==0xb800, flags==0x0) Data page fault 1111 The stack trace was: scsi_start+0x18 scsi_retry+0xd8 invoke_callouts+0x160 softclock+0x38 sw_service+0x154 mp_ext_interrupt+0x2a0 $RDB_int_patch+0x58 mpn_splx_free_lock_ul4_brn_target+0x4 net_callout+0x90 netisr_netisr+0x1bc netisr_daemon+0x68 ( SR: 8606114227 CR: JAGac23205 ) Pvlink switch does not occur if getting repeated bus resets Has been seen on two system configurations with a shared bus which is getting reset regularly. The I/Os never time out and there is no switch to the alternate link. ( SR: 8606105826 CR: JAGab74163 ) System experienced Spinlock Deadlock panic. The panic string is: panic: (display==0xb800, flags==0x0) Spinlock deadlock! The stack shows the following: panic+0x10 too_much_time+0x238 wait_for_lock_spinner+0x2f4 wait_for_lock_4way+0x2c mpn_splx_retry+0x24 resume_cleanup+0x16c resume+0x280 _swtch+0x138 real_sleep+0x234 _sleep+0x14 unhashdaemon+0xa4 main+0x59c ( SR: 8606127843 CR: JAGac78644 ) The SCSI retry policy introduced with PHKL_17467/8 is causing a problem. This policy ensured that an INQUIRY command would be sent to the device after a disk becomes nonresponsive. The problem is that when the disk becomes responsive again, the inquiry will return the old data instead of the current information. ( SR: 8606132417 CR: JAGad01566 ) The SCSI log messages don't give the hardware path. To decode the device, we must translate the 'dev:' field. PHKL_20686: ( SR: 8606105212 CR: JAGab73159 ) SCSI hardware failure causes system hang with multiple processes waiting for I/O return. LVM expects either an I/O error or an EPOWERF (timeout) to continue. Multiple console messages are generated which read: SCSI: Third party detected bus hang -- lbolt: xxxxxxxx, bus: x PHKL_19787: DTS# JAGab72357 SR# 8606104808 This patch provides new functionality to support HP VISUALIZE-fxe graphics. DTS# JAGab31968 SR# 8606100742 A message "SCSI: C720_BMALLOC failed to allocate space in c720_DataSetup()" gets logged repeatedly in dmesg on a 735. PHKL_19614: DTS# JAGab70055 SR# 8606103392 The command "ioscan -kfn" displays erroneous output with the patch PHKL_19131. A fast wide differential disk may be incorrectly reported as narrow single- ended under description column in the ioscan output. DTS# JAGab71209 SR# 8606103998 Suppose a diskarray is hooked up via Fibre Channel with the primary paths through mux1 and the alternate paths through mux2. If mux1 is powerfailed, LVM fails over to the alternate paths, but when mux1 is powered back up, the primary paths are still unavailable. DTS# JAGab41088 SR# 8606101027 ServiceGuard package TOCs node because cluster lock ioctls take too long. PHKL_19130: DTS#: JAGaa42584 SR#: 1653281824 system panic with "scsi unrecovered deferred error" DTS#: JAGaa44446 SR#: 8606101359 command-mode might stop working at any arbitrary time with respect to the application and device trying to use it. DTS#: JAGaa08513 SR#: 8606101473 While doing an Inquiry command to request Unit Serial Number Page, one extra byte is transfered. There are no real symptoms associated with this problem. PHKL_19097: System panics (Data page Fault) in scsi_start_bus_locked() PHKL_18917: LVM hangs due to I/O requests never being returned by the IO subsystem. The message "Device violation of Contingent Allegiance" is issued to syslog. PHKL_18390: ( SR: 1653300004 DTS: JAGaa47696, dup of JAGab11155) Slow PVlink failover after installing PHKL_17467. Diskinfo reports back on an unavailable disk. ( SR: 1653300970 DTS: JAGab11365 ) ( SR: 1653290395 DTS: JAGaa47016 ) A faulty disk can prevent the LVM mirroring from working. PHKL_17467: I/O failover hang on Fiber Channel PV_link. PHKL_16861: I/O failover hang on Fiber Channel PV_link. PHKL_17639: This patch enables new functionality that is part of the 10.20 ACE (Additional Core Enhancements) Workstation bundle, which adds new I/O drivers to support the B1000, C3000, and J5000 systems. PHKL_16926: ( SR: 5003434118 DTS: JAGaa23967 ) System panics (Data Page Fault) in scsi_destroy_scb ( SR: 5003429654 DTS: JAGaa40369 ) System panics in c720_invalid_req_done ( SR: 4701407890 DTS: JAGaa23080 ) Unexpected Disconnect Messages when using pass through driver Defect Description: PHKL_22223: ( SR: 8606136160 CR: JAGad05289 ) A QUEUE FULL condition is an error reported by a SCSI device which indicates that the device has reached the limit of the number of I/Os that it can process concurrently and that the rejected I/O request must be retried later. The defect that caused a performance problem was that we effectively turned the queue depth down to 1 and never raised it back again. Resolution: After a QUEUE FULL, we'll wait for any outstanding I/Os to complete and gradually increase the queue depth back up to the previous max queue depth in such a way as to minimize the likelihood of another immediate QUEUE FULL condition. ( SR: 8606100450 CR: JAGab31847 ) The ioscan algorithm comes from old SCSI-2 design paradigms which are no longer valid. In the old paradigm we only scanned each target until we found an invalid LUN, which caused us to miss seeing certain LUNs in the newer SCSI device paradigms. Resolution: A redesign of scsi_probe() has been done which allows ioscan to discover all existing devices. There are not expected to be any increased ioscan times. ( SR: 8606146079 CR: JAGad15415 ) We did not check the value of a pointer before dereferencing it. When the pointer was NULL the system paniced. The pointer can be NULL for a variety of corner-case reasons in the operation of the driver, and thus checking for NULL should have been done and was not. Resolution: We now check the value of the pointer before dereferencing it. If NULL, we dump the contents of the SCSI I/O card registers to the syslog and continue processing. PHKL_21862: ( SR: 8606142756 CR: JAGad12108 ) The built-in narrow single-ended SCSI bus on workstation models J5600 and C3600 is incorrectly setup as a wide bus. Resolution: The model numbers J5600 and C3600 were added to conditionals in the c720_init() and c720_pci_attach() routines. PHKL_21084: ( SR: 5003462986 CR: JAGab17250 ) The panic occurs because the requested target pointer had been freed and the lun structure is no longer valid. The scsi_retry() walks through the per-bus retry queue. During its work, only the per bus lock is held. For the restart of I/O, scsi_start() is used which needs to hold the bus lock (for ordering reasons) which is so released. But during the time that the bus lock is released the state can change. Resolution: Instead of requeuing directly to the tag_q, scsi_retry() queues the timed-out requests to a temporary queue. This prevents them from being processed and started until the bus lock is re-acquiered at which point the requests are requeued from the temporary queue to the tag_q. ( SR: 8606114227 CR: JAGac23205 ) The hang is due to infinite retry at the scsi layer without being passed to LVM to try to allow the timeout and switch the link. Resolution: The fix was to change the callback routine from sd_retry() to sd_retry_check(), which checks the B_PFTIMEOUT flag and calls sd_nonresponse() if it is set instead of calling sd_retry(). ( SR: 8606105826 CR: JAGab74163 ) The spinlock deadlock occurred because one processor owns the SPL lock which some other processor wants. The problem is due to a bus lock given instead of the lun lock wanted. The real problem is that the lun and bus locks were locked in a different order in two places of the code. Resolution: The locks were correctly reordered in sd_dump_queue(). ( SR: 8606127843 CR: JAGac78644 ) The inquiry data are buffered and the inquiry command was sent directly to the disk only under special circumstances causing invalid data to be returned. Resolution: The inquiry command is always sent to the disk to get current data. ( SR: 8606132417 CR: JAGad01566 ) Change the scsi_dmesg_log_io function so that the hardware path is visible without decoding the 'dev:' field Resolution: Use the translation functions to add the hardware path to the logged information. PHKL_20686: ( SR: 8606105212 CR: JAGab73159 ) SCSI driver detects hardware failure and resets the bus. However, the reset operation cannot resolve the bus hang and the reset interrupt never occurs. Without a bus reset timeout, the processes hang waiting for I/O's queued for the bus. Resolution: Added code to timeout unsuccessful bus reset, abort the I/O and return an EPOWERF to the upper layer. PHKL_19787: New funtionality to support HP VISUALIZE-fxe graphics Resolution: Add new functionality. DTS# JAGab31968 SR# 8606100742 A message "SCSI: C720_BMALLOC failed to allocate space in c720_DataSetup()" gets logged repeatedly in dmesg on a 735. Resolution: Made the logging of this diagnostic message conditional like other SCSI diagnostic messages. PHKL_19614: DTS# JAGab70055 SR#: 8606103392 PHKL_19131 affects disk representation in ioscan output. This problem happens due to A class system firmware returning 4 words of info in response to a GET_INITIATOR PDC call instead of 6 words and the 5th and 6th words are zeroed out. Driver misinterpreted these zero values for the 5th and 6th words and assumed that the firmware wanted it to configure the SCSI card as Narrow Single Ended card. Resolution: The driver now expects 4 or 6 bytes from a call to GET_INITIATOR depending on the type of the machine. This has fixed the problem. DTS#JAGab71209 8606103998 The root cause of the problem is bit collision between definition of the flags L_EPOWERF and L_DEFERRED_ERROR. Resolution: As a resolution the flag L_DEFERRED_ERROR is redefined with a different value. DTS# JAGab41088 SR# 8606101027 ServiceGuard TOCs node because cluster lock ioctls take too long. This occurs with 10.20 of HP-UX and 10.10 of ServiceGuard. The root cause of the problem is that the sdisk driver responds with cached data when an SIOC_INQUIRY ioctl is issued for an open device. This defeats LVM's attempt to determine if the either the device or the path to it have failed before issuing IO requests to acquire the cluster lock. When the IO request is subsequently issued, the improved error handling recently reintroduced to the sdisk driver now results in several retries being attempted before the error is reported back to LVM. This does not allow LVM sufficient time to switch to the mirror disk or alternate path and successfully complete the operation before ServiceGuard TOCs the machine. Resolution: The resolution to this problem is to make all SCSI inquiries go out to the device rather than read the data cached in memory. PHKL_19130: DTS#: JAGaa42584 SR#: 1653281824 If immediate reporting is enabled and a deferred error occurs, the system will panic with "scsi unrecovered deferred error". Resolution: The new deferred error check/handling method is to block all IO requests for the disk, when a deferred error occurs, until the device is closed and reopened. DTS#: JAGaa44446 SR#: 8606101359 scsi_ctl replaces the cdevsw table entries for d_read and d_write when the lun is not in command-mode for performance improvements. The problem is that the cdevsw table is a global resource and is not owned by a lun and command-mode might stop working at any arbitrary time. Resolution: Removed that code. DTS#: JAGaa08513 SR#: 8606101473 the FC data length exceeds the maximum SCSI transfer length by 1 byte while performing an Unit Serial Number Page. Resolution: Reduce the size of the scsi serial structure by 1. PHKL_19097: When sd_open() fails in scsi_lun_open, we goto recover_lck1 which falls through to recover_lp. recover_lp sets lp->ddsw to NULL but fails to set lp->scb_q_nonempty to NULL. This causes a data page fault panic in scsi_start_bus_locked(). This might occur when an open() fails on a busy device. Resolution: Set lp->scb_q_nonempty to NULL in label recover_lp in scsi_lun_open(). PHKL_18917: When the message is issued (typically caused by a bus RESET during contingent allegiance condition (CAC)), the corresponding I/O request is then lost and never returned to the requestor, eventually causing a system hang. Resolution: When a bus RESET happens during a CAC, the c720 driver now insures that all currently active I/O requests are posted as incomplete and scheduled to be retried. PHKL_18390: ( SR: 1653300004 DTS: JAGaa47696, dup of JAGaa11155 ) Slow PVlink failover or diskinfo reporting good disk status on an unavailable disk is due to the SCSI INQUIRY command returning cached data instead of sending the command down to the device." Resolution: We now ensure an INQUIRY command will be sent down to the device when the disk becomes nonresponsive. ( SR: 1653300970 DTS: JAGab11365 ) ( SR: 1653290395 DTS: JAGaa47016 ) If a faulty disk sends NOT_READY sense key to SCSI. The current SCSI policy is to retry the request until the disk is ready. This results in a hang IO situation and prevents the LVM mirroring from working. Resolution: LVM-related NOT_READY requests will be treated as nonresponse from the disk and will therefore be failed back for LVM to handle. PHKL_17467: In a hardware configuration, mirrored disks can be accessed through primary/alternate Fiber Channel (FC) links. If the primary link and the alternate link of a disk of the mirrored pairs are down, the other disk should continue to sending or receiving data. The problem is it fails to do so and causes an I/O hang. Resolution: This patch provides fix for this hang problem. The SCSI layer will retry the FC requests as long as the PFTIMEOUT period has not expired and the request is recoverable. PHKL_16861: In a hardware configuration, mirrored disks can be accessed through primary/alternate Fiber Channel (FC) links. If the primary link and the alternate link of a disk of the mirrored pairs are down, the other disk should continue to sending or receiving data. The problem is it fails to do so and causes an I/O hang. This patch will provide a temporary fix for this problem. In this fix, the SCSI layer will retry the FC request as long as the FC sets a flag to ask for retrying the request. PHKL_17639: New functionality to support the B1000, C3000, and J5000 systems on HP-UX 10.20. New functionality adds new I/O drivers. Resolution: Add support for new SCSI hardware in the SCSI driver. PHKL_16926: ( SR: 5003434118 DTS: JAGaa23967 ) There is a race condition between scsi_lun_open and scsi_start_bus_locked. This can be fixed by incrementing the in_use counter before releasing the lun lock therefore insuring the lun stay open. ( SR: 5003429654 DTS: JAGaa40369 ) In c720_invalid_req_done, we directly dereference scb->busp without assuring that this scb is a bus scb. The busp pointer is NULL if the scb is a lun scb. Thus, the fix is to add a check to see whether lsp->scb->busp is NULL, if so, obtain the busp from lsp->scb->lp->bus instead. ( SR: 4701407890 DTS: JAGaa23080 ) When using the pass through driver with the "inhibit Inquiry on open" option (see scsi_ctl(7)) and a device on a SCSI bus with no other devices and repeatedly opening and closing the device to send but a single SCSI command, the bus is sometimes in the wrong state when the target device begins to transfer data. SR: 1653281824 1653290395 1653300004 1653300970 1653306654 4701398263 4701407668 4701407890 4701414136 5003429654 5003434118 5003462986 5003464297 8606100742 8606101027 8606103392 8606103998 8606104808 8606105212 8606105826 8606114227 8606127843 8606132417 8606136160 8606100450 8606146079 8606142756 Patch Files: /usr/conf/lib/libhp-ux.a(scsi_c720.o) /usr/conf/lib/libhp-ux.a(scsi_ctl.o) /usr/conf/lib/libhp-ux.a(scsi_disk.o) /usr/include/sys/scsi_ctl.h what(1) Output: /usr/conf/lib/libhp-ux.a(scsi_c720.o): scsi_c720.c $Date: 2000/08/17 10:48:26 $ $Revision : 1.5.98.52 $ PATCH_10.20 (PHKL_22223) scsi_c720.c $Date: 2000/08/17 10:48:26 $ $Revision: 1.5.98.52 $ /usr/conf/lib/libhp-ux.a(scsi_ctl.o): scsi_ctl.c $Date: 2000/08/17 09:15:20 $ $Revision: 1.9.98.47 $ PATCH_10.20 (PHKL_22223) /usr/conf/lib/libhp-ux.a(scsi_disk.o): scsi_disk.c $Date: 2000/03/23 14:41:34 $ $Revision : 1.7.98.41 $ PATCH_10.20 (PHKL_21084) /usr/include/sys/scsi_ctl.h: scsi_ctl.h $Date: 2000/08/17 09:12:19 $ $Revision: 1 .8.98.14 $ PATCH_10.20 (PHKL_22223) cksum(1) Output: 3487515870 98912 /usr/conf/lib/libhp-ux.a(scsi_c720.o) 1117167513 69392 /usr/conf/lib/libhp-ux.a(scsi_ctl.o) 48918485 20924 /usr/conf/lib/libhp-ux.a(scsi_disk.o) 3520525129 52681 /usr/include/sys/scsi_ctl.h Patch Conflicts: None Patch Dependencies: s700: 10.20: PHKL_16750 Hardware Dependencies: None Other Dependencies: None Supersedes: PHKL_16861 PHKL_16926 PHKL_17467 PHKL_17639 PHKL_18390 PHKL_18917 PHKL_19097 PHKL_19130 PHKL_19614 PHKL_19787 PHKL_20686 PHKL_21084 PHKL_21862 Equivalent Patches: PHKL_22224: s800: 10.20 Patch Package Size: 310 KBytes Installation Instructions: Please review all instructions and the Hewlett-Packard SupportLine User Guide or your Hewlett-Packard support terms and conditions for precautions, scope of license, restrictions, and, limitation of liability and warranties, before installing this patch. ------------------------------------------------------------ 1. Back up your system before installing a patch. 2. Login as root. 3. Copy the patch to the /tmp directory. 4. Move to the /tmp directory and unshar the patch: cd /tmp sh PHKL_22223 5a. For a standalone system, run swinstall to install the patch: swinstall -x autoreboot=true -x match_target=true \ -s /tmp/PHKL_22223.depot By default swinstall will archive the original software in /var/adm/sw/patch/PHKL_22223. If you do not wish to retain a copy of the original software, you can create an empty file named /var/adm/sw/patch/PATCH_NOSAVE. WARNING: If this file exists when a patch is installed, the patch cannot be deinstalled. Please be careful when using this feature. It is recommended that you move the PHKL_22223.text file to /var/adm/sw/patch for future reference. To put this patch on a magnetic tape and install from the tape drive, use the command: dd if=/tmp/PHKL_22223.depot of=/dev/rmt/0m bs=2k Special Installation Instructions: This patch depends on base patch PHKL_16750. For successful installation, please ensure that PHKL_16750 is in the same depot with this patch, or PHKL_16750 is already installed.