Patch Name: PHNE_19070 Patch Description: s700_800 10.20 R6.10.20 SNAplus2 Link cumulative patch Creation Date: 99/06/30 Post Date: 99/09/30 Hardware Platforms - OS Releases: s700: 10.20 s800: 10.20 Products: SNAplus2-Link R6.10.20 Filesets: SNAplus2-Link.SNAP2-LINK Automatic Reboot?: Yes Status: General Superseded Critical: Yes PHNE_19070: HANG PHNE_17819: PANIC HANG PHNE_17405: HANG PANIC PHNE_16758: HANG Path Name: /hp-ux_patches/s700_800/10.X/PHNE_19070 Symptoms: PHNE_19070: (1) 4701429407 Lan performance degraded when attempting to start an SDLC psi card on a T600 system. (2) 4701425561 R6.11.00 on a V-Class system: After several hours of APPC activity, (about 10 incoming allocates per second), APPC TP's fail to load, with error messages 512-257(0-10) logged. In addition, a system panic has occured while the user APPC application was terminated. Although these two problems are very different by nature, it has been determined that they are closely related due to internal mechanisms in SNAPlus2 in its communication via Streams putq messages. The stack trace for the panic was as follows: panic+0x14 report_trap_or_int_and_panic+0x80 trap+0xa8c nokgdb+0x8 putq_owned+0x2a0 putq+0x1c vba_track_putq+0x4c vpr_stream_output_msg+0x40c vpr_delete_entity+0x43c vpr_stream_close+0x1a8 close_wrapper+0x6c csq_protect+0x120 osr_pop_subr+0x220 osr_close_subr+0x324 hpstreams_close_int+0x314 hpstreams_close+0x2c call_open_close+0x1f8 closed+0xb0 spec_close+0x54 vn_close+0x48 vno_close+0x20 closef+0x64 exit+0x324 rexit+0x28 syscall+0x480 $syscallrtn+0x0 (3) 4701425355 While running reliability tests with new ACC driver and SNAplus2 the lab hit a system hang a couple of times when using qllc over X25. (NB: ACC use their own X.25 stack which uses nli2zcom module.) When examining the TC with q4 we found the system stuck at same routine, sna_q_v0_get_rw_lock, in libsixp.a. Here is how the q4 stack looks. sna_q_v0_get_rw_lock+0xc8 vql_stream_read_input+0xdc putnext+0x50 N2Z_F_data_ind+0x38 N2z_iev_pass_data_up+0x114 N2z_ReadEvent_Recvd+0x209c Zc_putq+0x5c nacc0_receive_data+0x140 (4) 1653299073 After upgrading from R4.4 to R6.10.20 on a T600 system, the SDLC link could no longer be activated and the lan performance is severly degraded. When the SNA resources are activated, the following error messages are logged: ----------------------- 15:32:40 WET 19 mars 1999 SDLC Message 768 - 107, Subcode: 0 - 11 Log category: EXCEPTION Cause Type: External System: centurix SDLC write timer retry limit has been exceeded. DLC name = SDLC0 Port name = SDLCP0 Port number = 0x00000000 Cause: An attempt to transmit a frame using an SDLC port has timed out. This may indicate a problem with the SDLC adapter or with the modem and cabling. The port is stopped. Action: Check the modem and communications link. ----------------------- 15:32:40 WET 19 mars 1999 APPN Message 512 - 60, Subcode: 0 - 10 Log category: PROBLEM Cause Type: SNA System: centurix An active link station has failed. Port name = SDLCP0 LS name = SDLCL0 Adjacent CP name = 0000000000000000000000000000000000 Cause: An active link station has failed ----------------------- 15:33:29 WET 19 mars 1999 SDLC Message 768 - 17, Subcode: 1 - 11 Log category: EXCEPTION Cause Type: External System: centurix DSR was not active when activating port. Return code = 0x0003 Cause: An error occurred on a port. The port is configured as Non-switched but DSR was not present. Also the syslog.log is filled up with the messages 'lan3_process_read_completion: Received out of sequence' The impact of all the above is that the LAN card becomes very slow to the point where the system becomes unusable. The only way to recover LAN traffic is to reboot the system without starting SNA at all. PHNE_17819: (1) 5003446971 Data page fault panic in nbm_free_buffer while running simple SNA tests over two LAN interfaces between the three machines running SNAplus2. (2) 4701418707 When using CPI-C without side information, outgoing attaches sometimes fail because validation has been unintentionally turned on. (3) 1653305805 One processor on a two processor box running R6 over 10.20 hangs which then causes cmcld to TOC the box to preserve system integrity. Top of stack for hanging process is: FUNC PC v0_get_rw_lock+0xb8 0.0x3cc4a8 vpr_route_ips_on_route+0x40 0.0x4094e0 vds_rcv_buffers_available+0x1a0 0.0x3e1720 vds_receive_proc+0x674 0.0x3e47fc nba_dispatch_input+0x298 0.0x5af050 nba_dispatch_process+0xa4 0.0x5af184 nba_schedule_process+0x134 0.0x5af5ec nba_send_ips+0x308 0.0x5afd3c (4) 1653293878 Invokable TP failing to start with following error messages logged. ------------- 10:52:14 GMT 10 Feb 1999 ---------------- NODE Message 16384 - 0, Subcode: 10 - 10 Log category: EXCEPTION Cause Type: Internal System: LR1875 Internal system error. Errno = 7 Action: Provide support services with the audit and error log files, and trace files if available. ------------- 10:52:14 GMT 10 Feb 1999 ---------------- APPN Message 512 - 257, Subcode: 0 - 10 Log category: PROBLEM Cause Type: Config System: LR1875 Dynamic load of TP failed. Sense code = 0x07000000 LU alias = DFKC TP name = lr229bci PHNE_17405: (1) 5003441717 The snaperrlog process can be left lying around when the SNAplus2 daemon is not started. Attempting to restart the SNAplus2 software using 'snap start' will fail (because the snaperrlog process is still there from a previous run). (2) 4701413054 System panic - Data Page Fault at nsm_process_record_from_ss+130 (3) 4701399279 The PSI firmware header is not recognized by the snapwhat command. (4) 1653289686 If using a TN3270 (not E) client and hit the clear key while TN Server is presenting an SSCP screen, then the client will lock up. The host may respond with an error message. (5) 1653289603 If using a TN3270 (not E) client and hit the clear key while TN Server is presenting an SSCP screen, TN Server forwards the clear key to the host(sends an empty RU on SSCP-LU session). The host may respond with an error message. PHNE_16758: (1) 4701405316 Updated binaries required for patching the latest R6 release of SNAplus2. (2) 4701399527 Assert errors are produced when the host sends a USSMSG10 screen to a LU configured for LU6.2. The ASSERTS are in fact benign, and will cause no problems with the integrity of the system. The ASSERTs only occur when the USSMSG10 screen is segmented, and greater than around 500 bytes in size. (3) 1653279703 Assert errors logged when RJE workstation is started due to RTM request being received. This has no affect on opperation other than bad entries in the error log. (4) 1653276543 3270 session can hang, even if you stop and restart the snap3270 emulator. If you take a trace of the problem, you will see that a NOTIFY is not sent when the 3270 emulator is stopped or started. (5) 1653273979 Enhancement to allow multiple PUs to be used on a secondary leased link. This means that if an SDLC port is connected to a leased line, you can have multiple LS's active over the port at the same time. (6) 1653267179 An Application can fail to start a remote LU62 transaction, because an invalid user ID is specified on the Attach, when AP_SAME is specified on the ALLOCATE verb. PHNE_15937: (1) 5003398354 If you issue 'snapadmin define_local_lu' for an LU which is already defined, only the attach routing data and the description field are updated - all other parameters are ignored - however, the snapadmin command does not indicate this but gives a successful return message. (2) 4701396374 Node fails to start if TN Client configured with unrecognised hostname. (3) 4701395459 The following assert message is logged to the console, syslog file and sna log files. Assert ips->cont_size >= MU_CONT_SIZE from vtc.c (4) 4701392670 When in client /server configuration, various ASSERT messages recorded in the sna.err log file from the SLIM component, followed often by a crash of the SLIM. This causes general unpredictable behaviour of the system when the master server goes down and is restarted. (5) 1653261669 SDLC link shows DISABLED after restart of SNAPLUS2 daemon PHNE_14392: (1) 4701386227 Unable to start SDLC Eisa cards Defect Description: PHNE_19070: (1) 4701429407 The problem is due to a code defect in the psi driver when attempting to process multiple DMA transactions. Resolution: The fix implemented consists in processing 1 DMA transaction at a time instead of processing queued DMA transactions. The DMA transactions are still queued but the DMA engine processes only one transaction at a time. It does not prefetch DMA transaction because we force it to stop and generate an interrupt after having processed every transaction. When the driver gets the interrupt related to the completion of a DMA transaction, it starts processing the next DMA transaction in the queue. (2) 4701425561 The Streams/UX subsystem on hp-ux 11.0 , unlike SVR4 streams, does not provide any form of locking when accessing a streams Q. Thus, on HP-UX it is not safe to perform a PUTQ to a stream from outside its context (i.e. from the put or service routine of another queue). Resolution: The streams call PUT() does contend for ownership of a given queue, because HP-UX guarantees that only a single put or service routine for a queue will be run at one time. Thus, to ensure the streams queues are protected we modify the SNA code to:- - issue put() rather than putq() - have the put routine for the streams Q issue the putq() to defer processing to the service routine. (3) 4701425355 Problem is that ACC stack is calling QLLC put routine from interrupt context. QLLC module is not designed to cope with this: all other drivers/stacks queue their messages in a simple service routine so they can be sent upstream outside interrupt context. We have nevertheless agreed that we will add queuing to our QLLC module so that it works with the new ACC X.25 driver. Resolution: Fundamental fix is to move from the put() routine to the service() routine all the read-side processing in the QLLC module. In practice this only affects M_PROTO messages. Examination of the QLLC module code suggested that code processing these messages in put() routine could simply be removed -- provided the messages were then queued to the service routine -- because the service routine already has to handle them (via an FSM) in situations of buffer shortage. Empirical testing bore this out. So fix actually simplifies code by removing 'special case' processing for data messages when there are buffers available and no control messages queued in front of them. (4) 1653299073 The problem is caused by the corruption of the lan3 data structures involved in DMA transactions by the psi0 DMA transaction processing. Resolution: Changed the handling of DMA trasactions. The transactions are still queued but the DMA engine processes only one transaction at a time. It does not prefetch DMA transaction because we force it to stop and generate an interrupt after having processed a transaction. When the driver gets the interrupt related to the completion of a DMA transaction, it starts processing the next DMA transaction in the queue. PHNE_17819: (1) 5003446971 Panic caused by attempting to dereference null pointer while examining posted_list LQE in nbm_info structure to see whether it is empty. Resolution: Add boolean flag to nbm_info to say whether posted list is empty or not. (2) 4701418707 The outgoing attach was being sent with the password from the previously rejected incoming attach causing a validation error. Resolution: Add code to copy the password from the START_TP signal into the tcp_ptr in nrmsttp.c (3) 1653305805 From the stack we can see that this a deadlock in the kernel during snap stop processing. We grab a write lock on vpr_entity_lock in vpr_stream_close() which we hold across a number of calls, including the one to nba_term(). It is this lock we are trying to acquire in vpr_route_ips_on_route() near the top of the stack trace. Resolution: We don't actually need to hold the vpr_entity_lock round the call to nba_term() in vpr_stream_close(). So the fix is just to release it before that call and reacquire it afterwards. (4) 1653293878 The TP is failing to start because the userid under which it is running has been misconfigured so that it can't retrieve its own group name. This may be due to local access to the group file or with running NIS (Network Information Service) to share user and group IDs across more than one machine. There are two reasons for the cryptic error logs recorded by SNAplus2:- - Failure of the getpwuid() or getgrgid() system calls was not logged as an error message. - The VSM_AS_TP_FAILURE internal error code was not getting put in the right part of the DLOAD_RSP_ERR message sent from the Service Manager to the APPC Stub. This meant that the APPC stub was misinterpreting it as an APPC sense code. Resolution: The root cause of the problem is to correctly configure the Unix user/group under which the TP is to be run. However changes to SNAplus2 have been made to improve the logging in this area as follows:- In vpm_build_user_info() in vr/vpmu.c we add error logs for the cases where getpwuid() or getgrgid() system calls fail. However, failure of these system calls leads to a path failure. So to make sure these new error logs actually reach the sna.err log file, we also modify vlm_user_write_log() in vdiag/vlmuser.c so that even if we fail to open a path we still attempt to send the datagram containing the log (in addition to attempting to write it locally). In the error reply arm of vsm_rcv_dload_confirm() in vr/vsmdload.c we put the error code in the dld_status field rather than the ld_sense_data field of the DLOAD_RSP_ERR message -- because this is where the vas_datagrams() routine in the APPC Stub expects to find it. We also change the exception logged in vsm_rcv_dload_confirm() from the generic one, with its rather misleading reference to errno to a new specific error. Texts of the new logs are in the vdiag/*.txt files. PHNE_17405: (1) 5003441717 If the kernel initialisation fails, it is possible that the snaperrlog process could hang - waiting for a signal from the kernel which never arrives. Resolution: A code change has been made to ensure that ,if the kernel initialisation fails, a failure notification is sent to the snaperrlog process so it can exit cleanly. (2) 4701413054 Small timing window when there is an empty list of LULU control blocks when processing SSCP_INIT_SIGNAL_NEG_RSP ISP. Resolution: Code changed to check whether LULU list is empty before trying to obtain first element of it. (3) 4701399279 The PSI f/w header string was not changed with the release of SNAplus2 as the f/w is common to both SNAplus & SNAplus2. Resolution: - a new what string for the NIO firmware - a new what string and a new compilation format for the EISA firmware The ']' character has been added at the beginning of each PSI firmware library header line so that the header can be recognized by the snapwhat command. (4) 1653289686 TN3270 cliemt was locking up when the clear key was entered because TN Server was passing the clear command to the Host instead of processing it locally (as is done in the Motif 3270 emulator for example). Resolution: Code changed to add check and special handling for the clear key at the beginning of the TN Server SSCP inbound MU processing. (5) 1653289603 TN3270 client was receiving SSCP datas when the clear key was entered because TN Server was passing the clear command to the Host instead of processing it locally (as is done in the Motif 3270 emulator for example). Resolution: Code changed to add check and special handling for the clear key at the beginning of the TN Server SSCP inbound MU processing. PHNE_16758: (1) 4701405316 Updated binaries provided for combined patching of latest R6 release ,as documented in SR text. (2) 4701399527 A Code change has been made to prevent Assert errors occurring when a large USSMSG10 is received for an LU6.2 session . The maximum amount of data permissible on the SSCP screen has been increased to 2048 bytes, to ensure segmented data on SSCP screen handled correctly. (3) 1653279703 Code change made to fix a problem with Assert errors being logged when an RJE workstation is started. Code changed to correct the ASSERT - it should only be produced if an application has opened the SSCP session and is listening for RTM requests (RJE does not do this so it should not be logged as an error). (4) 1653276543 Code change to prevent 3270 session hang, due to NOTIFY not sent. The fix applied is to ensure that any pending NOTIFY requests are flushed from the CH queue in the APPN node when a CLOSE_SSCP message is received (indicating that the emulator has been stopped). (5) 1653273979 Enhancement to allow multiple PUs to be used on a secondary leased link. This means that if an SDLC port is connected to a leased line, you can have multiple LS's active over the port at the same time. (6) 1653267179 The problem is basically that if you specify AP_SAME on the ALLOCATE verb but did not configure user validation, then we will send a user ID consisting of 10 NULLs. A code change has been made, and the following behavior applies when AP_SAME is used : case 1: a TP on Unix invokes a remote TP: the outgoing Allocate will contain a userID subfield, set to the Unix user ID the TP is running under; case 2 :a TP on Unix invokes several remote TPs:see case 1; case 3 : multiple conversations, where an INVOKED TP issues an ALLOCATE:in that case, the outgoing Allocate will include the same level of validation which was on the ATTACH that invoked that TP. PHNE_15937: (1) 5003398354 Code changed to ensure that if the user specifies any other parameters, they match those used on the initial define. Produce an error code otherwise. (2) 4701396374 Code changed to allow the node to start if it finds a TN Client configured with unrecognised hostname, but generates an error log which tells the user of the failure. (3) 4701395459 This was an incorrect ASSERT which has been removed. It is a benign problem, but produces annoying error logs and console messages. (4) 4701392670 The LAN logger component (which handles central logging) incorrectly registered itself with the service manager as a server. This means that a server could end up twice in the service table (for example, once as a backup, then again as a master server). This lead to extremely unpredictable and unreliable client/server operation. Code change made to prevent this incorrect registering. (5) 1653261669 Send a signal to the host when firmware is ready (backplane and frontplane are initialized) Remove debug trace from msgbuf (opt1:) PHNE_14392: (1) 4701386227 Fixed problems in SDLC driver and firmware. SR: 5003446971 5003441717 5003398354 4701429407 4701425561 4701425355 4701418707 4701413054 4701405316 4701399527 4701399279 4701396374 4701395459 4701392670 4701386227 1653305805 1653299073 1653293878 1653289686 1653289603 1653279703 1653276543 1653273979 1653267179 1653261669 Patch Files: /opt/sna/conf/lib/libpsi0.a /opt/sna/conf/lib/libpsi1.a /opt/sna/conf/lib/libsixd.a /opt/sna/conf/lib/libsixl.a /opt/sna/conf/lib/libsixp.a /opt/sna/conf/lib/libsixs.a /opt/sna/sdlc.dlf /opt/sna/sdlc.pbs /opt/sna/bin/snaptnsrvr what(1) Output: /opt/sna/bin/snaptnsrvr: HP92453-02A.10.00 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ ]R6.10.20.101 SNAplus2 R6 TN Server ] (PHNE_17405 : 99/01/11 17:27:05) ] /opt/sna/conf/lib/libpsi0.a: ]R6.10.20.101 SNAplus2 R6 NIO PSI driver ] (PHNE_19070 : 99/06/23 12:35:10) ] /opt/sna/conf/lib/libpsi1.a: ]R6.10.20.100 SNAplus2 R6 EISA PSI driver ] (10.20.R6 (DART 41): 98/07/22 10:36:54) ] /opt/sna/conf/lib/libsixd.a: ]R6.10.20.100 SNAplus2 R6 NDLC to DLPI Mapping ] (10.20.R6: 98/08/17 14:16:43) ] /opt/sna/conf/lib/libsixl.a: ]R6.10.20.102 SNAplus2 R6 SDLC in the Kernel ] (PHNE_17819 : 99/02/09 10:27:50) ] /opt/sna/conf/lib/libsixp.a: ]R6.10.20.101 SNAplus2 R6 QLLC Module ] (PHNE_19070 : 99/05/12 10:57:43) ] /opt/sna/conf/lib/libsixs.a: ]R6.10.20.112 SNAplus2 R6 Router in the kernel ] (PHNE_19070 : 99/06/09 10:26:50) ] ]R6.10.20.107 SNAplus2 R6 APPN kernel library routin es ] (PHNE_19070 : 99/03/16 17:43:19) ] /opt/sna/sdlc.dlf: ]SNAplus2 EISA FW v2.5 ](99/01/07 15:26:09) /opt/sna/sdlc.pbs: ]SNAplus2 NIO FW v2.1 ](98/11/13 11:58:22) cksum(1) Output: 2390782579 204416 /opt/sna/bin/snaptnsrvr 1537383816 63064 /opt/sna/conf/lib/libpsi0.a 3528364603 46800 /opt/sna/conf/lib/libpsi1.a 2674665038 172144 /opt/sna/conf/lib/libsixd.a 2006556386 360448 /opt/sna/conf/lib/libsixl.a 1631164544 142452 /opt/sna/conf/lib/libsixp.a 3131759759 3024164 /opt/sna/conf/lib/libsixs.a 3269251168 105244 /opt/sna/sdlc.dlf 3918812582 172212 /opt/sna/sdlc.pbs Patch Conflicts: None Patch Dependencies: None Hardware Dependencies: None Other Dependencies: None Supersedes: PHNE_14392 PHNE_15937 PHNE_16758 PHNE_17405 PHNE_17819 Equivalent Patches: None Patch Package Size: 4260 KBytes Installation Instructions: Please review all instructions and the Hewlett-Packard SupportLine User Guide or your Hewlett-Packard support terms and conditions for precautions, scope of license, restrictions, and, limitation of liability and warranties, before installing this patch. ------------------------------------------------------------ 1. Back up your system before installing a patch. 2. Login as root. 3. Copy the patch to the /tmp directory. 4. Move to the /tmp directory and unshar the patch: cd /tmp sh PHNE_19070 5a. For a standalone system, run swinstall to install the patch: swinstall -x autoreboot=true -x match_target=true \ -s /tmp/PHNE_19070.depot By default swinstall will archive the original software in /var/adm/sw/patch/PHNE_19070. If you do not wish to retain a copy of the original software, you can create an empty file named /var/adm/sw/patch/PATCH_NOSAVE. WARNING: If this file exists when a patch is installed, the patch cannot be deinstalled. Please be careful when using this feature. It is recommended that you move the PHNE_19070.text file to /var/adm/sw/patch for future reference. To put this patch on a magnetic tape and install from the tape drive, use the command: dd if=/tmp/PHNE_19070.depot of=/dev/rmt/0m bs=2k Special Installation Instructions: Stop SNA daemon before installing patch (snap stop). After installing the patch start the SNA daemon (snap start).