Patch Name: PHNE_20738 Patch Description: s700_800 10.20 R6.10.20 SNAplus2 Link cumulative patch Creation Date: 99/12/22 Post Date: 00/04/26 Hardware Platforms - OS Releases: s700: 10.20 s800: 10.20 Products: SNAplus2-Link R6.10.20 Filesets: SNAplus2-Link.SNAP2-LINK Automatic Reboot?: Yes Status: General Superseded Critical: No (superseded patches were critical) PHNE_20211: PANIC PHNE_19407: HANG PANIC PHNE_19070: HANG PHNE_17819: PANIC HANG PHNE_17405: HANG PANIC PHNE_16758: HANG Path Name: /hp-ux_patches/s700_800/10.X/PHNE_20738 Symptoms: PHNE_20738: (1) JAGab83920/8606111815 Customer's QLLC link fails to come up due to error in the XID exchange. The sna.err log shows XID 3 error during negotiation. (2) JAGab74185/8606105839 An LUA application receives an UNBIND indication from an established LU-LU session. It responds with RSP.UNBIND and encounters the following error: primary return code: LUA_STATE_CHECK (0x0200) secondary return code: LUA_NO_RUI_SESSION(0x81000000) (3) JAGab65399/4701429621 Dependent APPC LU does not get BIND from VTAM after the link is re-established following an outage. (4) JAGab25510/5003467290 Token Ring connection hangs when frames exceeding the largest frame size returned in RIF are dropped by the network. PHNE_20211: (1) JAGab79003/8606108556 K580 system panics on reboot with the following error messages: System Panic: 9245XB HP-UX (B.10.20) #1: Sun Jun 9 06:31:19 PDT 1996 panic: (display==0xb800, flags==0x0) nio_initialize : l_io_vec.iov_base == NULL PC-Offset Stack Trace (read across, most recent is 1st): 0x00276f44 0x003fdab0 0x00404968 0x001297c4 0x00240ae8 0x002a7ea8 0x002a6ae4 0x00215088 0x002147e0 0x00295368 0x00295444 0x00295444 0x00295444 0x002951c0 0x00295d20 0x002f316c 0x000c7014 0x00183960 End Of Stack It was not possible for the kernel to find a process that caused this crash. jCj1D Dumpsys() called (2) JAGab75469/8606106415 When starting an SDLC link the system panics with following stack trace: stack trace for event 0 crash event was a panic FUNC panic+0x10 report_trap_or_int_and_panic+0xe8 trap+0xa48 $RDB_trap_patch+0x20 sna_sdlc_vba_ips_putb+0x18 sdl_send_set_mode_frame+0x138 sdl_psecs+0xf88 sdl_prtmg+0x106c sdl_wrxfr+0x5a8 sdl_receive_proc+0x80 sna_sdlc_nba_dispatch_input+0x254 sna_sdlc_nba_dispatch_process+0xa4 sna_sdlc_nba_scheduler+0x11c vsi_stream_uw_service+0x3e0 sq_wrapper+0xb8 str_sched_mp_daemon+0x114 str_sched_daemon+0x1f4 main+0x97c $vstart+0x34 (3) JAGab73055/8606105164 While running SNAplus2 R6 on HPUX 11 a link failure occurred. The sna.err log file recorded message 4096 - 13 (ASSERT: File name = ../../p/vhlwr/vhshmod.c ). This caused the link to go down for one minute, and disrupting the current file transfer. Following the ASSERT, the sdlc driver is trying to do ten consecutive svphrx() calls. The last 9 calls will fail. The SDLC driver calls svphclos() to stop the link. (4) JAGab71519/8606104163 This system panic occurred on a H20 running 10.20 with 128MB. panic+0x10 report_trap_or_int_and_panic+0xe8 trap+0xa48 $call_trap+0x20 nsm_delete_session_id+0x104 nsm_cleanup_lu_lu_session+0x7b8 nsm_fsm_status+0x1d4c nsm_process_deactivate_session+0x140 nsm_process_record_from_rm+0x1a8 nsm_queue_handler+0x68 nba_dispatch_process+0x114 nba_scheduler+0x208 vpr_stream_uw_svc+0x58c sq_wrapper+0xb8 str_sched_up_daemon+0x2b0 str_sched_daemon+0xf4 main+0x974 $vstart+0x34 PHNE_19407: (1) JAGab71689 The following message is noted periodically in sna.err file: 10:07:29 EDT 09 Aug 1999 4096-5(0-0) P (UH2010D3) PID 3733 (snaperrlog)Log parameter mismatch. Message number = 512 - 137 The message 512 - 137 then follows with a question mark for the sense code. (2) JAGab71162 Symptoms of problem: After installing patch, PHNE_19527, and selecting Diagnostics, node tracing, sdlc level 2 tracing in xsnapadmin, activated the node. Then, ls went into starting status, then retry pending. Several seconds later, the system panicked. The following stack was produced when the machine crashed: panic+0x14 report_trap_or_int_and_panic+0x4c trap+0xef4 $RDB_trap_patch+0x38 sdl_reset_port+0x13c sdl_poll_port+0x3ec sdl_hms_ctl_proc+0xb4 sdl_receive_proc+0x14c sna_sdlc_nba_dispatch_input+0x254 sna_sdlc_nba_dispatch_process+0xa4 sna_sdlc_nba_scheduler+0x11c sna_sdlc_vsi_stream_uw_service+0x3e0 sq_wrapper+0x90 str_sched_up_daemon+0x440 str_sched_daemon+0x1f0 main+0x6e0 $vstart+0x34 $locore+0x90 The following asserts were also seen in the console log: WARNING: SNA ASSERT: 17:10:45 25 AUG 1999 File: ../../p/vsdlc/sdlcsigi.c Line: 1495 Condition: pcb->resetting == FALSE WARNING: SNA ASSERT: 17:11:15 25 AUG 1999 File: ../../p/vsdlc/sdlcsigi.c Line: 1375 Condition: pcb->alert != NULL (3) JAGab68385 Code inspection has shown that the SNAplus2 Kernel drivers do not have the MGR_IS_MP flag set on the streams_info structure. This has not caused problems but may affect performance on MP systems. (4) JAGab65528 The customer has been running SNAplus2 version R6.11.00 with PHNE_17229 for about a month with no problems. Today, after about 20 3270 users become active the system hangs. Needed to do a TOC to recover. The following stack trace yielded from TOC: nba_get_q_head+0x0 nch_sscp_receive+0x9b8 nba_dispatch_input+0x298 nba_dispatch_process+0xa4 nba_scheduler+0x1b0 vpr_stream_lr_svc+0x134 sq_wrapper+0x90 str_sched_up_daemon+0x440 str_sched_daemon+0x1f0 main+0x6e4 (5) JAGab65397 SNAplus2 R6.10.20 SDLC link fails to start if DSR is not present when link station is started. Only recovery is to reboot the system and ensure that modem signals are present before starting the link. Loss of DSR causes the same problem. The following errors are logged: SDLC Message 768 - 17, Subcode: 1 - 11 Log category: EXCEPTION Cause Type: External System: dqserv1 DSR was not active when activating port. Return code = 0x0003 Cause: An error occurred on a port. The port is configured as Non-switched but DSR was not present. Action: Check whether the configuration setting of non-switched is correct. If the port is correctly configured, check the modem and link hardware. Check other messages for further diagnostics. SDLC Message 768 - 94, Subcode: 0 - 11 Log category: EXCEPTION Cause Type: External System: dqserv1 SDLC port device driver reported an error. DLC name = SDLC0 Port name = mapport Port number = 0x00000000 Return code = 0x0003 Detailed return code = 0x0020 Cause: An error occurred on an SDLC port. The detailed return code provides more information on the error, as follows: 0x0020 DSR failure 0x0021 General hardware failure 0x0022 Modem power off 0x0023 CTS failure between frames (4 wire) 0x0024 CD failure between frames (4 wire) Action: Check the detailed return code shown; check for previous exception messages providing more information about the failure. (6) JAGab65303 The customer has problems printing a long file from a host. Gets the following error: ---------------- 19:34:03 SAT 21 May 1999------------------ APPN Message 512 - 452, Subcode: 0 - 10 Log category: EXCEPTION Cause Type: SNA System: cnbv LU type 0, 1, 2, or 3 session ended abnormally - protocol error. Sense code = 0x10020000 LFSID = 018E0000 Cause: An LU type 0, 1, 2, or 3 session ended abnormally (with the sense code shown) because of a protocol error. Action: Contact network support personnel with details of the problem. The sense code 0x10020000 means 'RU Length Error'. The problem was not seen when similar files were printed with SNAplus. PHNE_19070: (1) 4701429407 Lan performance degraded when attempting to start an SDLC psi card on a T600 system. (2) 4701425561 R6.11.00 on a V-Class system: After several hours of APPC activity, (about 10 incoming allocates per second), APPC TP's fail to load, with error messages 512-257(0-10) logged. In addition, a system panic has occured while the user APPC application was terminated. Although these two problems are very different by nature, it has been determined that they are closely related due to internal mechanisms in SNAPlus2 in its communication via Streams putq messages. The stack trace for the panic was as follows: panic+0x14 report_trap_or_int_and_panic+0x80 trap+0xa8c nokgdb+0x8 putq_owned+0x2a0 putq+0x1c vba_track_putq+0x4c vpr_stream_output_msg+0x40c vpr_delete_entity+0x43c vpr_stream_close+0x1a8 close_wrapper+0x6c csq_protect+0x120 osr_pop_subr+0x220 osr_close_subr+0x324 hpstreams_close_int+0x314 hpstreams_close+0x2c call_open_close+0x1f8 closed+0xb0 spec_close+0x54 vn_close+0x48 vno_close+0x20 closef+0x64 exit+0x324 rexit+0x28 syscall+0x480 $syscallrtn+0x0 (3) 4701425355 While running reliability tests with new ACC driver and SNAplus2 the lab hit a system hang a couple of times when using qllc over X25. (NB: ACC use their own X.25 stack which uses nli2zcom module.) When examining the TC with q4 we found the system stuck at same routine, sna_q_v0_get_rw_lock, in libsixp.a. Here is how the q4 stack looks. sna_q_v0_get_rw_lock+0xc8 vql_stream_read_input+0xdc putnext+0x50 N2Z_F_data_ind+0x38 N2z_iev_pass_data_up+0x114 N2z_ReadEvent_Recvd+0x209c Zc_putq+0x5c nacc0_receive_data+0x140 (4) 1653299073 After upgrading from R4.4 to R6.10.20 on a T600 system, the SDLC link could no longer be activated and the lan performance is severly degraded. When the SNA resources are activated, the following error messages are logged: ----------------------- 15:32:40 WET 19 mars 1999 SDLC Message 768 - 107, Subcode: 0 - 11 Log category: EXCEPTION Cause Type: External System: centurix SDLC write timer retry limit has been exceeded. DLC name = SDLC0 Port name = SDLCP0 Port number = 0x00000000 Cause: An attempt to transmit a frame using an SDLC port has timed out. This may indicate a problem with the SDLC adapter or with the modem and cabling. The port is stopped. Action: Check the modem and communications link. ----------------------- 15:32:40 WET 19 mars 1999 APPN Message 512 - 60, Subcode: 0 - 10 Log category: PROBLEM Cause Type: SNA System: centurix An active link station has failed. Port name = SDLCP0 LS name = SDLCL0 Adjacent CP name = 0000000000000000000000000000000000 Cause: An active link station has failed ----------------------- 15:33:29 WET 19 mars 1999 SDLC Message 768 - 17, Subcode: 1 - 11 Log category: EXCEPTION Cause Type: External System: centurix DSR was not active when activating port. Return code = 0x0003 Cause: An error occurred on a port. The port is configured as Non-switched but DSR was not present. Also the syslog.log is filled up with the messages 'lan3_process_read_completion: Received out of sequence' The impact of all the above is that the LAN card becomes very slow to the point where the system becomes unusable. The only way to recover LAN traffic is to reboot the system without starting SNA at all. PHNE_17819: (1) 5003446971 Data page fault panic in nbm_free_buffer while running simple SNA tests over two LAN interfaces between the three machines running SNAplus2. (2) 4701418707 When using CPI-C without side information, outgoing attaches sometimes fail because validation has been unintentionally turned on. (3) 1653305805 One processor on a two processor box running R6 over 10.20 hangs which then causes cmcld to TOC the box to preserve system integrity. Top of stack for hanging process is: FUNC PC v0_get_rw_lock+0xb8 0.0x3cc4a8 vpr_route_ips_on_route+0x40 0.0x4094e0 vds_rcv_buffers_available+0x1a0 0.0x3e1720 vds_receive_proc+0x674 0.0x3e47fc nba_dispatch_input+0x298 0.0x5af050 nba_dispatch_process+0xa4 0.0x5af184 nba_schedule_process+0x134 0.0x5af5ec nba_send_ips+0x308 0.0x5afd3c (4) 1653293878 Invokable TP failing to start with following error messages logged. ------------- 10:52:14 GMT 10 Feb 1999 ---------------- NODE Message 16384 - 0, Subcode: 10 - 10 Log category: EXCEPTION Cause Type: Internal System: LR1875 Internal system error. Errno = 7 Action: Provide support services with the audit and error log files, and trace files if available. ------------- 10:52:14 GMT 10 Feb 1999 ---------------- APPN Message 512 - 257, Subcode: 0 - 10 Log category: PROBLEM Cause Type: Config System: LR1875 Dynamic load of TP failed. Sense code = 0x07000000 LU alias = DFKC TP name = lr229bci PHNE_17405: (1) 5003441717 The snaperrlog process can be left lying around when the SNAplus2 daemon is not started. Attempting to restart the SNAplus2 software using 'snap start' will fail (because the snaperrlog process is still there from a previous run). (2) 4701413054 System panic - Data Page Fault at nsm_process_record_from_ss+130 (3) 4701399279 The PSI firmware header is not recognized by the snapwhat command. (4) 1653289686 If using a TN3270 (not E) client and hit the clear key while TN Server is presenting an SSCP screen, then the client will lock up. The host may respond with an error message. (5) 1653289603 If using a TN3270 (not E) client and hit the clear key while TN Server is presenting an SSCP screen, TN Server forwards the clear key to the host(sends an empty RU on SSCP-LU session). The host may respond with an error message. PHNE_16758: (1) 4701405316 Updated binaries required for patching the latest R6 release of SNAplus2. (2) 4701399527 Assert errors are produced when the host sends a USSMSG10 screen to a LU configured for LU6.2. The ASSERTS are in fact benign, and will cause no problems with the integrity of the system. The ASSERTs only occur when the USSMSG10 screen is segmented, and greater than around 500 bytes in size. (3) 1653279703 Assert errors logged when RJE workstation is started due to RTM request being received. This has no affect on opperation other than bad entries in the error log. (4) 1653276543 3270 session can hang, even if you stop and restart the snap3270 emulator. If you take a trace of the problem, you will see that a NOTIFY is not sent when the 3270 emulator is stopped or started. (5) 1653273979 Enhancement to allow multiple PUs to be used on a secondary leased link. This means that if an SDLC port is connected to a leased line, you can have multiple LS's active over the port at the same time. (6) 1653267179 An Application can fail to start a remote LU62 transaction, because an invalid user ID is specified on the Attach, when AP_SAME is specified on the ALLOCATE verb. PHNE_15937: (1) 5003398354 If you issue 'snapadmin define_local_lu' for an LU which is already defined, only the attach routing data and the description field are updated - all other parameters are ignored - however, the snapadmin command does not indicate this but gives a successful return message. (2) 4701396374 Node fails to start if TN Client configured with unrecognised hostname. (3) 4701395459 The following assert message is logged to the console, syslog file and sna log files. Assert ips->cont_size >= MU_CONT_SIZE from vtc.c (4) 4701392670 When in client /server configuration, various ASSERT messages recorded in the sna.err log file from the SLIM component, followed often by a crash of the SLIM. This causes general unpredictable behaviour of the system when the master server goes down and is restarted. (5) 1653261669 SDLC link shows DISABLED after restart of SNAPLUS2 daemon PHNE_14392: (1) 4701386227 Unable to start SDLC Eisa cards Defect Description: PHNE_20738: (1) JAGab83920/8606111815 The SNAplus2 node carefully checks the ABM support fields on an XID. Unlike previous versions of SNAplus, we are now being very restrictive regarding the particular combination of ABM support flags we let through. Whilst we are following the SNA Formats manual correctly, this does cause interoperability problems with some hosts. In this circumstance we have decided to relax the checks on the ABM support flags, to allow interoperability with other hosts. Resolution: Remove test that compares ABM support flags on XID. (2) JAGab74185/8606105839 The problem is caused by a STATUS_SESSION(NO_SESSION, LU_INACTIVE) followed by an CLOSE_PLU_SLU_SEC_RQ. The CLOSE causes a dummy UNBIND to be built and queued, but the CLOSE kills the session between the RUI_BID returning and the subsequent RUI_READ. Resolution: Check the reason for the CLOSE - if it is due to a LINK error or because the PU or LU are inactive, then simply kill the session - don't send the UNBIND down to the application. (3) JAGab65399/4701429621 If CH (Conventional Half-session) receives a segmented logon screen to an off-line LU, for example a dependent LU6.2 LU, CH may send an invalid response or, more likely, no response at all. This could potentially cause problems when trying to use APPC over that LU. Resolution: The nch_sscp_receive now stores the RH from the BBIU (!EBIU) in the SSCP section of local data. When the EBIU arives, the RH is copied in and it is turned round as a negative RSP. (4) JAGab25510/5003467290 SNAplus2 does not negotiate down the maximum BTU size from the largest frame size received in the RIF on the TEST_RSP (route discovery frame) received from the token ring driver. Resolution: Correct the code to process the RIF before sending the size back to SNAP APPN. PHNE_20211: (1) JAGab79003/8606108556 The driver code during nio_initialize checks for NULL IOVAs returned from sio_map(). The psi0 driver will panic if sio_map returned an IOVA of 0. But 0 is a valid IOVA; therefore, psi0 should not panic for NULL IOVAs. Resolution: Fix is to remove the panic on NULL IOVAs after sio_map() calls. Also in the step data structure, invalid IOVAs are redefined to be -1 (void* 0xffffffff) instead of 0. Also, change all checks for 0 IOVAs to be -1 IOVAs. (2) JAGab75469/8606106415 When a SNRM retry is received very quickly after the initial SNRM such that it crosses with the outgoing UA, SNAP-LINK may crash because it does not have a valid frame buffer. Resolution: Before sending a UA for a SNRM retry, check to see if the frame_buffer used for UA's is present or not. If not, it must be in the stub, so ignore the incoming SNRM retry. (3) JAGab73055/8606105164 Problem is that we are accessing data shared between normal context and call back context (hence typically interrupt context) outside the correct locking. We need to keep the lock on vhs_data_lock whenever we update data in the pcb_shared_data area of the port control block. We had a window where a TX was issued at the same time as a RX completed and so both bits of code tried to set up the next frame to RX and got confused. Resolution: The fix is to reorder the code to make all the changes to pcb_shared_data area inside the locked section of code. (4) JAGab71519/8606104163 In certain circumstances RM will be asked to delete an SCB that was created by another instance of RM. If this happens then we can assert or crash when trying to delete the session. Resolution: Fix is to store the process ID of the RM instance that creates the SCB on the SCB, then when RM receives a DEACTIVATE_CONV_GROUP (which is routed from the NOF using the LU name/alias field) it can check that the SCB was created by this instance of RM and reject the signal (with new secondary return code NAP_LUNAME_CGID_MISMATCH). PHNE_19407: (1) JAGab71689 From the code it appears that there is not really a parameter mismatch, but it looks as though the problem is due to us trying to log more text than we actually allow. We are limited by a 600 byte array of log text in nba_pd_print_var. Resolution: The log parameters are limited to 600 bytes to allow all of the log datagram to fit in a 1k buffer for use by the SLIM. Consequently, we can't change this 600 byte limit. Instead, for the particular log mentioned in this defect, we should ensure that the sense code gets put into the buffer before the error data (that way we will ensure we get all of the sense code). This will also stop the erroneous messages in the log file as we will be logging all of the parameters. (2) JAGab71162 Cause of problem: Having looked at the dump file it appears that the root cause of this crash is down to a timing problem: For switched links we send a REGISTER_STATION message from SNAP LINK to the HMOD stub. The HMOD receives this message, sends a response and issues an svphopen call to the actual HMOD. Upon receiving the REGISTER_STATION response SNAP LINK sends down a DIAL PORT message. The HMOD doesn't do anything with this DIAL PORT - it simply sends a response back immediately. When the HMOD open routine comes back (via the HMOD call back) the HMOD stub sends a POLL PORT message which indicates that we have successfully raised DSR. SNAP LINK must get the DIAL PORT response before the POLL PORT. However, if the HMOD open returns prior to the DIAL PORT response being sent, SNAP LINK will decide this is an error and issue a RESET PORT message. Normally we don't see this as the svphopen takes long enough to come back that we have processed the DIAL PORT. However, turn on tracing and the SNAP LINK/HMOD processing takes longer and there is a chance that we will get the svphopen back sooner than the DIAL PORT response (note that the customer hit this timing window with just SDLC level 2 trace - to reproduce it we had to turn on internal trace in the SDLC driver as well). The reason the box is actually crashing is even more convoluted: The reset port message that we send down (which is built on top of an alert held on the port control block in SNAP LINK) doesn't have the correct port_handle put onto it. Consequently when we get the reset_port response we assume that the port control block has been destroyed and we simply free off the reset port message. Unfortunately this leaves us in a bad state: - Firstly the resetting flag on the pcb is wrong - it is left as resetting == TRUE. - Secondly, because the reset port is freed off, this frees the alert stored on the pcb. The first assert is occurring because SNAP LINK issues reset port again to reset the port group. At this point we spot that the restting flag is wrong. The second assert happens when we are doing a retry of the LS - the same timing problem as before happens meaning we try and send a reset port using the alert held on the pcb. Unfortunately this has been set to NULL (hence the assert) because of our failure to reconcile the reset port response. Moreover, when we try and access this memory in the sdl_reset_port routine we crash. Resolution: There are two fixes here: Fix one - ensure that the correct handle is sent on the reset_port message. Fix two - ensure that for switched outgoing links we don't open the HMOD until we receive the DIAL PORT message For fix one we simplay add a line to sdl_reset_port to ensure that in the reset port case (like the close port) we set the port_handle to be the pcb->pcb_handle. For fix two we add a test into vhs_register_station so that vhs_hmod_open is only called if the port_type not switched outgoing. We then add a call to vhs_hmod_open to vhs_dial_port. (3) JAGab68385 Missing flag in the streams definition. Resolution: Add flag to streams_info section of all kernel drivers. (4) JAGab65528 The problem occurs under load when nch_sscp_receive() has two NOTIFY messages on its normal request pending queue. The routine loops round trying to process everything on the queue. The first NOTIFY is sent out OK. However this sets the flag 'LOCAL.norm_flow_rqs_blocked = TRUE'. This means that the second NOTIFY is then placed back on the queue by nch_df_sscp_send() resulting in a continuous loop trying to empty the normal request pending queue. Resolution: Modify the while() test in nch_sscp_receive() that processes the normal request pending queue so that it also checks LOCAL.norm_flow_rqs_blocked and drops out when this is TRUE. The end result of this change is that the first NOTIFY is sent as at the moment, and we then drop out of the queue processing and any further NOTIFYs or other MUs on the queue are then dealt with when the NOTIFY RSP returns. (5) JAGab65397 Firmware problem: It is possible for the state machine to remain in the CLOSE_PEND state indefinitely because it expects only timeout events when in this state. The firmware state machine should declare an error and restart itself when it remains in the CLOSE_PEND state too long. Driver Issues: From looking at the driver code and the traces it produced, it is apparent the driver is not initiating any action with the firmware if the link unexpectedly goes down (simulated by disconnecting the cable between the 9000 and the modem). The protocol between driver and firmware requires a message exchange for the link to start up. Resolution: The fix is to make sure that the driver and firmware, once they become 'unsynchronized', have a way to be re-synchronized. This is done by: A) When the firmware hits an error condition which it does not know how to handle, it will set a system error and 'jump' to the first line of the firmware code (i.e., first line of main). This is done by timing out on inactivity in the OPEN_PEND and CLOSE_PEND states of the firmware. The declaration of system error will allow the firmware code code to reset. B) To make sure that the driver will not be hung, the driver code will start out with a credit of 2 when it initiates data transfer with the card. This is to prevent situations in which both the firmware and the driver are waiting for each other to send a message. With the new credit assignment during initialization, the driver will always able to initiate action on the card. Since the code is designed assuming only 1 outstanding message to the firmware at a time, the driver has a credit check to make sure the credit value is not greater than 2. (6) JAGab65303 The problem occurs under load, when a temporary shortage of internal buffers means that the software must queue an incoming message (from the host) and deal with it later, when a buffer has become available. The reason for the problem is that the calculation of the RU length in the APPN node counts the same MU twice, once when it arrives (before it gets queued because there are no BUFFERs available) and again once it has been dequeued (because a BUFFER is now available). Specifically, ntc_buffers_available() calls ntc_process_btu(), passing the dequeued buffer. This calls ntc_segment_transfer() which increments the partial_bui_size field. The same code (from ntc_process_btu onwards) is called when the MU first arrived so the size of the MU is counted twice! Resolution: The fix is to 'undo' the first addition of the incoming message length to our running total of RU size if it turns out that we have to queue the message for later processing. Specifically, the fix decrements tc_cb->partial_biu_size in ntc_segment_transfer() if posting is requested and this was a non-BBIU segment. The process will be called again and the partial_biu_size re-incremented when the buffer that was posted for returns. PHNE_19070: (1) 4701429407 The problem is due to a code defect in the psi driver when attempting to process multiple DMA transactions. Resolution: The fix implemented consists in processing 1 DMA transaction at a time instead of processing queued DMA transactions. The DMA transactions are still queued but the DMA engine processes only one transaction at a time. It does not prefetch DMA transaction because we force it to stop and generate an interrupt after having processed every transaction. When the driver gets the interrupt related to the completion of a DMA transaction, it starts processing the next DMA transaction in the queue. (2) 4701425561 The Streams/UX subsystem on hp-ux 11.0 , unlike SVR4 streams, does not provide any form of locking when accessing a streams Q. Thus, on HP-UX it is not safe to perform a PUTQ to a stream from outside its context (i.e. from the put or service routine of another queue). Resolution: The streams call PUT() does contend for ownership of a given queue, because HP-UX guarantees that only a single put or service routine for a queue will be run at one time. Thus, to ensure the streams queues are protected we modify the SNA code to:- - issue put() rather than putq() - have the put routine for the streams Q issue the putq() to defer processing to the service routine. (3) 4701425355 Problem is that ACC stack is calling QLLC put routine from interrupt context. QLLC module is not designed to cope with this: all other drivers/stacks queue their messages in a simple service routine so they can be sent upstream outside interrupt context. We have nevertheless agreed that we will add queuing to our QLLC module so that it works with the new ACC X.25 driver. Resolution: Fundamental fix is to move from the put() routine to the service() routine all the read-side processing in the QLLC module. In practice this only affects M_PROTO messages. Examination of the QLLC module code suggested that code processing these messages in put() routine could simply be removed -- provided the messages were then queued to the service routine -- because the service routine already has to handle them (via an FSM) in situations of buffer shortage. Empirical testing bore this out. So fix actually simplifies code by removing 'special case' processing for data messages when there are buffers available and no control messages queued in front of them. (4) 1653299073 The problem is caused by the corruption of the lan3 data structures involved in DMA transactions by the psi0 DMA transaction processing. Resolution: Changed the handling of DMA trasactions. The transactions are still queued but the DMA engine processes only one transaction at a time. It does not prefetch DMA transaction because we force it to stop and generate an interrupt after having processed a transaction. When the driver gets the interrupt related to the completion of a DMA transaction, it starts processing the next DMA transaction in the queue. PHNE_17819: (1) 5003446971 Panic caused by attempting to dereference null pointer while examining posted_list LQE in nbm_info structure to see whether it is empty. Resolution: Add boolean flag to nbm_info to say whether posted list is empty or not. (2) 4701418707 The outgoing attach was being sent with the password from the previously rejected incoming attach causing a validation error. Resolution: Add code to copy the password from the START_TP signal into the tcp_ptr in nrmsttp.c (3) 1653305805 From the stack we can see that this a deadlock in the kernel during snap stop processing. We grab a write lock on vpr_entity_lock in vpr_stream_close() which we hold across a number of calls, including the one to nba_term(). It is this lock we are trying to acquire in vpr_route_ips_on_route() near the top of the stack trace. Resolution: We don't actually need to hold the vpr_entity_lock round the call to nba_term() in vpr_stream_close(). So the fix is just to release it before that call and reacquire it afterwards. (4) 1653293878 The TP is failing to start because the userid under which it is running has been misconfigured so that it can't retrieve its own group name. This may be due to local access to the group file or with running NIS (Network Information Service) to share user and group IDs across more than one machine. There are two reasons for the cryptic error logs recorded by SNAplus2:- - Failure of the getpwuid() or getgrgid() system calls was not logged as an error message. - The VSM_AS_TP_FAILURE internal error code was not getting put in the right part of the DLOAD_RSP_ERR message sent from the Service Manager to the APPC Stub. This meant that the APPC stub was misinterpreting it as an APPC sense code. Resolution: The root cause of the problem is to correctly configure the Unix user/group under which the TP is to be run. However changes to SNAplus2 have been made to improve the logging in this area as follows:- In vpm_build_user_info() in vr/vpmu.c we add error logs for the cases where getpwuid() or getgrgid() system calls fail. However, failure of these system calls leads to a path failure. So to make sure these new error logs actually reach the sna.err log file, we also modify vlm_user_write_log() in vdiag/vlmuser.c so that even if we fail to open a path we still attempt to send the datagram containing the log (in addition to attempting to write it locally). In the error reply arm of vsm_rcv_dload_confirm() in vr/vsmdload.c we put the error code in the dld_status field rather than the ld_sense_data field of the DLOAD_RSP_ERR message -- because this is where the vas_datagrams() routine in the APPC Stub expects to find it. We also change the exception logged in vsm_rcv_dload_confirm() from the generic one, with its rather misleading reference to errno to a new specific error. Texts of the new logs are in the vdiag/*.txt files. PHNE_17405: (1) 5003441717 If the kernel initialisation fails, it is possible that the snaperrlog process could hang - waiting for a signal from the kernel which never arrives. Resolution: A code change has been made to ensure that ,if the kernel initialisation fails, a failure notification is sent to the snaperrlog process so it can exit cleanly. (2) 4701413054 Small timing window when there is an empty list of LULU control blocks when processing SSCP_INIT_SIGNAL_NEG_RSP ISP. Resolution: Code changed to check whether LULU list is empty before trying to obtain first element of it. (3) 4701399279 The PSI f/w header string was not changed with the release of SNAplus2 as the f/w is common to both SNAplus & SNAplus2. Resolution: - a new what string for the NIO firmware - a new what string and a new compilation format for the EISA firmware The ']' character has been added at the beginning of each PSI firmware library header line so that the header can be recognized by the snapwhat command. (4) 1653289686 TN3270 cliemt was locking up when the clear key was entered because TN Server was passing the clear command to the Host instead of processing it locally (as is done in the Motif 3270 emulator for example). Resolution: Code changed to add check and special handling for the clear key at the beginning of the TN Server SSCP inbound MU processing. (5) 1653289603 TN3270 client was receiving SSCP datas when the clear key was entered because TN Server was passing the clear command to the Host instead of processing it locally (as is done in the Motif 3270 emulator for example). Resolution: Code changed to add check and special handling for the clear key at the beginning of the TN Server SSCP inbound MU processing. PHNE_16758: (1) 4701405316 Updated binaries provided for combined patching of latest R6 release ,as documented in SR text. (2) 4701399527 A Code change has been made to prevent Assert errors occurring when a large USSMSG10 is received for an LU6.2 session . The maximum amount of data permissible on the SSCP screen has been increased to 2048 bytes, to ensure segmented data on SSCP screen handled correctly. (3) 1653279703 Code change made to fix a problem with Assert errors being logged when an RJE workstation is started. Code changed to correct the ASSERT - it should only be produced if an application has opened the SSCP session and is listening for RTM requests (RJE does not do this so it should not be logged as an error). (4) 1653276543 Code change to prevent 3270 session hang, due to NOTIFY not sent. The fix applied is to ensure that any pending NOTIFY requests are flushed from the CH queue in the APPN node when a CLOSE_SSCP message is received (indicating that the emulator has been stopped). (5) 1653273979 Enhancement to allow multiple PUs to be used on a secondary leased link. This means that if an SDLC port is connected to a leased line, you can have multiple LS's active over the port at the same time. (6) 1653267179 The problem is basically that if you specify AP_SAME on the ALLOCATE verb but did not configure user validation, then we will send a user ID consisting of 10 NULLs. A code change has been made, and the following behavior applies when AP_SAME is used : case 1: a TP on Unix invokes a remote TP: the outgoing Allocate will contain a userID subfield, set to the Unix user ID the TP is running under; case 2 :a TP on Unix invokes several remote TPs:see case 1; case 3 : multiple conversations, where an INVOKED TP issues an ALLOCATE:in that case, the outgoing Allocate will include the same level of validation which was on the ATTACH that invoked that TP. PHNE_15937: (1) 5003398354 Code changed to ensure that if the user specifies any other parameters, they match those used on the initial define. Produce an error code otherwise. (2) 4701396374 Code changed to allow the node to start if it finds a TN Client configured with unrecognised hostname, but generates an error log which tells the user of the failure. (3) 4701395459 This was an incorrect ASSERT which has been removed. It is a benign problem, but produces annoying error logs and console messages. (4) 4701392670 The LAN logger component (which handles central logging) incorrectly registered itself with the service manager as a server. This means that a server could end up twice in the service table (for example, once as a backup, then again as a master server). This lead to extremely unpredictable and unreliable client/server operation. Code change made to prevent this incorrect registering. (5) 1653261669 Send a signal to the host when firmware is ready (backplane and frontplane are initialized) Remove debug trace from msgbuf (opt1:) PHNE_14392: (1) 4701386227 Fixed problems in SDLC driver and firmware. SR: 8606111815 8606108556 8606106415 8606105839 8606105164 8606104163 5003467290 5003446971 5003441717 5003398354 4701429621 4701429407 4701425561 4701425355 4701418707 4701413054 4701405316 4701399527 4701399279 4701396374 4701395459 4701392670 4701386227 1653305805 1653299073 1653293878 1653289686 1653289603 1653279703 1653276543 1653273979 1653267179 1653261669 Patch Files: /opt/sna/conf/lib/libpsi0.a /opt/sna/conf/lib/libpsi1.a /opt/sna/conf/lib/libsixd.a /opt/sna/conf/lib/libsixl.a /opt/sna/conf/lib/libsixm.a /opt/sna/conf/lib/libsixp.a /opt/sna/conf/lib/libsixq.a /opt/sna/conf/lib/libsixs.a /opt/sna/sdlc.dlf /opt/sna/sdlc.pbs /opt/sna/bin/snaptnsrvr what(1) Output: /opt/sna/bin/snaptnsrvr: HP92453-02A.10.00 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ ]R6.10.20.101 SNAplus2 R6 TN Server ] (PHNE_17405 : 99/01/11 17:27:05) ] /opt/sna/conf/lib/libpsi0.a: ]R6.10.20.105 SNAplus2 R6 NIO PSI driver ] (PHNE_20211 : 99/11/05 07:29:46) ] /opt/sna/conf/lib/libpsi1.a: ]B.10.20.102 SNAplus2 R6 EISA PSI driver ] (PHNE_19407 : 99/08/16 07:41:48) ] /opt/sna/conf/lib/libsixd.a: ]R6.10.20.102 SNAplus2 R6 NDLC to DLPI Mapping ] (PHNE_20738 : 99/11/30 17:33:31) ] /opt/sna/conf/lib/libsixl.a: ]R6.10.20.106 SNAplus2 R6 SDLC in the Kernel ] (PHNE_20211 : 99/10/18 17:42:22) ] /opt/sna/conf/lib/libsixm.a: ]R6.10.20.101 SNAplus2 R6 NDLC to DLPI Mapping ] (PHNE_19407 : 99/08/06 14:16:03) ] /opt/sna/conf/lib/libsixp.a: ]R6.10.20.102 SNAplus2 R6 QLLC Module ] (PHNE_19407 : 99/08/06 14:20:01) ] /opt/sna/conf/lib/libsixq.a: ]R6.10.20.101 SNAplus2 R6 QLLC Module ] (PHNE_19407 : 99/08/06 14:21:09) ] /opt/sna/conf/lib/libsixs.a: ]R6.10.20.120 SNAplus2 R6 Router in the kernel ] (PHNE_20738 : 99/11/26 18:15:12) ] ]R6.10.20.114 SNAplus2 R6 APPN kernel library routin es ] (PHNE_20738 : 99/11/26 18:14:41) ] /opt/sna/sdlc.dlf: SNAplus2 EISA FW v2.7 (99/07/30 12:58:42) /opt/sna/sdlc.pbs: ]SNAplus2 NIO FW v2.1 ](98/11/13 11:58:22) cksum(1) Output: 2390782579 204416 /opt/sna/bin/snaptnsrvr 3096725199 63500 /opt/sna/conf/lib/libpsi0.a 3450841052 47488 /opt/sna/conf/lib/libpsi1.a 2028436616 172652 /opt/sna/conf/lib/libsixd.a 148381262 360532 /opt/sna/conf/lib/libsixl.a 2260423153 2540 /opt/sna/conf/lib/libsixm.a 656500477 142452 /opt/sna/conf/lib/libsixp.a 1648757111 2504 /opt/sna/conf/lib/libsixq.a 347221352 3025252 /opt/sna/conf/lib/libsixs.a 3715532193 105228 /opt/sna/sdlc.dlf 3918812582 172212 /opt/sna/sdlc.pbs Patch Conflicts: None Patch Dependencies: None Hardware Dependencies: None Other Dependencies: None Supersedes: PHNE_14392 PHNE_15937 PHNE_16758 PHNE_17405 PHNE_17819 PHNE_19070 PHNE_19407 PHNE_20211 Equivalent Patches: None Patch Package Size: 4270 KBytes Installation Instructions: Please review all instructions and the Hewlett-Packard SupportLine User Guide or your Hewlett-Packard support terms and conditions for precautions, scope of license, restrictions, and, limitation of liability and warranties, before installing this patch. ------------------------------------------------------------ 1. Back up your system before installing a patch. 2. Login as root. 3. Copy the patch to the /tmp directory. 4. Move to the /tmp directory and unshar the patch: cd /tmp sh PHNE_20738 5a. For a standalone system, run swinstall to install the patch: swinstall -x autoreboot=true -x match_target=true \ -s /tmp/PHNE_20738.depot By default swinstall will archive the original software in /var/adm/sw/patch/PHNE_20738. If you do not wish to retain a copy of the original software, you can create an empty file named /var/adm/sw/patch/PATCH_NOSAVE. WARNING: If this file exists when a patch is installed, the patch cannot be deinstalled. Please be careful when using this feature. It is recommended that you move the PHNE_20738.text file to /var/adm/sw/patch for future reference. To put this patch on a magnetic tape and install from the tape drive, use the command: dd if=/tmp/PHNE_20738.depot of=/dev/rmt/0m bs=2k Special Installation Instructions: Stop SNA daemon before installing patch (snap stop). After installing the patch start the SNA daemon (snap start).