Recovery Subsystem

Select one of the Recovery Subsystem components below to find out more information about that component.

Reboot Option

Auto Recovery

Critical Errors

Power On Messages

Correctable Memory

Environment

Power Supply

Power Converter

Remote Communications

Integrated Management Log

Remote Insight

Reboot Option

This option allows you to initiate a reboot from the browser. You will be warned before the system allows you to continue the rebooting process. The following reboot options are available.

To reboot the device, select a reboot option and click Reboot. A text page displays notifying you that the reboot was successfully requested.

NOTE: The reboot option is not available for all devices.

Auto Recovery

This section provides Automatic Server Recovery (ASR) configuration information, tells you when the server was last reset, and allows you to modify pager settings. You can modify the Status, ASR Reset Boot Option, Pager Status, Pager Dial String, and Pager Message settings.

The following items display on this window.

General Information

If the last reset was an ASR reset, the ASR condition will be degraded.

To change the timeout setting, use the Compaq System Configuration Utility. The time you specify for this field should be a prudent period of time before resetting the system and activating the recovery process after a fault occurs. If the timeout period is set too low on a heavily utilized server, the timeout could occur before the software support has time to service the timer.

Reboot

Use the ASR Reset Limit feature in conjunction with the ASR Reset Count feature in the same window. The ASR Reset Count feature displays the number of times that ASR has rebooted the server. If the ASR Reset Count is approaching the reset limit, immediately investigate the server for problems by checking the Critical Error Log and running Compaq Diagnostics.

This count is reset to 0 when the system is reset manually .

Pager

The status can be:

Critical Errors

The Critical Error Log records non-correctable memory errors, as well as catastrophic hardware and software errors that cause a system to fail. This information helps you quickly identify and correct the problem, minimizing downtime.

This section displays a description of critical errors. The date and time of each error is followed by a brief description of the error. The time shown is rounded to the nearest hour.

If critical errors are marked with an exclamation point (!), indicating corrective action is required, the log condition is degraded. To eliminate the exclamation mark and indicate that an entry has been corrected, select the entries you wish to clear and click the Correct Marked Entries button or run Compaq Diagnostics on the device. An asterisk ( * ) indicates the log entry to which the Last Failure Message applies.


IMPORTANT: Agents must have sets enabled and you must have the correct SNMP Community string to be able to mark entries as corrected.


The following list displays errors that may be logged. If you receive any of these errors, run Compaq Diagnostics on your system or consult your software documentation.

Abnormal Program Termination - The device has detected a fatal software error resulting in a device failure.

ASR Base Memory Parity Error - The system detected a data error in base memory following a reset due to an ASR timeout.

ASR Extended Memory Parity Error - The system detected a data error in extended memory following a reset due to an ASR timeout.

ASR Memory Parity Error - The system ROM was unable to allocate enough memory to create a stack. It was unable to put a message on the screen or continue booting the server.

ASR Reset Limit Reached - The maximum number of system resets has been reached. The Compaq Utilities will be loaded.

ASR Reset Occurred - No error data is logged.

ASR Test Event - An ASR Test event was generated by the user through the system utilities. No action is required since the event was user-generated to test the ASR configuration.

ASR Timeout NMI - The server has generated an ASR NMI because the ASR timer has not been refreshed. This generally indicates a driver has not relinquished control of the processor causing a server failure. The resulting ASR NMI was generated to log this event. Note the module that was executing.

CPU Internal Corrected Error Threshold Exceeded - The system has detected that a CPU has exceeded the threshold for the number of internal ECC cache errors.

CPU Processor Power Module Failed - The system has detected that a processors power module has failed.

Critical Temperature - The system's critical temperature has been exceeded and auto shutdown has been initiated.

Error Detected On Bootup - The system detected an error during the Power-On Self-Test.

Exception - The processor has detected a critical exception resulting in a device failure.

Fan Failure - The system or processor fan failed.

NMI - CPU Local Error - The processor experienced a fatal error resulting in a device failure.

NMI - Expansion Board Error - A board on the expansion bus indicated an error condition causing a device failure.

NMI - Expansion Bus Arbitration Error - Memory refresh cycles were delayed, potentially leading to data loss. The error results in a system failure.

NMI - Expansion Bus Master Time-out - A bus master expansion board in the indicated slot did not release the bus after its maximum time resulting in a device failure.

NMI - Expansion Bus Slave Time-out - A board on the expansion bus delayed a bus cycle beyond the maximum time resulting in a device failure.

NMI - Failsafe Timer Expiration - The software was unable to reset the system failsafe timer, resulting in a system failure.

NMI - Processor Address Error 1 - A processor internal address parity checking error occurred, resulting in a device failure.

NMI - Processor Address Error 2 - The processor detected an address parity error during an inquire cycle.

NMI - Processor Cache Parity Error - A data error occurred within the processor cache, resulting in a system failure.

NMI - Processor Internal Error 1 - A processor internal parity error occurred, resulting in a device failure.

NMI - Processor Internal Error 2 - The processor detected an internal parity error or a functional redundancy error.

NMI - Processor Parity Error - The processor detected a data error resulting in a device failure.

NMI - Software Generated Interrupt - Software indicated a system error resulting in a system failure.

NMI - System Concurrency Error - A potential error condition was detected within the Data Flow Manager, resulting in a system failure.

NMI - Uncorrectable Memory Error - The device experienced an uncorrectable memory parity error resulting in a device failure.

NMI - Unknown Error Type - The device driver does not recognize this NMI. You may need to upgrade your health driver.

Processor Failure - The processor failed during the Power-On Self-Test.

Server Manager Failure - An error occurred in the server interface with the Server Manager/R.

UPS A/C Line Failure/Shutdown or Battery Low - The device has initiated a UPS or operating system shutdown, or the battery is almost depleted after an AC line failure.

The Last Failure Message on this window displays the last failure message associated with a critical error.

Power On Messages

This section displays the Power-On messages logged when the device was turned on. Refer to your device documentation for a listing of possible Power-On error messages and their meanings. Click the Clear Power-On Message button to clear the power-on message log. This button is only available if there are messages to clear.

Correctable Memory

This alarm indicates that a block of memory has failed or is failing and may need to be replaced. This condition is generally non-critical since the memory controller can correct the problem. However, this type of error indicates that a memory component is failing or has failed in the system issuing the alarm. The system continues to correct any errors it can.

Memory errors are corrected by the ECC memory subsystem when they occur. If you notice an increase in these errors, correct the problems as soon as possible. Further degradation of the memory components may occur, and then errors may no longer be correctable.

Environment

This section displays details on the device environment. The following information is available.

System Information

CAUTION: Do not operate the system with the cover removed. Proper airflow is possible only when the cover is in place and properly secured.

NOTE: A Failed condition will not occur in a client PC since the power supply for the client will be cut off in the event the thermal condition reaches a permanently damaging level.

Power Supply

This section displays information about the power supplies.

The following entries may be displayed:

Power Converter

This section displays information about the power converters. The following entries may be displayed:

Remote Communications

This section displays details about the status of the Integrated Remote Console (IRC) and the Rapid Recovery communications configuration.

The following fields display.

Status indicates whether the IRC is supported and enabled. Possible values include Not Supported, Enabled, and Disabled.

  1. The COM port for which IRC is configured does not exist.

  2. The COM port for which IRC is configured is a PCI device

  3. The IRQ for which IRC is configured does not match the COM port for which IRC is configured.

Remote PC Communications to Compaq Utilities

The following values may be displayed in this field.

If you have enabled Dial-Out Status, a dial-out connection will be attempted first. If that connection fails, then dial-in access is enabled. If the dial-out connection is successful, then dial-in is enabled after that connection is terminated.

After the ASR feature has attempted to deliver an alarm by the means of the pager, if the Dial-Out Status is enabled and a proper Dial-Out String has been provided, ASR will dial a remote PC. When a session is established, the server administrator can use a third party terminal emulation program to run the Compaq Utilities to diagnose the problem.

Possible values are:

After the ASR feature has attempted to deliver an alarm by means of the pager, if the Dial-Out Status is enabled and a proper Dial-Out String is provided in this field, ASR will attempt to dial a remote PC. When a session is established, the system administrator can use a third-party terminal emulation program to run the Compaq Utilities to diagnose the problem.

The Integrated Management Log records system events, critical errors, power-on message errors, and memory errors.The log also records catastrophic hardware and software errors that typically cause a system to fail. This information helps to quickly identify and correct the problem and minimize downtime.

Each event log entry has a status to identify the severity of the event:

If any events in the log have a condition of Caution, the overall log condition will be marked as degraded. If Critical events exist in the log, the overall log condition will be marked as failed.

To clear a degraded or failed event log, mark the log entry as repaired after you have repaired the condition that caused a log entry to be generated. Perform the following steps.

  1. Highlight the log entries in the Integrated Management Log.

  2. Click the Mark Repaired button. This button is located at the top of the window.


IMPORTANT: Agents must have sets enabled and you must have the correct SNMP Community string to be able to mark log entries as corrected.


The description column gives a brief description of the error or event. The update time column contains the last time this log was updated. The status column contains the status of the log entry.

Refer to the Compaq Integrated Management Log User Guide for more information.

Remote Insight

Select the Remote Insight entry from the Recovery list to display a submenu containing separate entries for General Information, Network Interface Card, Event Log, and a link to the Remote Insight Board Web Interface.

General Information

Network Interface Card

Event Log

Remote Insight Board Web Interface

General Information

The General Information section displays the following information about the Remote Insight board. Not all of the listed fields are supported on every model of Remote Insight Board.

Network Interface Card

The NIC section displays the following information about the NIC in the Remote Insight Board. Not all fields are supported by all models of Remote Insight Board and/or NIC.

Event Log

The Event Log section displays the list of events stored in the Remote Insight Board event log. These events can be cleared by a user with appropriate authority. Each event includes the following information:

Remote Insight Board Web Interface

This link launches a new browser window that will contain the web interface to the Remote Insight Board. This link is only present for models of Remote Insight Board that support this functionality.