intro_sam(1M)


intro_sam -- introduce the System Availability Monitor

Description

The System Availability Monitor (SAM) monitors the cluster and its nodes and records associated failure events. SAM monitors changes in system time and also monitors the the availability of itself. It records events in a log file and produces reports about system availability. The Summary report contains information about each of the monitored objects. The Failures report provides detailed information about the down time of a single object. The Events report provides detailed information about the up and down time of a single object. You can annotate the event record using the samlog(1M) command to add text describing an event, to note that down time was planned, and so on. You can view the report from the NonStop Clusters Management Suite (NCMS) graphical user interface (GUI) using the ncms(1M) command and selecting the Samview application. You can view the reports samrep(1M) from the command line using the command.

SAM has a monitor (samd) that watches for node and cluster events and writes the log. It also has a report generator that reads the log and generates availability percentages (samrep). You can generate the availability reports using the command line interface (samrep) or the NCMS GUI.

Once installed, SAM runs indefinitely to record the cluster availability statistics. SAM should not be stopped. When SAM is not operational, new events are lost and report data loses value.

Events are recorded in /var/avail/sam/eventlog. By default the event log resides in the root filesystem. SAM cannot log new events when the filesystem that contains it is full. If this file is removed (not advised), SAM creates a new one. If the event log is copied elsewhere, it can still be processed using the samrep -f pathname command. However, the entries from the copied event log and the current event log are not processed on the same report. The current event log contains records starting with the time it was created.

SAM is intended to handle only a low volume of logging. Significant portions of SAM are implemented in scripting languages, which execute relatively slowly. Because the event log file is intended to be kept forever, you should log as few events as necessary to limit its size.

Before using the SAM reports, read the following information to understand SAM and its reports:

About Monitored Objects

SAM monitors the up and down times of the cluster, the nodes, and itself. SAM also monitors time changes of the system clock to help with the interpretation of report data. With the samrep and samlog commands, you can specify the type and name of an object for a report or to modify a record in the log file. These SAM commands specify the type of monitored object with the -t objtype option. The specific object of that type is specified by the -n objname option of the SAM commands. For example the options -t NODE -n 2 specify node 2 as the specific object.

The types of objects that SAM monitors are as follows:

CLUS
The cluster as a whole.
NODE
The number of a node in the cluster.
TIME
A time change event, for example, the use of the date(1M) command to change system time. In the reports, this object appears as TIME.CHANGE.
APPL
SAM by default. If the event log file has been altered to include data for other applications, this object can specify them as well. This object appears in the reports as APPL.SAM or APPL. application, where application is the application name added to the event log.

About the SAM Reports

SAM reports three human-readable reports and one programmatic report. You can use the Samview interface to view the human-readable reports. The programmatic report is strictly for programs to read and is described in the samrep(1M) manual page. SAM reports include:

The reports can be annotated using the samlog command. Annotate the log file to note when and why the system time was changed, and so on. The reports cannot be annotated with the Samview NCMS interface, however. With the samlog command, you can add a short annotation to an event record, change unplanned down time to planned, or mark the state of an object as GONE. Such annotation is useful to record permanent node removals and to make the reports more meaningful. You can add complete records to the event log for other system objects and SAM calculates availability statistics when the up and down events are properly recorded using samlog.

Although the reports include system time changes greater than five minutes, these changes do not impact the availability percentages. The changes in system time are provided as additional information to use for interpreting the reports.

View the human-readable reports with the Samview GUI by entering ncms on the command line and selecting Samview from the list of choices that appears. You can also use the samrep command to view the reports.

The Summary Report

The Summary report is displayed by default from the Samview NCMS GUI or when you enter samrep on the command line. The header in the report contains the following information:

Information Fields of the Summary Report

The body of the Summary report contains the following columns of information retrieved from the event log:

object type.name
The type and name for each object in the report.
down cnt
The number of times the object went down during the time period noted in the header information.
last went down when
The time of the most recent down event. This time is a date if the event did not occur the day of the report. If the down event occurred the day of the report, the time is noted by hours, minutes, and seconds using a twenty-four hour format.
total down time unplanned
The total amount of unplanned time the object was down in hours, minutes, and seconds. By default, all down time is logged in the event log as unplanned, but you can annotate the event log file to specify that down time is planned using the samlog -i evid -p PLANNED command.
total down time planned
The total amount of planned time the object was down. The time is displayed in hours, minutes, and seconds. By default, all down time is logged in the event log as unplanned. You can annotate the event log file to specify that down time is planned using the samlog -i evid -p PLANNED command.
up time%
The percent of time the object was up. SAM calculates this time by dividing the total up time by the total up and down time for the object. This percentage can be affected by the samrep command line options, which allow you to specify that planned down time counts as up time for the percentage calculation in a report (samrep -k PLANNED | UNPLANNED).
last state
The last logged state of the object when the report ran. The state can be UP, DOWN, or GONE. The UP and DOWN states are placed in the event log file by SAM. The GONE state must be added with the samlog command, and stops SAM from accumulating down time for an object. For example, if you remove a node from the cluster, SAM considers it down. When you mark its state as GONE, SAM stops accumulating down time for it.

Example Summary Report

-------------------------------------------------------------------------------
SAM Summary Report for All Objects on cluster27 at 2000.02.04_22:06:22
 
Reporting Period: 14d 10h 33m 40s                       Log Format Version: 1.1
Period Began: 2000.01.21_11:32:42              First Event: 2000.01.21_11:32:42
Period Ended: 2000.02.04_22:06:22               Last Event: 2000.02.04_16:14:13
 
      object  down  last went  total down time  total down time  up time%  last
   type.name  cnt   down when        unplanned          planned           state
 
   CLUS.SELF  6    2000.01.31          17m 12s           4m 48s   99.8942    UP
      NODE.1  10   2000.02.03          30m 40s          14m 35s   99.7824    UP
      NODE.2  9    2000.02.01          36m 31s       3h 42m 15s   98.7556    UP
      NODE.3  11     11:09:37          46m 29s           2m  4s   99.7665    UP
 TIME.CHANGE  4    2000.01.28          31m 43s               0s   99.8475    UP
    APPL.SAM  49     16:13:56          46m 12s          19m 54s   99.6821    UP
   APPL.COMM  3    2000.01.28          23m 36s               0s   98.0987  GONE
 
Report Notes:
  + any object that was last down sometime today shows when as HH:MM:SS

-------------------------------------------------------------------------------

The Failures Report

The Failures report is displayed when you select the Failures tab in the Samview NCMS GUI and when you specify the -r FAILURES option of the samrep command. This report shows a list of down events for a single object selected with the GUI or specified on the samrep command line. The header contains the same information as for the Summary report, except that the first line contains the name of the selected object in object type.name format.

Information Fields of the Failures Report

The body of the Failures report contains the following columns of information retrieved from the event log.
related object
The name of the object associated with the failure. The name has the format of object type.name. This name may be that of the selected object for which the Failures report was generated, or it may be the name of the supporting object that actually failed and caused the selected object's down event.
event id
A unique identifier for the event. You can use this id to specify an event record to alter with samlog, or as a suboption for the samrep -q EVENTS command.
went down when
The time the failure occurred. This time appears as the date when the event did not occur on the day of the report. If the down event occurred on the day of the report, the time is noted by hours, minutes, and seconds using a twenty-four hour format.
duration for object type.name
The amount of time that the object type.name was down because of this down event. SAM calculates this duration from data in the event log.
planned?
A YES- or NO-populated field that indicates whether the down event was planned or not. By default, all down time is logged in the event log as unplanned. You can annotate the event log file to specify that down time is planned using samlog -i evid -p PLANNED command. Total planned and unplanned down time appears at the bottom of the report.
msg description
The short descriptive text associated with the event. SAM notes a basic reason for the event, if possible. You can annotate the event log file to specify more descriptive text for an event using the samlog -i evid msg command.

The Failures report can be used to locate an event id to use as the evid in the samlog -i evid command line.

Example Failures Report

-------------------------------------------------------------------------------
SAM Failures Report for NODE.2 on cluster27 at 2000.02.04_22:12:50
 
Reporting Period: 14d 10h 40m 8s                        Log Format Version: 1.1
Period Began: 2000.01.21_11:32:42              First Event: 2000.01.21_11:32:42
Period Ended: 2000.02.04_22:12:50               Last Event: 2000.02.04_16:14:13
 
     related  event  went down   duration for   plan  msg                 
      object  id          when   NODE.2         ned?  description         
 
 TIME.CHANGE  1098  2000.01.21           6m 21s   NO  changed -381 sec    
   CLUS.SELF  1101  2000.01.21       3h 40m 18s  YES  richard was testing 
      NODE.2  1122  2000.01.25           3m 59s   NO  samd home node died 
 TIME.CHANGE  1125  2000.01.25           5m 34s   NO  changed +334 sec    
      NODE.2  1127  2000.01.25           1m 57s  YES                      
   CLUS.SELF  1129  2000.01.25           6m 15s   NO  cluster died        
 TIME.CHANGE  1139  2000.01.26          10m  5s   NO  changed +605 sec    
   CLUS.SELF  1153  2000.01.28              10s   NO  cluster died        
   CLUS.SELF  1160  2000.01.28           7m 42s   NO  cluster died        
 TIME.CHANGE  1171  2000.01.28           9m 43s   NO  changed -583 sec    
   CLUS.SELF  1175  2000.01.28           7m 26s   NO  cluster died        
   CLUS.SELF  1183  2000.01.31           6m 48s   NO  cluster died        
      NODE.2  1196  2000.02.01           4m 11s   NO  samd home node died 
                               ----------------                           
               total unplanned          36m 31s                           
                 total planned       3h 42m 15s                           
 
Report Notes:
  + the exact time when objects went down can be obtained via the EVENTS report
  + TIME.CHANGE events were ignored in computing NODE.2 failure time
  + times in the report header are for all objects, not a particular object
-------------------------------------------------------------------------------

The Events Report

The Events report is displayed when you select the Events tab in the Samview NCMS GUI and when you specify the -r EVENTS option of the samrep(1M) command. This report shows a list of paired up and down events for a single object as selected with the Samview GUI or the samrep command line. The report is a listing of events from the log file. The information in this report is the basis for the calculations of up-time percentages and durations found in the other reports. The header contains the same information as the Failures report.

Information Fields of the Events Report

The body of the Events report contains the following columns of information retrieved from the event log.
event id
A unique identifier for the event. You can use the event id of the down event to specify an event record to alter with samlog, or as a suboption for the samrep -q EVENTS command.
related object
The name of the object associated with the failure. The name has the format of object type.name. This name can be that of the selected object for which the Events report was generated, or it can be the name of the supporting object that actually failed and caused the selected object's down event. You may need to view a separate report of the supporting object to determine its last logged state. The supporting object may be operational again despite having caused the selected object's down event. In such a case, this report shows only that the supporting object caused the selected object's down event, not the last logged state of the selected object.
new state
The state of the related object after the event occurred. The state can be UP, DOWN, or GONE. For monitored objects, SAM places the UP and DOWN states in the event log file. The GONE state must be added with the samlog command, and stops SAM from accumulating down time for an object. For example, if you remove a node from the cluster, SAM considers it down. When you mark its state as GONE, SAM stops accumulating down time for it.
occurred when YYYY.MM.DD_HH:MM:SS
The time of the event in year, month, day, hour, minute, second format.
planned?
A YES- or NO-populated field that indicates whether the down event was planned or not. By default, all down time is logged in the event log as unplanned. You can annotate the event log file to specify that down time is planned using samlog -i evid -p PLANNED command.
msg description
The short descriptive text associated with the event. SAM notes a basic reason for the event in the log file if possible. You can annotate the event log file to specify more descriptive text for an event using the samlog -i evid msg command.

Example Events Report

-------------------------------------------------------------------------------
SAM Events Report for NODE.2 on cluster27 at 2000.02.04_22:13:35
 
Reporting Period: 14d 10h 40m 53s                       Log Format Version: 1.1
Period Began: 2000.01.21_11:32:42              First Event: 2000.01.21_11:32:42
Period Ended: 2000.02.04_22:13:35               Last Event: 2000.02.04_16:14:13
 
 event     related  new   occurred when       plan  msg                        
 id         object  state YYYY.MM.DD_HH:MM:SS ned?  description                
 
 1088       NODE.2    UP  2000.01.21_11:32:42   NO                             
 1098  TIME.CHANGE  DOWN  2000.01.21_11:51:09   NO  changed -381 sec           
 1099  TIME.CHANGE    UP  2000.01.21_11:57:30   NO  heartbeat 10 window 300    
 1101    CLUS.SELF  DOWN  2000.01.21_12:01:18  YES  richard was testing        
 1108       NODE.2    UP  2000.01.21_15:41:36   NO                             
 1122       NODE.2  DOWN  2000.01.25_12:52:58   NO  samd home node died        
 1124       NODE.2    UP  2000.01.25_12:56:57   NO                             
 1125  TIME.CHANGE  DOWN  2000.01.25_12:58:32   NO  changed +334 sec           
 1126  TIME.CHANGE    UP  2000.01.25_13:04:06   NO  heartbeat 10 window 300    
 1127       NODE.2  DOWN  2000.01.25_13:05:59  YES                             
 1128       NODE.2    UP  2000.01.25_13:07:56   NO                             
 1129    CLUS.SELF  DOWN  2000.01.25_13:12:23   NO  cluster died               
 1131       NODE.2    UP  2000.01.25_13:18:38   NO                             
 1139  TIME.CHANGE  DOWN  2000.01.26_12:45:57   NO  changed +605 sec           
 1140  TIME.CHANGE    UP  2000.01.26_12:56:02   NO  heartbeat 10 window 300    
 1153    CLUS.SELF  DOWN  2000.01.28_13:49:14   NO  cluster died               
 1155       NODE.2    UP  2000.01.28_13:49:24   NO                             
 1160    CLUS.SELF  DOWN  2000.01.28_14:03:17   NO  cluster died               
 1164       NODE.2    UP  2000.01.28_14:10:59   NO                             
 1171  TIME.CHANGE  DOWN  2000.01.28_17:32:09   NO  changed -583 sec           
 1172  TIME.CHANGE    UP  2000.01.28_17:41:52   NO                             
 1175    CLUS.SELF  DOWN  2000.01.28_18:15:19   NO  cluster died               
 1179       NODE.2    UP  2000.01.28_18:22:45   NO                             
 1183    CLUS.SELF  DOWN  2000.01.31_10:32:09   NO  cluster died               
 1187       NODE.2    UP  2000.01.31_10:38:57   NO                             
 1196       NODE.2  DOWN  2000.02.01_10:33:03   NO  samd home node died        
 1198       NODE.2    UP  2000.02.01_10:37:14   NO                             
 
Report Notes:
  + TIME.CHANGE events were included only to indicate time changes occurred
  + times in the report header are for all objects, not a particular object
-------------------------------------------------------------------------------

How SAM Calculates Up and Down Time

SAM records up and down time for several objects. All recording is done from an operational node after the down event occurs. Because of this after-the-fact recording, SAM can record events that would otherwise prevent SAM from recording them. Every recorded event has a timestamp. From these timestamps SAM determines the amount of time each object has been up or down. While an object is in a GONE state, neither total up time nor total down time is incremented.

Each object's first event can have a different timestamp. However, if you specify a reporting period that begins before an object's first event, the object is treated as if its first event happened at the time you specified. This behavior increases the total up or down time of the object, depending on whether its first state was UP or DOWN. A similar situation exists if you specify a reporting period that ends after an object's last event. The amount of total up or down time is increased depending on whether its last state was UP or DOWN.

The timestamps for up and down events are derived as follows for each object type.

CLUS
The up event of the cluster has the same timestamp as the first node to come up. The down event of the cluster is timestamped based on the value contained in the heartbeat file of samd. When the cluster fails, samd fails and does not write any new values in its heartbeat file until it is restarted. When restarted, samd reads the old value and uses it as the timestamp for the cluster down event. The accuracy of the cluster down event timestamp depends on the heartbeat period of samd, which is 10 seconds by default.
NODE

Timestamps for both up and down events are obtained from the node state transition times provided by the Cluster Membership API. SAM treats any state other than UP as DOWN.

APPL
Timestamps for application events must be supplied by the application that invokes samlog(1M). In the case of SAM, which is itself an application, the up and down events reflect the failure and restart times of samd, or an internally monitored component of samd. For samd, the up event is timestamped with the current time as soon as the samd process starts execution. The down event is timestamped using the heartbeat file in a manner similar to that used for cluster failures.
TIME
The up event of a time change event is the later event. The down event is the earlier event. When SAM detects a change in time that exceeds the threshold, one event is timestamped using the current time, and the other is timestamped using the heartbeat file. The default threshold is 300 seconds.

SAM Commands

The following commands are available for generating and annotating reports:

ncms(1M)
Starts the NonStop Clusters Management Suite (NCMS) from which the Samview SAM report browser can be run.
samrep(1M)
Provides the command line interface to SAM reports
samlog(1M)
Provides command-line annotation of the SAM log files.

SAM also includes samdctl(1M), which provides administrative and programmatic control over samd(1M), the SAM daemon. The samdctl command requires root permission.

Diagnostics

SAM command warning messages indicate that an unexpected but non-fatal condition was encountered, and the operation succeeded. Error messages indicate that an unexpected and fatal condition was encountered, and the operation failed. The SAM commands return a zero status on success and a non-zero status on failure.

The format of all such messages is:

command_name: error: details_of_fatal_condition
command_name: warning: details_of_non_fatal_condition
command_name: info: informational_text

In rare instances when a SAM command may be unable to display such a message, it will attempt to append the message to the file /var/avail/sam/sam.err.

Files

/var/avail/sam/eventlog
The SAM event log
/var/avail/sam/.state
The private SAM directory for lock and state files
/var/avail/sam/sam.err
The auxillary message file when a command may be unable to display an error message.

References

ncms(1M), samdctl(1M), samd(1M), samlog(1M), samrep(1M), cluster(4)
03 Feb 2000
© 2000 The Santa Cruz Operation, Inc. All rights reserved.
UnixWare 7 Release 7.1.1b - 14 April 2000
© 2000 Compaq Computer Corporation.