intro_sam(1M)
intro_sam -- introduce the System Availability Monitor
Description
The System Availability Monitor (SAM) monitors the cluster
and its nodes and records associated failure events.
SAM monitors changes in system time and also monitors the
the availability of itself. It records events in a log file and
produces reports about system availability.
The Summary report contains information
about each of the monitored objects. The Failures
report provides detailed information about the
down time of a
single object. The Events report provides detailed
information about the up and down time of a single
object.
You can annotate the event record using the samlog(1M)
command to add text describing an
event, to note that down time was planned, and so on. You
can view the report from the NonStop Clusters Management
Suite (NCMS) graphical user interface (GUI) using the
ncms(1M) command and selecting the Samview application. You can view the reports samrep(1M)
from the command line using the
command.
SAM has a monitor (samd) that watches for node and cluster events and writes the log. It also has a report generator that reads the log and generates availability percentages (samrep). You can generate the availability reports using the command line interface (samrep) or the NCMS GUI.
Once installed, SAM runs indefinitely to record the
cluster availability statistics.
SAM should not be stopped. When SAM is not operational, new events are lost
and report data loses value.
Events are recorded in /var/avail/sam/eventlog.
By default the event log resides in the root filesystem. SAM cannot log new events when the filesystem that contains it is full.
If this
file is removed (not advised), SAM creates a new one.
If the event log is copied elsewhere, it can still be processed
using the samrep -f pathname command. However, the entries from the copied event log and the current event log are not processed on the same report. The current event log contains
records starting with the time it was created.
SAM is intended to handle only a low volume of logging. Significant portions of SAM are implemented in scripting languages, which execute relatively slowly. Because the event log file is intended to be kept forever, you should log as few events as necessary to limit its size.
Before using the SAM reports, read the following information to understand SAM and its reports:
About Monitored Objects
SAM monitors the up and down times of the cluster, the nodes, and
itself. SAM also monitors
time changes of the system clock to help with the
interpretation of report data. With the
samrep and
samlog commands, you can specify the type
and name of an object
for a report or to modify a record in the log file.
These SAM commands specify the type of monitored
object with the -t objtype
option. The specific object of that type is
specified by the -n objname
option of the SAM commands. For example the
options -t NODE -n 2
specify node 2 as the specific object.
The types of objects that SAM monitors are as follows:
- CLUS
- The cluster as a whole.
- NODE
- The number of a node in the cluster.
- TIME
- A time change event, for
example, the use of the date(1M)
command to change system time. In
the reports, this object appears as
TIME.CHANGE.
- APPL
- SAM by default. If the
event log file has been altered
to include data for other applications, this object
can specify them as well. This object appears in the
reports as APPL.SAM
or APPL. application,
where application
is the application name added to the event log.
About the SAM Reports
SAM reports three human-readable reports and one
programmatic report. You can use the Samview interface
to view the human-readable reports. The
programmatic report is strictly for programs to read and is
described in the
samrep(1M)
manual page. SAM reports include:
The reports can be annotated using the
samlog
command. Annotate the log file to note when and why the
system time was changed, and so on. The reports cannot be
annotated with the Samview NCMS interface, however.
With the samlog command, you can add a short annotation to an
event record, change unplanned down time to planned,
or mark the state of an object as GONE.
Such annotation is useful to record permanent node
removals and to make the reports more meaningful.
You can add complete records to the event log for other system objects
and SAM calculates availability statistics when the
up and down events are properly recorded using samlog.
Although the reports include system time changes greater
than five minutes, these changes do not impact the
availability percentages. The changes in system time are
provided as additional information to use for
interpreting the reports.
View the human-readable reports with the Samview
GUI by entering ncms on the command
line and selecting Samview
from the list of choices that appears.
You can also use the samrep command to view the reports.
The Summary Report
The Summary report is displayed by default from
the Samview NCMS GUI or
when you enter samrep on
the command line.
The header in the report contains
the following information:
- Cluster name
- Current time
- Reporting period, including the years, days, hours, minutes, and seconds covered by the report
- Period begin and end dates
- Times the first and last event in the report occurred
Information Fields of the Summary Report
The body of the Summary report contains the following columns of information retrieved from the event log:
- object type.name
- The type and name for each object in the report.
- down cnt
- The number of times the object went
down during the time period noted
in the header information.
- last went down when
- The time of the most recent down event.
This time is a date if the event did not occur
the day of the report. If the down event
occurred the day of the report, the time
is noted by hours, minutes, and seconds
using a twenty-four hour format.
- total down time unplanned
- The total amount of unplanned time the object was
down in hours, minutes, and seconds. By default, all down time is logged in the event log as unplanned, but you can annotate the event log file to specify that down time is planned using the
samlog -i evid -p PLANNED command.
- total down time planned
- The total amount of planned time the object was down. The time is displayed
in hours, minutes, and seconds. By default, all down time is logged in the event log as unplanned. You can annotate the event log file to specify that down time is planned using the samlog -i evid -p PLANNED
command.
- up time%
- The percent of time the object was up.
SAM calculates this time by dividing the total up time by the total
up and down time for the object.
This percentage can be affected by the
samrep command line options, which allow
you to specify that planned down time counts as
up time for the percentage calculation in a report
(samrep -k PLANNED | UNPLANNED).
- last state
- The last logged state of the object when the report ran.
The state can be UP, DOWN, or
GONE. The UP and
DOWN states are placed in the event
log file by SAM. The GONE
state must be added with the samlog
command, and stops SAM from accumulating down time for
an object. For example, if you remove a node from the
cluster, SAM considers it down. When you mark its
state as GONE, SAM stops accumulating down time for it.
Example Summary Report
-------------------------------------------------------------------------------
SAM Summary Report for All Objects on cluster27 at 2000.02.04_22:06:22
Reporting Period: 14d 10h 33m 40s Log Format Version: 1.1
Period Began: 2000.01.21_11:32:42 First Event: 2000.01.21_11:32:42
Period Ended: 2000.02.04_22:06:22 Last Event: 2000.02.04_16:14:13
object down last went total down time total down time up time% last
type.name cnt down when unplanned planned state
CLUS.SELF 6 2000.01.31 17m 12s 4m 48s 99.8942 UP
NODE.1 10 2000.02.03 30m 40s 14m 35s 99.7824 UP
NODE.2 9 2000.02.01 36m 31s 3h 42m 15s 98.7556 UP
NODE.3 11 11:09:37 46m 29s 2m 4s 99.7665 UP
TIME.CHANGE 4 2000.01.28 31m 43s 0s 99.8475 UP
APPL.SAM 49 16:13:56 46m 12s 19m 54s 99.6821 UP
APPL.COMM 3 2000.01.28 23m 36s 0s 98.0987 GONE
Report Notes:
+ any object that was last down sometime today shows when as HH:MM:SS
-------------------------------------------------------------------------------
The Failures Report
The Failures report is displayed when you select the
Failures tab in the Samview
NCMS GUI and when you specify
the -r FAILURES option of the
samrep command. This report
shows a list of down events for a single
object selected with the GUI or specified on
the samrep command line.
The header contains the same information as for the
Summary report, except that the first line contains the
name of the selected object in
object type.name format.
Information Fields of the Failures Report
The body of the Failures report contains the following
columns of information retrieved from the event log.
- related object
- The name of the object associated with the failure.
The name has the format of object
type.name. This name may be that of the selected object
for which the Failures report was generated, or it may be
the name of the supporting object that actually failed
and caused the selected object's down event.
- event id
- A unique identifier for the event. You can use this
id to specify an event record to alter with
samlog, or as
a suboption for the samrep -q EVENTS command.
- went down when
- The time the failure occurred. This time appears as
the date when the event did not occur on the day of the report.
If the down event occurred on the day of the report, the
time is noted by hours, minutes, and seconds using a
twenty-four hour format.
-
duration for object type.name
- The amount of time that the
object type.name
was down because of this down event. SAM calculates this
duration from data in the event log.
- planned?
- A YES- or NO-populated
field that indicates whether
the down event was planned or not. By default, all down time is logged in the event log as unplanned. You can annotate the event log file to specify that down time is planned using
samlog -i evid -p PLANNED command.
Total planned and unplanned down time appears at the
bottom of the report.
- msg description
- The short descriptive text associated with the event.
SAM notes a basic reason for the event, if possible. You
can annotate the event log file to specify more descriptive
text for an event using the samlog -i
evid msg
command.
The Failures report can be used to locate an event id
to use as the evid in the
samlog -i evid command line.
Example Failures Report
-------------------------------------------------------------------------------
SAM Failures Report for NODE.2 on cluster27 at 2000.02.04_22:12:50
Reporting Period: 14d 10h 40m 8s Log Format Version: 1.1
Period Began: 2000.01.21_11:32:42 First Event: 2000.01.21_11:32:42
Period Ended: 2000.02.04_22:12:50 Last Event: 2000.02.04_16:14:13
related event went down duration for plan msg
object id when NODE.2 ned? description
TIME.CHANGE 1098 2000.01.21 6m 21s NO changed -381 sec
CLUS.SELF 1101 2000.01.21 3h 40m 18s YES richard was testing
NODE.2 1122 2000.01.25 3m 59s NO samd home node died
TIME.CHANGE 1125 2000.01.25 5m 34s NO changed +334 sec
NODE.2 1127 2000.01.25 1m 57s YES
CLUS.SELF 1129 2000.01.25 6m 15s NO cluster died
TIME.CHANGE 1139 2000.01.26 10m 5s NO changed +605 sec
CLUS.SELF 1153 2000.01.28 10s NO cluster died
CLUS.SELF 1160 2000.01.28 7m 42s NO cluster died
TIME.CHANGE 1171 2000.01.28 9m 43s NO changed -583 sec
CLUS.SELF 1175 2000.01.28 7m 26s NO cluster died
CLUS.SELF 1183 2000.01.31 6m 48s NO cluster died
NODE.2 1196 2000.02.01 4m 11s NO samd home node died
----------------
total unplanned 36m 31s
total planned 3h 42m 15s
Report Notes:
+ the exact time when objects went down can be obtained via the EVENTS report
+ TIME.CHANGE events were ignored in computing NODE.2 failure time
+ times in the report header are for all objects, not a particular object
-------------------------------------------------------------------------------
The Events Report
The Events report is displayed when you select the
Events tab in the Samview NCMS GUI
and when you specify the -r EVENTS
option of the samrep(1M) command.
This report shows a list of paired
up and down events for a single object as selected
with the Samview GUI or the samrep
command line. The report is a listing of events
from the log file.
The information in this report is the basis for the
calculations of up-time percentages and durations found
in the other reports.
The
header contains the same information as the Failures report.
Information Fields of the Events Report
The body of the Events report contains the following
columns of information retrieved from the event log.
- event id
- A unique identifier for the event.
You can use the event id of the down event
to specify an event record to alter with
samlog, or as a suboption for
the samrep -q EVENTS command.
- related object
-
The name of the object associated with the failure. The name
has the format of object type.name.
This
name can be that of the selected object for which the Events
report was generated, or it can be the
name of the supporting object that actually failed
and caused the selected object's down event.
You may need to view a separate report of the supporting object
to determine its last logged state. The supporting
object may be operational again despite having caused
the selected object's down event. In such a
case, this report shows only that the
supporting object caused the selected object's
down event, not the last logged state of the selected object.
- new state
- The state of the related object after the event occurred.
The state can be UP,
DOWN, or GONE.
For monitored objects,
SAM places the UP and DOWN
states in the event log file.
The GONE state must
be added with the samlog
command, and stops SAM from accumulating down time for
an object. For example, if you remove a
node from the cluster, SAM considers it down.
When you mark its state as GONE,
SAM stops accumulating
down time for it.
- occurred when YYYY.MM.DD_HH:MM:SS
- The time of the event in year, month,
day, hour, minute, second format.
- planned?
- A YES- or NO-populated
field that indicates whether
the down event was planned or not.
By default, all down time is logged in the event log as unplanned. You can annotate the event log file to specify that down time is planned using
samlog -i evid -p PLANNED command.
- msg description
- The short descriptive text associated with the event.
SAM notes a basic reason for the event in the log file if
possible.
You can annotate the event log file to specify more descriptive
text for an event using the
samlog -i evid msg
command.
Example Events Report
-------------------------------------------------------------------------------
SAM Events Report for NODE.2 on cluster27 at 2000.02.04_22:13:35
Reporting Period: 14d 10h 40m 53s Log Format Version: 1.1
Period Began: 2000.01.21_11:32:42 First Event: 2000.01.21_11:32:42
Period Ended: 2000.02.04_22:13:35 Last Event: 2000.02.04_16:14:13
event related new occurred when plan msg
id object state YYYY.MM.DD_HH:MM:SS ned? description
1088 NODE.2 UP 2000.01.21_11:32:42 NO
1098 TIME.CHANGE DOWN 2000.01.21_11:51:09 NO changed -381 sec
1099 TIME.CHANGE UP 2000.01.21_11:57:30 NO heartbeat 10 window 300
1101 CLUS.SELF DOWN 2000.01.21_12:01:18 YES richard was testing
1108 NODE.2 UP 2000.01.21_15:41:36 NO
1122 NODE.2 DOWN 2000.01.25_12:52:58 NO samd home node died
1124 NODE.2 UP 2000.01.25_12:56:57 NO
1125 TIME.CHANGE DOWN 2000.01.25_12:58:32 NO changed +334 sec
1126 TIME.CHANGE UP 2000.01.25_13:04:06 NO heartbeat 10 window 300
1127 NODE.2 DOWN 2000.01.25_13:05:59 YES
1128 NODE.2 UP 2000.01.25_13:07:56 NO
1129 CLUS.SELF DOWN 2000.01.25_13:12:23 NO cluster died
1131 NODE.2 UP 2000.01.25_13:18:38 NO
1139 TIME.CHANGE DOWN 2000.01.26_12:45:57 NO changed +605 sec
1140 TIME.CHANGE UP 2000.01.26_12:56:02 NO heartbeat 10 window 300
1153 CLUS.SELF DOWN 2000.01.28_13:49:14 NO cluster died
1155 NODE.2 UP 2000.01.28_13:49:24 NO
1160 CLUS.SELF DOWN 2000.01.28_14:03:17 NO cluster died
1164 NODE.2 UP 2000.01.28_14:10:59 NO
1171 TIME.CHANGE DOWN 2000.01.28_17:32:09 NO changed -583 sec
1172 TIME.CHANGE UP 2000.01.28_17:41:52 NO
1175 CLUS.SELF DOWN 2000.01.28_18:15:19 NO cluster died
1179 NODE.2 UP 2000.01.28_18:22:45 NO
1183 CLUS.SELF DOWN 2000.01.31_10:32:09 NO cluster died
1187 NODE.2 UP 2000.01.31_10:38:57 NO
1196 NODE.2 DOWN 2000.02.01_10:33:03 NO samd home node died
1198 NODE.2 UP 2000.02.01_10:37:14 NO
Report Notes:
+ TIME.CHANGE events were included only to indicate time changes occurred
+ times in the report header are for all objects, not a particular object
-------------------------------------------------------------------------------
How SAM Calculates Up and Down Time
SAM records up and
down time for several objects. All
recording is done from an operational node after the down event occurs. Because of this after-the-fact recording, SAM can record events that would otherwise prevent SAM from recording them.
Every recorded event has a timestamp. From these timestamps SAM determines the amount of time each object has been up or down. While an object is in a GONE state, neither total up time nor total down time is incremented.
Each object's first event can have a different timestamp. However, if you specify a reporting period that begins before an object's first event, the object is treated as if its first event happened at the time you specified. This behavior increases the total up or down time of the object, depending on whether its first state was UP or DOWN. A similar situation exists if you specify a reporting period that ends after an object's last event. The amount of total up or down time is increased depending on whether its last state was UP or DOWN.
The timestamps for up and down events are derived as follows for each object type.
- CLUS
- The up event of the cluster has the same timestamp as the first node to come up. The down event of the cluster is timestamped based on the value contained in the heartbeat file of samd. When the cluster fails, samd fails and does not write any new values in its heartbeat file until it is restarted. When restarted, samd reads the old value and uses it as the timestamp for the cluster down event. The accuracy of the cluster down event timestamp depends on the heartbeat period of samd, which is 10 seconds by default.
-
NODE
Timestamps for both up and down events are obtained from the node state transition times provided by the Cluster Membership API. SAM treats any state other than UP as DOWN.
- APPL
- Timestamps for application events must be supplied by the application that invokes samlog(1M). In the case of SAM, which is itself an application, the up and down events reflect the failure and restart times of samd, or an internally monitored component of samd. For samd, the up event is timestamped with the current time as soon as the samd process starts execution. The down event is timestamped using the heartbeat file in a manner similar to that used for cluster failures.
- TIME
- The up event of a time change event is the later event. The down event is the earlier event. When SAM detects a change in time that exceeds the threshold, one event is timestamped using the current time, and the other is timestamped using the heartbeat file. The default threshold is 300 seconds.
SAM Commands
The following commands are available for generating
and annotating reports:
-
ncms(1M)
-
Starts the NonStop Clusters Management Suite (NCMS) from which
the Samview SAM report browser can be run.
-
samrep(1M)
-
Provides the command line interface to SAM reports
-
samlog(1M)
-
Provides command-line annotation of the SAM log files.
SAM also includes samdctl(1M), which provides administrative and programmatic control over samd(1M), the SAM
daemon. The samdctl command requires root permission.
Diagnostics
SAM command warning messages indicate that an unexpected but non-fatal condition was encountered, and the operation succeeded. Error messages indicate that an unexpected and fatal condition was encountered, and the operation failed. The SAM commands return a zero status on success and a non-zero status on failure.
The format of all such messages is:
command_name: error: details_of_fatal_condition
command_name: warning: details_of_non_fatal_condition
command_name: info: informational_text
In rare instances when a SAM command may be unable to display such a message, it will attempt to append the message to the file /var/avail/sam/sam.err
.
Files
- /var/avail/sam/eventlog
- The SAM event log
- /var/avail/sam/.state
- The private SAM directory for lock and state files
- /var/avail/sam/sam.err
- The auxillary message file when a command may be unable to display an error message.
References
ncms(1M),
samdctl(1M),
samd(1M),
samlog(1M),
samrep(1M),
cluster(4)
03 Feb 2000
© 2000 The Santa Cruz Operation, Inc. All rights
reserved.
UnixWare 7 Release 7.1.1b - 14 April 2000
© 2000 Compaq Computer Corporation.