Link Exceptions Help
Overview
A link exception occurs when the SAIL ASIC on a SPA detects a problem receiving
data on its ServerNet SAN X or Y port. If there is a transmission error
during normal operation, the receiving SPA reports the error back
to the transmitting SPA in the form of a "This Link Bad," or TLB command
symbol. In this way, SPAs learn of transmission and reception problems.
Link exceptions can also occur for a brief period when nodes are added
to or removed from a cluster.
When a link exception occurs, the SAIL ASIC captures the cause of the
exception in a bit in one of its two link exception registers. There is
one link exception register for the X port and one for the Y port. Each
type of link exception has a corresponding bit in these registers. When
the SPA driver (SPAD) takes an interrupt for a link exception, it reads
the link exception registers, increments its count for any link exception
it finds flagged there, and then clears the link exception registers in
preparation for the next link exception.
The hardware domain of a link exception is limited to the ServerNet SAN
ports at each end of the link and the physical link (cable) itself. This
means that when a SPA records a link exception, the associated hardware
problems are limited to either the SPA, the ServerNet SAN cable connecting
the SPA to the node at the other end of the link, or the ServerNet SAN ports
on the node at the other end of the link. The node at the other end of
the link may be a ServerNet SAN switch (router node) or (in the case of a
two-node
cluster) another end node (computer) equipped with a SPA. The accurate
recording and counting of all link exceptions detected by the hardware
also depends on proper operation of the SPAD and the firmware in the ServerNet
SAN switch.
The SPAD resets link exception statistics to zero for its SPA on every
boot of the node containing the SPA. In addition, the Link Exceptions view
can be used by those with root permission to reset the entire set of link
exception statistics for a given port (X or Y) on a SPA or for a given
fabric (X or Y) in a cluster.
NOTE: There are several types of link exceptions, as explained
under Link Exception Types. If the
Second link exception count is nonzero, the exception counts for the associated
X or Y port are approximate. See Second (link exception
type) for details.
Link Exception Types
The following link exception types are tracked for the X and Y ports, on
both a SPA and cluster basis, in the Link Exception view:
-
CRC - The CRC link exception occurs when a ServerNet SAN packet
cyclic redundancy check (CRC) fails to validate and the packet ends with
a "This Packet Good" (TPG) symbol, indicating that a failure has just occurred
over the receiving link. The packet corruption may be caused by the ServerNet
SAN port at the other end of the link or the physical link (cable) itself.
-
Command - The command link exception occurs when an invalid ServerNet
SAN command symbol is received.
-
Keepalive - The keepalive link exception occurs when a keepalive
ServerNet SAN command symbol is not received within the allotted time limit.
The cable providing the link may be damaged, loose, or missing or the node
(or ServerNet SAN switch) at the other end of the link may have a problem.
-
Clock Sync FIFO - The clock sync FIFO link exception occurs when
the FIFO used to compensate for the difference in clock rates between two
ends of a link encounters an overflow condition. This can happen if the
clock rate difference is too great or if the transmitting end of the link
is not sending the appropriate number of SKIP command symbols (SKIP symbols
are used to ensure overflow does not occur). A defective clock oscillator
is rarely the cause of this link exception. The oscillator is far more
likely to be completely nonfunctional than extremely fast.
-
Elastic FIFO - The elastic FIFO link exception occurs when the SAIL
ASIC's elastic FIFO overflows. The EFIFO is used for flow control on the
incoming packet stream. Associated with the EFIFO are high and low thresholds.
If the EFIFO is full to the level of the high threshold, the SAIL ASIC
sends a BUSY symbol so the transmitter at the other end of the link stops
sending packets, thus preventing an overflow the EFIFO. The contents of
the EFIFO are then processed and cleared out. When the level of the EFIFO
reaches the low threshold, the SAIL ASIC sends a READY symbol, which signals
the other end of the link to resume packet transmission. EFIFO overflow
may occur if the device at the other end of the link ignored the BUSY symbol,
if the BUSY symbol was garbled, or if the device at the other end of the
link received a spurious READY symbol. All of these scenarios are extremely
rare.
-
Second - The SAIL ASIC captures the type of
the first link exception that occurred in its X port or Y port link exception
register. If a second link exception is detected before the SPAD has a
chance to count and clear the type bit, the second link exception bit is
set. The "second" bit provides no insight into the type of the second link
exception. A third link exception before the SPAD has time to process the
first is not captured or counted at all. Therefore, when the second
link exception count is nonzero, it means the SPAD's link exception statistics
are not an accurate accounting of the number of actual exceptions, either
of a given type or in total. In this case, the presence and relative volumes
of particular types of link exceptions should be used as clues to the nature
of a problem.
Viewing Statistics
The link exception statistics appear in two panes:
-
For SPA N (where N is the SPA number), which shows all link exception
type counts for the SAIL ASIC X port and Y port in separate columns.
-
For Cluster, which shows a total of all link exception type counts
for the X and Y fabrics in the cluster.
Associated with the For SPA N pane is a SPA choice box used to select
the SPA containing the statistics to be retrieved.
To view the link exception statistics for a particular SPA, select the
SPA number with the SPA choice box; the statistics for the selected SPA
are displayed. The total link exception statistics for the cluster on a
fabric basis are continuously displayed in the bottom pane.
Resetting Statistics
Each of the two panes have a Reset X and Reset Y button for
users with root permission.
-
For SPA N pane - To reset all types of link exception statistics
for the SAIL ASIC X port, first view the statistics for the desired SPA,
then click on Reset X. Click on Reset Y to reset all link
exception statistics for the Y port.
-
For Cluster pane - To reset all types of link exception statistics
for all SAIL ASIC X ports in the cluster, click on Reset X.
Click on Reset Y to reset all link exception statistics for all
Y ports in the cluster.