Storage Automated Diagnostic Environment 2.X

Monitoring Coverage

 

This document describes the monitoring coverage of the Storage Automated Diagnostic Enviromnent software. This includes a description of each component being monitored, each attribute used to generate events and a complete list of all events generated for these devices. This document can be used in conjunction with the Event Advisor (also called Event Grid) to get a better understanding of the health coverage.

Each device section includes an introduction explaining the monitoring technique used in the application and a detailed description of the monitoring coverage. Agents are used to probe devices and 'health modules' are used to compare this information and generate events. Agents can be used to probe a switch using SNMP but also to read entries in a Sun T3 logfile. Since the application is used to monitor all aspects of the SAN, agents are also used to monitor the health of HBAs and hosts. The availability Map file is also included with each device section. These maps are a set of rules used to convert the different fru or component values to an availability value and an alert severity. All availability Map files are included in the System/Map directory of the package. For a complete list of all events along with the instrumentation report data that is used to generate these events, see Appendix B: Instrumentation Report Mapping (separate document).

In general , agents extract much more information from a device than what is needed for monitoring the health of the device. The extra information is available for review in the GUI. For example, configuration information extracted from a Sun T3 is only used in the exception report (Report -> General Reports -> Exception Report).

Standard with each agent is a set of events used to create and audit a device. These events are 'Discovery, 'Audit and 'Statistics events and will not be covered on a device by device basis since they are common to all devices. See appendix A for more information about the content of these events. In general, a Discovery or Audit event will include all extracted information from a device, including information that may not relate to health. The agents will also generate CommunicationLost and CommunicationEstablished events against all devices. These events can occur both inband and out-of-band . Fibre Channel devices can also generate linkEvent when the fibre channel counters available at each port of the SAN exceed predefined thresholds. These thresholds are stored in the SW_Thresholds file (Appendix C) and apply to CRC, InvalidTransmitWords and SignalLoss counters. All events mentioned in this document are included the the Event Advisor.

The Message and T3Message agents are used to monitor logfiles. The T3Message module can report logEvents against any Sun T3 or Sun 6120 device. The Messge agent monitor /var/adm/message file and reports logEvents againt many different devices that are seen inband by the HBA drivers.

The following devices and modules will be covered in this document:

Sun A3500FC, Sun A5K, Sun A1000, Sun D2, Sun Data Service Processor (DSP), Sun T3, Sun 6120, Tape, Sun V880Disk, Sun Virtualization (Vicom), Sun Switch, Sun Switch2, Brocade Switch, McData Switch, InRange Switch, Sun 3310, Sun 3510, Sun 9900.

 

Sun A3500FC/A1000 Monitoring

 

The A3500FC agent uses RM6 commands to monitor the A3500FC storage device inband. These commands are healthck, lad, rdacutil, raidutil and drivutil. See appendix A for an example of all the components and attributes extracted by the A3500FC agent. The A1000 agent inherits from the A3500FC agent and includes the same coverage.

The A3500FC health module monitors the following information:

The host /var/adm/message file is also monitored for the following entries:

 StateChange and Alarm events use a map file (located in System/Map) to map the different values of the status attributes to the availaibility of the component. Here is the A3500FC map file. This maps the controller state, disk status and battery status. For the controller, a value of Active ir considered 'available' and a value of 'Failed' unavailable. When the status of a component goes to Unavailable, an error is generated.

##################
a3500fc.map
##################
[availability]
controller.state.Active  = 1
controller.state.Failed  = 0
controller.state.Passive = 0
disk.status.Optimal      = 1
battery.status.OK        = 1
battery.status.Failed    = 0

 

Sun A5K Monitoring

 

Monitoring this device is done using the 'luxadm' cli command. The display, dump_map and rdls options are used with luxadm. See appendix A for more details about the information saved for this device.

The A5K health module monitors the following information:

- A change in the values of the fibre channel counters (luxadm -e rdls) may generate a LinkEvent if the threshold is exceeded. LinkEvents include information about both ports involved in the fibre channel link.

 The Map file used to map the used to calculate availability and event severity for the A5K is as follow. It is used to map the status of these components: disk, power, interfaceboard, gbic, fan, backplane and mpxio. A value of 'START =>' means that this rule will apply the very first time the component is probed, when there is no value to compare with yet. The operator '=>' is used to express a specific transition from one status value to another. Entering a specfic transition in the map file allows more control on the severity of the alert generated.

##################
a5k.map
##################
[availability]
disk.status.OK-On = 1
disk.status.OK-NotInstalled = 0
START => disk.status.OK-NotInstalled = I
status.OK = 1
power.status.OK = 1
power.status.Not Installed = 0 = W
power.status.Failed = 0
interface_board.status.OK = 1
interface_board.status.Not Installed = 0 = W
interface_board.status.Failed = 0
gbic.status.O.K. = 1
gbic.status.Failed = 0
gbic.status.Not Installed = 0 = W
START => gbic.status.Not Installed = I
fan.status.OK = 1
fan.status.NotInstalled = 0 = W
fan.status.non_critical_failure = 0 = W
START => fan.status.NotInstalled = I
backplane.status.OK = 1
backplane.status.Failed = 0
mpx.state.ONLINE = 1
mpx.state.OFFLINE = 0

 

 

Sun D2 Monitoring

 

The Sun D2 agent uses luxadm inquiry command along with the following cli commands: disk_inquiry, rdbuf, identify and vpd. These commands are included with the package and monitor this device inband. See appendix A for the complete instrumentation report.

The D2 health module monitors the following:

 Here is the D2 availability Map file:

##################
d2.map
##################
[availability]
status.drive_inserted =1
status.drive_inserted_after_power_up =1
status.drive_removed_after_power_up =0
status.no_drive_insterted =0 
power.status.operational_and_on =1
fan.status.operational =1
alarmEvent.revision = 0 = W
alarmEvent.slot_count = 0 = W
alarmEvent.temperature = 0 = W

 

Sun T3/6120 Monitoring

The Sun T3 and Sun 6120 agents uses http tokens to monitor the device and can also monitor the t3 logfile if this logfile is forwarded to the host. This monitoring is done primarily out-of-band. When the Message agent has access to inband information about the T3 from /var/adm/messages, this information is used to generate inband events about this same T3. Sun T3 monitoring is very detailed, see Appendix A for information about all the tokens extracted from the enclosure.

The T3 health module monitors the following:

- stateChange events when the status of these frus change.

- ComponentInsert and ComponentRemove events when the serial# change.

- A change in the revision of these frus will generate an alarmEvent.revision.

- For the power supply, the status of the batteries, fans, powerOutput and powerTemp are also monitored.

- For the Loop, both the loopStatus and LoopIsolated state are monitored and an alarm event can be generated.

- For the LoopCards in partner pairs, the cable (interface.loopcard.cable) is also monitored.

- For disks, change in the Port1State and Port2State will generate an alarmEvent. Alarms are also generated when the disks temperature goes over 55 degrees.

When the Sun T3 logfile is available, it will be monitored for warnings and errors to generate logEvents. See the t3_policies file for more details on the t3 logfile monitoring (Apendix D).

 The T3 availability Map file used to evaluate the availability and severity of the different fru status and states is very detailed:

##################
t3.map
##################
[availability]
fruStatus.ready-enabled = 1
fruStatus.ready-disabled = 0
fruStatus.ready-substituted = 1
fruStatus.booting-enabled = 1
fruStatus.booting-disabled = 0
fruStatus.booting-substituted = 1
fruStatus.missing-enabled = 0
fruStatus.missing-disabled = 0
fruStatus.missing-substituted = 0
fruStatus.fault-enabled = 0
fruStatus.fault-disabled = 0
fruStatus.fault-substituted = 0
fruStatus.fault-disabled => fruStatus.fault-substituted = W
fruStatus.notInstalled-enabled = 0 = W
fruStatus.notInstalled-disabled = 0 = W
fruStatus.notInstalled-substituted = 0 = W
fruStatus.offline-enabled = 0
fruStatus.offline-disabled = 0
fruStatus.offline-substituted = 0
volStatus.mounted =1
fruPowerBatState.normal = 1
fruPowerFan1State.normal = 1
fruPowerFan2State.normal = 1
fruPowerPowOutput.normal = 1
fruPowerPowTemp.normal = 1
fruPowerBatState.refreshing => fruPowerBatState.fault = W
fruPowerBatState.fault => fruPowerBatState.unknown = W
START => fruPowerBatState.fault = E
START => fruPowerBatState.[Undefined] = W
fruPowerBatState.refreshing = 1
fruPowerFan1State.refreshing = 1
fruPowerFan2State.refreshing = 1
fruPowerPowOutput.refreshing = 1
fruPowerPowTemp.refreshing = 1
fruPowerBatState.fault = 0 = I
fruPowerBatState.unknown = 0
fruPowerFan1State.fault = 0
fruPowerFan2State.fault = 0
fruPowerPowOutput.fault = 0
fruPowerPowTemp.fault = 0
fruPowerBatState.off = 0
fruPowerFan1State.off = 0
fruPowerFan2State.off = 0
fruPowerPowOutput.off = 0
fruPowerPowTemp.off = 0
fruDiskPort2State.ready = 1
fruDiskPort2State.notReady = 0
fruDiskPort2State.bypass = 0 = W
fruDiskPort1State.ready = 1
fruDiskPort1State.notReady = 0
fruDiskPort1State.bypass = 0 = W
fruLoopCable1State.installed = 1
fruLoopCable2State.installed = 1
fruLoopCable1State.notInstalled = 0 
fruLoopCable2State.notInstalled = 0
START => fruLoopCable1State.notInstalled = W
START => fruLoopCable2State.notInstalled = W
pathstat.INVALID = 0
pathstat.OK = 1
loopStatus.available = 1
volCacheMode.writeBehind = 1
volCacheMode.writeThrough = 0 = W
volCacheMode.disabled = 0 = W
START => volCacheMode.writeThrough = I
volCacheMirror.on = 1
volCacheMirror.off = 0 = W
START => volCacheMirror.off = I
START => volCacheMode.[Undefined] = W
START => volCacheMirror.[Undefined] = W
volOper.OK = 1
volOper.reconstructing = 1 = W
volOper.reconstructingToStandby = 1 = W
volOper.copyingFromStandby = 1 = W
volOper.copyingToStandby = 1 = E
volOper.initializing = 1 = W
volOper.verifying = 1 = I
alarmEvent.time_diff = 0 = I
alarmEvent.system_reboot = 0 = W
alarmEvent.volOwner = 0 = WY
alarmEvent.cacheMode = 0 = W
alarmEvent.sysvolslice = 0 = W
alarmEvent.volCount = 0 = W
alarmEvent.lunPermission = 0 = W
alarmEvent.initiators = 0 = W

 

FC-Tape Monitoring

FC Tapes are monitored using luxadm display command. The health module for this device reports changes in the port status. Fibre channel counters are also monitored using luxadm -e rdls, see Sun A5K for details.

This is the availability Map file for FC-Tape:

##################
tape.map
##################
[availability]
status.Ready =1
status.Not Ready =1
status.O.K. =1

 

Sun V880 Disk Monitoring

 

This device is monitored using the luxadm command. The health module monitors the following information:

 This is the availability Map file for the Sun V880Disk:

##################
v880disk.map
##################
[availability]
disk.status.OK-On = 1
disk.status.OK-NotInstalled = 0
status.OK = 1
SSC.status.OK = 1
SSC.status.NotInstalled = 0 = W
temperature.status.OK = 1
temperature.status.NotInstalled = 1

 

Sun Virtualization Engine (Vicom) Monitoring

The VE available with Sun 6900 Storage Solution uses cli command on the service processor. These commands are showmap, slicview, mpdrive and svstat. The health module monitors the following:

Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix  C.

 

Sun 1 Gig Switch Monitoring

 

The Sun Switch agent uses the sanbox cli command to monitor 1 Gig Qlogic switches. The command used are sanbox version, chassis_status, port_counts, port_status, chassis_counters, get_zone, chassis_id and links. The following attributes are monitored by the health module:

Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix  C.

The Sun 1Gig Switch availability Map file include the folowing rules:

##################
switch.map
##################
[availability]
mode.Online =1
mode.Not-logged-in =0
mode.Offline =0
mode.AdminOffline =0
operational.on =1
fan.1.status.OK =1
fan.2.status.OK =1
power.status.OK =1
temp.status.OK =1
alarmEvent.system_reboot = 0 = W
alarmEvent.zone_change = 0 = W

 

Sun 2 Gig Switch Monitoring

This agent uses snmp to monitors Qlogic 2 Gig switches and upgraded 1 Gig switches. See the sample instrumentation report in Appendix A for details about the information extracted using snmp. Class2RxFrames

The health module for this device monitors the following information. This health modules is also used for InRange switches

 Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix  C.

The Sun 2Gig Switch availability Map file include the folowing rules:

##################
switch2.map
##################
[availability]
state.online = 1
state.offline = 0
sensor.board.status.ok = 1
sensor.fan.status.ok = 1
sensor.power-supply.status.ok = 1
sensor.power-supply.status.failed = 0
alarmEvent.system_reboot = 0 = W

 

 

Brocade Switch Monitoring

This agent uses snmp to monitors Brocade switches. The health module for this device monitors the following information:

Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix  C.

The Brocade Switch availability Map file include the folowing rules:

##################
brocade.map
##################
[availability]
state.OpStatus.online =1
state.OpStatus.offline =0
state.OpStatus.faulty =0
sensor.temperature.status.nominal =1
sensor.temperature.status.absent =1
sensor.temperature.status.faulty =0
sensor.power-supply.status.nominal =1
sensor.power-supply.status.absent =0
sensor.power-supply.status.faulty =0
sensor.fan.status.nominal =1
sensor.fan.status.absent =1
sensor.fan.status.faulty =0
alarm.system_reboot = 0 = W

 

McData Switch Monitoring

This agent uses snmp to monitors McData switches. The health module for this device monitors the following information:

 Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix  C.

The McData Switch availability Map file include the folowing rules:

##################
mcdata.map
##################
[availability]
state.online =1
state.offline =0
sensor.powerSupply.status.ok =1
sensor.powerSupply.status.absent =0
sensor.powerSupply.status.failed =0
sensor.fan.status.ok =1
sensor.fan.status.absent =1
sensor.fan.status.failed =0
alarmEvent.system_reboot = 0 = W

 

InRange Switch Monitoring

This agent uses snmp to monitors InRange switches. The health modules inherits from the Sun 2Gig Switch module.

 The InRange switch uses the same availability map as the Sun 2Gig Switch

Sun 3310 /3510 Monitoring

The Sun 3310 and 3510 storage devices can be monitored both inband and out-of-band using the same cli command. This command accept a device path or an ip address. The application supports the discovery and monitoring of these devices either way. The commands used are sccli diag errors, sccli inquiry, sccli show events, sccli show port-wwns, sccli show lun-maps and sccli show configuration. The output of most of these commands is used by the 3310/3510 health module. The sccli show events commands is used to extract events that are directly generated by the storage devices and carry them as logEvents. This is done by the Message health module, not the 3310/3510 health module.

The 3310/3510 health module monitors the following information:

 Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix  C.

Here is the Sun 3310/3510 availability Map information:

##################
3310.map
##################
[availability]
enclosure.status.Online = 1
enclosure.status.Low Battery = 1
enclosure.status.Normal = 1
enclosure.status.Critical = 0
enclosure.status.Offline = 0
enclosure.status.Critical Rebuild = 0
enclosure.status.Non Existent = 0 = W
disk_status.Online = 1
disk_status.Used = 1
disk_status.Bad drive = 0
disk_status.Offline = 0
lun_status.Normal = 1
lun_status.Degraded = 0
enclosure.ethernet.ok = 1
enclosure.ethernet.ping_failed = 0

 

Sun 9900 (hitachi) Monitoring

This agent uses snmp to monitors Sun 9900 storage enclosures. This agent does not monitor at a fru level but only at a subsystem level, The subsystems available are the controller and the disk subsystem. The health module for this device monitors the following information:

This is the availability Map file for the Sun 9900 storage device:

##################
9900.map
##################
# noError (1),
# acute (2),
# serious (3),
# moderate(4),
# service (5)
[availability]
controller.Cache.noError = 1
controller.Cache.acute = 0
controller.Cache.serious = 0
controller.Cache.moderate = 0 = W
controller.Cache.service = 0 = W
controller.Battery.noError = 1
controller.Battery.acute = 0
controller.Battery.serious = 0
controller.Battery.moderate = 0 = W
controller.Battery.service = 0 = W
# Internal Bus
controller.CSW.noError = 1
controller.CSW.acute = 0
controller.CSW.serious = 0
controller.CSW.moderate = 0 = W
controller.CSW.service = 0 = W
controller.Environment.noError = 1
controller.Environment.acute = 0
controller.Environment.serious = 0
controller.Environment.moderate = 0 = W
controller.Environment.service = 0 = W
controller.Fan.noError = 1
controller.Fan.acute = 0
controller.Fan.serious = 0
controller.Fan.moderate = 0 = W
controller.Fan.service = 0 = W
controller.Processor.noError = 1
controller.Processor.acute = 0
controller.Processor.serious = 0
controller.Processor.moderate = 0 = W
controller.Processor.service = 0 = W
# POWER SUPPLY
controller.PS.noError = 1
controller.PS.acute = 0
controller.PS.serious = 0
controller.PS.moderate = 0 = W
controller.PS.service = 0 = W
# SHARED MEMORY
controller.SM.noError = 1
controller.SM.acute = 0
controller.SM.serious = 0
controller.SM.moderate = 0 = W
controller.SM.service = 0 = W
# DISK
disk.Drive.noError = 1
disk.Drive.acute = 0
disk.Drive.serious = 0
disk.Drive.moderate = 0 = W
disk.Drive.service = 0 = W
disk.Environment.noError = 1
disk.Environment.acute = 0
disk.Environment.serious = 0
disk.Environment.moderate = 0 = W
disk.Environment.service = 0 = W
disk.Fan.noError = 1
disk.Fan.acute = 0
disk.Fan.serious = 0
disk.Fan.moderate = 0 = W
disk.Fan.service = 0 = W
# POWER SUPPLY
disk.PS.noError = 1
disk.PS.acute = 0
disk.PS.serious = 0
disk.PS.moderate = 0 = W
disk.PS.service = 0 = W

 

Sun Data Service Processor (DSP) Monitoring

The DSP agent uses XML over an HTTP connection to monitor this device out-of-band. The information includes chassis, disks, ports, volumes and system information. The health module for this device monitors the following information:

Note: More details will be forthcoming.

[dsp.map]
portState.Online	 =1
portState.Offline	 =0
moduleState.Ready	 =1
powerState.On       =1
powerState.Off	 =1
      Valid values are:
        LOG_ALERT   - Needs immediate attention; imminent chassis failure
        LOG_CRIT    - Needs attention; potential chassis failure if
                                  left unchecked
        LOG_ERR     - Needs attention; imminent loss of service
        LOG_WARNING - Warning; potential loss of service 
        LOG_INFO    - Informational
              

 

Appendix A: Instrumentation Reports

(included in a separate document).

 

 

Appendix B: Instrumentation Report Mapping

(included in a separate document). 

 

 

Appendix C: Thresholds File (SW_Thresholds)

 

# FORMAT
#   Code       Cnt, Period, QuietPeriod, Desc
#   Period, QuietPeriod: hours/ minutes

# Thresholds for /var/adm/message driver

driver.SF_OFFLINE               = 10,24h,1h,W,socal/ifp Offline
driver.SF_OFFLALERT             = 15,24h,1h,E,socal/ifp Offline
driver.SCSI_TRAN_FAILED         = 10,4h,1h,W, SCSI transport failed
driver.SCSI_ASC                 = 10,4h,1h,W,scsi
driver.SCSI_TR_READ             = 10,4h,1h,W,scsi READ
driver.SCSI_TR_WRITE            = 10,4h,1h,W,scsi WRITE

 
driver.SSD_WARN                 =  5,24h,1h,W,SSD Warning
driver.SSD_ALERT                = 20,24h,1h,E,SSD Alert
driver.PFA                      =  1,24h,1h,E,Predictive Failure

driver.SF_CRC_WARN              = 10,24h,1h,W,CRC Warning
driver.SF_CRC_ALERT             = 15,24h,1h,E,CRC Alert

driver.SFOFFTOWARN              =  5,24h,1h,W,Offline Timeouts
driver.SF_DMA_WARN              =  1,24h,1h,W,SF DMA Warning
driver.SF_RESET                 = 10,24h,1h,W,SF Reset
driver.ELS_RETRY                = 10,24h,1h,W,ESL retries
driver.SF_RETRY                 = 10,24h,1h,W,SF Retries
driver.TOELS                    = 10,24h,1h,W,ELS Timeouts
driver.SFTOELS                  = 10,24h,1h,W,SFTOELS Timeouts
driver.DDOFFL                   = 10,24h,1h,W,Offlines
driver.LOOP_OFFLINE             =  1,5m,1h,E, Loop Offline
driver.LOOP_ONLINE              =  1,5m,1h,N, Loop Online
driver.QLC_LOOP_OFFLINE         =  1,5m,1h,E, Loop Offline
driver.QLC_LOOP_ONLINE          =  1,5m,1h,N, Loop Online
driver.LINK_DOWN                =  1,5m,1h,E, JNI Loop down
driver.LINK_UP                  =  1,5m,1h,N, JNI Loop up

 
# Thresholds with VM present

driver.VM_SF_OFFLINE               = 10,24h,1h,W,socal/ifp Offline
driver.VM_SF_OFFLALERT             = 15,24h,1h,E,socal/ifp Offline
driver.VM_SCSI_TRAN_FAILED         = 10,4h,1h,W, SCSI transport failed
driver.VM_SCSI_ASC                 = 10,4h,1h,W,scsi
driver.VM_SCSI_TR_READ             = 10,4h,1h,W,scsi READ
driver.VM_SCSI_TR_WRITE            = 10,4h,1h,W,scsi WRITE

 
driver.VM_SSD_WARN                 = 100,24h,1h,W,SSD Warning
driver.VM_SSD_ALERT                =  1,24h,1h,E,SSD Alert
driver.VM_PFA                      =  1,24h,1h,E,Predictive Failure

driver.VM_SF_CRC_WARN              = 100,24h,1h,W,CRC Warning
driver.VM_SF_CRC_ALERT             =  1,24h,1h,E,CRC Alert

driver.VM_SFOFFTOWARN              =  5,24h,1h,W,Offline Timeouts
driver.VM_SF_DMA_WARN              =  1,24h,1h,W,SF DMA Warning
driver.VM_SF_RESET                 =  1,24h,1h,W,SF Reset
driver.VM_ELS_RETRY                =  1,24h,1h,W,ESL retries
driver.VM_SF_RETRY                 =  1,24h,1h,W,SF Retries
driver.VM_TOELS                    =  1,24h,1h,W,ELS Timeouts
driver.VM_SFTOELS                  =  1,24h,1h,W,SFTOELS Timeouts
driver.VM_DDOFFL                   =  1,24h,1h,W,Offlines
driver.VM_LOOP_OFFLINE             =  2,5m,1h,E, Loop Offline
driver.VM_LOOP_ONLINE              =  2,5m,1h,N, Loop Online

# A3500

a3500.CTRL_FIRM  = 1,24h,24h,W,Controller firmware version error

# Thresholds for the Vicom
vicom.crc                         = 200,50m,10m,E, CRC
vicom.itw                         = 200,50m,10m,E, Invalid Transmit words
vicom.link                        = 200,50m,10m,E, Link fails
vicom.proto                       = 200,50m,10m,E, 
vicom.signal                      = 200,50m,10m,E, Signal losses
vicom.sync                        = 200,50m,10m,E, Sync losses

# Thresholds for the Switch

switch.LinkFails                  = 200,50m,10m,E,
switch.Total_LIP_Rcvd             = 200,50m,10m,E,
switch.InvalidTxWds               = 200,50m,10m,E,
switch.SyncLosses                 = 200,50m,10m,E,
switch.CRC_Errs                   = 200,50m,10m,E,
switch.Prim_Seq_Errs              = 200,50m,10m,E,
switch.AL_Init_Errs               = 200,50m,10m,E,
switch.AddressIdErrs              = 200,50m,10m,E,
switch.short_frame_err_cnt        = 200,50m,10m,E,
switch.long_frame_err_cnt         = 200,50m,10m,E,
switch.loss_of_signal_cnt         = 200,50m,10m,E,
switch.sync_loss                  = 200,50m,10m,E,
switch.Discards                   = 200,50m,10m,E,
switch.AL_Inits                   = 200,50m,10m,E,
switch.LIF_flow_cntrl_err_cnt     = 200,50m,10m,E,
switch.lof_timeout_els            = 200,50m,10m,E,
switch.lof_timeout                = 200,50m,10m,E,

 
# Thresholds for the Switch2

switch2.LossofSynchronization           = 200,50m,10m,E,
switch2.LinkFailures                    = 200,50m,10m,E,
switch2.PrimitiveSequenceProtocolErrors = 200,50m,10m,E,
switch2.InvalidTxWords                  = 90,60m,10m,E,
switch2.InvalidCRC                      = 200,50m,10m,E,

 
brocade.LipIns                    = 100,50m,10m,E,
brocade.LipOuts                   = 100,50m,10m,E,
brocade.McastTimedOuts            = 100,50m,10m,E,
brocade.RxBadEofs                 = 100,50m,10m,E,
brocade.RxBadOs                   = 100,50m,10m,E,
brocade.RxCrcs                    = 100,50m,10m,E,
brocade.RxEncInFrs                = 100,50m,10m,E, 
brocade.RxTooLongs                = 100,50m,10m,E,

mcdata.AddressIdErrors            = 100,50m,10m,E,
mcdata.DelimiterErrors            = 100,50m,10m,E,
mcdata.InvalidCrcs                = 100,50m,10m,E,
mcdata.InvalidTxWords             = 100,50m,10m,E,
mcdata.LinkFailures               = 100,50m,10m,E,
mcdata.LinkResetIns               = 100,50m,10m,E,
mcdata.LinkResetOuts              = 100,50m,10m,E,
mcdata.PrimSeqProtoErrors         = 100,50m,10m,E,
mcdata.SigLosses                  = 100,50m,10m,E,
mcdata.SyncLosses                 = 100,50m,10m,E,

3310.lip                          = 100,50m,10m,E,
3310.link                         = 100,50m,10m,E,
3310.sync                         = 100,50m,10m,E,
3310.signal                       = 100,50m,10m,E,
3310.seq                          = 100,50m,10m,E,
3310.itw                          = 100,50m,10m,E,
3310.crc                          = 100,50m,10m,E,

 
CRCcounters.rule1                 = 10,24h,0m,E, Host <-> switch
CRCcounters.rule3                 = 10,24h,0m,E, switch <-> switch
CRCcounters.rule8                 = 10,24h,0m,E, storage <-> host/switch

ITWcounters.rule1                 = 10,1h,6m,E, Host <-> switch
ITWcounters.rule3                 = 10,1h,6m,E, switch <-> switch
ITWcounters.rule8                 = 10,1h,6m,E, storage <-> host/switch

SIGcounters.rule1                 = 10,1h,6m,E, Host <-> switch
SIGcounters.rule3                 = 10,1h,6m,E, switch <-> switch
SIGcounters.rule8                 = 10,1h,6m,E, storage <-> host/switch

 
# Health Check thresholds
#('LINK', 'SIG', 'SEQ', 'CRC', 'SYNC', 'TXW', 'INF', 'OUTF');

 
health-switch.TXW                 = 1, 1m, 0m, E, Too many InvalidTxWords
health-switch.CRC                 = 1, 2m, 0m, E, Too many CRC
health-switch.TXW                 = 1, 2m, 0m, E, Too many InvalidTxWords
health-switch.SYNC                = 1, 2m, 0m, E, Too many SYNC
health-switch.SEQ                 = 1, 2m, 0m, E, Too many SEQ

health-switch.SIG                 = 10, 2m, 0m, E, Too many InvalidTxWords
' This is text explaining what to do with this problem
' This is the second line of text. This text can be maintained in System/SW_thresholds

health-a5k.SYNC                   = 1, 2m, 0m, E, Too many Sync
health-a5k.SEQ                    = 1, 2m, 0m, E, Too many InvalidSequence
health-a5k.TXW                    = 1, 2m, 0m, E, Too many InvalidTxWords
health-a5k.CRC                    = 1, 2m, 0m, E, Too many CRC

health-t3.SYNC                    = 1, 2m, 0m, E, Too many SYNC
health-t3.SEQ                     = 1, 2m, 0m, E, Too many SEQ
health-t3.TXW                     = 1, 2m, 0m, E, Too many InvalidTxWords
health-t3.CRC                     = 1, 2m, 0m, E, Too many CRC
  

health-tape.CRC                   = 1, 2m, 0m, E, Too many CRC
health-tape.TXW                   = 1, 2m, 0m, E, Too many InvalidTxWords
health-tape.SYNC                  = 1, 2m, 0m, E, Too many SYNC
health-tape.SEQ                   = 1, 2m, 0m, E, Too many SEQ




Appendix D: Sun T3/6120 and 3310 logfile monitoring

These policies files are used to evaluate the severity and category of entries found in logfiles. The policy is used when it's pattern matches the entry in the logfile. Patterns are regular expressions.


#
# This only applies to t3.LogEvent
# Policies are executed from top to bottom in the file.
#

[policy1]
pattern=/warning temperature threshold exceeded/
egrid=temp_threshold
key=temp_threshold
severity=2

[policy2]
pattern=/u(\d)ctr ISP.+LOOP DOWN/
known=1
severity=1
action=1
egrid=controller.port
key=$PORT

[policy3]
pattern=/: W: .*, Replace battery/
egrid=power.battery.replace
key=replaceBattery
severity=2

# warning that comes from a shell are a Notice and not actionable
[policy40]
pattern=/ sh\d+\[.*: W: (.*)/
severity=0
egrid=array_warning

[policy41]
pattern=/: [EW]: u\dctr XOR:/
key=controller
severity=2
egrid=controller.XOR

[policy4]
pattern=/: W: (.*)/
severity=1
egrid=array_warning
action=1

[policy5]
pattern=/: E: (.*)/
severity=2
egrid=array_error

[policy6]
pattern=/: WARNING: /
severity=2
egrid=$comp1

[policy7]
pattern=/: N: u\dpcu\d: Refreshing battery/
severity=0
key=refreshBattery
egrid=power.battery.refresh

[policy71]
pattern=/: N: u\dctr: Enabled/
severity=0
egrid=controller

[policy8]
pattern=/: N: u\dpcu\d.*Battery not OK/
severity=2
egrid=power.battery

[policy9]
pattern=/: N: u\dpcu\d.*PCU\d hold time/
severity=1
egrid=power.battery
action=1

# runs when Sense key is present in the following 1/2 lines
[policy101]
pattern=/: [WNI]: u\dd\d/
pattern2=/Sense Key = (\w+), Asc = (\w+), Ascq = (\w+)/
pattern3=/Sense Key = (\w+), Asc = (\w+), Ascq = (\w+)/
key=senseKey
extended=senseKey
egrid=disk.senseKey

[policy10]
pattern=/: [NI]: u\dd\d/
pattern2=/Sense/
pattern3=/Sense/
key=senseKey
egrid=disk.senseKey
severity=2

[policy11]
pattern=/: [NI]: u\dd\d.*disk error/
key=disk_error
egrid=disk.error
severity=2

# generic Notice or Info about a disk is considered warning/actionable
# shortkey mean one event can have notices from different disks.
[policy12]
pattern=/: N: u\dd\d/
key=log
egrid=disk.log
severity=1
shortkey=1
action=0



#
#######################################
#  This only applies to 3310.LogEvent
# Policies are executed from top to bottom in the file.
#

[policy1]
pattern=/ALERT: CPU/
egrid=cpu
severity=2

[policy2]
pattern=/ALERT:/
egrid=array_error
severity=2

[policy1]
pattern=/FATAL ERROR/
egrid=array_error
severity=2

[policy3]
pattern=/controller failure detected/
egrid=controller
severity=2
actionable=1
 
</pre>