Storage Automated Diagnostic Environment 2.X
Monitoring Coverage
This document describes the monitoring coverage of the Storage Automated Diagnostic Enviromnent software. This includes a description of each component being monitored, each attribute used to generate events and a complete list of all events generated for these devices. This document can be used in conjunction with the Event Advisor (also called Event Grid) to get a better understanding of the health coverage.
Each device section includes an introduction explaining the monitoring technique used in the application and a detailed description of the monitoring coverage. Agents are used to probe devices and 'health modules' are used to compare this information and generate events. Agents can be used to probe a switch using SNMP but also to read entries in a Sun T3 logfile. Since the application is used to monitor all aspects of the SAN, agents are also used to monitor the health of HBAs and hosts. The availability Map file is also included with each device section. These maps are a set of rules used to convert the different fru or component values to an availability value and an alert severity. All availability Map files are included in the System/Map directory of the package. For a complete list of all events along with the instrumentation report data that is used to generate these events, see Appendix B: Instrumentation Report Mapping (separate document).
In general , agents extract much more information from a device than what is needed for monitoring the health of the device. The extra information is available for review in the GUI. For example, configuration information extracted from a Sun T3 is only used in the exception report (Report -> General Reports -> Exception Report).
Standard with each agent is a set of events used to create and audit a device. These events are 'Discovery, 'Audit and 'Statistics events and will not be covered on a device by device basis since they are common to all devices. See appendix A for more information about the content of these events. In general, a Discovery or Audit event will include all extracted information from a device, including information that may not relate to health. The agents will also generate CommunicationLost and CommunicationEstablished events against all devices. These events can occur both inband and out-of-band . Fibre Channel devices can also generate linkEvent when the fibre channel counters available at each port of the SAN exceed predefined thresholds. These thresholds are stored in the SW_Thresholds file (Appendix C) and apply to CRC, InvalidTransmitWords and SignalLoss counters. All events mentioned in this document are included the the Event Advisor.
The Message and T3Message agents are used to monitor logfiles. The T3Message module can report logEvents against any Sun T3 or Sun 6120 device. The Messge agent monitor /var/adm/message file and reports logEvents againt many different devices that are seen inband by the HBA drivers.
The following devices and modules will be covered in this document:
Sun A3500FC, Sun A5K, Sun A1000, Sun D2, Sun Data Service Processor (DSP), Sun T3, Sun 6120, Tape, Sun V880Disk, Sun Virtualization (Vicom), Sun Switch, Sun Switch2, Brocade Switch, McData Switch, InRange Switch, Sun 3310, Sun 3510, Sun 9900.
Sun A3500FC/A1000 Monitoring
The A3500FC agent uses RM6 commands to monitor the A3500FC storage device inband. These commands are healthck, lad, rdacutil, raidutil and drivutil. See appendix A for an example of all the components and attributes extracted by the A3500FC agent. The A1000 agent inherits from the A3500FC agent and includes the same coverage.
The A3500FC health module monitors the following information:
The host /var/adm/message file is also monitored for the following entries:
StateChange and Alarm events use a map file (located in System/Map) to map the different values of the status attributes to the availaibility of the component. Here is the A3500FC map file. This maps the controller state, disk status and battery status. For the controller, a value of Active ir considered 'available' and a value of 'Failed' unavailable. When the status of a component goes to Unavailable, an error is generated.
################## a3500fc.map ################## [availability] controller.state.Active = 1 controller.state.Failed = 0 controller.state.Passive = 0 disk.status.Optimal = 1 battery.status.OK = 1 battery.status.Failed = 0
Sun A5K Monitoring
Monitoring this device is done using the 'luxadm' cli command. The display, dump_map and rdls options are used with luxadm. See appendix A for more details about the information saved for this device.
The A5K health module monitors the following information:
- A change in the values of the fibre channel counters (luxadm -e rdls) may generate a LinkEvent if the threshold is exceeded. LinkEvents include information about both ports involved in the fibre channel link.
The Map file used to map the used to calculate availability and event severity for the A5K is as follow. It is used to map the status of these components: disk, power, interfaceboard, gbic, fan, backplane and mpxio. A value of 'START =>' means that this rule will apply the very first time the component is probed, when there is no value to compare with yet. The operator '=>' is used to express a specific transition from one status value to another. Entering a specfic transition in the map file allows more control on the severity of the alert generated.
##################
a5k.map
##################
[availability]
disk.status.OK-On = 1
disk.status.OK-NotInstalled = 0
START => disk.status.OK-NotInstalled = I
status.OK = 1
power.status.OK = 1
power.status.Not Installed = 0 = W
power.status.Failed = 0
interface_board.status.OK = 1
interface_board.status.Not Installed = 0 = W
interface_board.status.Failed = 0
gbic.status.O.K. = 1
gbic.status.Failed = 0
gbic.status.Not Installed = 0 = W
START => gbic.status.Not Installed = I
fan.status.OK = 1
fan.status.NotInstalled = 0 = W
fan.status.non_critical_failure = 0 = W
START => fan.status.NotInstalled = I
backplane.status.OK = 1
backplane.status.Failed = 0
mpx.state.ONLINE = 1
mpx.state.OFFLINE = 0
Sun D2 Monitoring
The Sun D2 agent uses luxadm inquiry command along with the following cli commands: disk_inquiry, rdbuf, identify and vpd. These commands are included with the package and monitor this device inband. See appendix A for the complete instrumentation report.
The D2 health module monitors the following:
- A change in the serial# of the midplane or the esm.0 or esm.1 module will generate a ComponentRemoved or ComponentInserted event.
- A change in the revision or revision-date of these same 3 modules will generate an alarmEvent.revision event.
- A change in the status of each slot (esm.0, esm.1) will genetate a stateChange event.
- An alarmEvent is generated for change in the status of the Power supply and fan component. See the D2 map file in Appendix B for the valid values of this attribute.
- A change in the total number of slots present in the device will generate an alarmEvent.
- A change in the overall temperature status of the enclosure will generate an alarmEvent.
Here is the D2 availability Map file:
##################
d2.map
##################
[availability]
status.drive_inserted =1
status.drive_inserted_after_power_up =1
status.drive_removed_after_power_up =0
status.no_drive_insterted =0
power.status.operational_and_on =1
fan.status.operational =1
alarmEvent.revision = 0 = W
alarmEvent.slot_count = 0 = W
alarmEvent.temperature = 0 = W
Sun T3/6120 Monitoring
The Sun T3 and Sun 6120 agents uses http tokens to monitor the device and can also monitor the t3 logfile if this logfile is forwarded to the host. This monitoring is done primarily out-of-band. When the Message agent has access to inband information about the T3 from /var/adm/messages, this information is used to generate inband events about this same T3. Sun T3 monitoring is very detailed, see Appendix A for information about all the tokens extracted from the enclosure.
The T3 health module monitors the following:
- A discrepancy between the device time and the host time.
- System reboot will generate an alarmEvent.
- A change in VolAttachOwner will generate an alarmEvent.
- A change in FC stats ("InvTxWord,InvCRCCnt,LossSignalCnt) generate internal loop alarmEvent.
- The following events can be generated for the controllers, power-supplies, loops, loopCards and disks:
- stateChange events when the status of these frus change.
- ComponentInsert and ComponentRemove events when the serial# change.
- A change in the revision of these frus will generate an alarmEvent.revision.
- For the power supply, the status of the batteries, fans, powerOutput and powerTemp are also monitored.
- For the Loop, both the loopStatus and LoopIsolated state are monitored and an alarm event can be generated.
- For the LoopCards in partner pairs, the cable (interface.loopcard.cable) is also monitored.
- For disks, change in the Port1State and Port2State will generate an alarmEvent. Alarms are also generated when the disks temperature goes over 55 degrees.
- Changes in the status of the volumes (volStatus and volOper) on the device will generate a stateChange event.
- Change in the cacheMode and cacheMirror value will generate an alarmEvent.
- A change in the volSlicing state will generate an alarmEvent.
- Changes in lunPermissions and initiators values will generate alarmEvents.
- For 6120, a change in the fibre channel counters may generate a linkEvent if the threshold is exceeded. Fibre channel counters are only accessible inband for Sun T3 and require inband access to this device.
When the Sun T3 logfile is available, it will be monitored for warnings and errors to generate logEvents. See the t3_policies file for more details on the t3 logfile monitoring (Apendix D).
The T3 availability Map file used to evaluate the availability and severity of the different fru status and states is very detailed:
##################
t3.map
##################
[availability]
fruStatus.ready-enabled = 1
fruStatus.ready-disabled = 0
fruStatus.ready-substituted = 1
fruStatus.booting-enabled = 1
fruStatus.booting-disabled = 0
fruStatus.booting-substituted = 1
fruStatus.missing-enabled = 0
fruStatus.missing-disabled = 0
fruStatus.missing-substituted = 0
fruStatus.fault-enabled = 0
fruStatus.fault-disabled = 0
fruStatus.fault-substituted = 0
fruStatus.fault-disabled => fruStatus.fault-substituted = W
fruStatus.notInstalled-enabled = 0 = W
fruStatus.notInstalled-disabled = 0 = W
fruStatus.notInstalled-substituted = 0 = W
fruStatus.offline-enabled = 0
fruStatus.offline-disabled = 0
fruStatus.offline-substituted = 0
volStatus.mounted =1
fruPowerBatState.normal = 1
fruPowerFan1State.normal = 1
fruPowerFan2State.normal = 1
fruPowerPowOutput.normal = 1
fruPowerPowTemp.normal = 1
fruPowerBatState.refreshing => fruPowerBatState.fault = W
fruPowerBatState.fault => fruPowerBatState.unknown = W
START => fruPowerBatState.fault = E
START => fruPowerBatState.[Undefined] = W
fruPowerBatState.refreshing = 1
fruPowerFan1State.refreshing = 1
fruPowerFan2State.refreshing = 1
fruPowerPowOutput.refreshing = 1
fruPowerPowTemp.refreshing = 1
fruPowerBatState.fault = 0 = I
fruPowerBatState.unknown = 0
fruPowerFan1State.fault = 0
fruPowerFan2State.fault = 0
fruPowerPowOutput.fault = 0
fruPowerPowTemp.fault = 0
fruPowerBatState.off = 0
fruPowerFan1State.off = 0
fruPowerFan2State.off = 0
fruPowerPowOutput.off = 0
fruPowerPowTemp.off = 0
fruDiskPort2State.ready = 1
fruDiskPort2State.notReady = 0
fruDiskPort2State.bypass = 0 = W
fruDiskPort1State.ready = 1
fruDiskPort1State.notReady = 0
fruDiskPort1State.bypass = 0 = W
fruLoopCable1State.installed = 1
fruLoopCable2State.installed = 1
fruLoopCable1State.notInstalled = 0
fruLoopCable2State.notInstalled = 0
START => fruLoopCable1State.notInstalled = W
START => fruLoopCable2State.notInstalled = W
pathstat.INVALID = 0
pathstat.OK = 1
loopStatus.available = 1
volCacheMode.writeBehind = 1
volCacheMode.writeThrough = 0 = W
volCacheMode.disabled = 0 = W
START => volCacheMode.writeThrough = I
volCacheMirror.on = 1
volCacheMirror.off = 0 = W
START => volCacheMirror.off = I
START => volCacheMode.[Undefined] = W
START => volCacheMirror.[Undefined] = W
volOper.OK = 1
volOper.reconstructing = 1 = W
volOper.reconstructingToStandby = 1 = W
volOper.copyingFromStandby = 1 = W
volOper.copyingToStandby = 1 = E
volOper.initializing = 1 = W
volOper.verifying = 1 = I
alarmEvent.time_diff = 0 = I
alarmEvent.system_reboot = 0 = W
alarmEvent.volOwner = 0 = WY
alarmEvent.cacheMode = 0 = W
alarmEvent.sysvolslice = 0 = W
alarmEvent.volCount = 0 = W
alarmEvent.lunPermission = 0 = W
alarmEvent.initiators = 0 = W
FC-Tape Monitoring
FC Tapes are monitored using luxadm display command. The health module for this device reports changes in the port status. Fibre channel counters are also monitored using luxadm -e rdls, see Sun A5K for details.
This is the availability Map file for FC-Tape:
##################
tape.map
##################
[availability]
status.Ready =1
status.Not Ready =1
status.O.K. =1
Sun V880 Disk Monitoring
This device is monitored using the luxadm command. The health module monitors the following information:
- A change in the status of each disk will generate a stateChange Event.
- A change in the serial# number of the disks will generate a ComponentInsert and ComponentRemove event.
- The state of the base backplane, base loop, expansion backplane and expansion loop will generate an alarm event.
- A change in the temperature status of the base and expansion unit will generate an alarm event.
- A change in the values of the fibre channel counters (luxadm -e rdls) may generate a LinkEvent if the threshold is exceeded. LinkEvents include information about both ports involved in the fibre channel link
This is the availability Map file for the Sun V880Disk:
##################
v880disk.map
##################
[availability]
disk.status.OK-On = 1
disk.status.OK-NotInstalled = 0
status.OK = 1
SSC.status.OK = 1
SSC.status.NotInstalled = 0 = W
temperature.status.OK = 1
temperature.status.NotInstalled = 1
Sun Virtualization Engine (Vicom) Monitoring
The VE available with Sun 6900 Storage Solution uses cli command on the service processor. These commands are showmap, slicview, mpdrive and svstat. The health module monitors the following:
- New volumes and deleted volumes will generate alarm events.
- Any change in the logical map table (active_path, passive_path, active_uid, passive_uid) will generate a alarmEvent.
- A change in the values of the fibre channel counters may generate a LinkEvent if the threshold is exceeded
Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix C.
Sun 1 Gig Switch Monitoring
The Sun Switch agent uses the sanbox cli command to monitor 1 Gig Qlogic switches. The command used are sanbox version, chassis_status, port_counts, port_status, chassis_counters, get_zone, chassis_id and links. The following attributes are monitored by the health module:
- A change in the status of the fan.1, fan.2 , power supply and temperature will generate an alarmEvent.
- A change in the system_reboot value will generate an alarm event.
- A change in the status of the switch ports will generate a stateChange event. The administrative status in included with the description of these events.
- Changes in the values of the internal ASIC counters will generate port statistics events.
- A change in the zoning of the switch will generate an alarmEvent.
- A change in the values of the fibre channel counters may generate a LinkEvent if the threshold is exceeded
Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix C.
The Sun 1Gig Switch availability Map file include the folowing rules:
##################
switch.map
##################
[availability]
mode.Online =1
mode.Not-logged-in =0
mode.Offline =0
mode.AdminOffline =0
operational.on =1
fan.1.status.OK =1
fan.2.status.OK =1
power.status.OK =1
temp.status.OK =1
alarmEvent.system_reboot = 0 = W
alarmEvent.zone_change = 0 = W
Sun 2 Gig Switch Monitoring
This agent uses snmp to monitors Qlogic 2 Gig switches and upgraded 1 Gig switches. See the sample instrumentation report in Appendix A for details about the information extracted using snmp. Class2RxFrames
The health module for this device monitors the following information. This health modules is also used for InRange switches
- Change in the port status of each switch port generates a stateChange event. The administrative status is included with this event.
- Change in port statistics will generate a alarmEvent. Events are only generated when the ports are online. These statistics include .Class2Rx, Class2Tx, Class3Rx, Class3Tx frames, InvalidCRC, InvalidTxWords, LinkFailures, SyncLoss, ProtoErrors.
- Change in board sensors status will generate an alarm event.
- Change in the status of the fan and power-supply status generate a alarm event.
- A change in the System.uptime value of the switch will generate an alarm event.
- A change in the values of the fibre channel counters may generate a LinkEvent if the threshold is exceeded
Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix C.
The Sun 2Gig Switch availability Map file include the folowing rules:
##################
switch2.map
##################
[availability]
state.online = 1
state.offline = 0
sensor.board.status.ok = 1
sensor.fan.status.ok = 1
sensor.power-supply.status.ok = 1
sensor.power-supply.status.failed = 0
alarmEvent.system_reboot = 0 = W
Brocade Switch Monitoring
This agent uses snmp to monitors Brocade switches. The health module for this device monitors the following information:
- Change in the port status of each switch port generates a stateChange event. The administrative status is included with this event.
- A change in the status of the temperature, fan and power-supply sensors will generate a alarm event.
- Port statistics are also monitored for changes.
- A change in the system last_reboot will generate an alarmEvent.
- A change in the values of the fibre channel counters may generate a LinkEvent if the threshold is exceeded
Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix C.
The Brocade Switch availability Map file include the folowing rules:
##################
brocade.map
##################
[availability]
state.OpStatus.online =1
state.OpStatus.offline =0
state.OpStatus.faulty =0
sensor.temperature.status.nominal =1
sensor.temperature.status.absent =1
sensor.temperature.status.faulty =0
sensor.power-supply.status.nominal =1
sensor.power-supply.status.absent =0
sensor.power-supply.status.faulty =0
sensor.fan.status.nominal =1
sensor.fan.status.absent =1
sensor.fan.status.faulty =0
alarm.system_reboot = 0 = W
McData Switch Monitoring
This agent uses snmp to monitors McData switches. The health module for this device monitors the following information:
- Change in the port status of each switch port generates a stateChange event. The administrative status is included with this event.
- A change in the status of the temperature, fan and power-supply sensors will generate a alarm event.
- Port statistics are also monitored for changes.
- A change in the system last_reboot will generate an alarmEvent.
- A change in the values of the fibre channel counters may generate a LinkEvent if the threshold is exceeded
Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix C.
The McData Switch availability Map file include the folowing rules:
##################
mcdata.map
##################
[availability]
state.online =1
state.offline =0
sensor.powerSupply.status.ok =1
sensor.powerSupply.status.absent =0
sensor.powerSupply.status.failed =0
sensor.fan.status.ok =1
sensor.fan.status.absent =1
sensor.fan.status.failed =0
alarmEvent.system_reboot = 0 = W
InRange Switch Monitoring
This agent uses snmp to monitors InRange switches. The health modules inherits from the Sun 2Gig Switch module.
- A change in the values of the fibre channel counters may generate a LinkEvent if the threshold is exceeded
The InRange switch uses the same availability map as the Sun 2Gig Switch
Sun 3310 /3510 Monitoring
The Sun 3310 and 3510 storage devices can be monitored both inband and out-of-band using the same cli command. This command accept a device path or an ip address. The application supports the discovery and monitoring of these devices either way. The commands used are sccli diag errors, sccli inquiry, sccli show events, sccli show port-wwns, sccli show lun-maps and sccli show configuration. The output of most of these commands is used by the 3310/3510 health module. The sccli show events commands is used to extract events that are directly generated by the storage devices and carry them as logEvents. This is done by the Message health module, not the 3310/3510 health module.
The 3310/3510 health module monitors the following information:
- A change in the overall status of the enclosure will generate an alarm event.
- Changes in the firmware_version or Revision information will generate an alarm Event.
- A missing channel will generate an alarmEvent.channel.
- A change in the revision or in the seral# of the frus will generate alarm events.
- A change in the status of the disks will generate a stateChange event.
- A change in the serial# of the disks will generate a ComponentInsert or ComponentRemove event.
- Alarm Events are generated when a new lun is added, when the total number of partitions change, when the raid level change or when the effective size of a partition change.
- A change in the status of each lun will generate a stateChange event for that lun.
Thresholds are used when monitoring statistical counters for this device. For more details about these thresholds, see the SW_Thresholds in Appendix C.
Here is the Sun 3310/3510 availability Map information:
##################
3310.map
##################
[availability]
enclosure.status.Online = 1
enclosure.status.Low Battery = 1
enclosure.status.Normal = 1
enclosure.status.Critical = 0
enclosure.status.Offline = 0
enclosure.status.Critical Rebuild = 0
enclosure.status.Non Existent = 0 = W
disk_status.Online = 1
disk_status.Used = 1
disk_status.Bad drive = 0
disk_status.Offline = 0
lun_status.Normal = 1
lun_status.Degraded = 0
enclosure.ethernet.ok = 1
enclosure.ethernet.ping_failed = 0
Sun 9900 (hitachi) Monitoring
This agent uses snmp to monitors Sun 9900 storage enclosures. This agent does not monitor at a fru level but only at a subsystem level, The subsystems available are the controller and the disk subsystem. The health module for this device monitors the following information:
- On the controller, changes in the battery, cache, fan, environment, fan, power-supply, shared memory and processor status will generate an alarm event.
- On the disk, changes in the drives, power-supply, environment and fan subsystem status will generate an alarm event.
- A change in the controller version (raid.DKCMainVersion) will generate an alarm event.
This is the availability Map file for the Sun 9900 storage device:
##################
9900.map
##################
# noError (1),
# acute (2),
# serious (3),
# moderate(4),
# service (5)
[availability]
controller.Cache.noError = 1
controller.Cache.acute = 0
controller.Cache.serious = 0
controller.Cache.moderate = 0 = W
controller.Cache.service = 0 = W
controller.Battery.noError = 1
controller.Battery.acute = 0
controller.Battery.serious = 0
controller.Battery.moderate = 0 = W
controller.Battery.service = 0 = W
# Internal Bus
controller.CSW.noError = 1
controller.CSW.acute = 0
controller.CSW.serious = 0
controller.CSW.moderate = 0 = W
controller.CSW.service = 0 = W
controller.Environment.noError = 1
controller.Environment.acute = 0
controller.Environment.serious = 0
controller.Environment.moderate = 0 = W
controller.Environment.service = 0 = W
controller.Fan.noError = 1
controller.Fan.acute = 0
controller.Fan.serious = 0
controller.Fan.moderate = 0 = W
controller.Fan.service = 0 = W
controller.Processor.noError = 1
controller.Processor.acute = 0
controller.Processor.serious = 0
controller.Processor.moderate = 0 = W
controller.Processor.service = 0 = W
# POWER SUPPLY
controller.PS.noError = 1
controller.PS.acute = 0
controller.PS.serious = 0
controller.PS.moderate = 0 = W
controller.PS.service = 0 = W
# SHARED MEMORY
controller.SM.noError = 1
controller.SM.acute = 0
controller.SM.serious = 0
controller.SM.moderate = 0 = W
controller.SM.service = 0 = W
# DISK
disk.Drive.noError = 1
disk.Drive.acute = 0
disk.Drive.serious = 0
disk.Drive.moderate = 0 = W
disk.Drive.service = 0 = W
disk.Environment.noError = 1
disk.Environment.acute = 0
disk.Environment.serious = 0
disk.Environment.moderate = 0 = W
disk.Environment.service = 0 = W
disk.Fan.noError = 1
disk.Fan.acute = 0
disk.Fan.serious = 0
disk.Fan.moderate = 0 = W
disk.Fan.service = 0 = W
# POWER SUPPLY
disk.PS.noError = 1
disk.PS.acute = 0
disk.PS.serious = 0
disk.PS.moderate = 0 = W
disk.PS.service = 0 = W
Sun Data Service Processor (DSP) Monitoring
The DSP agent uses XML over an HTTP connection to monitor this device out-of-band. The information includes chassis, disks, ports, volumes and system information. The health module for this device monitors the following information:
- Changes in the port status will generate a stateChange event.
- A change in the status of each module will generate a stateChange event.
- The addition or removal of a module or a change in the serial# of a module will generate a ComponentInsert or ComponentRemove event.
Note: More details will be forthcoming.
[dsp.map]
portState.Online =1
portState.Offline =0
moduleState.Ready =1
powerState.On =1
powerState.Off =1
- DSP syslog entries. The DSP syslog will be monitored for notable entries which will create an alert. Initially, only ALERT, CRIT, ERR and WARNING will be Severities - Pirus events use a subset of the Unix Syslog severities.
Valid values are:
LOG_ALERT - Needs immediate attention; imminent chassis failure
LOG_CRIT - Needs attention; potential chassis failure if
left unchecked
LOG_ERR - Needs attention; imminent loss of service
LOG_WARNING - Warning; potential loss of service
LOG_INFO - Informational
Appendix A: Instrumentation Reports
(included in a separate document)
.
Appendix B: Instrumentation Report Mapping
(included in a separate document).
Appendix C: Thresholds File (SW_Thresholds)
# FORMAT # Code Cnt, Period, QuietPeriod, Desc # Period, QuietPeriod: hours/ minutes# Thresholds for /var/adm/message driver
driver.SF_OFFLINE = 10,24h,1h,W,socal/ifp Offline driver.SF_OFFLALERT = 15,24h,1h,E,socal/ifp Offline driver.SCSI_TRAN_FAILED = 10,4h,1h,W, SCSI transport failed driver.SCSI_ASC = 10,4h,1h,W,scsi driver.SCSI_TR_READ = 10,4h,1h,W,scsi READ driver.SCSI_TR_WRITE = 10,4h,1h,W,scsi WRITE
driver.SSD_WARN = 5,24h,1h,W,SSD Warning driver.SSD_ALERT = 20,24h,1h,E,SSD Alert driver.PFA = 1,24h,1h,E,Predictive Failure
driver.SF_CRC_WARN = 10,24h,1h,W,CRC Warning driver.SF_CRC_ALERT = 15,24h,1h,E,CRC Alert
driver.SFOFFTOWARN = 5,24h,1h,W,Offline Timeouts driver.SF_DMA_WARN = 1,24h,1h,W,SF DMA Warning driver.SF_RESET = 10,24h,1h,W,SF Reset driver.ELS_RETRY = 10,24h,1h,W,ESL retries driver.SF_RETRY = 10,24h,1h,W,SF Retries driver.TOELS = 10,24h,1h,W,ELS Timeouts driver.SFTOELS = 10,24h,1h,W,SFTOELS Timeouts driver.DDOFFL = 10,24h,1h,W,Offlines driver.LOOP_OFFLINE = 1,5m,1h,E, Loop Offline driver.LOOP_ONLINE = 1,5m,1h,N, Loop Online driver.QLC_LOOP_OFFLINE = 1,5m,1h,E, Loop Offline driver.QLC_LOOP_ONLINE = 1,5m,1h,N, Loop Online driver.LINK_DOWN = 1,5m,1h,E, JNI Loop down driver.LINK_UP = 1,5m,1h,N, JNI Loop up
# Thresholds with VM present
driver.VM_SF_OFFLINE = 10,24h,1h,W,socal/ifp Offline driver.VM_SF_OFFLALERT = 15,24h,1h,E,socal/ifp Offline driver.VM_SCSI_TRAN_FAILED = 10,4h,1h,W, SCSI transport failed driver.VM_SCSI_ASC = 10,4h,1h,W,scsi driver.VM_SCSI_TR_READ = 10,4h,1h,W,scsi READ driver.VM_SCSI_TR_WRITE = 10,4h,1h,W,scsi WRITE
driver.VM_SSD_WARN = 100,24h,1h,W,SSD Warning driver.VM_SSD_ALERT = 1,24h,1h,E,SSD Alert driver.VM_PFA = 1,24h,1h,E,Predictive Failure
driver.VM_SF_CRC_WARN = 100,24h,1h,W,CRC Warning driver.VM_SF_CRC_ALERT = 1,24h,1h,E,CRC Alert
driver.VM_SFOFFTOWARN = 5,24h,1h,W,Offline Timeouts driver.VM_SF_DMA_WARN = 1,24h,1h,W,SF DMA Warning driver.VM_SF_RESET = 1,24h,1h,W,SF Reset driver.VM_ELS_RETRY = 1,24h,1h,W,ESL retries driver.VM_SF_RETRY = 1,24h,1h,W,SF Retries driver.VM_TOELS = 1,24h,1h,W,ELS Timeouts driver.VM_SFTOELS = 1,24h,1h,W,SFTOELS Timeouts driver.VM_DDOFFL = 1,24h,1h,W,Offlines driver.VM_LOOP_OFFLINE = 2,5m,1h,E, Loop Offline driver.VM_LOOP_ONLINE = 2,5m,1h,N, Loop Online
# A3500
a3500.CTRL_FIRM = 1,24h,24h,W,Controller firmware version error
# Thresholds for the Vicom vicom.crc = 200,50m,10m,E, CRC vicom.itw = 200,50m,10m,E, Invalid Transmit words vicom.link = 200,50m,10m,E, Link fails vicom.proto = 200,50m,10m,E, vicom.signal = 200,50m,10m,E, Signal losses vicom.sync = 200,50m,10m,E, Sync losses
# Thresholds for the Switch
switch.LinkFails = 200,50m,10m,E, switch.Total_LIP_Rcvd = 200,50m,10m,E, switch.InvalidTxWds = 200,50m,10m,E, switch.SyncLosses = 200,50m,10m,E, switch.CRC_Errs = 200,50m,10m,E, switch.Prim_Seq_Errs = 200,50m,10m,E, switch.AL_Init_Errs = 200,50m,10m,E, switch.AddressIdErrs = 200,50m,10m,E, switch.short_frame_err_cnt = 200,50m,10m,E, switch.long_frame_err_cnt = 200,50m,10m,E, switch.loss_of_signal_cnt = 200,50m,10m,E, switch.sync_loss = 200,50m,10m,E, switch.Discards = 200,50m,10m,E, switch.AL_Inits = 200,50m,10m,E, switch.LIF_flow_cntrl_err_cnt = 200,50m,10m,E, switch.lof_timeout_els = 200,50m,10m,E, switch.lof_timeout = 200,50m,10m,E,
# Thresholds for the Switch2
switch2.LossofSynchronization = 200,50m,10m,E, switch2.LinkFailures = 200,50m,10m,E, switch2.PrimitiveSequenceProtocolErrors = 200,50m,10m,E, switch2.InvalidTxWords = 90,60m,10m,E, switch2.InvalidCRC = 200,50m,10m,E,
brocade.LipIns = 100,50m,10m,E, brocade.LipOuts = 100,50m,10m,E, brocade.McastTimedOuts = 100,50m,10m,E, brocade.RxBadEofs = 100,50m,10m,E, brocade.RxBadOs = 100,50m,10m,E, brocade.RxCrcs = 100,50m,10m,E, brocade.RxEncInFrs = 100,50m,10m,E, brocade.RxTooLongs = 100,50m,10m,E,
mcdata.AddressIdErrors = 100,50m,10m,E, mcdata.DelimiterErrors = 100,50m,10m,E, mcdata.InvalidCrcs = 100,50m,10m,E, mcdata.InvalidTxWords = 100,50m,10m,E, mcdata.LinkFailures = 100,50m,10m,E, mcdata.LinkResetIns = 100,50m,10m,E, mcdata.LinkResetOuts = 100,50m,10m,E, mcdata.PrimSeqProtoErrors = 100,50m,10m,E, mcdata.SigLosses = 100,50m,10m,E, mcdata.SyncLosses = 100,50m,10m,E,
3310.lip = 100,50m,10m,E, 3310.link = 100,50m,10m,E, 3310.sync = 100,50m,10m,E, 3310.signal = 100,50m,10m,E, 3310.seq = 100,50m,10m,E, 3310.itw = 100,50m,10m,E, 3310.crc = 100,50m,10m,E,
CRCcounters.rule1 = 10,24h,0m,E, Host <-> switch CRCcounters.rule3 = 10,24h,0m,E, switch <-> switch CRCcounters.rule8 = 10,24h,0m,E, storage <-> host/switch
ITWcounters.rule1 = 10,1h,6m,E, Host <-> switch ITWcounters.rule3 = 10,1h,6m,E, switch <-> switch ITWcounters.rule8 = 10,1h,6m,E, storage <-> host/switch
SIGcounters.rule1 = 10,1h,6m,E, Host <-> switch SIGcounters.rule3 = 10,1h,6m,E, switch <-> switch SIGcounters.rule8 = 10,1h,6m,E, storage <-> host/switch
# Health Check thresholds #('LINK', 'SIG', 'SEQ', 'CRC', 'SYNC', 'TXW', 'INF', 'OUTF');
health-switch.TXW = 1, 1m, 0m, E, Too many InvalidTxWords health-switch.CRC = 1, 2m, 0m, E, Too many CRC health-switch.TXW = 1, 2m, 0m, E, Too many InvalidTxWords health-switch.SYNC = 1, 2m, 0m, E, Too many SYNC health-switch.SEQ = 1, 2m, 0m, E, Too many SEQ
health-switch.SIG = 10, 2m, 0m, E, Too many InvalidTxWords ' This is text explaining what to do with this problem ' This is the second line of text. This text can be maintained in System/SW_thresholds
health-a5k.SYNC = 1, 2m, 0m, E, Too many Sync health-a5k.SEQ = 1, 2m, 0m, E, Too many InvalidSequence health-a5k.TXW = 1, 2m, 0m, E, Too many InvalidTxWords health-a5k.CRC = 1, 2m, 0m, E, Too many CRC
health-t3.SYNC = 1, 2m, 0m, E, Too many SYNC health-t3.SEQ = 1, 2m, 0m, E, Too many SEQ health-t3.TXW = 1, 2m, 0m, E, Too many InvalidTxWords health-t3.CRC = 1, 2m, 0m, E, Too many CRC
health-tape.CRC = 1, 2m, 0m, E, Too many CRC health-tape.TXW = 1, 2m, 0m, E, Too many InvalidTxWords health-tape.SYNC = 1, 2m, 0m, E, Too many SYNC health-tape.SEQ = 1, 2m, 0m, E, Too many SEQ
Appendix D: Sun T3/6120 and 3310 logfile monitoring
These policies files are used to evaluate the severity and category of entries found in logfiles. The policy is used when it's pattern matches the entry in the logfile. Patterns are regular expressions.
# # This only applies to t3.LogEvent # Policies are executed from top to bottom in the file. #
[policy1] pattern=/warning temperature threshold exceeded/ egrid=temp_threshold key=temp_threshold severity=2
[policy2] pattern=/u(\d)ctr ISP.+LOOP DOWN/ known=1 severity=1 action=1 egrid=controller.port key=$PORT
[policy3] pattern=/: W: .*, Replace battery/ egrid=power.battery.replace key=replaceBattery severity=2
# warning that comes from a shell are a Notice and not actionable [policy40] pattern=/ sh\d+\[.*: W: (.*)/ severity=0 egrid=array_warning
[policy41] pattern=/: [EW]: u\dctr XOR:/ key=controller severity=2 egrid=controller.XOR
[policy4] pattern=/: W: (.*)/ severity=1 egrid=array_warning action=1
[policy5] pattern=/: E: (.*)/ severity=2 egrid=array_error
[policy6] pattern=/: WARNING: / severity=2 egrid=$comp1
[policy7] pattern=/: N: u\dpcu\d: Refreshing battery/ severity=0 key=refreshBattery egrid=power.battery.refresh
[policy71] pattern=/: N: u\dctr: Enabled/ severity=0 egrid=controller
[policy8] pattern=/: N: u\dpcu\d.*Battery not OK/ severity=2 egrid=power.battery
[policy9] pattern=/: N: u\dpcu\d.*PCU\d hold time/ severity=1 egrid=power.battery action=1
# runs when Sense key is present in the following 1/2 lines [policy101] pattern=/: [WNI]: u\dd\d/ pattern2=/Sense Key = (\w+), Asc = (\w+), Ascq = (\w+)/ pattern3=/Sense Key = (\w+), Asc = (\w+), Ascq = (\w+)/ key=senseKey extended=senseKey egrid=disk.senseKey
[policy10] pattern=/: [NI]: u\dd\d/ pattern2=/Sense/ pattern3=/Sense/ key=senseKey egrid=disk.senseKey severity=2
[policy11] pattern=/: [NI]: u\dd\d.*disk error/ key=disk_error egrid=disk.error severity=2
# generic Notice or Info about a disk is considered warning/actionable # shortkey mean one event can have notices from different disks. [policy12] pattern=/: N: u\dd\d/ key=log egrid=disk.log severity=1 shortkey=1 action=0
# ####################################### # This only applies to 3310.LogEvent # Policies are executed from top to bottom in the file. #
[policy1] pattern=/ALERT: CPU/ egrid=cpu severity=2
[policy2] pattern=/ALERT:/ egrid=array_error severity=2
[policy1] pattern=/FATAL ERROR/ egrid=array_error severity=2
[policy3] pattern=/controller failure detected/ egrid=controller severity=2 actionable=1 </pre>