11 Monitoring and Testing the System

System monitoring involves the use of basic commands and optional utilities to obtain baselines of operating parameters, such as the CPU workload or I/O throughput. You use these baselines to monitor, record, and compare ongoing system activity and ensure that the system does not deviate too far from your operational requirements.

Monitoring the system also enables you to predict and prevent problems that might make the system or its peripherals unavailable to users. Information from monitoring utilities enables you to react quickly to unexpected events such as system panics and disk crashes so that you can quickly resolve problems and bring the system back online.

The topic of monitoring is closely related to your technical support needs. Some of the utilities described in this chapter have a dual function. Apart from realtime system monitoring, they also collect historical and event-specific data that is used by your technical support representative. This data can be critical in getting your system up and running quickly after a fault in the operating system or hardware. Therefore, it is recommended that you at least follow the monitoring guidelines in Section 11.1.

Testing involves the use of commands and utilities to exercise parts of the system or peripheral devices such as disks. The available test utilities are documented in this chapter. Your system hardware also provides test utilities that you run at the console prompt. Refer to your Owner's guide for information on hardware test commands.

The following topics are covered in this chapter:

Section 11.1 contains basic monitoring guidelines and provides an overview of the utilities. It also provides pointers to related topics.

Section 11.2 describes some of the monitoring utilities in greater detail.

Section 11.3 describes environmental monitoring, which monitors aspects of system hardware status such as the temperature and whether the cooling fan is working. This feature depends on whether the hardware contains sensors that support such monitoring. Not all systems support this feature.

Section 11.4 describes how you use the system component the test utilities. Note that your system hardware also provides test routines. Refer to the Owner's Manual for more information. If you need to obtain detailed information on the characteristics of system devices (such as disks and tapes) see the hwmgr command, documented in Chapter 5.

11.1 Overview of Monitoring and Testing

This section provides some general guidelines for monitoring your system, and a brief overview of all the utilities that the operating system provides.

11.1.1 Guidelines for Monitoring Systems

Use the following procedure after you configure your system exactly as required for its intended operation:

1. Choose the utilities you will use to monitor your system on a daily basis.

Review the overview of monitoring utilities provided in this section. Based on the system configuration, select utilities that meet the requirements of the configuration and your monitoring needs. For example, if you have a graphics head terminal and you want to monitor several distributed systems you might want to set up the SysMan Station. If you want to monitor a single local server the dxsysinfo window might be adequate.

If applicable, set any attributes that trigger warnings and messages. For example, you might want to set a limit of 85% full on all file systems to prevent loss of data due to a full device.

Note

Many optional subsystems provide their own monitoring utilities. Familiarize yourself with these interfaces and decide whether they are more appropriate than the generic utilities.

2. Establish a baseline

Run the sys_check -all utility to:

Establish a no-load baseline.

Determine whether any system attributes need to be tuned.

If necessary, use the information from sys_check to tune system attributes. Refer to the System Configuration and Tuning guide for information on Tuning your system. Store the baseline data where it can be easily accessed later, such as on another system. You might also want to print a copy of the report.

3. Run the sys_check utility under load

At an appropriate time, run the sys_check utility when the system is under a reasonable workload. Choose only those options that you want to monitor, such as -perf. This might have a small impact on system performance, so you might not want to run it during peak end-user demand.

Analyze the output from the sys_check utility and perform any additional recommended changes that meet with your operational requirements. This might involve further tuning of system attributes or configuration changes such as the reallocation of system resources using a utility such as the Class Scheduler. See Section 11.2.2 for information on using the sys_check utility.

4. Set up Event Management (EVM)

Configure the event management logging and reporting strategy for the system in conjunction with whatever monitoring strategy you employ. See Chapter 13 and Chapter 12 for information on how to configure EVM.

5. Configure monitoring utilities

Set up any other monitoring utilities that you want to use. For example:

Configure the sys_check utility to run regularly during off-peak hours by using the runsyscheck script with the cron utility as described in Section 11.2.2. In the event of a system problem, the regularly-updated report is useful when analyzing and troubleshooting the problem.

Note

Crash dump data might also be required when diagnosing system problems. See Chapter 14 for information on configuring the crash dump environment.

Install and configure any optional performance utilities, such as the Performance Manager. If supported by the target system, configure environmental monitoring, as described in Section 11.3.

11.1.2 Summary of Commands and Utilities

The operating system provides a number of monitoring commands and utilities. Some commands return a simple snapshot of system data in numerical format, while others have many options for selecting and filtering information. Also provided are complex graphical interfaces that filter and track system data in real time and display it on a graphics head terminal.

Choose monitoring utilities that best fit your local environment and monitoring needs and consider the following:

Using monitoring utilities can impact system performance.
- To help diagnose problems in performance, such as I/O bottlenecks, a simple command such as iostat might be adequate.
- To provide a quick visual check of resources on a single-user system, the X11 System Information interface (dxsysinfo) might be adequate.

Some utilities are restricted to the root user while others are accessible by all system users.

For enterprise-wide monitoring, the SysMan Station can display the health of many systems simultaneously on a single screen.

To track assets across an enterprise or verify what options are installed in what systems (and check whether they are functioning correctly), the web-based Insight Manager utility can be used for both UNIX servers and client PC systems.

You might need to provide output from a monitoring utility to your technical support site during problem diagnosis. It will greatly reduce your system downtime if you take a system baseline and establish a routine monitoring and data collection schedule before any problems occur.

The following sections describe the monitoring utilities.

11.1.2.1 Command-Line Utilities

Use the following commands to display a snapshot of various system statistics:

vmstat

The vmstat command displays system statistics for virtual memory, processes, trap, and CPU activity. An example of vmstat output is:

bigrig> vmstat
Virtual Memory Statistics: (pagesize = 8192)
procs   memory         pages                         intr       cpu
r  w  u act  free wire fault cow zero react pin pout in  sy  cs us sy id
2 97 20 8821  50K 4434 653K 231K 166K 1149 142K    0 76 250 194  1  1 98

See vmstat(1) or more information.

iostat

The iostat command reports input and output information for terminals and disks and the percentage of time the CPU has spent performing various operations. An example of iostat output is:

bigrig> iostat
    tty      floppy0          dsk0          cpu
 tin tout    bps    tps    bps    tps  us ni sy id
   0    1      0      0      3      0   0  0  1 98

See iostat(1) for more information.

who

The who command reports input and output information for terminals and disks and the percentage of time the CPU has spent performing various operations. An example of who output is:

bigrig> who
# who
root        console     Jan  3 09:55
root        :0          Jan  3 09:55
root        pts/1       Jan  3 09:55
bender      pts/2       Jan  3 14:59
root        pts/3       Jan  3 15:43

See who(1) and users(1) for more information.

uptime

The uptime command reports how long the system has been running. See uptime(1) for more information.

Refer also to the netstat command and the Network Administration: Connections guide for information on monitoring your network.

11.1.2.2 SysMan Menu Monitoring and Tuning Tasks

The SysMan Menu provides options for several monitoring tasks. Refer to Chapter 1 for general information on using the SysMan Menu. The following options are provided under the Monitoring and Tuning menu item:

View Events [event_viewer]: This option invokes the EVM event viewer, which is described in Chapter 13.
Set up Insight Manager [imconfig]: Invokes the interface that enables you to configure Insight Manager and start the Insight Manager daemon. Refer to Chapter 1 for information on configuring Insight Manager.
View Virtual Memory (VM) Statistics [vmstat]: This is a SysMan Menu interface to the vmstat command, described previously in this section.
View Input/Output (I/O) Statistics [iostat]: This is a SysMan Menu interface to the iostat command, described previously in this section.
View Uptime Statistics [uptime]: This is a SysMan Menu interface to the uptime command, described previously in this section.

In addition, the following options are provided under the Support and Services menu item:

Create escalation Report [escalation]: Invokes the escalation report feature of the sys_check utility. The escalation report is used only in conjunction with diagnostic services, and is requested by your technical support organization. Refer to Section 11.2.2 for more information on using the escalation options in sys_check.
Create configuration Report [config_report]: Invokes the system configuration report feature of the sys_check utility. Use this option to create a baseline record of your system configuration and to update the baseline at regular intervals. Note that using this option creates a full default report which can take many minutes to complete and can impact system performance. Refer to Section 11.2.2 for more information on using the sys_check utility.

The SysMan Station provides a graphical view of one or more systems and also enables you to launch applications to perform administrative operations on any component. Refer to Chapter 1 for information on using the SysMan Station.

11.1.2.3 X11-Compliant Graphical Interfaces

The operating system provides System Management folders containing several graphical interfaces that are typically used under the default Common Desktop Environment (CDE) windowing environment. You can invoke these interfaces from the CDE Front Panel by clicking on the Application Manager icon to display the Application Manager folder. From this folder, select the System Admin icon, and then the MonitoringTuning icon. This folder provides icons that invoke the following SysMan Menu items:

Configuration Report: This icon invokes a graphical interface to the system configuration report feature of the sys_check utility.
Escalation Report: This icon invokes a graphical interface to the escalation report feature of the sys_check utility.
Insight Manager: This icon invokes the interface that enables you to configure Insight Manager and start the Insight Manager daemon.

The remaining applications in this folder relate to system tuning. Refer to the System Configuration and Tuning guide for information on tuning using the Process Tuner (a graphical interface to the nice command) and the Kernel Tuner (dxkerneltuner) .

The Tools folder provides graphical interfaces to the commands such as vmstat. Invoke these interfaces from the CDE Front Panel by clicking on the Application Manager icon to display the Application Manager folder. From this folder, select the System Admin icon, and then the Tools icon. This folder provides the following interfaces:

I/O Statistics: This is a graphical interface to the iostat command, described previously in this section.
Network Statistics: This is a graphical interface to the netstat command. Refer to the Network Administration: Connections guide for information on monitoring your network.
System Messages: This is a graphical interface to the /var/adm/messages log file, which is used to store certain system messages according to the current configuration of system event management. For information on events, the messages they generate, and the message log files, refer to Chapter 12 and Chapter 13.
Virtual Memory Statistics: This is a graphical interface to the vmstat command, described previously in this section.
Who?: This is a graphical interface to the who command, described previously in this section.

The remaining X11-compliant monitoring application is located in the Application Manager - DailyAdmin folder. Click on the System Information (dxsysinfo) icon to launch the interface. This interface provides you with a quick view of the following system resources and data:

A brief description of the number and type of processors (CPUs).

The UNIX operating system version and the amount of available system memory.

Three dials indicating approximate amount of CPU activity, in-use memory, and in-use virtual memory (swap). This information can also be obtained using commands such as vmstat.

Two warning buttons for files and swap. These buttons are filled with color when a file system is nearly full or if the amount of swap space is too low.

The current available space status of all local and remotely-mounted file systems. You can set a percentage limit here to trigger the warning indicators if available space falls below a certain percentage. Refer to Chapter 6 and Chapter 9 for information on increasing the available file system space.

11.1.2.4 Advanced Monitoring Utilities

The following utilities provide options that enable you to view and record many different operating parameters:

Collect

The collect utility enables you to sample many different kinds of system and process data simultaneously over a predetermined sampling time. You can collect information to data files and play the files back at the terminal.

The collect utility can assist you in diagnosing performance problems and its report output might be requested by your technical support service when they are assisting you in solving system problems. Using the collect utility is described in Section 11.2.1.

The sys_check utility

The sys_check utility is a command-line interface that you use to create a permanent record of the system configuration and the current settings of many system attributes. This utility is described in detail in Section 11.2.2.

The Monitoring Performance History (MPH) Utility

The Monitoring Performance History (MPH) utility is a suite of shell scripts that gathers information on the reliability and availability of the operating system and its hardware environment such as crash data files. This utility is described in detail in Section 11.2.3.

Performance Manager

Performance Manager is an SNMP-based, user-extensible, real-time performance monitoring and management utility. It enables you to detect and correct performance problems on a single system (or a cluster). Performance Manager has a graphical user interface (GUI), and a limited command-line interface using commands such as the getone command to read and display lines of data. The GUI can be configured to display tables and graphs, showing many different system parameters and values, such as CPU performance, physical memory usage, and disk transfers.

Performance Manager comprises two primary components: Performance Manager GUI (pmgr) and Performance Manager daemon (pmgrd). Additional daemons are used in monitoring TruCluster clusters (clstrmond) and the Advanced File System (advsfd), supplied in the AdvFS Utilities subset.

The Performance Manager software subsets are included on the Associated Products, Volume 2 CD-ROM. No license is required to install and use the software. For an overview of features refer to the release notes. The PostScript file is PMGR***_RELNOTES.ps and the text file is PMGR***_RELNOTES.txt. The Performance Manager guide is provided in the Software Documentation CD-ROM.

11.1.3 Related Documentation

The following topics are closely related to system monitoring and testing:

Refer to Chapter 10 for information on administering the system accounting services, which enables you to monitor and record access to resources such as printers.

Refer to Chapter 12 for instructions on configuring and using basic system event logging by using the basic binlogd and syslogd event channels. This chapter also describes how you access system log files, where events and errors are recorded.

Refer to Chapter 13 for information on configuring and using the Event Manager (EVM), which provides sophisticated management of system events, including automated response to certain types of event.

Refer to the Network Administration: Connections guide for information on monitoring the system's networking components.

Refer to the System Configuration and Tuning for information tuning your system in response to information gathered during monitoring and testing.

11.2 Configuring and Using Monitoring Utilities

This section introduces some of the monitoring utilities and describes their setup and use. Refer to the documentation and reference pages supplied with each application for more information. Refer to Chapter 1 for information on configuring and using the SysMan Station to monitor systems that have a graphics environment.

A closely related topic is event management and error logging. Refer to Chapter 12 and Chapter 13 for information on these topics.

11.2.1 Using collect to Record System Data

The /usr/sbin/collect command-line utility collects data that describes the current system status. It enables you to select from many parameters and sort them and to time the data collection period. The data is displayed in real time or recorded to a file for future analysis or playback. Using the collect utility has a low CPU overhead because you can focus on the exact aspects of system behavior that you need to record and therefore it should not adversely effect system performance.

The output from the unqualified /usr/sbin/collect command is similar to the output from monitoring commands such as vmstat, iostat, or netstat.

The command synopsis is fully defined in collect(8). Important features provided by the collect utility are:

Controlling the duration of, and rate at which data is sampled. Sorting the output according to processor usage.

Extracting a time slice of data from a data record file. For example, if you want to look at certain system parameters during the busiest time of use, you can extract that data from the data file by using the -C option.

Specifying a particular device using its device special file name. For example the following command identifies that data is collected from the named devices:
```
# collect -sd -Ddsk1,dsk10
```

Specifying a particular subsystem such as the CPU or the network. For example, the following command specifies that data is collected only for the CPUs, and a sample of data is shown:

# collect -e cf
CPU SUMMARY
USER SYS IDLE WAIT INTR SYSC CS RUNQ AVG5 AVG30 AVG60 FORK VFORK
  13  16   71    0  149  492 725   0 0.13 0.05  0.01 0.30 0.00
SINGLE CPU STATISTICS
  CPU USER  SYS IDLE WAIT
    0   13   16   71    0

Recording and preserving a series of data files by using the -H (history) option. Compressing data files for economical storage.

Specifying specific users, groups, and processes for which data is to be sampled.

Using the -p option, you can specify multiple data files and use the collect utility to play them back as one stream. Using the -f option you can combine multiple binary input files into one binary output file.

The collect utility locks itself into memory by using the page locking function plock(), and cannot be swapped out by the system. It also raises its priority by using the priority function nice(). If required, page locking can be disabled by using the -ol command option and the priority setting can be disabled by using the -on command option. However, using collect should have minimal impact on a system under high load.

11.2.2 Using the sys_check Utility

The sys_check utility provides you with the following:

The ability to establish a baseline of system configuration information, both for software and hardware and record it in an easily accessible HTML report for web browsing. You can update this report regularly or as your system configuration changes.

The opportunity to perform automated checking of many system attributes (such as tuning parameters) and receive feedback on settings that might be more appropriate to the current use of the system.
The sys_check utility also checks and reports recommended maintenance suggestions, such as installing patch kits and maintaining swap space.

The ability to generate a problem escalation report that can be used by your technical support service to diagnose and correct system problems.

In addition to recording the current hardware and software configuration, The sys_check utility produces an extensive dump of system performance parameters. This feature enables you to record many system attribute values, providing a useful baseline of system data. Such a baseline is particularly useful before you undertake major changes or perform troubleshooting procedures.

When you run the sys_check utility it produces an HTML document on standard output. Used with the -escalate flag, the script produces /var/tmp/escalate* output files by default. These files can be forwarded to your technical support organization and used for diagnosing system problems and errors.

Use the following command to obtain a complete list of command options.

# /usr/sbin/sys_check -h

The output produced by the sys_check utility typically varies between 0.5MB and 3MB in size and it can take from 30 minutes to an hour to complete the check. See sys_check(8) for more details of the various command options. You can greatly reduce the run time by excluding items from the run. For example, the sys_check utility runs setld to record the installed software. Excluding the setld operation can greatly reduce the sys_check run duration.

You can also invoke standard sys_check run tasks as follows:

Using CDE, open the Application Manager from the CDE front panel. Select System_Admin and then MonitoringTuning. There are icons for two standard sys_check run tasks, Configuration Report and Escalation Report.

Using the SysMan Menu, expand the Support and Services menu item and choose from the following options:
- Create escalation report
- Create configuration report.
For information on using the SysMan Menu, refer to Chapter 1.

You can run sys_check tasks automatically by enabling an option in the root crontabs file. In the /var/spool/cron/crontabs directory, the root file contains a list of default tasks that are run by cron on a regular basis. Remove the comment (#) command from the following line:

#0 3 * * 0 /usr/share/sysman/bin/runsyscheck

When this option is enabled the resulting report is referenced by Insight Manager and can be read from the Insight Manager Configuration Report option. See Chapter 1 for information on using Insight Manager.

11.2.3 Using the Monitoring Performance History Utility

The Monitoring Performance History (MPH) utility is a suite of shell scripts that gathers information on the reliability and availability of the operating system and its hardware environment such as crash data files. The information is automatically copied to your systems vendor by internet mail or DSN link, if available. Using this data, performance analysis reports are created and distributed to development and support groups. This information is only used internally by your systems vendor to improve the design of reliable and highly available systems.

The MPH run process is automatic, requiring no user intervention. Initial configuration requires approximately 10 minutes of your time. MPH will not impact or degrade your system's performance because it runs as a background task, using negligible CPU resource. The disk space required for the collected data and the application is approximately 300 blocks per system. This could be slightly higher in the case of a high number of errors and is considerably larger for the initial run, when a baseline is established (a one-time event).

The MPH utility operates as follows:

Every 10 minutes it records a timestamp indicating that the system is running.

Daily at 2:00am, it extracts any new events records from the default event log /var/adm/binary.errlog.

Every day at 3:00am it transfers the event and timestamp data and any new crashdc data files in /var/adm/crash to the system vendor. The average transfer is 150 blocks of data.

Before running MPH, review the following information:

The Standard Programmer Commands (Software Development) OSFPGMR400 subset must be installed. Use the setld -i command to verify that the subset is installed.

The MPH software kit is contained in the mandatory base software subset OSFHWBASE400. This subset is installed automatically during the operating system installation. Full documentation is located in /usr/field/mph/unix_installation_guide.ps. A text file is also supplied.

The disk space requirement for the MPH software subset is approximately 100 blocks.

To configure MPH on your system, you must be the root user and principal administrator of the target system. You need to supply your name, telephone number, and e-mail address. Complete the following steps:

Find the serial number (SN) of the target system, which is generally located on the rear of the system box. You need this number to complete the installation script.

Enter the following command to run the MPH script:
```
# /usr/field/mph/MPH_UNIX***.CSH
```
Where *** is the version number, such as 025.

Enter the information requested by the script. When the script is complete, MPH starts automatically.

If the operating system needs to be shut down for any reason, an orderly shutdown process must be followed. Otherwise, you will have to restart the MPH script as described in the MPH documentation. See mph(1) for more information.

11.3 Environmental Monitoring

On any system, thermal levels can increase because of poor ventilation, overheating conditions, or fan failure. Without detection, an unscheduled shutdown could ensue, causing the system's loss of data or damage to the system itself. By using Environmental Monitoring, the thermal state of AlphaServer systems can be detected and users can be alerted in time enough to recover or perform an orderly shutdown of the system.

The Environmental Monitoring framework consists of four components:

The loadable kernel module and its associated APIs.

The Server System MIB subagent daemon.

The envmond daemon.

The envconfig utility.

These components are described in the following sections.

11.3.1 Loadable Kernel Module

The loadable kernel module and its associated APIs contain the parameters needed to monitor and return status on your system's threshold levels. The kernel module exports server management attributes as described in Section 11.3.1.1 through the kernel configuration manager (CFG) interface only. It works across all platforms that support server management, and provides compatibility for other server management systems under development.

The loadable kernel module does not include platform-specific code (such as the location of status registers). It is transparent to the kernel module which options are supported by a platform. That is, the kernel module and platform are designed to return valid data if an option is supported, a fixed constant for unsupported options, or null.

11.3.1.1 Specifying Loadable Kernel Attributes

The loadable kernel module exports the parameters listed in Table 11-1 to the kernel configuration manager (CFG).

Table 11-1: Parameters Defined in the Kernel Module

Parameter	Purpose
`env_current_temp`	Specifies the current temperature of the system. If a system is configured with the KCRCM module, the temperature returned is in Celsius. If a system does not support temperature readings and a temperature threshold is not exceeded, a value of -1 is returned. If a system does not support temperature readings and a temperature threshold is exceeded, a value of -2 is returned.
`env_high_temp_thresh`	Provides a system-specific operating temperature threshold. The value returned is a hardcoded, platform-specific temperature in Celsius.
`env_fan_status`	Specifies a noncritical fan status. The value returned is a bit value of zero (0). This value will differ when the hardware support is provided for this feature.
`env_ps_status`	Provides the status of the redundant power supply. On platforms that provide interrupts for redundant power supply failures, the corresponding error status bits are read to determine the return value. A value of 1 is returned on error; otherwise, a value of zero (0) is returned.
`env_supported`	Indicates whether or not the platform supports server management and environmental monitoring.

11.3.1.2 Obtaining Platform-Specific Functions

The loadable kernel module must return environmental status based on the platform being queried. To obtain environmental status, the get_info() function is used. Calls to the get_info() function are filtered through the platform_callsw[] table.

The get_info() function obtains dynamic environmental data by using the function types described in Table 11-2.

Table 11-2: `get_info()` Function Types

Function Type	Use of Function
`GET_SYS_TEMP`	Reads the system's internal temperature on platforms that have a KCRCM module configured.
`GET_FAN_STATUS`	Reads fan status from error registers.
`GET_PS_STATUS`	Reads redundant power supply status from error registers.

The get_info() function obtains static data by using the HIGH_TEMP_THRESH function type, which reads the platform-specific upper threshold operational temperature.

11.3.1.3 Server System MIB Subagent

The Server System MIB Agent, (which is an eSNMP subagent) is used to export a subset of the Environmental Monitoring parameters specified in the Server System MIB. The Server System MIB exports a common set of hardware-specific parameters across all server platforms, depending on the operating system installed.

Table 11-3 maps the subset of Server System MIB variables that support Environmental Monitoring to the kernel parameters described in Section 11.3.1.1.

Table 11-3: Mapping of Server Subsystem Variables

Server System MIB Variable Name	Kernel Module Parameter
`svrThSensorReading`	`env_current_temp`
`svrThSensorStatus`	`env_current_temp`
`svrThSensorHighThresh`	`env_high_temp_thresh`
`svrPowerSupplyStatus`	`env_ps_temp`
`svrFanStatus`	`env_fan_status`

An SNMP MIB compiler and other utilities are used to compile the MIB description into code for a skeletal subagent daemon. Communication between the subagent daemon and the master agent eSNMP daemon, snmpd, is handled by interfaces in the eSNMP shared library (libesnmp.so). The subagent daemon must be started when the system boots and after the eSNMP daemon has started.

For each Server System MIB variable listed in Table 11-3, code is provided in the subagent daemon, which accesses the appropriate parameter from the kernel module through the CFG interface.

11.3.2 Monitoring Environmental Thresholds

To monitor the system environment, the envmond daemon is used. You can customize the daemon by using the envconfig utility. The following sections discuss the daemon and utility. See envmond(8) and envconfig(8) for more information.

11.3.2.1 Environmental Monitoring Daemon

By using the Environmental Monitoring daemon, envmond, threshold levels can be checked and corrective action can ensue before damage occurs to your system. Then the envmond daemon performs the following tasks:

Queries the system for threshold levels.

When the cooling fan on an AlphaServer 1000A fails, the kernel logs the error, synchronizes the disks, then powers down the system. On all other fan failures, a hard shutdown ensues.

Notifies users when a high temperature threshold condition is resolved.

Notifies all users that an orderly shutdown is in progress if recovery is not possible.

To query the system, the envmond daemon uses the base operating system command /usr/sbin/snmp_request to obtain the current values of the environment variables specified in the Server System MIB.

To enable Environmental Monitoring, the envmond daemon must be started during the system boot, but after the eSNMP and Server System MIB agents are started. You can customize the envmond daemon by using the envconfig utility.

11.3.2.2 Customizing the envmond Daemon

You can use the envconfig utility to customize how the environment is queried by the envmond daemon. These customizations are stored in the /etc/rc.config file, which is read by the envmond daemon during startup. Use the envconfig utility to perform the following tasks:

Turn environmental monitoring on or off during the system boot.

Start or stop the envmond daemon after the system boot.

Specify the frequency between queries of the system by the envmond daemon.

Set the highest threshold level that can be encountered before a temperature event is signaled by the envmond daemon. Specify the path of a user-defined script that you want the envmond daemon to execute when a high threshold level is encountered.

Specify the grace period allotted to save data if a shutdown message is broadcast.

Display the values of the Environmental Monitoring variables.

11.3.3 User-Definable Messages

Messages broadcasted or logged by the Environmental Monitoring utility can be modified. The messages are located in the following file:

/usr/share/sysman/envmon/EnvMon_UserDefinable_Msg.tcl

You must be root to edit this file and you can edit any message text included in braces ({}). The instructions for editing each section of the file are included in the comment fields, preceded by the # symbol.

For example, the following message provides samples of possible causes for the high temperature condition:

set EnvMon_Ovstr(ENVMON_SHUTDOWN_1_MSG){System has reached a \
high temperature condition. Possible problem source: Clogged \
air filter or high ambient room temperature.}

You could modify this message text as follows:

set EnvMon_Ovstr(ENVMON_SHUTDOWN_1_MSG) {System \
has reached a high temperature condition. Check the air \
conditioning unit}

Note that you must not alter any data in this file other that the text strings between the braces ({}).

11.4 Using System Exercisers

The operating system provides a set of exercisers that you can use to troubleshoot your system. The exercisers test specific areas of your system, such as file systems or system memory. The following sections provides information on the system exercisers:

Running the system exercisers (Section 11.4.1)

Using exerciser diagnostics (Section 11.4.2)

Exercising file systems by using the fsx command (Section 11.4.3)

Exercising system memory by using the memx command (Section 11.4.4)

Exercising shared memory by using the shmx command (Section 11.4.5)

Exercising disk drives by using the diskx command (Section 11.4.6)

Exercising tape drives by using the tapex command (Section 11.4.7)

Exercising communications systems by using the cmx command (Section 11.4.8)

In addition to the exercisers documented in this chapter, your system might also support the DEC Verifier and Exerciser Tool (VET), which provides a similar set of exercisers. Refer to the documentation that came with your latest firmware CD-ROM for information on VET.

11.4.1 Running System Exercisers

To run a system exerciser, you must be logged in as superuser and /usr/field must be your current directory.

The commands that invoke the system exercisers provide an option for specifying a file where diagnostic output is saved when the exerciser completes its task.

Most of the exerciser commands have an online help option that displays a description of how to use that exerciser. To access online help, use the -h option with a command. For example, to access help for the diskx exerciser, use the following command:

# diskx -h

You can run the exercisers in the foreground or the background and can cancel them at any time by pressing [Ctrl/c] in the foreground. You can run more than one exerciser at the same time; keep in mind, however, that the more processes you have running, the slower the system performs. Thus, before exercising the system extensively, make sure that no other users are on the system.

There are some restrictions when you run a system exerciser over an NFS link or on a diskless system. For exercisers such as fsx that need to write to a file system, the target file system must be writable by root. Also, the directory from which an exerciser is executed must be writable by root because temporary files are written to the directory.

These restrictions can be difficult to adhere to because NFS file systems are often mounted in a way that prevents root from writing to them. You can overcome some of these problems by copying the exerciser into another directory and running it from the new directory.

11.4.2 Using Exerciser Diagnostics

When an exerciser is halted (either by pressing [Ctrl/c] or by timing out), diagnostics are displayed and are stored in the exerciser's most recent log file. The diagnostics inform you of the test results.

Each time an exerciser is invoked, a new log file is created in the /usr/field directory. For example, when you execute the fsx command for the first time, a log file named #LOG_FSX_01 is created. The log files contain records of each exerciser's results and consist of the starting and stopping times, and error and statistical information. The starting and stopping times are also logged into the default /var/adm/binary.errlog system error log file. This file also contains information on errors reported by the device drivers or by the system.

The log files provide a record of the diagnostics. However, after reading a log file, delete it because an exerciser can have only nine log files. If you attempt to run an exerciser that has accumulated nine log files, the exerciser tells you to remove some of the old log files so that it can create a new one.

If an exerciser finds errors, you can determine which device or area of the system has the difficulty by looking at the /var/adm/binary.errlog file, using either the dia command (preferred) or the uerf command. For information on the error logger, see the Section 12.1. For the meanings of the error numbers and signal numbers, see intro(2) and sigvec(2).

11.4.3 Exercising a File System

Use the fsx command to exercise the local file systems. The fsx command exercises the specified local file system by initiating multiple processes, each of which creates, writes, closes, opens, reads, validates, and unlinks a test file of random data.

Note

Do not test NFS file systems with the fsx command.

The fsx command has the following syntax:

fsx [-fpath] [-h] [-ofile] [-pnum] [-tmin]

Refer to fsx(8) for a description of the command options.

The following example of the fsx command tests the /usr file system with five fsxr processes running for 60 minutes in the background:

# fsx -p5 -f/usr -t60 &

11.4.4 Exercising System Memory

Use the memx command to exercise the system memory. The memx command exercises the system memory by initiating multiple processes. By default, the size of each process is defined as the total system memory in bytes divided by 20. The minimum allowable number of bytes per process is 4095. The memx command runs 1s and 0s, 0s and 1s, and random data patterns in the allocated memory being tested.

The files that you need to run the memx exerciser include the following:

memx

memxr

The memx command is restricted by the amount of available swap space. The size of the swap space and the available internal memory determine how many processes can run simultaneously on your system. For example, if there are 16 MB of swap space and 16 MB of memory, all of the swap space is used if all 20 initiated processes (the default) run simultaneously. This would prevent execution of other process. Therefore, on systems with large amounts of memory and small amounts of swap space, you must use the -p or -m option, or both, to restrict the number of memx processes or to restrict the size of the memory being tested.

The memx command has the following syntax:

memx -s [-h] [-msize] [-ofile] [-pnum] [-tmin]

See memx(8) for a description of the command options.

The following example of the memx command initiates five memxr processes that test 4095 bytes of memory and runs in the background for 60 minutes:

# memx -m4095 -p5 -t60 &

11.4.5 Exercising Shared Memory

Use the shmx command to exercise the shared memory segments. The shmx command spawns a background process called shmxb. The shmx command writes and reads the shmxb data in the segments, and the shmxb process writes and reads the shmx data in the segments.

Using shmx, you can test the number and the size of memory segments and shmxb processes. The shmx exerciser runs until the process is killed or until the time specified by the -t option is exhausted.

You automatically invoke the shmx exerciser when you start the memx exerciser, unless you specify the memx command with the -s option. You can also invoke the shmx exerciser manually. The shmx command has the following syntax:

/usr/field/shmx [-h] [-ofile] [-v] [-ttime] [-msize] [-sn]

See shmx(8) for a description of the command options.

The following example tests the default number of memory segments, each with a default segment size:

# shmx &

The following example runs three memory segments of 100,000 bytes for 180 minutes:

# shmx -t180 -m100000 -s3 &

11.4.6 Exercising a Disk Drive

Use the diskx command to exercise the disk drives. The main areas that are tested include the following:

Reads, writes, and seeks

Performance

Disktab entry verification

Caution

Some of the tests involve writing to the disk; for this reason, use the exerciser cautiously on disks that contain useful data that the exerciser could overwrite. Tests that write to the disk first check for the existence of file systems on the test partitions and partitions that overlap the test partitions. If a file system is found on these partitions, you are prompted to determine whether the test continues.

You can use the diskx command options to specify the tests that you want performed and to specify the parameters for the tests.

The diskx command has the following syntax:

diskx [options] [parameters] -f devname

See diskx(8) for a description of the options.

The -f devname option specifies the device special file on which to perform testing. The devname variable specifies the name of the block or character special file that represents the disk to be tested, such as /dev/disk/dsk1h. The last character of the file name can specify the disk partition to test.

If a partition is not specified, all partitions are tested. For example, if the devname variable is /dev/disk/dsk0, all partitions are tested. If the devname variable is /dev/disk/dsk0a, the a partition is tested. This parameter must be specified and can be used with all test options.

The following example performs read-only testing on the character device special file that /dev/rdisk/dsk0 represents. Because a partition is not specified, the test reads from all partitions. The default range of transfer sizes is used. Output from the exerciser program is displayed on the terminal screen:

# diskx -f /dev/rdisk/dsk0 -r

The following example runs on the a partition of /dev/disk/dsk0, and program output is logged to the diskx.out file. The program output level is set to 10 and causes additional output to be generated:

# diskx -f /dev/disk/dsk0a -o diskx.out -d -debug 10

The following example shows that performance tests are run on the a partition of /dev/disk/dsk0, and program output is logged to the diskx.out file. The -S option causes sequential transfers for the best test results. Testing is done over the default range of transfer sizes:

# diskx -f /dev/disk/dsk0 -o diskx.out -p -S

The following command runs the read test on all partitions of the specified disks. The disk exerciser is invoked as three separate processes, which generate extensive system I/O activity. The command shown in this example can be used to test system stress:

# diskx -f /dev/rdisk/dsk0 -r &; diskx -f /dev/rdisk/dsk1 -r &; diskx -f /dev/rdisk/dsk2 -r &

11.4.7 Exercising a Tape Drive

Use the tapex command to exercise a tape drive. The tapex command writes, reads, and validates random data on a tape device from the beginning-of-tape (BOT) to the end-of-tape (EOT). The tapex command also performs positioning tests for records and files, and tape transportability tests.

Some tapex options perform specific tests (for example, an end-of-media (EOM) test). Other options modify the tests, for example, by enabling caching.

The tapex command has the following syntax:

tapex [options] [parameters]

See tapex(8) for a description of the command options.

The following example runs an extensive series of tests on tape device /dev/tape/tape0_d0 and sends all output to the tapex.out file:

# tapex -f /dev/tape/tape0_d0 -E -o tapex.out

The following example performs random record size tests and outputs information in verbose mode. This test runs on the default tape device /dev/tape/tape0_d0, and the output is sent to the terminal screen.

# tapex -g -v

The following example performs read and write record testing using record sizes in the range 10 K to 20 K. This test runs on the default tape device /dev/tape/tape0_d0, and the output is sent to the terminal screen.

# tapex -r -min_rs 10k -max_rs 20k

The following example performs a series of tests on tape device /dev/tape/tape0_d0, which is treated as fixed block device in which record sizes for tests are multiples of the blocking factor 512 KB. The append-to-media test is not performed.

# tapex -f /dev/tape/tape0_d0 -fixed 512 -no_overwrite

11.4.8 Exercising the Terminal Communication System

Use the cmx command to exercise the terminal communications system. The cmx command writes, reads, and validates random data and packet lengths on the specified communications lines.

The lines you exercise must have a loopback connector attached to the distribution panel or the cable. Also, the line must be disabled in the /etc/inittab file and in a nonmodem line; that is, the CLOCAL option must be set to on. Otherwise, the cmx command repeatedly displays error messages on the terminal screen until its time expires or until you press [Ctrl/c].

You cannot test pseudodevice lines or lta device lines. Pseudodevices have p, q, r, s, t, u, v, w, x, y, or z as the first character after tty, for example, ttyp3.

The cmx command has the following syntax:

/usr/field/cmx [-h] [-o file] [-t min] [-l line]

See cmx(8) for a description of the command options.

The following example exercises communication lines tty22 and tty34 for 45 minutes in the background:

# cmx -l 22 34 -t45 &

The following example exercises lines tty00 through tty07 until you press [Ctrl/c]:

# cmx -l 00-07