The following topics are covered in this chapter:
An overview of the basic monitoring guidelines and utilities, and pointers to related topics (Section 11.1)
A detailed discussion of some of these monitoring utilities (Section 11.2)
A discussion of environmental monitoring, which monitors aspects of system hardware status such as the temperature and whether the cooling fan is working; this feature depends on whether the hardware contains sensors that support such monitoring and not all systems support this feature (Section 11.3 )
A discussion on the use of the system component test utilities; your system hardware also provides test routines; see the Owner's Manual for more information (Section 11.4 )
If you need to obtain detailed information on the characteristics of
system devices (such as disks and tapes) see the
hwmgr
command, documented in the
Hardware Management
manual.
11.1 Overview of Monitoring and Testing
System monitoring involves the use of basic commands and optional utilities to obtain baselines of operating parameters, such as the CPU workload or I/O throughput. Use these baselines to monitor, record, and compare ongoing system activity and ensure that the system does not deviate too far from your operational requirements.
Monitoring the system also enables you to predict and prevent problems that may make the system or its peripherals unavailable to users. Information from monitoring utilities enables you to react quickly to unexpected events such as system panics and disk crashes so that you can resolve problems quickly and bring the system back on line.
The topic of monitoring is related closely to your technical support needs. Some of the utilities described in this chapter have a dual function. Apart from realtime system monitoring, they also collect historical and event-specific data that is used by your technical support representative. This data can be critical for getting your system up and running quickly after a fault in the operating system or hardware. Therefore, it is recommended that you follow the monitoring guidelines in Section 11.1.1 at the very least.
Testing involves the use of commands and utilities to exercise parts of the system or peripheral devices such as disks. The available test utilities are documented in this chapter. Your system hardware also provides test utilities that you run at the console prompt. See your Owner's manual for information on hardware test commands.
Section 11.1.1
provides general guidelines for
monitoring your system, and
Section 11.1.2
gives a brief
overview of all the utilities that the operating system provides.
11.1.1 Guidelines for Monitoring Systems
Use the following procedure after you configure your system exactly as required for its intended operation:
Choose the utilities to monitor your system on a daily basis.
Review the overview of monitoring utilities described in
Section 11.1.2.
Based on the system configuration, select utilities that satisfy the requirements
of the configuration and your monitoring needs.
For example, if you have a
graphics terminal and you want to monitor several distributed systems, consider
setting up the SysMan Station.
If you want to monitor a single local
server, the
dxsysinfo
window may be adequate.
If applicable, set any attributes that trigger warnings and messages. For example, you may choose to set a limit of 85% full on all file systems to prevent loss of data because of a full device.
Note
Many optional subsystems provide their own monitoring utilities. Familiarize yourself with these interfaces and decide whether they are more appropriate than the generic utilities.
Establish a baseline.
Run the
sys_check
utility with the
-all
option:
To establish a no-load baseline.
To determine whether any system attributes should be tuned.
If necessary, use the information from the
sys_check
utility to tune system attributes.
See the
System Configuration and Tuning
manual for information on Tuning your system.
Store the baseline data where
it can be accessed easily later, such as on another system.
Also, print a
copy of the report.
Run the
sys_check
utility under load.
At an appropriate time, run the
sys_check
utility
when the system is under a reasonable workload.
Choose only those options,
such as
-perf, that you want to monitor.
This may have a small
affect on system performance, so you may not want to run it during peak end-user
demand.
Analyze the output from the
sys_check
utility and
perform any additional recommended changes that satisfy your operational requirements.
This may involve further tuning of system attributes or configuration changes,
such as the reallocation of system resources using a utility like the Class
Scheduler.
See
Section 11.2.2
for information on using
the
sys_check
utility.
Set up the Event Manager.
Configure the event management logging and reporting strategy for the system in conjunction with whatever monitoring strategy you employ. See Chapter 13 and Chapter 12 for information on how to configure the Event Manager.
Configure monitoring utilities.
Set up any other monitoring utilities that you want to use. For example:
Configure the
sys_check
utility to run
regularly during off-peak hours by using the
runsyscheck
script with the
cron
utility as described in
Section 11.2.2.
In the event of a system problem, the regularly-updated report is useful when
analyzing and troubleshooting the problem.
Note
Crash dump data may be required for diagnosing system problems. See Chapter 14 for information on configuring the crash dump environment.
Install and configure any optional performance utilities. If supported by the target system, configure environmental monitoring, as described in Section 11.3.
11.1.2 Summary of Commands and Utilities
The operating system provides a number of monitoring commands and utilities. Some commands return a simple snapshot of system data in numerical format, while others have many options for selecting and filtering information. Also provided are complex graphical user interfaces that filter and track system data in real time and display it on a graphics terminal.
Choose monitoring utilities that suit your local environment and monitoring needs and consider the following:
Using monitoring utilities can affect system performance.
To help diagnose problems in performance, such as I/O bottlenecks,
a simple command, like
iostat, may be adequate.
To provide a quick visual examination of resources on a single-user
system, the X11 System Information interface (dxsysinfo)
may be adequate.
Some utilities are restricted to the root user while others are accessible by all system users.
For enterprise-wide monitoring, the SysMan Station can display the health of many systems simultaneously on a single screen.
To track assets across an enterprise or verify what options are installed in what systems (and verify whether they are functioning correctly), the web-based HP Insight Manager utility can be used for both UNIX servers and client PC systems.
You may need to provide output from a monitoring utility to your technical support site during problem diagnosis. It reduces your system downtime greatly if you take a system baseline and establish a routine monitoring and data collection schedule before any problems occur.
The following sections describe the monitoring utilities.
11.1.2.1 Command Line Utilities
Use the following commands to display a snapshot of various system statistics:
vmstatThe
vmstat
command displays system statistics for
virtual memory, processes, trap, and CPU activity.
An example of
vmstat
output is:
bigrig> vmstat Virtual Memory Statistics: (pagesize = 8192) procs memory pages intr cpu r w u act free wire fault cow zero react pin pout in sy cs us sy id 2 97 20 8821 50K 4434 653K 231K 166K 1149 142K 0 76 250 194 1 1 98
See
vmstat(1)
iostat
The
iostat
command reports input and output information for terminals
and disks and the percentage of time the CPU has spent performing various
operations.
An example of
iostat
output is:
bigrig> iostat
tty floppy0 dsk0 cpu
tin tout bps tps bps tps us ni sy id
0 1 0 0 3 0 0 0 1 98
See
iostat(1)
who
The
who
command reports information about users and processes on the local system.
An example of
who
output is:
bigrig> who # who root console Jan 3 09:55 root :0 Jan 3 09:55 root pts/1 Jan 3 09:55 bender pts/2 Jan 3 14:59 root pts/3 Jan 3 15:43
There is a similar command,
users, that displays
a compact list of the users logged in.
uptime
The
uptime
command reports how long the system
has been running.
bigrig> uptime 16:20 up 167 days, 14:33, 4 users, load average: 0.23, 0.24, 0.24
See
uptime(1)
There is a similar command,
w, that displays the
same information as
uptime, but also displays information
for the users logged in.
See
w(1)
netstatThe
netstat
command displays network-related statistics in various formats.
See the
netstat
command and the
Network Administration: Connections
manual for information on monitoring your network.
11.1.2.2 SysMan Menu Monitoring and Tuning Tasks
The SysMan Menu provides options for several monitoring tasks. See Chapter 1 for general information on using the SysMan Menu. The following options are provided under the Monitoring and Tuning menu item:
This option invokes the EVM event viewer, which is described in Chapter 13.
This option invokes the interface that enables you to configure HP Insight Manager and start the HP Insight Manager daemon. See Chapter 1 for information on configuring HP Insight Manager.
This is a SysMan Menu interface to the
vmstat
command,
described in
Section 11.1.2.1.
This
is a SysMan Menu interface to the
iostat
command, described
in
Section 11.1.2.1.
This is a SysMan Menu
interface to the
uptime
command, described in
Section 11.1.2.1.
In addition, the following options are provided under the Support and Services menu item:
This option
invokes the escalation report feature of the
sys_check
utility.
The escalation report is used only in conjunction with diagnostic
services, and is requested by your technical support organization.
See
Section 11.2.2
for more information on using the escalation
options in the
sys_check
utility.
This
option invokes the system configuration report feature of the
sys_check
utility.
Use this option to create a baseline record of your system
configuration and to update the baseline at regular intervals.
Using this
option creates a full default report which can take many minutes to complete
and can affect system performance.
See
Section 11.2.2
for more information on using the
sys_check
utility.
The SysMan Station provides a graphical view of one or more systems
and also enables you to launch applications to perform administrative operations
on any component.
See
Chapter 1
for information on using
the SysMan Station.
11.1.2.4 X11-Compliant Graphical User Interfaces
The operating system provides several graphical user interfaces (GUIs) that are used typically under the default Common Desktop Environment (CDE) windowing environment; they are located in the System Management folders.
You can invoke these interfaces from the SysMan Applications
panel on the CDE Front Panel;
Figure 1-5
shows the SysMan
Applications Panel.
There are icons that link to the Monitoring/Tuning folder,
the Tools folder, and the Daily Admin folder.
Monitoring/Tuning Folder
This folder provides icons that invoke the following SysMan Menu items:
This icon invokes a graphical user interface to the
system configuration report feature of the
sys_check
utility.
This
icon invokes a graphical user interface to the escalation report feature of
the
sys_check
utility.
This icon invokes the interface that enables you to configure HP Insight Manager and start the HP Insight Manager daemon.
The
remaining applications in this folder relate to system tuning.
See the
System Configuration and Tuning
manual for information on tuning using the Process Tuner (a graphical user
interface to the
nice
command) and the Kernel Tuner (dxkerneltuner).
Tools Folder
The Tools folder provides graphical user interfaces to the commands
such as
vmstat.
Invoke these interfaces from the CDE Front Panel by selecting the Application
Manager icon to display the Application Manager folder.
From this folder,
select the System Admin icon, and then the Tools icon.
This folder provides
the following interfaces:
This
is a graphical user interface to the
iostat
command, described
previously in
Section 11.1.2.1.
This
is a graphical user interface to the
netstat
command.
See
the
Network Administration: Connections
manual for information on monitoring your network.
This is a graphical user interface to the
/var/adm/messages
log file, which stores certain system messages
according to the current configuration of system event management.
For information
on events, the messages they generate, and the message log files, see
Chapter 12
and
Chapter 13.
This is a graphical
user interface to the
vmstat
command, described in
Section 11.1.2.1.
This is a graphical user interface to the
who
command, described in
Section 11.1.2.1.
Daily Admin Folder
The remaining X11-compliant monitoring application is System Information, which is located in the Application Manager - DailyAdmin folder.
Select the System Information
(dxsysinfo) icon to launch the interface.
This interface
provides you with a quick view of the following system resources and data:
A brief description of the number and type of processors (CPUs).
The UNIX operating system version and the amount of available system memory.
Three dials indicating approximate amount of CPU activity,
in-use memory, and in-use virtual memory (swap).
This information can be obtained
also by using commands such as
vmstat.
Two warning indicators for files and swap. These indicators change color when a file system is nearly full or if the amount of swap space is too low.
The current available space status of all local and remotely-mounted file systems. You can set a percentage limit here to trigger the warning indicators if available space falls below a certain percentage. See Chapter 6 and Chapter 9 for information on increasing the available file system space.
11.1.2.5 Advanced Monitoring Utilities
The following utilities provide options that enable you to view and record many different operating parameters:
The
collect
utility
enables you to sample many different kinds of system and process data simultaneously
over a predetermined sampling time.
You can collect information to data files
and play the files back at the terminal.
The
collect
utility can assist you in diagnosing
performance problems and its report output may be requested by your technical
support service when they are assisting you in solving system problems.
Using
the
collect
utility is described in
Section 11.2.1.
See the Collect User's Guide, which is included with the installation kit, for more information.
sys_check
utilityThe
sys_check
utility is a command line interface that you use to create
a permanent record of the system configuration and the current settings of
many system attributes.
This utility is described in detail in
Section 11.2.2.
The Monitoring Performance History (MPH) utility is a suite of shell scripts that gathers information on the reliability and availability of the operating system and its hardware environment such as crash data files. This utility is described in detail in Section 11.2.3.
The following topics in this manual are related closely to system monitoring and testing:
See Chapter 10 for information on administering the system accounting services, which enables you to monitor and record access to resources such as printers.
See
Chapter 12
for instructions on configuring
and using basic system event logging by using the basic
binlogd
and
syslogd
event channels.
This chapter also describes
how you access system log files, where events and errors are recorded.
See Chapter 13 for information on configuring and using the Event Manager (EVM), which provides sophisticated management of system events, including automated response to certain types of event.
These manuals also provide additional information related to system monitoring and testing:
See the System Configuration and Tuning manual for information on tuning your system in response to information gathered during monitoring and testing.
See the Network Administration: Connections manual for information on monitoring the system's networking components.
11.2 Configuring and Using Monitoring Utilities
The following sections introduce some of the monitoring utilities and describes their setup and use. See the documentation and reference pages supplied with each application for more information. See Chapter 1 for information on configuring and using the SysMan Station to monitor systems that have a graphics environment.
A closely related topic is event management and error logging.
See
Chapter 12
and
Chapter 13
for information
on these topics.
11.2.1 Using the collect Utility to Record System Data
The
/usr/sbin/collect
command line utility collects data that describes the current system
status.
It enables you to select from many parameters and sort them and to
time the data collection period.
The data is displayed in real time or recorded
to a file for future analysis or playback.
Using the
collect
utility has a low CPU overhead because you can focus on the exact aspects
of system behavior that you need to record and therefore it should not adversely
effect system performance.
The output from the unqualified
/usr/sbin/collect
command is similar to the output from monitoring commands such as
vmstat,
iostat, or
netstat.
The command synopsis is defined fully in
collect(8)collect
utility are:
Controlling the duration of, and rate at which data is sampled. Sorting the output according to processor usage.
Extracting a time slice of data from a data record file. For example, if you want to look at certain system parameters during the busiest time of use, you can extract that data from the data file by using the -C option.
Specifying a particular device using its device special file name. For example the following command identifies that data is collected from the named devices:
# collect -sd -Ddsk1,dsk10
Specifying a particular subsystem such as the CPU or the network. For example, the following command specifies that data is collected only for the CPUs, and a sample of data is shown:
# collect -e cf
CPU SUMMARY
USER SYS IDLE WAIT INTR SYSC CS RUNQ AVG5 AVG30 AVG60 FORK VFORK
13 16 71 0 149 492 725 0 0.13 0.05 0.01 0.30 0.00
SINGLE CPU STATISTICS
CPU USER SYS IDLE WAIT
0 13 16 71 0
Recording and preserving a series of data files by using the -H (history) option.
Compressing data files for economical storage.
Specifying specific users, groups, and processes for which data is to be sampled.
Playing back data files with the -p option. The -f option lets you combine multiple binary input files into one binary output file.
The
collect
utility locks itself into memory by using
the page locking function
plock(), and cannot be swapped
out by the system.
It also raises its priority by using the priority function
nice().
If required, page locking can be disabled by using the
-ol
command option and the priority setting can be disabled by using
the
-on
command option.
However, using the
collect
utility should have minimal affect on a system under high load.
11.2.2 Using the sys_check Utility
The
sys_check
utility provides you with the following:
The ability to establish a baseline of system configuration information, both for software and hardware and record it in an easily accessible HTML report for web browsing. You can recreate this report regularly or as your system configuration changes.
The opportunity to perform automated examination of many system attributes (such as tuning parameters) and receive feedback on settings that may be more appropriate to the current use of the system.
The
sys_check
utility examines and reports recommended
maintenance suggestions, such as installing patch kits and maintaining swap
space.
The ability to generate a problem escalation report that can be used by your technical support service to diagnose and correct system problems.
In addition to recording the current hardware and software configuration,
the
sys_check
utility produces an extensive dump of system
performance parameters.
This feature enables you to record many system attribute
values, providing a useful baseline of system data.
Such a baseline is particularly
useful before you undertake major changes or perform troubleshooting procedures.
When you run the
sys_check
utility, it produces an
HTML document on standard output.
Used with the
-escalate
flag, the script produces
$TMPDIR/escalate*
output files
by default, where the environmental variable
$TMPDIR
determines
the temporary directory.
These files can be forwarded to your technical support
organization and used for diagnosing system problems and errors.
Use the following command to obtain a complete list of command options.
# /usr/sbin/sys_check -help
The output produced by the
sys_check
utility typically varies between 0.5MB and 3MB in size and it can take from
30 minutes to an hour to complete the examination.
See
sys_check(8)sys_check
utility runs
setld
to record the installed software.
Excluding
the
setld
operation can greatly reduce the
sys_check
run duration.
You can invoke standard
sys_check
run tasks as follows:
Using CDE, open the Application Manager from the CDE front
panel.
Select System_Admin and then MonitoringTuning.
There are icons for
two standard
sys_check
run tasks, Configuration Report
and Escalation Report.
Using the SysMan Menu, expand the Support and Services menu item and choose from the following options:
Create escalation report
Create configuration report
For information on using the SysMan Menu, see Chapter 1.
You can run
sys_check
tasks automatically by enabling
an option in the root
crontabs
file.
In the
/var/spool/cron/crontabs
directory, the
root
file contains a list of default tasks that are run by
cron
on a regular basis.
Remove the comment (#) command from
the following line:
#0 3 * * 0 /usr/share/sysman/bin/runsyscheck
When this option is enabled, the resulting report is referenced
by HP Insight Manager and can be read from the Tru64 UNIX Configuration Report icon on
the HP Insight Manager Device Home Page.
See
Chapter 1
for information
on using HP Insight Manager.
11.2.3 Using the Monitoring Performance History Utility
The Monitoring Performance History (MPH) utility is a suite of shell scripts that gathers information, such as crash data files, on the reliability and availability of the operating system and its hardware environment. The information is copied automatically to your systems vendor by Internet mail or a DSN link, if available. Using this data, performance analysis reports are created and distributed to development and support groups. This information is used internally only by your systems vendor to improve the design of reliable and highly available systems.
The MPH run process is automatic, requiring no user intervention. Initial configuration requires approximately ten minutes. MPH does not affect or degrade your system's performance because it runs as a background task, using negligible CPU resources. The disk space required for the collected data and the application is approximately 300 blocks per system. This could be slightly higher in the case of a high number of errors and is considerably larger for the initial run, when a baseline is established; this is a one-time event.
The MPH utility operates as follows:
Every 10 minutes it records a timestamp indicating that the system is running.
Daily at 2:00 A.M., it extracts any new events records from
the default event log
/var/adm/binary.errlog.
Every day at 3:00 A.M.
it transfers the event and time stamp
data and any new
crashdc
data files in
/var/adm/crash
to the system vendor.
The average transfer is 150 blocks of data.
Before running MPH, review the following information:
The Standard Programmer Commands (Software Development) OSFPGMR400
subset must be installed.
Use the
setld
-i
command to verify that the subset is installed.
The MPH software kit is contained in the mandatory base software
subset OSFHWBASE400.
This subset is installed automatically
during the operating system installation.
Full documentation is located in
/usr/field/mph/unix_installation_guide.ps.
A text file is also
supplied.
The disk space requirement for the MPH software subset is approximately 100 blocks.
To configure MPH on your system, you must be the root user and the principal administrator of the target system. You need to supply your name, telephone number, and e-mail address. Complete the following steps:
Find the serial number (SN) of the target system, which is generally located on the rear of the system box. You need this number to complete the installation script.
Enter the following command to run the MPH script:
# /usr/field/mph/MPH_UNIX***.CSH
In this example,
***
denotes the version number, such as 025.
Enter the remaining information requested by the script. When the script is complete, MPH starts automatically.
If the operating system needs to be shut down for any reason,
an orderly shutdown process must be followed.
Otherwise, you need to restart
the MPH script as described in the MPH documentation.
See
mph(1)11.3 Environmental Monitoring and envmond/envconfig
On any system, thermal levels can increase because
of poor ventilation, overheating, or fan failure.
Without detection, an unscheduled
shutdown could occur, causing a loss of data, damage to the system, or both.
By monitoring the system's environment, you can be forewarned in time to recover
or perform a gradual and orderly system shutdown.
Environmental Monitoring from the Command Line
You can monitor the environmental status of your system by using the
sysconfig
command line utility, as follows:
# /sbin/sysconfig -q envmon
This command reports the current temperature, fan status, and other
general environmental information.
Environmental Sensor Monitoring
On a limited number of recent hardware platforms there also exists the capability to monitor thermal environment sensors for processor temperature and fan speed thresholds. Issue the following command as root to determine if your hardware platform supports these features.
# /sbin/hwmgr view hier | grep sensor 52: sensor systhermal-cpu0 53: sensor systhermal-cpu1 54: sensor systhermal-cpu2 55: sensor systhermal-pci_zone-1 56: sensor systhermal-pci_zone-2 57: sensor systhermal-pci_zone-3 58: sensor sysfan-pci_zone-1/2 59: sensor sysfan-power_supply-3/4 60: sensor sysfan-cpu_memory-5/6 61: sensor syspower-ps-0 62: sensor syspower-ps-1 63: sensor syspower-ps-2
The output returned by this command indicates the individual active sensors.
If there is no output, these features are not supported on your platform.
These sensors post events to the Event Manager when a threshold is reached.
By reading the Event Manager log file and using the Hardware Manager command line
program, you can determine which component is becoming critical.
See
Chapter 13
for information on Event Manager; see the
Hardware Management
manual
and
hwmgr(8)HP Insight Manager
HP Insight Manager also provides a method for monitoring the environmental
conditions through the Recovery->Environment display.
See
HP Management Agents for AlphaServers for Tru64 UNIX
for more information.
envmond/envconfig Framework
There exists an Environmental Monitoring framework that consists of four components:
The loadable kernel module and its associated APIs
The Server System MIB subagent daemon
The
envmond
daemon
The
envmond
daemon is used to monitor the system
environment.
See
envmond(8)
The
envconfig
utility
The
envconfig
utility lets you customize the
envmond
daemon.
See
envconfig(8)
These components are described in the following sections.
11.3.1 Loadable Kernel Module
The loadable kernel module and its associated APIs contain the parameters needed to monitor and return status on your system's threshold levels. The kernel module exports server management attributes as described in Section 11.3.1.1 through the kernel configuration manager (CFG) interface only. It works across all platforms that support server management, and provides compatibility for other server management systems under development.
The loadable kernel module does not include platform-specific code (such as the location of status registers). It is transparent to the kernel module which options are supported by a platform. That is, the kernel module and platform are designed to return valid data if an option is supported, a fixed constant for unsupported options, or null.
11.3.1.1 Specifying Loadable Kernel Attributes
The loadable kernel module exports the parameters listed in Table 11-1 to the kernel configuration manager (CFG).
Table 11-1: Parameters Defined in the Kernel Module
| Parameter | Purpose |
env_current_temp |
Specifies the system's current temperature. If a system is configured with the KCRCM module, the temperature returned is in Celsius. If a system does not support temperature readings and a temperature threshold is not exceeded, a value of -1 is returned. If a system does not support temperature readings and a temperature threshold is exceeded, a value of -2 is returned. |
env_high_temp_thresh |
Provides a system-specific operating temperature threshold. The value returned is a hardcoded, platform-specific temperature in Celsius. |
env_fan_status |
Specifies a noncritical fan status. The value returned is a bit value of zero (0). This value differs when the hardware support is provided for this feature. |
env_ps_status |
Provides the status of the redundant power supply. On platforms that provide interrupts for redundant power supply failures, the corresponding error status bits are read to determine the return value. A value of 1 is returned on error; otherwise, a value of zero (0) is returned. |
env_supported |
Indicates whether or not the platform supports server management and environmental monitoring. |
11.3.1.2 Obtaining Platform-Specific Functions
The loadable kernel module must return environmental status based on
the platform being queried.
To obtain environmental status, the
get_info()
function is used.
Calls to the
get_info()
function are filtered through the
platform_callsw[]
table.
The
get_info()
function obtains dynamic environmental
data by using the function types described in
Table 11-2.
Table 11-2: get_info() Function Types
| Function Type | Use of Function |
GET_SYS_TEMP |
Reads the system's internal temperature on platforms that have a KCRCM module configured. |
GET_FAN_STATUS |
Reads fan status from error registers. |
GET_PS_STATUS |
Reads redundant power supply status from error registers. |
The
get_info()
function obtains static data by
using the
HIGH_TEMP_THRESH
function type, which reads the
platform-specific upper threshold operational temperature.
11.3.2 Server System MIB Subagent
The Server System MIB Agent (an eSNMP subagent) exports a subset of the Environmental Monitoring parameters specified in the Server System MIB. The Server System MIB exports a common set of hardware-specific parameters across all server platforms, depending on the operating system installed.
Table 11-3 maps the subset of Server System MIB variables that support Environmental Monitoring to the kernel parameters described in Section 11.3.1.1.
Table 11-3: Mapping of Server Subsystem Variables
| Server System MIB Variable Name | Kernel Module Parameter |
svrThSensorReading |
env_current_temp |
svrThSensorStatus |
env_current_temp |
svrThSensorHighThresh |
env_high_temp_thresh |
svrPowerSupplyStatus |
env_ps_temp |
svrFanStatus |
env_fan_status |
An SNMP MIB compiler and other utilities are used to compile the
MIB description into code for a skeletal subagent daemon.
Communication between
the subagent daemon and the master agent eSNMP daemon,
snmpd,
is handled by interfaces in the eSNMP shared library (libesnmp.so).
The subagent daemon must be started when the system boots and
after the eSNMP daemon has started.
The subagent daemon contains code for Server System MIB variable listed
in
Table 11-3.
The daemon accesses the appropriate parameter
from the kernel module through the CFG interface.
11.3.3 Environmental Monitoring Daemon
Use the Environmental Monitoring daemon,
envmond,
to examine threshold levels and take corrective action before damage occurs
to your system.
Then the
envmond
daemon performs the following
tasks:
It queries the system for threshold levels.
It begins a system shutdown when a cooling fan fails.
On the AlphaServer 1000A fails, the kernel logs the error and synchronizes the disks before it powers down the system.
On all other fan failures, a hard shutdown occurs.
It notifies users when a high temperature threshold condition is resolved.
It notifies all users that an orderly shutdown is in progress if recovery is not possible.
To query
the system, the
envmond
daemon uses the base operating
system command
/usr/sbin/snmp_request
to obtain the current
values of the environment variables specified in the Server System MIB.
To enable Environmental Monitoring, the eSNMP and Server System MIB
agents must be started during system boot followed by the enabling of the
the
envmond
daemon, also during system boot.
See
envmond(8)11.3.4 Using envconfig to Configure the envmond Daemon
You can use the
envconfig
utility to customize how
the
envmond
daemon queries the environment.
These customizations
are stored in the
/etc/rc.config
file, which is read
by the
envmond
daemon during startup.
Use the
envconfig
utility to perform the following
tasks:
Turning environmental monitoring on or off during the system boot.
Starting or stopping the
envmond
daemon
after the system boot.
Specifying the frequency between system queries by the
envmond
daemon.
Setting the highest threshold level that can be encountered
before a temperature event is signaled by the
envmond
daemon.
Specify the path of a user-defined script that you want the
envmond
daemon to execute when a high threshold level is encountered.
Specifying a grace period for saving data if a shutdown message is broadcast.
Displaying the values of the Environmental Monitoring variables.
See
envconfig(8)11.3.5 User-Definable Messages
You can modify messages broadcasted or logged by the Environmental Monitoring utility. The messages are located in the following file:
/usr/share/sysman/envmon/EnvMon_UserDefinable_Msg.tcl
You must be root to edit this file; you can edit
any message text included in braces ({}).
The instructions for editing each
section of the file are included in the comment fields, preceded by the
#
symbol.
For example, the following message provides samples of possible causes for the high temperature condition:
set EnvMon_Ovstr(ENVMON_SHUTDOWN_1_MSG){System has reached a \
high temperature condition. Possible problem source: Clogged \
air filter or high ambient room temperature.}
You could modify this message text as follows:
set EnvMon_Ovstr(ENVMON_SHUTDOWN_1_MSG) {System \
has reached a high temperature condition. Check the air \
conditioning unit}
Do not alter any data in this
file other that the text strings between the braces ({}).
11.4 Using System Exercisers
The operating system provides a set of exercisers that you can use to troubleshoot your system. The exercisers test specific areas of your system, such as file systems or system memory. The following sections provides information on the system exercisers:
Running the system exercisers (Section 11.4.1)
Using exerciser diagnostics (Section 11.4.2)
Exercising file systems by using the
fsx
command (Section 11.4.3)
Exercising system memory by using the
memx
command (Section 11.4.4)
Exercising shared memory by using the
shmx
command (Section 11.4.5)
Exercising communications systems by using the
cmx
command (Section 11.4.6)
Additionally, you can exercise disk drives by using the
diskx
command and tape drives by using the
tapex
command.
For more information, see
diskx(8)tapex(8)
In addition to the exercisers documented in this chapter, your system
may support the Verifier and Exerciser Tool (VET), which provides a similar
set of exercisers.
See the documentation that came with your latest firmware
CD-ROM for information on VET.
11.4.1 Running System Exercisers
To run a system exerciser, you must be logged in as superuser and your
current directory must be
/usr/field.
The commands that invoke the system exercisers provide an option for saving the diagnostic output into a specified file when the exerciser completes its task.
Most of the exerciser commands have an online help option that displays
a description of how to use that exerciser.
To access online help, use the
-h
option with a command.
For example, to access help for the
diskx
exerciser, use the following command:
# diskx -h
You can run the exercisers in the foreground or the background; you can cancel them at any time by entering [Ctrl/c] in the foreground. You can run more than one exerciser at the same time; keep in mind, however, that the more processes you have running, the slower the system performs. Thus, before exercising the system extensively, make sure that no other users are on the system.
There are some restrictions when you run a system exerciser over an
NFS link or on a diskless system.
Exercisers, such as
fsx,
need to write to a file system, so the target file system must be writable
by root.
Also, the directory from which an exerciser is executed must be
writable by root because temporary files are written to the directory.
These restrictions can be difficult to adhere to because NFS file systems
are often mounted in a way that prevents root from writing to them.
You can
overcome some of these problems by copying the exerciser into another directory
and running it from the new directory.
11.4.2 Using Exerciser Diagnostics
When an exerciser is halted (either by entering [Ctrl/c] or by timing out), diagnostics are displayed and are stored in the exerciser's most recent log file. The diagnostics inform you of the test results.
Each time an exerciser is invoked, a new log file is created in the
/usr/field
directory.
For example, when you execute the
fsx
command for the first time, a log file named
#LOG_FSX_01
is created.
The log files contain records of each exerciser's results
and consist of the starting and stopping times, and error and statistical
information.
The starting and stopping times are also logged into the default
/var/adm/binary.errlog
system error log file.
This file also contains
information on errors reported by the device drivers or by the system.
The log files provide a record of the diagnostics. However, be sure to delete a log file after reading it because an exerciser can have only nine log files. If you attempt to run an exerciser that has accumulated nine log files, the exerciser tells you to remove some of the old log files so that it can create a new one.
If an exerciser finds errors, you can determine which device or area
of the system is experiencing difficulty by looking at the
/var/adm/binary.errlog
file, using either the
dia
command (this is
preferred) or the
uerf
command.
For information on the
error logger, see the
Section 12.1.
For
the meanings of the error numbers and signal numbers, see
intro(2)sigvec(2)11.4.3 Exercising a File System
Use the
fsx
command to exercise the local file
systems.
The
fsx
command exercises the specified local
file system by initiating multiple processes, each of which creates, writes,
closes, opens, reads, validates, and unlinks a test file of random data.
Note
Do not test NFS file systems with the
fsxcommand.
The
fsx
command has the following syntax:
fsx
[-fpath]
[-h]
[-ofile]
[-pnum]
[-tmin]
See
fsx(8)
The following example of the
fsx
command tests the
/usr
file system with five
fsxr
processes running
for 60 minutes in the background:
# fsx -p5 -f/usr -t60 &
11.4.4 Exercising System Memory
Use the
memx
command to exercise the system memory.
The
memx
command exercises the system memory by initiating
multiple processes.
By default, the size of each process is defined as the
total system memory in bytes divided by 20.
The minimum allowable number
of bytes per process is 4,095.
The
memx
command writes
and reads ones and zeroes, zeroes and ones, and random data patterns in the
allocated memory being tested.
The files that you need to run the
memx
exerciser
include the following:
memx
memxr
The
memx
command is restricted
by the amount of available swap space.
The size of the swap space and the
available internal memory determine how many processes can run simultaneously
on your system.
For example, if there are 16 MB of swap space and 16 MB of
memory, all the swap space is used if all 20 initiated processes (the default)
run simultaneously.
This prevents execution of other process.
Therefore, on
systems with large amounts of memory and small amounts of swap space, you
must use the
-p
or
-m
option, or both, to
restrict the number of
memx
processes or to restrict the
size of the memory being tested.
The
memx
command has the following syntax:
memx
-s
[-h]
[-msize]
[-ofile]
[-pnum]
[-tmin]
See
memx(8)
The following example of the
memx
command initiates
five
memxr
processes that test 4,095 bytes of memory and
runs in the background for 60 minutes:
# memx -m4095 -p5 -t60 &
11.4.5 Exercising Shared Memory
Use the
shmx
command to exercise the shared
memory segments.
The
shmx
command spawns a background process
called
shmxb.
The
shmx
command writes
and reads the
shmxb
data in the segments, and the
shmxb
process writes and reads the
shmx
data
in the segments.
Using
shmx, you can test the number and the size
of memory segments and
shmxb
processes.
The
shmx
exerciser runs until the process is killed or until the time
specified with the
-t
option is exhausted.
You invoke the
shmx
exerciser automatically when
you start the
memx
exerciser, unless you specify the
memx
command with the
-s
option.
You can also invoke
the
shmx
exerciser manually.
The
shmx
command has the following syntax:
/usr/field/shmx
[-h]
[-ofile]
[-v]
[-ttime]
[-msize]
[-sn]
See
shmx(8)
The following example tests the default number of memory segments, each with a default segment size:
# shmx &
The following example runs three memory segments of 100,000 bytes for 180 minutes:
# shmx -t180 -m100000 -s3 &
11.4.6 Exercising the Terminal Communication System
Use the
cmx
command to exercise the terminal
communications system.
The
cmx
command writes, reads, and
validates random data and packet lengths on the specified communications lines.
The lines you exercise must have a loopback connector attached to the
distribution panel or the cable.
Also, the line must be disabled in the
/etc/inittab
file and in a nonmodem line; that is, the
CLOCAL
option must be set to on.
Otherwise, the
cmx
command repeatedly displays error messages on the terminal screen until its
time expires or until you enter
[Ctrl/c].
You cannot test pseudodevice lines or
lta
device
lines.
Pseudodevices have
p,
q,
r,
s,
t,
u,
v,
w,
x,
y,
or
z
as the first character after
tty,
for example,
ttyp3.
The
cmx
command has the following syntax:
/usr/field/cmx
[-h]
[-o file]
[-t min]
[-l line]
See
cmx(8)
The following example exercises communication lines
tty22
and
tty34
for 45 minutes in the background:
# cmx -l 22 34 -t45 &
The following example exercises lines
tty00
through
tty07
until you enter
[Ctrl/c]:
# cmx -l 00-07