You must gather a wide variety of performance information in order to identify performance problems or areas where performance is deficient.
Some symptoms or indications of performance problems are obvious. For example, applications complete slowly or messages appear on the console indicating that the system is out of resources. Other problems or performance deficiencies are not obvious and can be detected only by monitoring system performance.
This chapter describes how to perform the following tasks:
Understand how the system logs event messages (Section 3.1)
Set up system accounting and disk quotas to track and control resource utilization (Section 3.2)
Use tools to gather a variety of performance information (Section 3.3)
Establish a method to continuously monitor system performance (Section 3.4)
Profile and debug kernels (Section 3.5)
After you identify a performance problem or an area in which performance is deficient, you can identify an appropriate solution. See Chapter 4 for information about improving system performance.
It is recommended that you set up a routine to continuously monitor system events and to alert you when serious problems occur. In addition to helping you diagnose performance problems, periodically examining event and log files allows you to correct a problem before it impacts performance or availability.
The operating system uses the system event logging facility and the
binary event logging facility to log system events.
The system event logging
facility uses the
syslog
function to log events in ASCII
format.
The
syslogd
daemon collects the messages logged
from the various kernel, command, utility, and application programs.
This
daemon then writes the messages to a local file or forwards the messages
to a remote system, as specified in the
/etc/syslog.conf
event logging configuration file.
You should periodically monitor
these ASCII log files for performance information.
The binary event logging facility detects hardware and software events
in the kernel and logs detailed information in binary format records.
The
binary event logging facility uses the
binlogd
daemon
to collect various event log records.
The daemon then writes these records
to a local file or forwards the records to a remote system, as specified in
the
/etc/binlog.conf
default configuration file.
You can examine the binary event log files by using the following methods:
The DECevent utility continuously monitors system events through the binary event logging facility, decodes events, and tracks the number and the severity of events logged by system devices. DECevent can analyze system events and provides a notification mechanism (for example, mail) that can warn of potential problems.
You must register a license to use DECevent's analysis and notification features. These features may be available as part of your service agreement. A license is not needed to use DECevent to translate the binary log file to ASCII format.
You can use the
uerf
command to translate binary log files to ASCII format.
See
uerf(8)
for information.
In addition, it is recommended that you configure crash dump support into the system. Significant performance problems may cause the system to crash, and crash dump analysis tools can help you diagnose performance problems.
See the System Administration manual for more information about event logging and crash dumps.
It is recommended that you set up system accounting, which allows you to obtain information about the amount of CPU usage and connect time, the number of processes spawned, memory and disk usage, the number of I/O operations, and the number of printing operations by each user.
In addition, you should establish Advanced File System (AdvFS) and UNIX File System (UFS) disk quotas in order to track and control disk usage. Disk quotas allow you to limit the disk space available to users and to monitor disk space usage.
See the System Administration manual for information about setting up system accounting and UFS disk quotas. See the AdvFS Administration manual for information about AdvFS quotas.
There are various commands and utilities that you can use to obtain a comprehensive understanding of your system performance. It is important that you gather statistics under a variety of conditions. Comparing sets of data will help you to diagnose performance problems.
For example, to determine how an application impacts system performance, you can gather performance statistics without the application running, start the application, and then gather the same statistics. Comparing different sets of data will enable you to identify whether the application is consuming memory, CPU, or disk I/O resources.
Be sure to gather information at different stages during the application processing to obtain comprehensive performance information. For example, an application may be I/O-intensive during one stage and CPU-intensive during another.
There are three commands that you can use to obtain some fundamental performance information:
vmstat
The primary source of performance problems is a lack of memory, which can affect response time and application completion time. In addition, a lack of CPU resources can also result in long application completion times.
Use the
vmstat
command
to determine if the system is using a lot of memory
resources or CPU cycles.
The command output shows the number of free
pages and the percentages of user, system, and idle CPU times.
See
Section 6.3.2
for information.
iostat
If your disk I/O load is not spread evenly among the available disks,
bottlenecks may occur at the disks that are being excessively used.
Use the
iostat
command to determine if disk I/O is
being evenly distributed.
The command output shows which disks are
being excessively used.
See
Section 8.2.1
for information.
swapon -s
Insufficient swap space for your workload can result in poor application
performance and response time.
To check swap space, use the
swapon -s
command.
The command output shows the
total amount of swap space and the percentage of swap space
that is being used.
See
Section 6.3.3
for information.
To obtain application performance information, use profiling tools to collect statistics on CPU usage, call counts, call cost, memory usage, and I/O operations at various levels (for example, at a procedure level or at an instruction level). Profiling identifies sections of application code that consume large portions of execution time, so that you can focus on improving code efficiency in these sections.
There are many tools that you can use to query subsystems, profile the system kernel and applications, and collect CPU statistics. See the following tables for information:
| Kernel profiling and debugging | Table 3-1 |
| Memory resource monitoring | Table 6-1 |
| CPU monitoring | Table 7-1 |
| Disk operation monitoring | Table 8-1 |
| LSM monitoring | Table 8-7 |
| AdvFS monitoring | Table 9-4 |
| UFS monitoring | Table 9-7 |
| NFS monitoring | Table 9-9 |
| Network subsystem monitoring | Table 10-1 |
| Application profiling and debugging | Table 11-1 |
You may want to set up a routine to continuously monitor system performance. Some monitoring tools will alert you when serious problems occur (for example, sending mail). It is important that you choose a continuous monitoring tool that does not impact system performance.
The following tools allow you to continuously monitor performance:
Performance Manager
Simultaneously monitors multiple Tru64 UNIX systems, detects performance problems, and performs event notification. See Section 3.4.1 for more information.
Performance Visualizer
Graphically displays the performance of all significant components of a parallel system. Using Performance Visualizer, you can monitor the performance of all the member systems in a cluster. See Section 3.4.2 for more information.
Digital Continuous Profiling Infrastructure
Provides continuous, low-overhead system profiling and allows you to track cycles during program execution. See Section 3.4.3 for information.
Collects a variety of performance data on a running system and either
displays the information or saves it to a binary file.
The
monitor
utility is available on the Tru64 UNIX
Freeware CD-ROM.
See
ftp://gatekeeper.dec.com/pub/DEC
for information.
Provides continuous reports on the state of the system, including a
list of the processes using the most CPU resources.
The
top
command is available on the Tru64 UNIX Freeware CD-ROM.
See
ftp://eecs.nwu.edu/pub/top
for information.
Continuously monitors the network traffic associated with a particular
network service and allows you to identify the source of a packet.
See
tcpdump(8)
for information.
Continuously monitors all incoming network traffic to a Network File
System (NFS) server, and displays the number and percentage of packets received.
See
nfswatch(8)
for information.
xload
Displays the system load average in a
histogram that is periodically updated.
See
xload(1X)
for information.
Provides information about activity on volumes, plexes, subdisks, and
disks under LSM control.
The
volstat
utility reports statistics
that reflect the activity levels of LSM objects since boot time or since you
reset the statistics.
See
Section 8.3.4.2
for information.
Monitors the Logical Storage Manager (LSM) for failures in disks, volumes, and plexes, and sends mail if a failure occurs. See Section 8.3.4.4 for information.
The following sections describe the Performance Manager and
Performance Visualizer products and the
dcpi
tool.
Performance Manager (PM) for Tru64 UNIX allows you to simultaneously monitor multiple Tru64 UNIX systems, so you can detect and correct performance problems. PM can operate in the background, alerting you to performance problems. Monitoring only a local node does not require a PM license. However, a license is required to monitor multiple nodes and clusters.
PM gathers and displays Simple Network Protocol (SNMP and eSNMP) data for the systems you choose, and allows you to detect and correct performance problems from a central location. PM has a graphical user interface (GUI) that runs locally and displays data from the monitored systems.
Use the GUI to choose the systems and data that you want to monitor. You can customize and extend PM, so you can create and save performance monitoring sessions. Graphs and charts can show hundreds of different system values, including CPU performance, memory usage, disk transfers, file-system capacity, network efficiency, database performance, and AdvFS and cluster-specific metrics. Data archives can be used for high-speed playback or long-term trend analysis.
PM provides comprehensive thresholding, rearming, and tolerance facilities for all displayed metrics. You can set a threshold on every key metric, and specify the PM reaction when a threshold is crossed. For example, you can configure PM to send mail, to execute a command, or to display a notification message.
PM also has performance analysis and system management scripts, as well as cluster-specific and AdvFS-specific scripts. Run these scripts separately to target specific problems, or run them simultaneously to check the general system performance. The PM analyses include suggestions for eliminating problems. PM can monitor both individual cluster members and an entire cluster concurrently.
See the Performance Manager online documentation for more information.
Performance Visualizer is a valuable tool for developers of parallel applications. Because it monitors the performance of several systems simultaneously, it allows you to see the impact of a parallel application on all the systems, and to ensure that the application is balanced across all systems. When problems are identified, you can change the application code and use Performance Visualizer to evaluate the effects of these changes. Performance Visualizer is a Tru64 UNIX layered product and requires a license.
Performance Visualizer also helps you identify overloaded systems, underutilized resources, active users, and busy processes.
Using Performance Visualizer, you can monitor the following:
CPU utilization by each CPU in a multiprocessing system
Load average
Use of paged memory
Paging events, which indicate how much a system is paging
Use of swap space
Behavior of individual processes
You can choose to look at all of the hosts in a parallel system or at individual hosts. See the Performance Visualizer documentation for more information.
Use the Digital Continuous Profiling
Infrastructure (dcpi) tool to provide low-overhead
system profiling and to track cycles during program execution.
Using
dcpi, you can continuously profile entire systems, including the
kernel, user programs, drivers, and shared libraries, which may enable
you to make coding improvements.
The
dcpi
tool maintains a database of profile information
that is updated incrementally for every executable image that runs.
A suite
of profile analysis tools analyzes the profile information at various levels.
For example, you can determine the percentage of CPU cycles that were used
to execute the kernel and each user program.
You can also determine how long
a specific instruction stalled, on average, because of a D-cache miss.
The
dcpi
tool may not support all
configurations.
The
dcpi
tool is available from the Systems Research
Center at the following World Wide Web location:
http://www.research.digital.com/SRC/dcpi
Table 3-1 describes the tools that you can use to profile and debug the kernel. Detailed information about these profiling and debugging tools is located in the Kernel Debugging manual and in the tools' reference pages.
| Name | Use | Description |
Analyzes profiling data |
Analyzes profiling data and produces statistics showing which portions of code consume the most time and where the time is spent (for example, at the routine level, the basic block level, or the instruction level). The
|
|
Produces a program counter profile of a running kernel |
Profiles a running kernel using the performance
counters on the Alpha chip.
You analyze the performance data collected by
the tool with the
prof
command.
See
kprofile(1)
for more information. |
|
Debugs running kernels, programs, and crash dumps, and examines and temporarily modifies kernel variables |
Provides source-level debugging for
C, Fortran, Pascal, assembly language, and machine code.
The
Use
|
|
Debugs running kernels and crash dumps |
Allows you to examine
a running kernel or a crash dump.
The
You can also use
extensions to check resource usage (for example, CPU usage).
See
|
|
Debugs kernels and applications |
Debugs
programs and the kernel and helps locate run-time programming errors.
The
|