3    Monitoring Systems and Diagnosing Performance Problems

You must gather a wide variety of performance information in order to identify performance problems or areas where performance is deficient.

Some symptoms or indications of performance problems are obvious. For example, applications complete slowly or messages appear on the console indicating that the system is out of resources. Other problems or performance deficiencies are not obvious and can be detected only by monitoring system performance.

This chapter describes how to perform the following tasks:

After you identify a performance problem or an area in which performance is deficient, you can identify an appropriate solution. See Chapter 4 for information about improving system performance.

3.1    Obtaining Information About System Events

It is recommended that you set up a routine to continuously monitor system events and to alert you when serious problems occur. In addition to helping you diagnose performance problems, periodically examining event and log files allows you to correct a problem before it impacts performance or availability.

The operating system uses the system event logging facility and the binary event logging facility to log system events. The system event logging facility uses the syslog function to log events in ASCII format. The syslogd daemon collects the messages logged from the various kernel, command, utility, and application programs. This daemon then writes the messages to a local file or forwards the messages to a remote system, as specified in the /etc/syslog.conf event logging configuration file. You should periodically monitor these ASCII log files for performance information.

The binary event logging facility detects hardware and software events in the kernel and logs detailed information in binary format records. The binary event logging facility uses the binlogd daemon to collect various event log records. The daemon then writes these records to a local file or forwards the records to a remote system, as specified in the /etc/binlog.conf default configuration file.

You can examine the binary event log files by using the following methods:

In addition, it is recommended that you configure crash dump support into the system. Significant performance problems may cause the system to crash, and crash dump analysis tools can help you diagnose performance problems.

See the System Administration manual for more information about event logging and crash dumps.

3.2    Using System Accounting and Disk Quotas

It is recommended that you set up system accounting, which allows you to obtain information about the amount of CPU usage and connect time, the number of processes spawned, memory and disk usage, the number of I/O operations, and the number of printing operations by each user.

In addition, you should establish Advanced File System (AdvFS) and UNIX File System (UFS) disk quotas in order to track and control disk usage. Disk quotas allow you to limit the disk space available to users and to monitor disk space usage.

See the System Administration manual for information about setting up system accounting and UFS disk quotas. See the AdvFS Administration manual for information about AdvFS quotas.

3.3    Gathering Performance Information

There are various commands and utilities that you can use to obtain a comprehensive understanding of your system performance. It is important that you gather statistics under a variety of conditions. Comparing sets of data will help you to diagnose performance problems.

For example, to determine how an application impacts system performance, you can gather performance statistics without the application running, start the application, and then gather the same statistics. Comparing different sets of data will enable you to identify whether the application is consuming memory, CPU, or disk I/O resources.

Be sure to gather information at different stages during the application processing to obtain comprehensive performance information. For example, an application may be I/O-intensive during one stage and CPU-intensive during another.

There are three commands that you can use to obtain some fundamental performance information:

To obtain application performance information, use profiling tools to collect statistics on CPU usage, call counts, call cost, memory usage, and I/O operations at various levels (for example, at a procedure level or at an instruction level). Profiling identifies sections of application code that consume large portions of execution time, so that you can focus on improving code efficiency in these sections.

There are many tools that you can use to query subsystems, profile the system kernel and applications, and collect CPU statistics. See the following tables for information:

Kernel profiling and debugging Table 3-1
Memory resource monitoring Table 6-1
CPU monitoring Table 7-1
Disk operation monitoring Table 8-1
LSM monitoring Table 8-7
AdvFS monitoring Table 9-4
UFS monitoring Table 9-7
NFS monitoring Table 9-9
Network subsystem monitoring Table 10-1
Application profiling and debugging Table 11-1

3.4    Continuously Monitoring Performance

You may want to set up a routine to continuously monitor system performance. Some monitoring tools will alert you when serious problems occur (for example, sending mail). It is important that you choose a continuous monitoring tool that does not impact system performance.

The following tools allow you to continuously monitor performance:

The following sections describe the Performance Manager and Performance Visualizer products and the dcpi tool.

3.4.1    Using Performance Manager

Performance Manager (PM) for Tru64 UNIX allows you to simultaneously monitor multiple Tru64 UNIX systems, so you can detect and correct performance problems. PM can operate in the background, alerting you to performance problems. Monitoring only a local node does not require a PM license. However, a license is required to monitor multiple nodes and clusters.

PM gathers and displays Simple Network Protocol (SNMP and eSNMP) data for the systems you choose, and allows you to detect and correct performance problems from a central location. PM has a graphical user interface (GUI) that runs locally and displays data from the monitored systems.

Use the GUI to choose the systems and data that you want to monitor. You can customize and extend PM, so you can create and save performance monitoring sessions. Graphs and charts can show hundreds of different system values, including CPU performance, memory usage, disk transfers, file-system capacity, network efficiency, database performance, and AdvFS and cluster-specific metrics. Data archives can be used for high-speed playback or long-term trend analysis.

PM provides comprehensive thresholding, rearming, and tolerance facilities for all displayed metrics. You can set a threshold on every key metric, and specify the PM reaction when a threshold is crossed. For example, you can configure PM to send mail, to execute a command, or to display a notification message.

PM also has performance analysis and system management scripts, as well as cluster-specific and AdvFS-specific scripts. Run these scripts separately to target specific problems, or run them simultaneously to check the general system performance. The PM analyses include suggestions for eliminating problems. PM can monitor both individual cluster members and an entire cluster concurrently.

See the Performance Manager online documentation for more information.

3.4.2    Using Performance Visualizer

Performance Visualizer is a valuable tool for developers of parallel applications. Because it monitors the performance of several systems simultaneously, it allows you to see the impact of a parallel application on all the systems, and to ensure that the application is balanced across all systems. When problems are identified, you can change the application code and use Performance Visualizer to evaluate the effects of these changes. Performance Visualizer is a Tru64 UNIX layered product and requires a license.

Performance Visualizer also helps you identify overloaded systems, underutilized resources, active users, and busy processes.

Using Performance Visualizer, you can monitor the following:

You can choose to look at all of the hosts in a parallel system or at individual hosts. See the Performance Visualizer documentation for more information.

3.4.3    Using Digital Continuous Profiling Infrastructure

Use the Digital Continuous Profiling Infrastructure (dcpi) tool to provide low-overhead system profiling and to track cycles during program execution. Using dcpi, you can continuously profile entire systems, including the kernel, user programs, drivers, and shared libraries, which may enable you to make coding improvements.

The dcpi tool maintains a database of profile information that is updated incrementally for every executable image that runs. A suite of profile analysis tools analyzes the profile information at various levels. For example, you can determine the percentage of CPU cycles that were used to execute the kernel and each user program. You can also determine how long a specific instruction stalled, on average, because of a D-cache miss.

The dcpi tool may not support all configurations.

The dcpi tool is available from the Systems Research Center at the following World Wide Web location:

http://www.research.digital.com/SRC/dcpi

3.5    Profiling and Debugging Kernels

Table 3-1 describes the tools that you can use to profile and debug the kernel. Detailed information about these profiling and debugging tools is located in the Kernel Debugging manual and in the tools' reference pages.

Table 3-1:  Kernel Profiling and Debugging Tools

Name Use Description

prof

Analyzes profiling data

Analyzes profiling data and produces statistics showing which portions of code consume the most time and where the time is spent (for example, at the routine level, the basic block level, or the instruction level).

The prof command uses as input one or more data files generated by the kprofile, uprofile, or pixie profiling tools. The prof command also accepts profiling data files generated by programs linked with the -p switch of compilers such as cc. See prof(1) for more information.

kprofile

Produces a program counter profile of a running kernel

Profiles a running kernel using the performance counters on the Alpha chip. You analyze the performance data collected by the tool with the prof command. See kprofile(1) for more information.

dbx

Debugs running kernels, programs, and crash dumps, and examines and temporarily modifies kernel variables

Provides source-level debugging for C, Fortran, Pascal, assembly language, and machine code. The dbx debugger allows you to analyze crash dumps, trace problems in a program object at the source-code level or at the machine code level, control program execution, trace program logic and flow of control, and monitor memory locations.

Use dbx to debug kernels, debug stripped images, examine memory contents, debug multiple threads, analyze user code and applications, display the value and format of kernel data structures, and temporarily modify the values of some kernel variables. See dbx(8) for more information.

kdbx

Debugs running kernels and crash dumps

Allows you to examine a running kernel or a crash dump. The kdbx debugger, a frontend to the dbx debugger, is tailored specifically to debugging kernel code and displays kernel data in a readable format. The debugger is extensible and customizable, allowing you to create commands that are tailored to your kernel debugging needs.

You can also use extensions to check resource usage (for example, CPU usage). See kdbx(8) for more information.

ladebug

Debugs kernels and applications

Debugs programs and the kernel and helps locate run-time programming errors. The ladebug symbolic debugger is an alternative to the dbx debugger and provides both command-line and graphical user interfaces and support for debugging multithreaded programs. See the Ladebug Debugger Manual and ladebug(1) for more information.