8 Profiling Programs to Improve Performance

Profiling is a method of identifying sections of code that consume large portions of execution time. In a typical program, most execution time is spent in relatively few sections of code. To improve performance, the greatest gains result from improving coding efficiency in time-intensive sections.

8.1 Overview

Tru64 UNIX supports four approaches to performance improvement:

Automatic and profile-directed optimizations (see Section 8.4).

Manual design and code optimizations (see Section 8.5).

Minimizing system-resource usage (see Section 8.6).

Verifying the significance of test cases (see Section 8.7).

One approach might be enough, but more might be beneficial if no single approach addresses all aspects of a program's performance. This chapter describes each approach and the tools provided by Tru64 UNIX to support them. In addition, the following topics are covered in this chapter:

Selecting profiling information to display (Section 8.8)

Merging profile data files (Section 8.9)

Profiling multithreaded applications (Section 8.10)

Using monitor routines to control profiling (Section 8.11)

For more information, see the following reference pages:

Profiling: cc(1), hiprof(1), pixie(1), third(1), uprofile(1), prof(1), gprof(1)

System monitoring: ps(1), swapon(8), vmstat(1)

Performance Manager, available from the Tru64 UNIX Associated Products Volume 1 installation media: pmgr(8X)

Graphical tools, available from the Graphical Program Analysis subset of the Tru64 UNIX Associated Products Volume 1 installation media, or as part of the Compaq Enterprise Toolkit for Windows/NT desktops with the Microsoft VisualStudio97: dxheap(1), dxprof(1), mview(1), pview(1)

Visual Threads, available from the Tru64 UNIX Associated Products Volume 1 installation media: dxthreads(1). Use Visual Threads to analyze a multithreaded appplication for potential logic and performance problems.

The System Configuration and Tuning manual.

8.2 Profiling Sample Program

The examples in the remainder of this chapter refer to the program sample, whose source code is contained in the files profsample.c (main program), add_vector.c, mul_by_scalar.c, print_it.c, and profsample.h. These files are shown in Example 8-1.

Example 8-1: Profiling Sample Program

****************** profsample.c *************
#include <math.h>
#include <stdio.h>
#include "profsample.h"
 
static void mul_by_pi(double ary[])
{
    mul_by_scalar(ary, LEN/2, 3.14159);
}
 
void main()
{
    double ary1[LEN];
    double *ary2;
    double sum = 0.0;
    int i;
 
    ary2 = malloc(LEN * sizeof(double));
    for (i=0; i<LEN; i++) {
        ary1[i] = 0.0;
        ary2[i] = sqrt((double)i);
    }
    mul_by_pi(ary1);
    mul_by_scalar(ary2, LEN, 2.71828);
    for (i=0; i<100; i++)
        add_vector(ary1, ary2, LEN);
    for (i=0; i<100; i++)
	sum += ary1[i];
    if (sum < 0.0)
	print_it(0.0);
    else
	print_it(sum);
}
****************** profsample.h: ********************
 
void mul_by_scalar(double ary[], int len, double num);
void add_vector(double arya[], double aryb[], int len);
void print_it(double value);
#define LEN 100000
 
***************** add_vector.c: ********************
 
#include "profsample.h"
 
void add_vector(double arya[], double aryb[], int len)
{
    int i;
    for (i=0; i<LEN; i++) {
        arya[i] += aryb[i];
    }
}
**************** mul_by_scalar.c: ******************
 
#include "profsample.h"
 
void mul_by_scalar(double ary[], int len, double num)
{
    int i;
    for (i=0; i<LEN; i++) {
        ary[i] *= num;
    }
}
**************** print_it.c: **********************
 
#include <stdio.h>
#include "profsample.h"
 
void print_it(double value)
{
    printf("Value = %f\n", value);
}

8.3 Compilation Options for Profiling

When using debug and optimization options with the cc command, note the following considerations. They apply to all the profiling tools discussed in this chapter unless stated otherwise.

The -g1 option provides the minimum debug information needed (that is, line numbers and procedure names) and is sufficient for all profilers. The cc command default, -g0, is tolerated but provides no names for local (for example, static) procedures. The -g2 and higher options provide less than optimal code as well as unneeded information.

When doing automatic or profile-directed optimization (Section 8.4), note that the cc command's -feedback option automatically applies the -g1 option. Also, you need to experiment to find the automatic optimization level (for example, -O3) that provides the best run-time performance for your program and compiler.

When doing manual optimization, note that none of the profiling tools show inlined procedures separately by their own names. Profiling statistics for an inlined procedure are included in the statistics for the calling procedure. For example, if proc1 calls proc2 (which is inlined), the statistics for proc1 will include the time spent in proc2. Therefore, to provide some optimization but with minimal inlining when profiling, use the -O2 (or -O) option. In some cases, you may need to specify the -inline none option to eliminate all inlining. This restriction does not apply to automatic optimization.

8.4 Automatic and Profile-Directed Optimizations

The following sections discuss automatic and profile-directed optimizations.

8.4.1 Techniques

Automatic and profile-directed optimizations are the simplest approaches to improving application performance.

Some degree of automatic optimization can be achieved by using the compiler's and linker's optimization options. These can help in the generation of minimal instruction sequences that make best use of the CPU architecture and cache memory.

However, the compiler and linker can improve their optimizations if they are given information on which instructions are executed most often when the program is run with its normal input data and environment. While the default optimizations give improved performance for most common situations, the optimizers can do even better if they can tune the program in favor of the heavily used instruction sequences as determined from a sample run.

Tru64 UNIX helps you provide the optimizers with this information on processing hot spots by allowing a profiler's results to be fed back into a recompilation. This customized, profile-directed optimization can be used in conjunction with automatic optimization.

8.4.2 Tools and Examples

The following sections discuss the tools used for automatic and profile directed optimizations.

8.4.2.1 Automatic Optimization

The cc command's automatic optimization options are selected with the -O, -fast, -inline, -om, and other related options. See cc(1) for details and Chapter 10 for more information on the many options and tradeoffs available.

For example, this command selects a high degree of optimization in both the compiler and the linker:

% cc -non_shared -O3 -om *.c

8.4.2.2 Profile-Directed Optimization

The pixie profiler provides profile information that the cc command's -feedback and -om options can use to tune the generated instruction sequences to the demands placed on the program by particular sets of input data. This technique sould be used only with executables. For shared libraries, use cord as described in Section 8.4.2.3.

The following example shows the three necessary basic steps in this process, which consist of (1) preparing the program for profile-directed optimization, (2) creating an instrumented version of the program and running it to collect profiling statistics, and (3) feeding that information back to the compiler and linker to help them optimize the executable code. Later examples show how to elaborate on these steps to accommodate ongoing changes during development and data from multiple profiling runs.

% cc -feedback sample -o sample -non_shared -O3 *.c -lm [1]

% pixie -update -quiet sample [2]

% cc -feedback sample -o sample -non_shared -om -O3 *.c -lm [3]

When the program is compiled with the -feedback option for the first time, a special augmented executable file is created. It contains information used by the compiler to relate the executable to the source files. It also contains a section later used to store profiling feedback information for the compiler. This section remains empty after the first compilation, because no feedback information has yet been generated by the pixie profiler (step 2). The file name specified with the -feedback option should be the same as the executable file name, which in this example is sample (from the -o option). By default, the -feedback option applies the -g1 option, which provides optimum symbolization for profiling. (For information about the -g1 and -O3 options, see Section 8.3.) [Return to example]

The pixie command creates an instrumented version of the program (sample.pixie) and then runs it (because a prof option, -update, is specified). Execution statistics and address mapping data are automatically collected in an instruction-counts file (sample.Counts) and an instruction-addresses file (sample.Addrs). The -update option puts this profiling information in the augmented executable. The -quiet option prevents informational and progress messages from being printed. [Return to example]

In the second compilation with the -feedback option, the profiling information in the augmented executable guides the compiler and (through the -om option) the postlink optimizer. This customized feedback enhances any automatic optimization provided by the -O3 and -om options. When the -om option is used with nonshared executables, it is particularly effective because it optimizes all library routines, too. Compiler optimizations can be made even more effective by using the -ifo and/or -assume whole_program options in conjunction with the -feedback option. However, as noted in Section 10.1.1, the compiler may be unable to compile very large programs as if there were only one source file. [Return to example]

Example 8-2 shows the output for each of these steps. See pixie(1) and cc(1) for more information.

Example 8-2: Sample Profile-Directed Optimization Output

% cc -feedback sample -o sample -non_shared -O3 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
cc: Info: Feedback file sample does not exist (nofbfil)
cc: Info: Compilation will proceed without feedback optimizations (nofbopt)
mul_by_scalar.c:
cc: Info: Feedback file sample does not exist (nofbfil)
cc: Info: Compilation will proceed without feedback optimizations (nofbopt)
print_it.c:
cc: Info: Feedback file sample does not exist (nofbfil)
cc: Info: Compilation will proceed without feedback optimizations (nofbopt)
profsample.c:
cc: Info: Feedback file sample does not exist (nofbfil)
cc: Info: Compilation will proceed without feedback optimizations (nofbopt)
 
% pixie -update -quiet sample
Value = 540.930890

% cc -feedback sample -o sample -non_shared -om -O3 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
mul_by_scalar.c:
print_it.c:
profsample.c:

During a typical development process, steps 2 and 3 of the previous example are repeated as needed to reflect the impact of any changes to the source code. For example:

% cc -feedback sample -o sample -non_shared -O3 *.c -lm
% pixie -update -quiet sample
% cc -feedback sample -o sample -non_shared -O3 *.c -lm
[modify source code]
% cc -feedback sample -o sample -non_shared -O3 *.c -lm
.....
[modify source code]
% cc -feedback sample -o sample -non_shared -O3 *.c -lm
% pixie -update -quiet sample
% cc -feedback sample -o sample -non_shared -om -O3 *.c -lm

Because the profiling information in the augmented executable persists from compilation to compilation, the pixie processing step that updates the information does not have to be repeated every time a source module is modified and recompiled. But with each modification the relevance of the old feedback information to the actual code is reduced, and the potential quality of the optimization is degraded, depending on the exact modification. The pixie processing step after the last modification and recompilation guarantees that the feedback information is correctly updated for the last compilation.

The profiling information in an augmented executable file makes it larger than a normal executable (typically 3-5 percent). After development is completed, you can use the strip command to remove any profiling and symbol table information. For example:

% strip sample

You might want to run your instrumented program several times with different inputs to get an accurate picture of its profile. The following example explains how to merge profiling statistics from two runs of a program, prog, whose output varies from run to run with different sets of input:

% cc -feedback prog -o prog -non_shared -O3 *.c [1]

% pixie -pids prog [2]

% prog.pixie [3]
(input set 1)
% prog.pixie
(input set 2)

% prof -pixie -update prog prog.Counts.* [4]

% cc -feedback prog -o prog -non_shared -om -O3 *.c [5]

The first compilation produces an augmented executable, as explained in the previous example. [Return to example]

By default, each run of the instrumented program (prog.pixie) produces a profiling data file called prog.Counts. The -pids option adds the process ID of each of the instrumented program's test runs to the name of the profiling data file produced (that is, prog.Counts.pid). Thus, the data files produced by subsequent runs do not overwrite each other. [Return to example]

The instrumented program is run twice, producing a uniquely named data file each time -- for example, prog.Counts.371 and prog.Counts.422. [Return to example]

The prof -pixie command merges the two data files. The -update option updates the executable, prog, with the combined information. [Return to example]

The second compilation step uses the combined profiling information from the two runs of the program to guide the optimization. [Return to example]

8.4.2.3 Profile-Directed Reordering

You can use a different kind of feedback file (created with the pixie or prof -pixie command) as input to the cord utility. The cord utility reorders the procedures in an executable program or shared library to improve cache utilization. The following example shows how to create a feedback file and then use the -cord option to compile an executable with the feedback file as input:

% cc -O3 -o sample *.c -lm
% pixie -feedback sample.fb sample [1]
% cc -O3 -cord -feedback sample.fb -o sample *.c -lm [2]

The pixie command creates an instrumented version of the program and also runs it (because a prof option, -feedback is specified). The -feedback option creates a feedback file (sample.fb) that collects execution statistics to be used by the compiler in the next step. [Return to example]

The cc command's -feedback option accepts the feedback file as input. The -cord option invokes the cord utility. [Return to example]

Compiling a shared library with feedback is similar. Profile the shared library with one or more programs that exercise the library code that most needs optimizing. For example:

% cc -o libexample.so -shared -g1 -O3 lib*.c [1]
% cc -o exerciser -O3 exerciser.c -L. -lexample [2]
% pixie -L. -incobj libexample.so -run exerciser [3]
% prof -pixie -feedback libexample.fb libexample.so exerciser.Counts [4]
% cc -cord -feedback libexample.fb -o libexample.so -shared -g1 -O3 lib*.c [5]

The shared library is compiled with the -g1 option to give feedback data for each source line. [Return to example]

A program to exercise the important parts of the library is built. [Return to example]

The shared library and program are instrumented and run to profile them. [Return to example]

A feedback file is generated for just the shared library. [Return to example]

The shared library is recompiled, relinked, and reordered to optimize the performance of the code that the profile shows is most heavily used. [Return to example]

Use a feedback file generated with the same optimization level.

You can also use cord with the runcord utility. For more information, see pixie(1), prof(1), cord(1), and runcord(1).

8.5 Manual Design and Code Optimizations

The following sections discuss the techniques and tools used for manual design and code optimizations.

8.5.1 Techniques

The effectiveness of the automatic optimizations described in Section 8.4 is limited by the efficiency of the algorithms that the program uses. You can further improve a program's performance by manually optimizing its algorithms and data structures. Such optimizations may include reducing complexity from N-squared to log-N, avoiding copying of data, and reducing the amount of data used. It may also extend to tuning the algorithm to the architecture of the particular machine it will be run on -- for example, processing large arrays in small blocks such that each block remains in the data cache for all processing, instead of the whole array being read into the cache for each processing phase.

Tru64 UNIX supports manual optimization with its profiling tools, which identify the parts of the application that impose the highest CPU load -- CPU cycles, cache misses, and so on. By evaluating various profiles of a program, you can identify which parts of the program use the most CPU resources, and you can then redesign or recode algorithms in those parts to use less resources. The profiles also make this exercise more cost-effective by helping you to focus on the most demanding code rather than on the least demanding.

8.5.2 Tools and Examples

The following sections discuss the tools used for CPU-time profiling with call graph and for CPU-time/event profiling with source lines and instructions.

8.5.2.1 CPU-Time Profiling with Call Graph

A call-graph profile shows how much CPU time is used by each procedure, and how much is used by all the other procedures that it calls. This profile can show which phases or subsystems in a program spend most of the total CPU time, which can help in gaining a general understanding of the program's performance. This section describes two tools that provide this information:

The hiprof profiler

The cc command's -pg option

Both tools are used with the gprof tool, implicitly or explicitly, to format and display the profile.

The optional dxprof command provides a graphical display of CPU-time call-graph profiles.

Using the hiprof Profiler

The hiprof profiler (see hiprof(1)) instruments the program and generates a call graph while the instrumented program executes. This profiler does not require that the program be compiled in any particular way except as indicated in Section 8.3. The hiprof command can generate a call-graph profile for shared libraries and for program code, with moderate optimization and minimal debug information. For example:

% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c [1]

% cc -o sample -g1 -O2 profsample.c -L. -lsample -lm [2]

% hiprof -numbers -L. -inc libsample.so sample [3]

A shared library, libsample.so, is created from three source modules, with debug information and optimization as indicated in Section 8.3. [Return to example]

The source module profsample is compiled and linked against the shared library libsample.so (located in the current directory) to produce the executable sample. [Return to example]

The -inc[obj] option tells hiprof to profile libsample.so in addition to the executable (sample). The hiprof command creates an instrumented version of the program (sample.hiprof). Because at least one gprof option (-numbers) is specified, hiprof then automatically runs that instrumented program to create a profile data file (sample.hiout) and then runs gprof to display the profile. The -numbers option prints each procedure's starting line number, source-file name, and library name. This helps identify any multiple static procedures with the same name. [Return to example]

The resulting sample profile is shown in Example 8-3. The call-graph profile estimates the total cost of calling a procedure, including other procedures that it calls. The estimation assumes that each call to a given procedure takes the same amount of time; this may not be true in many cases, but it is always accurate when there is only one caller.

By default, hiprof uses a low-frequency sampling technique. This can profile all the code executed by the program, including all selected libraries. It also provides a call-graph profile of all selected procedures (except those in the threads-related system libraries) and detailed profiles at the level of source lines or machine instructions (if selected).

Example 8-3: Sample hiprof Default Profile Using gprof

% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
add_vector.c:
mul_by_scalar.c:
print_it.c:
 
% cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
 
% hiprof -numbers -L. -inc libsample.so sample
hiprof: info: instrumenting sample ...
 
hiprof: info: the LD_LIBRARY_PATH environment variable is not defined
hiprof: info: setting LD_LIBRARY_PATH=.:. [1]
hiprof: info: running instrumented program sample.hiprof ...
 
Value = 179804.149985
 
hiprof: info: instrumented program exited with status 0
hiprof: info: displaying profile for sample ...
 
gprof -b -scaled -all -numbers -L. sample.hiprof ./sample.hiout [2]
 
*********************** call-graph profile ******************* [3]
 
granularity: samples per 4 bytes; units: seconds*1e-3; total: 323e-3 seconds
 
                                    called /     total    parents
index  %total   self  descendents   called +      self  name      index
                                    called /     total    children
 
                     2       321         1 /         1    __start [2]
[1]    100.0         2       321         1              main [1] [4]
                   318         0       100 /       100    add_vector [3] [5]
                     3         0         2 /         2    mul_by_scalar [4]
                     0         0         1 /         1    print_it [5]
 
------------------------------------------------------
 
                   318         0       100 /       100    main [1] [6]
[3]     98.5       318         0       100              add_vector [3]
 
------------------------------------------------------
 
                     3         0         2 /         2    main [1]
[4]      0.9         3         0         2              mul_by_scalar [4]
 
------------------------------------------------------
 
                     0         0         1 /         1    main [1]
[5]      0.0         0         0         1              print_it [5]
 
------------------------------------------------------
 
*********************** timing profile section ***************
 
granularity: samples per 4 bytes; units: seconds*1e-3; total: 323e-3 seconds
 
  %   cumulative     self           self/call total/call
total     units     units     calls   seconds   seconds  name
 98.5       318       318       100   3184e-6   3184e-6 add_vector [3]
  0.9       321         3         2   1465e-6   1465e-6 mul_by_scalar [4]
  0.6       323         2         1   1953e-6 323242e-6 main [1]
  0.0       323         0         1         0         0 print_it [5]
 
*********************** index section ************************
Index by function name - 4 entries
 
   [3] add_vector         :"add_vector.c":1
   [1] main               :"profsample.c":12
   [4] mul_by_scalar      :"mul_by_scalar.c":1
   [5] print_it           :"print_it.c":4

The LD_LIBRARY_PATH environment variable is automatically set to point to the working directory, where the instrumented shared library libsample is located (see Section 8.8.4.1). [Return to example]

The automatically generated gprof command line uses the -scaled option by default, which can display profiles with CPU-cycle granularity and millisecond units if the procedures selected for display have short run times. [Return to example]

The gprof output comprises three sections: a call-graph profile, a timing profile, and an index (a concise means of identifying each procedure). In this example, the three sections have been separated by rows of asterisks (with the section names) that do not appear in the output produced by gprof. In the call-graph profile section, each routine in the program has its own subsection that is contained within dashed lines and identified by the index number in the first column. [Return to example]

This line describes the main routine, which is the subject of this portion of the call-graph profile because it is the leftmost routine in the rightmost column of this section. The index number [2] in the first column corresponds to the index number [2] in the index section at the end of the output. The percentage in the second column reports the total amount of time in the subgraph that is accounted for by main and its descendants, in this case add_vector, mul_by_scalar, and print_it. The 1 in the called column indicates the total number of times that the main routine is called. [Return to example]

This line describes the relationship of add_vector to main. Because add_vector is below main in this section, add_vector is identified as the child of main. The fraction 100/100 indicates that of the total of 100 calls to add_vector (the denominator), 100 of these calls came from main (the numerator). [Return to example]

This line describes the relationship of main to add_vector. Because main is listed above add_vector in the last column, main is identified as the parent of add_vector. [Return to example]

For nonthreaded programs, hiprof can alternatively count the number of machine cycles used or page faults suffered by the program. The cost of each parent procedure's calls is accurately calculated, making the call-graph profile much more useful than the one in the default mode, which can only estimate the costs of parents. Also, the CPU time (in nanosecond units for short tests) or page-faults count reported for the instrumented routines includes that for the uninstrumented routines that they call. This can summarize the costs and reduce the run-time overhead, but note that the machine-cycle counter wraps if no instrumented procedure is called at least every few seconds.

In the following example, the hiprof command's -cycles option is used to display a profile of the machine cycles used by the sample program:

% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
% cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
% hiprof -cycles -numbers -L. -inc libsample.so sample

The resulting sample profile is shown in Example 8-4.

Example 8-4: Sample hiprof -cycles Profile Using gprof

% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
add_vector.c:
mul_by_scalar.c:
print_it.c:
 
% cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
 
% hiprof -cycles -numbers -L. -inc libsample.so sample
hiprof: info: instrumenting sample ...
 
hiprof: info: the LD_LIBRARY_PATH environment variable is not defined
hiprof: info: setting LD_LIBRARY_PATH=.:.
hiprof: info: running instrumented program sample.hiprof ...
 
Value = 179804.149985
 
hiprof: info: instrumented program exited with status 0
hiprof: info: displaying profile for sample ...

gprof -b -scaled -all -numbers -L. sample ./sample.hiout
 
granularity: cycles; units: seconds*1e-9; total: 362320292e-9 seconds
 
                                    called /     total    parents 
index  %total   self  descendents   called +      self  name    	index
                                    called /     total    children
 
                                                          <spontaneous>
[1]    100.0    893310 361426982         1              __start [1]
                       361371860         1 /         1    main [2]
 
------------------------------------------------------
 
                       361371860         1 /         1    __start [1]
[2]     99.7  36316218 325055642         1              main [2]
                       321107275       100 /       100    add_vector [3]
                         3519530         2 /         2    mul_by_scalar [4]
                          428838         1 /         1    print_it [5]
 
------------------------------------------------------
 
                       321107275       100 /       100    main [2]
[3]     88.6 321107275         0       100              add_vector [3]
 
------------------------------------------------------
 
                         3519530         2 /         2    main [2]
[4]      1.0   3519530         0         2              mul_by_scalar [4]
 
------------------------------------------------------
 
                          428838         1 /         1    main [2]
[5]      0.1    428838         0         1              print_it [5]
 
------------------------------------------------------
 
granularity: cycles; units: seconds*1e-9; total: 362320292e-9 seconds
 
  %   cumulative     self           self/call total/call        
total     units     units     calls   seconds   seconds  name    
 88.6 321107275 321107275       100   3211e-6   3211e-6 add_vector [3]
 10.0 357423492  36316218         1  36316e-6 361372e-6 main [2]
  1.0 360943022   3519530         2   1760e-6   1760e-6 mul_by_scalar [4]
  0.2 361836332    893310         1 893310e-9 362320e-6 __start [1]
  0.1 362265170    428838         1 428838e-9 428838e-9 print_it [5]
 
Index by function name - 11 entries
 
   [1] __start            <sample>
   [3] add_vector         <libsample.so>:"add_vector.c":1
   [2] main               <sample>:"profsample.c":12
   [4] mul_by_scalar      <libsample.so>:"mul_by_scalar.c":1
   [5] print_it           <libsample.so>:"print_it.c":4

Using the cc Command's -pg Option

The cc command's -pg option uses the same sampling technique as hiprof, but the program needs to be instrumented by compiling with the -pg option. (The program also needs to be compiled with the debug and optimization options indicated in Section 8.3.) Only the executable is profiled (not shared libraries), and few system libraries are compiled to generate a call-graph profile; therefore, hiprof may be preferred. However, the cc command's -pg option and gprof are supported in a very similar way on different vendors' UNIX systems, so this may be an advantage. For example:

 
% cc -pg -o sample -g1 -O2 *.c -lm [1]
 
% ./sample [2]
 
% gprof -scaled -b -numbers sample [3]

The cc command's -pg call-graph profiling option creates an instrumented version of the program, sample. (You must specify the -pg option for both the compile and link phases.) [Return to example]

Running the instrumented program produces a profiling data file (named gmon.out, by default) to be used by the gprof tool. For information about working with multiple data files, see Section 8.9. [Return to example]

The gprof command extracts information from the profiling data file and displays it. The -scaled option displays CPU time in units that give the best precision without exceeding the report's column widths. The -b option suppresses the printing of a description of each field in the profile. [Return to example]

The resulting sample profile is shown in Example 8-5. The gprof tool estimates the total cost of calling a procedure (including its calls) in its call-graph profile.

Example 8-5: Sample cc -pg Profile Using gprof

% cc -pg -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
mul_by_scalar.c:
print_it.c:
profsample.c:
 
% ./sample
Value = 179804.149985
 
% gprof -scaled -b -numbers sample
 
granularity: samples per 8 bytes; units: seconds*1e-3; total: 326e-3 seconds
 
                                    called /     total    parents 
index  %total   self  descendents   called +      self  name      index
                                    called /     total    children
 
                     5       321         1 /         1    __start [2]
[1]    100.0         5       321         1              main [1]
                   317         0       100 /       100    add_vector [3]
                     4         0         2 /         2    mul_by_scalar [4]
                     0         0         1 /         1    print_it [5]
 
------------------------------------------------------
 
                   317         0       100 /       100    main [1]
[3]     97.3       317         0       100              add_vector [3]
 
------------------------------------------------------
 
                     4         0         2 /         2    main [1]
[4]      1.2         4         0         2              mul_by_scalar [4]
 
------------------------------------------------------
 
                     0         0         1 /         1    main [1]
[5]      0.0         0         0         1              print_it [5]
 
------------------------------------------------------
 
granularity: samples per 8 bytes; units: seconds*1e-3; total: 326e-3 seconds
 
  %   cumulative     self           self/call total/call        
total     units     units     calls   seconds   seconds  name    
 97.3       317       317       100   3174e-6   3174e-6 add_vector [3]
  1.5       322         5         1   4883e-6 326172e-6 main [1]
  1.2       326         4         2   1953e-6   1953e-6 mul_by_scalar [4]
  0.0       326         0         1         0         0 print_it [5]

Index by function name - 4 entries
 
   [3] add_vector         <sample>:"add_vector.c":1
   [1] main               <sample>:"profsample.c":12
   [4] mul_by_scalar      <sample>:"mul_by_scalar.c":1
   [5] print_it           <sample>:"print_it.c":4

8.5.2.2 CPU-Time/Event Profiles for Sourcelines/Instructions

A good performance-improvement strategy may start with a procedure-level profile of the whole program (possibly with a call graph, to provide an overall picture), but it will often progress to detailed profiling of individual source lines and instructions. The following tools provide this information:

The uprofile profiler

The hiprof profiler

The cc command's -p option

The pixie profiler

Using the uprofile Profiler

The uprofile profiler (see uprofile(1)) uses a sampling technique to generate a profile of the CPU time or events, such as cache misses, associated with each procedure or source line or instruction. The sampling frequency depends on the processor type and the statistic being sampled, but it is on the order of a millisecond for CPU time. The profiler achieves this without modifying the application program at all, by using hardware counters that are built into the Alpha CPU. Running the uprofile command with no arguments yields a list of all the kinds of events that a particular machine can profile, depending on the nature of its architecture. The default is to profile machine cycles, resulting in a CPU-time profile. The following example shows how to display a profile of the instructions that used the top 90 percent of CPU time:

% cc -o sample -g1 -O2 *.c -lm [1]

% uprofile -asm -quit 90cum% sample [2]

The resulting sample profile, which includes explanatory text, is shown in Example 8-6.

For information about the -g1 and -O2 options, see Section 8.3. [Return to example]

The uprofile command runs the sample program, collecting the performance counter data into a profile data file (named umon.out, by default). Because prof options (-asm and -quit) are specified, uprofile then automatically runs the prof tool to display the profile.
The -asm option provides per-instruction profiling of cycles (and other CPU statistics, such as data cache misses, if specified). Because no counter statistics are specified here, uprofile displays a CPU-time profile for each instruction. The -quit 90cum% option truncates the profile after 90 percent of the whole has been printed (see Section 8.8.3). (Also available are the -heavy option, which reports the lines that executed the most instructions, and the -lines option, which reports the profile per source line within each procedure (see Section 8.8.2). [Return to example]

Example 8-6: Sample uprofile CPU-Time Profile Using prof

% cc -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
mul_by_scalar.c:
print_it.c:
profsample.c:
 
% uprofile -asm -quit 90cum% sample
Value = 179804.149985
Writing umon.out
 
Displaying profile for sample:
 
Profile listing generated Thu Dec  3 10:29:25 1998 with:
   prof -asm -quit 90cum% sample umon.out
 
----------------------------------------------------------------------------
*  -a[sm] using performance counters:                                      *
*    cycles0: 1 sample every 65536 Cycles (0.000164 seconds)               *
*  sorted in descending order by total time spent in each procedure;       *
*  unexecuted procedures excluded                                          *
----------------------------------------------------------------------------
 
Each sample covers 4.00 byte(s) for 0.052% of 0.3123 seconds
 
      millisec      %  cum %      address:line  instruction
 
add_vector (add_vector.c)
           0.0   0.00   0.00  0x120001260:2     addl    zero, a2, a2
           0.0   0.00   0.00  0x120001264:5     bis     zero, zero, t0
           0.0   0.00   0.00  0x120001268:5     ble     a2, 0x12000131c
           0.0   0.00   0.00  0x12000126c:5     subl    a2, 0x3, t1
           0.0   0.00   0.00  0x120001270:5     cmple   t1, a2, t2
           0.0   0.00   0.00  0x120001274:5     beq     t2, 0x1200012f4
           0.0   0.00   0.00  0x120001278:5     ble     t1, 0x1200012f4
           0.0   0.00   0.00  0x12000127c:5     subq    a0, a1, t3
           0.0   0.00   0.00  0x120001280:5     lda     t3, 31(t3)
           0.0   0.00   0.00  0x120001284:5     cmpule  t3, 0x3e, t3
           0.0   0.00   0.00  0x120001288:5     bne     t3, 0x1200012f4
           0.0   0.00   0.00  0x12000128c:5     ldq_u   zero, 0(sp)
          20.2   6.45   6.45  0x120001290:6     ldl     zero, 128(a1)
           1.3   0.42   6.87  0x120001294:6     lds     $f31, 128(a0)
          35.6  11.39  18.26  0x120001298:6     ldt     $f0, 0(a1)
          20.0   6.40  24.66  0x12000129c:6     ldt     $f1, 0(a0)
           9.7   3.10  27.75  0x1200012a0:6     ldt     $f10, 8(a1)
          14.9   4.77  32.53  0x1200012a4:6     ldt     $f11, 8(a0)
          17.4   5.56  38.09  0x1200012a8:6     ldt     $f12, 16(a1)
           7.0   2.26  40.35  0x1200012ac:6     ldt     $f13, 16(a0)
           8.2   2.62  42.97  0x1200012b0:6     ldt     $f14, 24(a1)
          12.9   4.14  47.11  0x1200012b4:6     ldt     $f15, 24(a0)
          24.9   7.97  55.09  0x1200012b8:6     addt    $f1,$f0,$f0
          24.7   7.92  63.01  0x1200012bc:6     addt    $f11,$f10,$f10
          37.8  12.12  75.13  0x1200012c0:6     addt    $f13,$f12,$f12
          39.2  12.54  87.67  0x1200012c4:6     addt    $f15,$f14,$f14
           0.8   0.26  87.93  0x1200012c8:5     addl    t0, 0x4, t0
           0.0   0.00  87.93  0x1200012cc:5     lda     a1, 32(a1)
           8.4   2.68  90.61  0x1200012d0:6     stt     $f0, 0(a0)

By comparison, the following example shows how to display a profile of the instructions that suffered the top 90 percent of data cache misses on an EV56 Alpha system:

% cc -o sample -g1 -O2 *.c -lm
% uprofile -asm -quit 90cum% dcacheldmisses sample

The resulting sample profile is shown in Example 8-7.

Example 8-7: Sample uprofile Data-Cache-Misses Profile Using prof

% uprofile -asm -quit 90cum% dcacheldmisses sample
Value = 179804.149985
Writing umon.out
 
Displaying profile for sample:
 
Profile listing generated Thu Dec  3 10:34:25 1998 with:
   prof -asm -quit 90cum% sample umon.out
 
----------------------------------------------------------------------------
*  -a[sm] using performance counters:                                      *
*    dcacheldmiss: 1 sample every 16384 DCache LD Misses [1]                *
*  sorted in descending order by samples recorded for each procedure;      *
*  unexecuted procedures excluded                                          *
----------------------------------------------------------------------------
 
Each sample covers 4.00 byte(s) for 0.18% of 550 samples [2]
 
       samples      %  cum %      address:line  instruction
 
add_vector (add_vector.c)
           0.0   0.00   0.00  0x120001260:2     addl    zero, a2, a2
           0.0   0.00   0.00  0x120001264:5     bis     zero, zero, t0
           0.0   0.00   0.00  0x120001268:5     ble     a2, 0x12000131c
           0.0   0.00   0.00  0x12000126c:5     subl    a2, 0x3, t1
           0.0   0.00   0.00  0x120001270:5     cmple   t1, a2, t2
           0.0   0.00   0.00  0x120001274:5     beq     t2, 0x1200012f4
           0.0   0.00   0.00  0x120001278:5     ble     t1, 0x1200012f4
           0.0   0.00   0.00  0x12000127c:5     subq    a0, a1, t3
           0.0   0.00   0.00  0x120001280:5     lda     t3, 31(t3)
           0.0   0.00   0.00  0x120001284:5     cmpule  t3, 0x3e, t3
           0.0   0.00   0.00  0x120001288:5     bne     t3, 0x1200012f4
           0.0   0.00   0.00  0x12000128c:5     ldq_u   zero, 0(sp)
           1.0   0.18   0.18  0x120001290:6     ldl     zero, 128(a1)
           3.0   0.55   0.73  0x120001294:6     lds     $f31, 128(a0)
          62.0  11.27  12.00  0x120001298:6     ldt     $f0, 0(a1) [3]
          64.0  11.64  23.64  0x12000129c:6     ldt     $f1, 0(a0)
           8.0   1.45  25.09  0x1200012a0:6     ldt     $f10, 8(a1)
          47.0   8.55  33.64  0x1200012a4:6     ldt     $f11, 8(a0)
          13.0   2.36  36.00  0x1200012a8:6     ldt     $f12, 16(a1)
          38.0   6.91  42.91  0x1200012ac:6     ldt     $f13, 16(a0)
          15.0   2.73  45.64  0x1200012b0:6     ldt     $f14, 24(a1)
          51.0   9.27  54.91  0x1200012b4:6     ldt     $f15, 24(a0)
          49.0   8.91  63.82  0x1200012b8:6     addt    $f1,$f0,$f0
         142.0  25.82  89.64  0x1200012bc:6     addt    $f11,$f10,$f10
          13.0   2.36  92.00  0x1200012c0:6     addt    $f13,$f12,$f12

The stated sampling rate of 1 sample every 16384 means that each sample shown in the profile occurred after 16384 data cache misses, but not all of these occurred at the instruction indicated. [Return to example]

The total number of samples is shown, not the number of data cache misses. [Return to example]

Indicates the number of samples recorded at an instruction, which statistically implies a proportionate number of data cache misses. [Return to example]

The uprofile profiling technique has the advantage of very low run-time overhead. Also, the detailed information it can provide on the costs of executing individual instructions or source lines is essential in identifying exactly which operation in a procedure is slowing the program down.

The disadvantages of uprofile are as follows:

Only executables can be profiled. To profile code in a library, you must first link the program with the -non_shared option.

Only one program can be profiled with the hardware counters at one time.

Threads cannot be profiled individually.

The Alpha EV6 architecture's execution of instructions out of sequence can significantly reduce the accuracy of fine-grained profiles.

Using the hiprof Profiler

As noted earlier, the hiprof command's default PC-sampling technique can also generate CPU-time profiles like those of uprofile, using a sampling frequency of a little under a millisecond. The profile is not as accurate, because the call counting affects the performance, but it has some advantages:

Shared libraries can be profiled.

Threads can be individually profiled (at the cost of very large memory and data file size).

It is independent of hardware resources and architecture.

In the following example, the hiprof command's -lines option is used to display a profile of the CPU time used by each source line, grouped by procedure:

% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
% cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
% hiprof -lines -numbers -L. -inc libsample.so sample

The resulting sample profile is shown in Example 8-8.

Example 8-8: Sample hiprof -lines PC-Sampling Profile

% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
add_vector.c:
mul_by_scalar.c:
print_it.c:
 
% cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
 
% hiprof -lines -numbers -L. -inc libsample.so sample
hiprof: info: instrumenting sample ...
 
hiprof: info: the LD_LIBRARY_PATH environment variable is not defined
hiprof: info: setting LD_LIBRARY_PATH=.:.
hiprof: info: running instrumented program sample.hiprof ...
 
Value = 179804.149985
 
hiprof: info: instrumented program exited with status 0
hiprof: info: displaying profile for sample ...
 
gprof -b -scaled -all -lines -numbers -L. sample.hiprof ./sample.hiout
 
Milliseconds per source line, in source order within functions
 
procedure (file)                           line bytes   millisec      %  cum %
 
add_vector (add_vector.c)                     2    52        0.0   0.00   0.00
                                              5    92        1.0   0.30   0.30
                                              6    92      318.4  98.19  98.49
                                              8     4        0.0   0.00  98.49
mul_by_scalar (mul_by_scalar.c)               2    52        0.0   0.00  98.49
                                              5    72        0.0   0.00  98.49
                                              6    64        3.9   1.20  99.70
                                              8     4        0.0   0.00  99.70
main (profsample.c)                           9    32        0.0   0.00  99.70
                                             12   128        0.0   0.00  99.70
                                             16     8        0.0   0.00  99.70
                                             19    32        0.0   0.00  99.70
                                             20    36        0.0   0.00  99.70
                                             21    16        0.0   0.00  99.70
                                             22    24        0.0   0.00  99.70
                                             24     4        0.0   0.00  99.70
                                             25    36        0.0   0.00  99.70
                                             26    16        0.0   0.00  99.70
                                             27    28        1.0   0.30 100.00
                                             28    20        0.0   0.00 100.00
                                             29    40        0.0   0.00 100.00
                                             30     4        0.0   0.00 100.00
                                             31    20        0.0   0.00 100.00
                                             32     4        0.0   0.00 100.00
                                             33    20        0.0   0.00 100.00
                                             34    56        0.0   0.00 100.00

Using the cc Command's -p Option

The cc command's -p option uses a low-frequency sampling technique to generate a profile of CPU time that is similar to uprofile's but statistically less accurate. However, the -p option does offer the following advantages:

Shared libraries can be profiled

Threads can be individually profiled

Independent of hardware resources and architecture

Common to many UNIX operating systems

On Tru64 UNIX, can profile all the shared libraries used by a program

The program needs to be relinked with the -p option, but it does not need to be recompiled from source so long as the original compilation used an acceptable debug level, such as the -g1 cc command option (see Section 8.3). For example, to profile individual source lines and procedures of a program (if the program runs for long enough to generate a dense sample array):

% cc -p -o sample -g1 -O2 *.c -lm [1] 
% setenv PROFFLAGS '-all -stride 1' [2] 
% ./sample [3] 
% prof -all -proc -heavy -numbers sample [4]

The cc command's -p PC-sample-profiling option creates an instrumented version of the program, called sample. [Return to example]

The -all option specified with the PROFFLAGS environment variable asks for all shared libraries to be profiled (see Section 8.8.4). This causes sqrt (from libm.so) to show up in the profile as the second highest CPU-time user. The variable must be set before the instrumented program is run.
The -stride 1 option in PROFFLAGS asks for a separate counter to be used for each instruction, to allow accurate per-source-line profiling with prof's -heavy option. [Return to example]

Running the instrumented program produces a PC-sampling data file called mon.out, by default, to be used by the prof tool. For information about working with multiple data files, see Section 8.9. [Return to example]

The prof tool (see prof(1)) uses the PC-sampling data file to produce the profile. Because this technique works by periodic sampling of the program counter, you might see different output when you profile the same program multiple times.
When running prof manually, as in this example, you can filter which shared libraries to include in the displayed profile; the -all option tells prof to include all libraries (see Section 8.8.4). The -proc[edures] option profiles the instructions executed in each procedure and the calls to procedures. The -heavy option reports the lines that executed the most instructions. (Also, -lines shows per-line profiles and -asm shows per-instruction profiles, both grouped by procedure.) [Return to example]

The resulting sample profile is shown in Example 8-9.

Example 8-9: Sample cc -p Profile Using prof

% cc -p -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
mul_by_scalar.c:
print_it.c:
profsample.c:
 
% setenv PROFFLAGS "-all -stride 1"
 
% ./sample
Value = 179804.149985
 
% prof -all -proc -heavy -numbers sample
Profile listing generated Mon Feb 23 15:33:07 1998 with: 
   prof -all -proc -heavy -numbers sample  
 
----------------------------------------------------------------------------
*  -p[rocedures] using pc-sampling;                                        *
*  sorted in descending order by total time spent in each procedure;       *
*  unexecuted procedures excluded                                          *
----------------------------------------------------------------------------
 
Each sample covers 4.00 byte(s) for 0.25% of 0.3955 seconds 
 
%time     seconds  cum %   cum sec  procedure (file)     
 
 93.1      0.3682   93.1      0.37 add_vector (<sample>:"add_vector.c":1)
  5.4      0.0215   98.5      0.39 sqrt (<libm.so>)
  1.0      0.0039   99.5      0.39 mul_by_scalar (<sample>:"mul_by_scalar.c":1)
  0.5      0.0020  100.0      0.40 main (<sample>:"profsample.c":12)
 

----------------------------------------------------------------------------
*  -h[eavy] using pc-sampling;                                             *
*  sorted in descending order by time spent in each source line;           *
*  unexecuted lines excluded                                               *
----------------------------------------------------------------------------
 
Each sample covers 4.00 byte(s) for 0.25% of 0.3955 seconds 
 
procedure (file)                           line bytes   millisec      %  cum %
 
add_vector (add_vector.c)                     6    80      363.3  91.85  91.85
add_vector (add_vector.c)                     5    96        4.9   1.23  93.09
mul_by_scalar (mul_by_scalar.c)               6    60        3.9   0.99  94.07
main (profsample.c)                          20    36        2.0   0.49  94.57

Using the pixie Profiler

The pixie tool (see pixie(1)) can also profile source-lines and instructions (including shared libraries), but note that when it displays counts of cycles, it is actually reporting counts of instructions executed, not machine cycles. Its -truecycles 2 option can estimate the number of cycles that would be used if all memory accesses were satisfied by the cache, but programs can rarely cache enough of their data for this to be accurate, and only the Alpha EV4 and EV5 families can be fully simulated in this way. For example:

% cc -o sample -g1 -O2 *.c -lm [1]

% pixie -all -proc -heavy -quit 5 sample [2]

For information about the -g1 and -O2 options, see Section 8.3. [Return to example]

The pixie command creates an instrumented version of the program (sample.pixie) and an instruction-addresses file (sample.Addrs). Because prof options (-proc, -heavy, -quit) are specified, pixie automatically runs that instrumented program to create an instructions-counts file (sample.Counts) and then runs prof to display the profile, using the .Addrs and .Counts files as input.
The -all option asks for shared libraries to be profiled. Although a pixie profile can include shared libraries, system libraries (like libm.so, which contains the sqrt function) do not include source-line numbers, so they are not included in the -heavy option's per-line profile, and static procedures get proc_at... names created for them.
The -heavy option reports lines that executed the most instructions. The -proc[edures] option profiles the instructions executed in each procedure and the calls to procedures. The -quit 5 option truncates the report after 5 lines for each mode (-heavy, -proc[edures]). [Return to example]

The resulting sample profile is shown in Example 8-10.

Example 8-10: Sample pixie Profile Using prof

% cc -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
mul_by_scalar.c:
print_it.c:
profsample.c:
 
% pixie -all -proc -heavy -quit 5 sample
pixie: info: instrumenting sample ...
 
pixie: info: the LD_LIBRARY_PATH environment variable is not defined
pixie: info: setting LD_LIBRARY_PATH=.
pixie: info: running instrumented program sample.pixie ...
 
Value = 179804.149985
 
pixie: info: instrumented program exited with status 0
pixie: info: displaying profile for sample ...

Profile listing generated Mon Feb 23 15:33:55 1998 with:
   prof -pixie -all -procedures -heavy -quit 5 sample ./sample.Counts 
 
----------------------------------------------------------------------------
*  -p[rocedures] using basic-block counts;                                 *
*  sorted in descending order by the number of cycles executed in each     *
*  procedure; unexecuted procedures are excluded                           *
----------------------------------------------------------------------------
 
69089823 cycles (0.1727 seconds at 400.00 megahertz)
 
    cycles %cycles  cum %  seconds     cycles  bytes procedure (file)
                                        /call  /line
 
  60001400   86.85  86.85   0.1500     600014     48 add_vector (add_vector.c)
   7100008   10.28  97.12   0.0178         72      ? sqrt (<libm.so>)
   1301816    1.88  99.01   0.0033    1301816     27 main (profsample.c)
    675020    0.98  99.98   0.0017     337510     36 mul_by_scalar (mul_by_scalar.c)
       854    0.00  99.98   0.0000        854      ? __cvtas_t_to_a (<libc.so>)

----------------------------------------------------------------------------
*  -p[rocedures] using invocation counts;                                  *
*  sorted in descending order by number of calls per procedure;            *
*  unexecuted procedures are excluded                                      *
----------------------------------------------------------------------------
 
100504 invocations total
 
     calls  %calls    cum%    bytes procedure (file)
 
    100000   99.50   99.50      820 sqrt (<libm.so>)
       100    0.10   99.60      192 add_vector (add_vector.c)
        39    0.04   99.64      164 proc_at_0x3ff815bc1e0 (<libc.so>)
        38    0.04   99.67      144 proc_at_0x3ff815bc150 (<libc.so>)
        38    0.04   99.71       16 proc_at_0x3ff815bc140 (<libc.so>)

----------------------------------------------------------------------------
*  -h[eavy] using basic-block counts;                                      *
*  sorted in descending order by the number of cycles executed in each     *
*  line; unexecuted lines are excluded                                     *
----------------------------------------------------------------------------
 
procedure (file)                           line bytes     cycles      %  cum %
 
add_vector (add_vector.c)                     6    92   45000000  65.13  65.13
add_vector (add_vector.c)                     5    92   15001200  21.71  86.85
main (profsample.c)                          20    36     600003   0.87  87.71
main (profsample.c)                          22    24     600000   0.87  88.58
mul_by_scalar (mul_by_scalar.c)               6    64     487500   0.71  89.29

The -procedures profile includes a profile of call counts.

The optional dxprof command provides a graphical display of profiles collected by either pixie or the cc command's -p option.

You can also run the prof -pixstats command on the executable file sample to generate a detailed report on opcode frequencies, interlocks, a miniprofile, and more. For more information, see prof(1).

8.6 Minimizing System Resource Usage

The following sections describe the techniques and tools to help you minimize use of system resources by your application.

8.6.1 Techniques

The techniques described in the previous sections can improve an application's use of just the CPU. You can make further performance enhancements by improving the efficiency with which the application uses the other components of the computer system, such as heap memory, disk files, network connections, and so on.

As with CPU profiling, the first phase of a resource usage improvement process is to monitor how much memory, data I/O and disk space, elapsed time, and so on, is used. The throughput of the computer can then be increased or tuned in ways that help the program, or the program's design can be tuned to make better use of the computer resources that are available. For example:

Reduce the size of the data files that the program reads and writes.

Use memory-map files instead of regular I/O.

Allocate memory incrementally on demand instead of allocating at startup the maximum that could be required.

Fix heap leaks and do not leave allocated memory unused.

See the System Configuration and Tuning manual for a broader discussion of analyzing and tuning a Tru64 UNIX system.

8.6.2 Tools and Examples

The following sections discuss using system monitors and the Third Degree tool to minimize system resource usage.

8.6.2.1 System Monitors

The Tru64 UNIX base system commands ps u, swapon -s, and vmstat3 can show the currently active processes' usage of system resources such as CPU time, physical and virtual memory, swap space, page faults, and so on. See ps(1), swapon(8), and vmstat(3) for more information.

The optional pview command provides a graphical display of similar information for the processes that comprise an application. See pview(1).

The time commands provided by the Tru64 UNIX system and command shells provide an easy way to measure the total elapsed and CPU times for a program and it descendants. See time(1).

Performance Manager is an optional system performance monitoring and management tool with a graphical interface. See pmgr(8X).

For more information about related tools, see the System Configuration and Tuning manual.

8.6.2.2 Heap Memory Analyzers

The Third Degree tool (see third(1)) reports heap memory leaks in a program, by instrumenting it with the Third Degree memory-usage checker, running it, and displaying a log of leaks detected at program exit. For example:

% cc -o sample -g -non_shared *.c -lm [1]

% third -display sample [2]

Full debug information (that is, compiling with the -g option) is usually best for Third Degree, but if less is available, the reports will just be machine-level instead of source-level. The -g1 option is fine if you are just checking for memory leaks. [Return to example]

The third command creates an instrumented version of the program (sample.third). Because the -display option is specified, third automatically runs that instrumented program to create a log file (sample.3log) and then runs more to display the file. [Return to example]

The resulting sample log file is shown in Example 8-11.

Example 8-11: Sample third Log File

cc -o sample -g -non_shared add_vector.c mul_by_scalar.c print_it.c profsample.c -lm
add_vector.c:
mul_by_scalar.c:
print_it.c:
profsample.c:
third -display sample
third: info: instrumenting sample ...
third: info: running instrumented program sample.third ...
 
Value = 179804.149985
 
third: info: instrumented program exited with status 0
third: info: error log for sample ...
more ./sample.3log 
Third Degree version 5.0
sample.third run by jpx on frumpy.abc.dec.com at Mon Jun 21 16:59:50 1999
 
//////////////////////////////  Options  //////////////////////////////
-----------------------------------------------------------------------
 
-----------------------------------------------------------------------
 
------------------------------------------------------------------------
------------------------------------------------------------------------
New blocks in heap after program exit
 
Leaks - blocks not yet deallocated but apparently not in use:
 * A leak is not referenced by static memory, active stack frames,
   or unleaked blocks, though it may be referenced by other leaks.
 * A leak "not referenced by other leaks" may be the root of a leaked tree.
 * A block referenced only by registers, unseen thread stacks, mapped memory,
   or uninstrumented library data is falsely reported as a leak. Instrumenting
   shared libraries, if any, may reduce the number of such cases.
 * Any new leak lost its last reference since the previous heap report, if any.
 
A total of 800000 bytes in 1 leak were found:
 
800000 bytes in 1 leak (including 1 not referenced by other leaks) created at:
    malloc                         sample
    main                           sample, profsample.c, line 19
    __start                        sample
 
Objects - blocks not yet deallocated and apparently still in use:
 * An object is referenced by static memory, active stack, or other objects.
 * A leaked block may be falsely reported as an object if a pointer to it
   remains when a new stack frame or heap block reuses the pointer's memory.
   Using the option to report uninitialized stack and heap may avoid such cases.
 * Any new object was allocated since the previous heap report, if any.
 
A total of 0 bytes in 0 objects were found:
 
------------------------------------------------------------------------
------------------------------------------------------------------------
                memory layout at program exit
                           heap     933888 bytes [0x140012000-0x1400f6000]
                          stack     819504 bytes [0x11ff37ed0-0x120000000]
                    sample data      66464 bytes [0x140000000-0x1400103a0]
                    sample text     802816 bytes [0x120000000-0x1200c4000]
=============================================================================

By default, Third Degree profiles memory leaks, with little overhead.

If you are interested only in leaks occurring during the normal operation of the program, not during startup or shutdown, you can specify additional places to check for previously unreported leaks. For example, the preshutdown leak report will give this information:

% third -display -after startup -before shutdown sample

Third Degree can also detect various kinds of bugs that may be affecting the correctness or performance of a program. See Chapter 7 for more information on debugging and leak detection.

The optional dxheap command provides a graphical display of Third Degree's heap and bug reports. See dxheap(1).

The optional mview command provides a graphical analysis of heap usage over time. This view of a program's heap can clearly show the presence (if not the cause) of significant leaks or other undesirable trends such as wasted memory. See mview(1).

8.7 Verifying the Significance of Test Cases

The following sections discuss the techniques and tools used to verify the significance of test cases.

8.7.1 Techniques

Most of the profiling techniques described in the previous sections are effective only if you profile and optimize or tune the parts of the program that are executed in the scenarios whose performance is important. Careful selection of the data used for the profiled test runs is often sufficient, but you may want a quantitative analysis of which code was and was not executed in a given set of tests.

8.7.2 Tools and Examples

The pixie command's -t[estcoverage] option reports lines of code that were not executed in a given test run. For example:

% cc -o sample -g1 -O2 *.c -lm
% pixie -t sample

Similarly, the -zero option reports the names of procedures that were never executed. Conversely, pixie's -p[rocedure], -h[eavy], and -a[sm] options show which procedures, source lines, and instructions were executed.

If multiple test runs are needed to build up a typical scenario, the prof command can be run separately on a set of profile data files. For example:

% cc -o sample -g1 -O2 *.c -lm [1]

% pixie -pids sample [2]

% ./sample.pixie [3]
% ./sample.pixie 

% prof -pixie -t sample sample.Counts.* [4]

For information about the -g1 and -O2 options, see Section 8.3. [Return to example]

The pixie command creates an instrumented version of the program (sample.pixie) and an instruction-addresses file (sample.Addrs). The -pids option adds the process ID of the instrumented program's test run (item 3) to the name of the profiling data file produced, so that a unique file is retained after each run. For information about working with multiple data files, see Section 8.9. [Return to example]

The instrumented program is run twice (usually with different input data) to produce two profiling data files named sample.Counts.pid. [Return to example]

The -pixie option tells prof to use pixie mode rather than the default PC-sampling mode. The prof tool uses the sample.Addrs file and the two sample.Counts.pid files to create the profile from the two runs. [Return to example]

8.8 Selecting Profiling Information to Display

Depending on the size of the application and the type of profiling you request, profilers may generate a very large amount of output. However, you are often only interested in profiling data about a particular portion of your application. The following sections show how you can select the appropriate information by using:

prof options (with the pixie, uprofile, or prof command)

gprof options (with the hiprof or gprof command)

Many of the prof and gprof options perform similar functions and have similar names and syntax. The examples used show prof options. For complete details, see hiprof(1), pixie(1), uprofile(1), prof(1), and gprof(1).

See Section 8.11 for information on using monitor routines to control profiling.

8.8.1 Limiting Profiling Display to Specific Procedures

The prof command's -only option prints profiling information for only a specified procedure. You can use this option several times on the command line. For example:


% pixie -only mul_by_scalar -only add_vector sample

The -exclude option prints profiling information for all procedures except the specified procedure. You can use this option several times on the command line. Do not use -only and -exclude together on the same command line.

Many of the prof profiling options print output as percentages; for example, the percentage of total execution time attributed to a particular procedure.

By default, the -only and -exclude options cause prof to calculate percentages based on all of the procedures in the application even if they were omitted from the listing. You can change this behavior with the -Only and -Exclude options. They work the same as -only and -exclude, but cause prof to calculate percentages based only on those procedures that appear in the listing.

The -totals option, used with the -procedures and -invocations listings, prints cumulative statistics for the entire object file instead of for each procedure in the object.

8.8.2 Displaying Profiling Information for Each Source Line

The prof and gprof -heavy and -lines options display the number of machine cycles, instructions, page faults, cache misses, and so on for each source line in your application. The -asm option displays them for each instruction.

The -heavy option prints an entry for every source line. Entries are sorted from the line with the heaviest use to the line with the lightest. Because -heavy often prints a large number of entries, you might want to use one of the -quit, -only, or -exclude options to reduce output to a manageable size (see Section 8.8.3).

The -lines option is similar to -heavy, but sorts the output differently. This option prints the lines for each procedure in the order that they occur in the source file. Even lines that never executed are printed. The procedures themselves are sorted by decreasing total use.

8.8.3 Limiting Profiling Display by Line

The -quit option reduces the amount of profiling output displayed. It affects the output from the -procedures, -heavy, and -lines profiling modes.

The -quit option has the following three forms:

`-quit` `N`	Truncates the listing after `N` lines.
`-quit` `N%`	Truncates the listing after the first line that shows less than `N%` of the total.
`-quit` `Ncum%`	Truncates the listing after the line at which the cumulative total reaches `N%` of the total.

If you specify several modes on the same command line, -quit affects the output from each mode. The following command prints only the 20 most time-consuming procedures and the 20 most time-consuming source lines:

% pixie -procedures -heavy -quit 20 sample

Depending on the profiling mode, the total can refer to the total amount of time, the total number of executed instructions, or the total number of invocation counts.

8.8.4 Including Shared Libraries in the Profiling Information

When you are using the hiprof, pixie, third, prof, or gprof commands, the following options let you selectively display profiling information for shared libraries used by the program:

-all displays profiles for all shared libraries (if any) described in the data file(s) in addition to the executable.

-incobj lib displays the profile for the named shared library in addition to the executable.

-excobj lib does not display the profile for the named shared library, if -all was specified.

For example, the following command displays profiling information in all shared libraries except for user1.so:


% prof -all -excobj user1.so sample

When you are using the cc command's -p option to obtain a PC-sampling profile, you can use the PROFFLAGS environment variable to include or exclude profiling information for shared libraries when the instrumented program is run (as shown in Example 8-9). The PROFFLAGS variable accepts the -all, -incobj, and excobj options.

For more information specific to shared libraries, see Section 8.8.4.1.

8.8.4.1 Specifying the Location of Instrumented Shared Libraries

The LD_LIBRARY_PATH environment variable is used to tell the loader where to find instrumented shared libraries.

By default, when you run hiprof, pixie, or third, the LD_LIBRARY_PATH variable is automatically set to point to the working directory. You can set it to point to an arbitrary directory, as in the following C shell example:

% setenv LD_LIBRARY_PATH "my_inst_libraries"

8.9 Merging Profile Data Files

If the program you are profiling is fairly complicated, you may want to run it several times with different inputs to get an accurate picture of its profile. Each of the profiling tools lets you merge profile data from different runs of a program. But first you must override the default behavior whereby each run of the instrumented program overwrites the existing data file. Section 8.9.1 explains how to do that. Section 8.9.2 explains how to merge data files.

8.9.1 Data File-Naming Conventions

The default name of a profiling data file depends on the tool used to create it, as follows:

Tool	Default Name of Profiling Data File
`hiprof`	`program.hiout`
`pixie`	`program.Counts` (used with `program.Addrs`)
`uprofile`	`umon.out`
`cc -p`	`mon.out`
`cc -pg`	`gmon.out`

By default, when you make multiple profiling runs, each run overwrites the existing data file in the working directory. Each tool has options for renaming data files so they can be preserved from run to run. The hiprof, pixie, and uprofile commands have the following options:

`-dirname` `path`	Specifies a directory where the data file is to be created.
`-pids`	Adds the process ID of the instrumented program's run to the data file name.

For example, the following command sequence produces two data files of the form sample.pid.hiout in the subdirectory profdata:

% hiprof -dir profdata -pids sample
% ./sample
% ./sample

Then, when using the gprof command, you can specify a wildcard to include all the profiling data files in the directory:

% gprof -b sample profdata/*

When using the cc -p or cc -pg profiling options, you need to set the PROFFLAGS environment variable (before running the instrumented program). For example, the following command sequence would produce two data files of the form pid.sample in the subdirectory profdata (C shell example):

% cc -pg -o sample -g1 -O2 *.c -lm
% setenv PROFFLAGS "-dir profdata -pids"
% ./sample
% ./sample

Note that PROFFLAGS affects the profiling behavior of a program during its execution; it has no effect on the prof and gprof postprocessors. When you set PROFFLAGS to a null string, no profiling occurs.

For more information about file-naming conventions, see the tool reference pages. See Section 8.10 for the file-naming conventions for multithreaded programs.

8.9.2 Data File-Merging Techniques

After creating several profiling data files from multiple runs of a program, you can display profiling information based on an average of these files.

Use the prof or gprof postprocessor, depending on the profiling technique used, as follows:

If the profiling tool is:	Use this post processor:
`cc -p`, `uprofile`, or `pixie`	`prof`
`cc -pg`, or `hiprof`	`gprof`

One merge technique is to specify the name of each data file explicitly on the command line. For example, the following command prints profiling information from two profile data files generated using hiprof:

% gprof sample sample.1510.hiout sample.1522.hiout

Keeping track of many different profiling data files, however, can be difficult. Therefore, prof and gprof provide the -merge option to combine several data files into a single merged file. When prof operates in pixie mode (prof -pixie), the -merge option combines the .Counts files. When prof operates in PC-sampling mode, this switch combines the mon.out or other profile data files.

The following example combines two profile data files into a single data file named total.out:


% prof -merge total.out sample 1773.sample 1777.sample

At a later time, you can then display profiling data using the combined file, just as you would use a normal mon.out file. For example:

% prof -procedures sample total.out

The merge process is similar for -pixie mode. You must specify the executable file's name, the .Addrs file, and each .Counts file. For example:


% prof -pixie -merge total.Counts a.out a.out.Addrs \
  a.out.Counts.1866 a.out.Counts.1868

8.10 Profiling Multithreaded Applications

Note

To analyze a multithreaded appplication for potential logic and performance problems, you can use Visual Threads, which is available on the Tru64 UNIX Associated Products Volume 1 installation media. Visual Threads can be used on POSIX Threads Library applications and on Java applications. See dxthreads(1).

Profiling multithreaded applications is essentially the same as profiling nonthreaded applications. However, to profile multithreaded applications, you must compile your program with the -pthread or -threads option (as they are defined by the cc command). Other threads packages are not supported.

When using hiprof(1), pixie(1), or third(1), specify the -pthread option to enable the tool's thread-safe profiling support. The uprofile(1) command and the cc command's -p and -pg options need no extra option to become thread-safe.

The default case for profiling multithreaded applications is to provide one profile covering all threads. In this case, you get profiling across the entire process, and you get one output file of profiling data from all threads.

When using hiprof(1) or pixie(1), specify the -threads option for per-thread data.

When using the cc command's -p or -pg option, set the PROFFLAGS environment variable to -threads for per-thread profiling, as shown in the following C shell example:

setenv PROFFLAGS "-threads"

The uprofile(1) and third(1) tools do not provide per-thread data.

Per-thread profiling data files are given names that include unique thread numbers, which reflect the sequence in which the threads were created or in which their profiling was started.

If you use the monitor( ) or monstartup( ) calls (see Section 8.11) in a threaded program, you must first set PROFFLAGS to "-disable_default -threads", which gives you complete control of profiling the application.

If the application uses monitor( ) and allocates separate buffers for each thread profiled, you must first set PROFFLAGS to "-disable_default -threads" because this setting affects the file-naming conventions that are used. Without the -threads option, the buffer and address range used as a result of the first monitor or monstartup call would be applied to every thread that subsequently requests profiling. In this case, a single data file that covers all threads being profiled would be created.

Each thread in a process must call the monitor( ) or monstartup( ) routines to initiate profiling for itself.

8.11 Using monitor Routines to Control Profiling

The default behavior for the cc command's -p and -pg modes on Tru64 UNIX systems is to profile the entire text segment of your program and place the profiling data in mon.out for prof profiling or in gmon.out for gprof profiling. For large programs, you might not need to profile the entire text segment. The monitor routines provide the ability to profile portions of your program specified by the lower and upper address boundaries of a function address range.

The monitor routines are as follows:

monitor: Use this routine to gain control of explicit profiling by turning profiling on and off for a specific text range. This routine is not supported for gprof profiling.
monstartup: Similar to monitor except it specifies address range only and is supported for gprof profiling.
moncontrol: Use this routine with monitor and monstartup to turn PC sampling on or off during program execution for a specific process or thread.
monitor_signal: Use this routine to profile nonterminating programs, such as daemons.

You can use monitor and monstartup to profile an address range in each shared library as well as in the static executable.

For more information on these functions, see monitor(3).

By default, profiling begins as soon your program starts to execute. To prevent this behavior, set the PROFFLAGS environment variable to -disable_default, as shown in the following C shell example:

setenv PROFFLAGS "-disable_default"

Then, you can use the monitor routines to begin profiling after the first call to monitor or monstartup.

Example 8-12 demonstrates how to use the monstartup and monitor routines within a program to begin and end profiling.

Example 8-12: Using monstartup() and monitor()

   /*  Profile the domath( ) routine using monstartup.
    *  This example allocates a buffer for the entire program.
    *  Compile command: cc -p foo.c -o foo -lm
    *  Before running the executable, enter the following
    *  from the command line to disable default profiling support:
    *  setenv PROFFLAGS -disable_default
    */
 
    #include <stdio.h>
    #include <sys/syslimits.h>
 
    char dir[PATH_MAX];
 
    extern void _ _start( );
    extern unsigned long _etext;
 
    main( )
    {
        int i;
        int a = 1;
 
        /* Start profiling between _ _start (beginning of text)
         * and _etext (end of text).  The profiling library
         * routines will allocate the buffer.
         */
 
        monstartup(_ _start,&_etext);
 
           for(i=0;i<10;i++)
                domath( );
 
        /* Stop profiling and write the profiling output file. */
 
        monitor(0);
   }
    domath( )
   {
      int i;
      double d1, d2;
 
      d2 = 3.1415;
      for (i=0;  i<1000000;  i++)
         d1 = sqrt(d2)*sqrt(d2);
   }

The external name _etext lies just above all the program text. See end(3) for more information.

When you set the PROFFLAGS environment variable to -disable_default, you disable default profiling buffer support. You can allocate buffers within your program, as shown in Example 8-13.

Example 8-13: Allocating Profiling Buffers Within a Program

   /*  Profile the domath routine using monitor().
    *  Compile command: cc -p foo.c -o foo -lm
    *  Before running the executable, enter the following
    *  from the command line to disable default profiling support:
    *  setenv PROFFLAGS -disable_default
    */
 
   #include <sys/types.h>
   #include <sys/syslimits.h>
 
   extern char *calloc( );
 
   void domath(void);
   void nextproc(void);
 
   #define INST_SIZE 4          /* Instruction size on Alpha */
   char dir[PATH_MAX];
 
   main( )
   {
        int i;
        char *buffer;
        size_t bufsize;
 
        /*  Allocate one counter for each instruction to
         *  be sampled. Each counter is an unsigned short.
         */
 
        bufsize = (((char *)nextproc - (char *)domath)/INST_SIZE)
         * sizeof(unsigned short);
 
         /*  Use calloc( ) to ensure that the buffer is clean
          *  before sampling begins.
          */
 
         buffer = calloc(bufsize,1);
 
         /*  Start sampling.  */
          monitor(domath,nextproc,buffer,bufsize,0);
          for(i=0;i<10;i++)
                domath( );
 
         /* Stop sampling and write out profiling buffer.  */
                  monitor(0);
  }
   void domath(void)
  {
    int i;
    double d1, d2;
 
       d2 = 3.1415;
       for (i=0;  i<1000000;  i++)
           d1 = sqrt(d2)*sqrt(d2);
   }
 
   void nextproc(void)
   {}

You use the monitor_signal( ) routine to profile programs that do not terminate. Declare this routine as a signal handler in your program and build the program for prof or gprof profiling. While the program is executing, send a signal from the shell by using the kill command.

When the signal is received, monitor_signal is invoked and writes profiling data to the data file. If the program receives another signal, the data file is overwritten.

Example 8-14 shows how to use the monitor_signal routine.

Setting the PROFFLAGS environment variable to -sigdump SIGNAME provides the same capability without needing any new program code.

Example 8-14: Using monitor_signal() to Profile Nonterminating Programs

/* From the shell, start up the program in background.
 * Send a signal to the process, for example: kill -30 <pid>
 * Process the [g]mon.out file normally using gprof or prof
 */
 
#include <signal.h>
 
extern int monitor_signal();
 
main()
{
    int i;
    double d1, d2;
 
    /*
     * Declare monitor_signal() as signal handler for SIGUSR1
     */
    signal(SIGUSR1,monitor_signal);
    d2 = 3.1415;
    /*
     * Loop infinitely (absurd example of non-terminating process)
     */
    for (;;)
        d1 = sqrt(d2)*sqrt(d2);
}