Profiling is a method of identifying sections of code that consume large
portions of execution time.
In a typical program, most execution time is spent
in relatively few sections of code.
To improve performance, the greatest gains
result from improving coding efficiency in time-intensive sections.
8.1 Overview
Tru64 UNIX supports four approaches to performance improvement:
Automatic and profile-directed optimizations (see Section 8.4).
Manual design and code optimizations (see Section 8.5).
Minimizing system-resource usage (see Section 8.6).
Verifying the significance of test cases (see Section 8.7).
One approach might be enough, but more might be beneficial if no single approach addresses all aspects of a program's performance. This chapter describes each approach and the tools provided by Tru64 UNIX to support them. In addition, the following topics are covered in this chapter:
Selecting profiling information to display (Section 8.8)
Merging profile data files (Section 8.9)
Profiling multithreaded applications (Section 8.10)
Using monitor routines to control profiling (Section 8.11)
For more information, see the following reference pages:
Profiling:
cc
(1),
hiprof
(1),
pixie
(1),
third
(1),
uprofile
(1),
prof
(1),
gprof
(1)
Performance Manager, available from the Tru64 UNIX Associated
Products Volume 1 installation media:
pmgr
(8X)
Graphical tools, available from the Graphical Program Analysis
subset of the Tru64 UNIX Associated Products Volume 1 installation media,
or as part of the Compaq Enterprise Toolkit for Windows/NT desktops
with the Microsoft VisualStudio97:
dxheap
(1),
dxprof
(1),
mview
(1),
pview
(1)
Visual Threads, available from the Tru64 UNIX Associated
Products Volume 1 installation media:
dxthreads
(1).
Use Visual Threads
to analyze a multithreaded appplication for potential logic and performance
problems.
The System Configuration and Tuning manual.
The examples in the remainder of this chapter refer to the program
sample
, whose source code is contained in the files
profsample.c
(main program),
add_vector.c
,
mul_by_scalar.c
,
print_it.c
, and
profsample.h
.
These files are shown in
Example 8-1.
Example 8-1: Profiling Sample Program
****************** profsample.c ************* #include <math.h> #include <stdio.h> #include "profsample.h" static void mul_by_pi(double ary[]) { mul_by_scalar(ary, LEN/2, 3.14159); } void main() { double ary1[LEN]; double *ary2; double sum = 0.0; int i; ary2 = malloc(LEN * sizeof(double)); for (i=0; i<LEN; i++) { ary1[i] = 0.0; ary2[i] = sqrt((double)i); } mul_by_pi(ary1); mul_by_scalar(ary2, LEN, 2.71828); for (i=0; i<100; i++) add_vector(ary1, ary2, LEN); for (i=0; i<100; i++) sum += ary1[i]; if (sum < 0.0) print_it(0.0); else print_it(sum); } ****************** profsample.h: ******************** void mul_by_scalar(double ary[], int len, double num); void add_vector(double arya[], double aryb[], int len); void print_it(double value); #define LEN 100000 ***************** add_vector.c: ******************** #include "profsample.h" void add_vector(double arya[], double aryb[], int len) { int i; for (i=0; i<LEN; i++) { arya[i] += aryb[i]; } } **************** mul_by_scalar.c: ****************** #include "profsample.h" void mul_by_scalar(double ary[], int len, double num) { int i; for (i=0; i<LEN; i++) { ary[i] *= num; } } **************** print_it.c: ********************** #include <stdio.h> #include "profsample.h" void print_it(double value) { printf("Value = %f\n", value); }
8.3 Compilation Options for Profiling
When
using debug and optimization options with the
cc
command,
note the following considerations.
They apply to all the profiling tools discussed
in this chapter unless stated otherwise.
The
-g1
option provides the minimum debug
information needed (that is, line numbers and procedure names) and is sufficient
for all profilers.
The
cc
command default,
-g0
, is tolerated but provides no names for local (for example, static)
procedures.
The
-g2
and higher options provide less than
optimal code as well as unneeded information.
When doing automatic or profile-directed optimization (Section 8.4), note that the
cc
command's
-feedback
option automatically applies the
-g1
option.
Also, you need to experiment to find the automatic optimization level (for
example,
-O3) that provides the best run-time performance
for your program and compiler.
When doing manual optimization, note that none of the profiling
tools show inlined procedures separately by their own names.
Profiling statistics
for an inlined procedure are included in the statistics for the calling procedure.
For example, if
proc1
calls
proc2
(which
is inlined), the statistics for
proc1
will include the
time spent in
proc2
.
Therefore, to provide some optimization
but with minimal inlining when profiling, use the
-O2
(or
-O
) option.
In some cases, you may need to specify the
-inline none
option to eliminate all inlining.
This restriction
does not apply to automatic optimization.
8.4 Automatic and Profile-Directed Optimizations
The following
sections discuss automatic and profile-directed optimizations.
8.4.1 Techniques
Automatic and profile-directed optimizations are the simplest approaches to improving application performance.
Some degree of automatic optimization can be achieved by using the compiler's and linker's optimization options. These can help in the generation of minimal instruction sequences that make best use of the CPU architecture and cache memory.
However, the compiler and linker can improve their optimizations if they are given information on which instructions are executed most often when the program is run with its normal input data and environment. While the default optimizations give improved performance for most common situations, the optimizers can do even better if they can tune the program in favor of the heavily used instruction sequences as determined from a sample run.
Tru64 UNIX helps you provide the optimizers with this information
on processing hot spots by allowing a profiler's results to be fed back into
a recompilation.
This customized, profile-directed optimization can be used
in conjunction with automatic optimization.
8.4.2 Tools and Examples
The following sections discuss the tools used for automatic and profile
directed optimizations.
8.4.2.1 Automatic Optimization
The
cc
command's automatic optimization options are
selected with the
-O
,
-fast
,
-inline
,
-om
, and other related options.
See
cc
(1)
for details and Chapter 10 for more information on the many
options and tradeoffs available.
For example, this command selects a high degree of optimization in both the compiler and the linker:
%
cc -non_shared -O3 -om *.c
8.4.2.2 Profile-Directed Optimization
The
pixie
profiler provides profile information
that the
cc
command's
-feedback
and
-om
options can use to tune the generated instruction
sequences to the demands placed on the program by particular sets of input
data.
This technique sould be used only with executables.
For shared libraries, use
cord
as described in
Section 8.4.2.3.
The following example shows the three necessary basic steps in this process, which consist of (1) preparing the program for profile-directed optimization, (2) creating an instrumented version of the program and running it to collect profiling statistics, and (3) feeding that information back to the compiler and linker to help them optimize the executable code. Later examples show how to elaborate on these steps to accommodate ongoing changes during development and data from multiple profiling runs.
%
cc -feedback sample -o sample -non_shared -O3 *.c -lm
[1]%
pixie -update -quiet sample
[2]%
cc -feedback sample -o sample -non_shared -om -O3 *.c -lm
[3]
When the program is compiled with the
-feedback
option for the first time, a special augmented executable
file is created.
It contains information used by the compiler to relate the
executable to the source files.
It also contains a section later used to store
profiling feedback information for the compiler.
This section remains empty
after the first compilation, because no feedback information has yet been
generated by the
pixie
profiler (step 2).
The file name
specified with the
-feedback
option should be the same as
the executable file name, which in this example is
sample
(from the
-o
option).
By default, the
-feedback
option applies the
-g1
option, which provides optimum symbolization
for profiling.
(For information about the
-g1
and
-O3
options, see
Section 8.3.)
[Return to example]
The
pixie
command
creates an instrumented version of the program (sample.pixie
)
and then runs it (because a
prof
option,
-update, is specified).
Execution statistics and address mapping data are
automatically collected in an instruction-counts file (sample.Counts
) and an instruction-addresses file (sample.Addrs
).
The
-update
option puts this profiling information in the
augmented executable.
The
-quiet
option prevents informational
and progress messages from being printed.
[Return to example]
In the second compilation with the -feedback option, the profiling information in the augmented executable guides the compiler and (through the -om option) the postlink optimizer. This customized feedback enhances any automatic optimization provided by the -O3 and -om options. When the -om option is used with nonshared executables, it is particularly effective because it optimizes all library routines, too. Compiler optimizations can be made even more effective by using the -ifo and/or -assume whole_program options in conjunction with the -feedback option. However, as noted in Section 10.1.1, the compiler may be unable to compile very large programs as if there were only one source file. [Return to example]
Example 8-2
shows the output for each of these
steps.
See
pixie
(1)
and
cc
(1)
for more information.
Example 8-2: Sample Profile-Directed Optimization Output
% cc -feedback sample -o sample -non_shared -O3 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: cc: Info: Feedback file sample does not exist (nofbfil) cc: Info: Compilation will proceed without feedback optimizations (nofbopt) mul_by_scalar.c: cc: Info: Feedback file sample does not exist (nofbfil) cc: Info: Compilation will proceed without feedback optimizations (nofbopt) print_it.c: cc: Info: Feedback file sample does not exist (nofbfil) cc: Info: Compilation will proceed without feedback optimizations (nofbopt) profsample.c: cc: Info: Feedback file sample does not exist (nofbfil) cc: Info: Compilation will proceed without feedback optimizations (nofbopt) % pixie -update -quiet sample Value = 540.930890 % cc -feedback sample -o sample -non_shared -om -O3 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: mul_by_scalar.c: print_it.c: profsample.c:
During a typical development process, steps 2 and 3 of the previous example are repeated as needed to reflect the impact of any changes to the source code. For example:
%
cc -feedback sample -o sample -non_shared -O3 *.c -lm
%
pixie -update -quiet sample
%
cc -feedback sample -o sample -non_shared -O3 *.c -lm
[modify source code]%
cc -feedback sample -o sample -non_shared -O3 *.c -lm
..... [modify source code]%
cc -feedback sample -o sample -non_shared -O3 *.c -lm
%
pixie -update -quiet sample
%
cc -feedback sample -o sample -non_shared -om -O3 *.c -lm
Because the profiling information in the augmented executable persists
from compilation to compilation, the
pixie
processing step
that updates the information does not have to be repeated every time a source
module is modified and recompiled.
But with each modification the relevance
of the old feedback information to the actual code is reduced, and the potential
quality of the optimization is degraded, depending on the exact modification.
The
pixie
processing step after the last modification and
recompilation guarantees that the feedback information is correctly updated
for the last compilation.
The profiling information in an augmented executable file makes it larger
than a normal executable (typically 3-5 percent).
After development
is completed, you can use the
strip
command to remove any
profiling and symbol table information.
For example:
%
strip sample
You might want to run your instrumented program several times with different
inputs to get an accurate picture of its profile.
The following example explains
how to merge profiling statistics from two runs of a program,
prog
, whose output varies from run to run with different sets of input:
%
cc -feedback prog -o prog -non_shared -O3 *.c
[1]%
pixie -pids prog
[2]%
prog.pixie
[3] (input set 1)%
prog.pixie
(input set 2)%
prof -pixie -update prog prog.Counts.*
[4]%
cc -feedback prog -o prog -non_shared -om -O3 *.c
[5]
The first compilation produces an augmented executable, as explained in the previous example. [Return to example]
By default, each run of the instrumented
program (prog.pixie
) produces a profiling data file called
prog.Counts
.
The
-pids
option adds the process ID
of each of the instrumented program's test runs to the name of the profiling
data file produced (that is,
prog.Counts.
pid).
Thus, the data files produced by subsequent runs do not
overwrite each other.
[Return to example]
The instrumented program is run twice,
producing a uniquely named data file each time -- for example,
prog.Counts.371
and
prog.Counts.422
.
[Return to example]
The
prof -pixie
command
merges the two data files.
The
-update
option updates the
executable,
prog
, with the combined information.
[Return to example]
The second compilation step uses the combined profiling information from the two runs of the program to guide the optimization. [Return to example]
You
can use a different kind of feedback file (created with the
pixie
or
prof -pixie
command) as input to the
cord
utility.
The
cord
utility reorders the procedures
in an executable program or shared library to improve cache utilization.
The
following example shows how to create a feedback file and then use the
-cord
option to compile an executable with the feedback file
as input:
%
cc -O3 -o sample *.c -lm
%
pixie -feedback sample.fb sample
[1]%
cc -O3 -cord -feedback sample.fb -o sample *.c -lm
[2]
The
pixie
command creates
an instrumented version of the program and also runs it (because a
prof
option,
-feedback
is specified).
The
-feedback
option creates a feedback file (sample.fb
)
that collects execution statistics to be used by the compiler in the next
step.
[Return to example]
The
cc
command's
-feedback
option accepts the feedback file as input.
The
-cord
option invokes the
cord
utility.
[Return to example]
Compiling a shared library with feedback is similar. Profile the shared library with one or more programs that exercise the library code that most needs optimizing. For example:
%
cc -o libexample.so -shared -g1 -O3 lib*.c
[1]%
cc -o exerciser -O3 exerciser.c -L. -lexample
[2]%
pixie -L. -incobj libexample.so -run exerciser
[3]%
prof -pixie -feedback libexample.fb libexample.so exerciser.Counts
[4]%
cc -cord -feedback libexample.fb -o libexample.so -shared -g1 -O3 lib*.c
[5]
The shared library is compiled with the -g1 option to give feedback data for each source line. [Return to example]
A program to exercise the important parts of the library is built. [Return to example]
The shared library and program are instrumented and run to profile them. [Return to example]
A feedback file is generated for just the shared library. [Return to example]
The shared library is recompiled, relinked, and reordered to optimize the performance of the code that the profile shows is most heavily used. [Return to example]
Use a feedback file generated with the same optimization level.
You can also use
cord
with the
runcord
utility.
For more information, see
pixie
(1),
prof
(1),
cord
(1),
and
runcord
(1).
8.5 Manual Design and Code Optimizations
The following sections discuss the
techniques and tools used for manual design and code optimizations.
8.5.1 Techniques
The effectiveness of the automatic optimizations described in Section 8.4 is limited by the efficiency of the algorithms that the program uses. You can further improve a program's performance by manually optimizing its algorithms and data structures. Such optimizations may include reducing complexity from N-squared to log-N, avoiding copying of data, and reducing the amount of data used. It may also extend to tuning the algorithm to the architecture of the particular machine it will be run on -- for example, processing large arrays in small blocks such that each block remains in the data cache for all processing, instead of the whole array being read into the cache for each processing phase.
Tru64 UNIX supports manual optimization with its profiling tools,
which identify the parts of the application that impose the highest CPU load --
CPU cycles, cache misses, and so on.
By evaluating various profiles of a program,
you can identify which parts of the program use the most CPU resources, and
you can then redesign or recode algorithms in those parts to use less resources.
The profiles also make this exercise more cost-effective by helping you to
focus on the most demanding code rather than on the least demanding.
8.5.2 Tools and Examples
The following sections discuss the tools used for CPU-time profiling
with call graph and for CPU-time/event profiling with source lines and instructions.
8.5.2.1 CPU-Time Profiling with Call Graph
A call-graph profile shows how much CPU time is used by each procedure, and how much is used by all the other procedures that it calls. This profile can show which phases or subsystems in a program spend most of the total CPU time, which can help in gaining a general understanding of the program's performance. This section describes two tools that provide this information:
The
hiprof
profiler
The
cc
command's
-pg
option
Both tools are used with the
gprof
tool,
implicitly or explicitly, to format and display the profile.
The optional
dxprof
command provides a graphical
display of CPU-time call-graph profiles.
Using the hiprof Profiler
The
hiprof
profiler (see
hiprof
(1)) instruments
the program and generates a call graph while the instrumented program executes.
This profiler does not require that the program be compiled in any particular
way except as indicated in
Section 8.3.
The
hiprof
command can generate a call-graph profile for shared libraries
and for program code, with moderate optimization and minimal debug information.
For example:
%
cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
[1]%
cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
[2]%
hiprof -numbers -L. -inc libsample.so sample
[3]
A shared library,
libsample.so
, is created from three source modules, with debug information and
optimization as indicated in
Section 8.3.
[Return to example]
The source module
profsample
is compiled and linked against the shared library
libsample.so
(located in the current directory) to produce the executable
sample
.
[Return to example]
The
-inc[obj]
option
tells
hiprof
to profile
libsample.so
in addition to the executable (sample
).
The
hiprof
command creates an instrumented version of the program (sample.hiprof
).
Because at least one
gprof
option
(-numbers
) is specified,
hiprof
then
automatically runs that instrumented program to create a profile data file
(sample.hiout
) and then runs
gprof
to
display the profile.
The
-numbers
option prints each procedure's
starting line number, source-file name, and library name.
This helps identify
any multiple static procedures with the same name.
[Return to example]
The resulting sample profile is shown in Example 8-3. The call-graph profile estimates the total cost of calling a procedure, including other procedures that it calls. The estimation assumes that each call to a given procedure takes the same amount of time; this may not be true in many cases, but it is always accurate when there is only one caller.
By default,
hiprof
uses a low-frequency sampling technique.
This can profile
all the code executed by the program, including all selected libraries.
It
also provides a call-graph profile of all selected procedures (except those
in the threads-related system libraries) and detailed profiles at the level
of source lines or machine instructions (if selected).
Example 8-3: Sample hiprof Default Profile Using gprof
% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c add_vector.c: mul_by_scalar.c: print_it.c: % cc -o sample -g1 -O2 profsample.c -L. -lsample -lm % hiprof -numbers -L. -inc libsample.so sample hiprof: info: instrumenting sample ... hiprof: info: the LD_LIBRARY_PATH environment variable is not defined hiprof: info: setting LD_LIBRARY_PATH=.:. [1] hiprof: info: running instrumented program sample.hiprof ... Value = 179804.149985 hiprof: info: instrumented program exited with status 0 hiprof: info: displaying profile for sample ... gprof -b -scaled -all -numbers -L. sample.hiprof ./sample.hiout [2] *********************** call-graph profile ******************* [3] granularity: samples per 4 bytes; units: seconds*1e-3; total: 323e-3 seconds called / total parents index %total self descendents called + self name index called / total children 2 321 1 / 1 __start [2] [1] 100.0 2 321 1 main [1] [4] 318 0 100 / 100 add_vector [3] [5] 3 0 2 / 2 mul_by_scalar [4] 0 0 1 / 1 print_it [5] ------------------------------------------------------ 318 0 100 / 100 main [1] [6] [3] 98.5 318 0 100 add_vector [3] ------------------------------------------------------ 3 0 2 / 2 main [1] [4] 0.9 3 0 2 mul_by_scalar [4] ------------------------------------------------------ 0 0 1 / 1 main [1] [5] 0.0 0 0 1 print_it [5] ------------------------------------------------------ *********************** timing profile section *************** granularity: samples per 4 bytes; units: seconds*1e-3; total: 323e-3 seconds % cumulative self self/call total/call total units units calls seconds seconds name 98.5 318 318 100 3184e-6 3184e-6 add_vector [3] 0.9 321 3 2 1465e-6 1465e-6 mul_by_scalar [4] 0.6 323 2 1 1953e-6 323242e-6 main [1] 0.0 323 0 1 0 0 print_it [5] *********************** index section ************************ Index by function name - 4 entries [3] add_vector :"add_vector.c":1 [1] main :"profsample.c":12 [4] mul_by_scalar :"mul_by_scalar.c":1 [5] print_it :"print_it.c":4
The
LD_LIBRARY_PATH
environment variable is automatically set to point to the working directory,
where the instrumented shared library
libsample
is located
(see
Section 8.8.4.1).
[Return to example]
The automatically generated
gprof
command line uses the
-scaled
option by
default, which can display profiles with CPU-cycle granularity and millisecond
units if the procedures selected for display have short run times.
[Return to example]
The
gprof
output
comprises three sections: a call-graph profile, a timing profile, and an index
(a concise means of identifying each procedure).
In this example, the three
sections have been separated by rows of asterisks (with the section names)
that do not appear in the output produced by
gprof
.
In
the call-graph profile section, each routine in the program has its own subsection
that is contained within dashed lines and identified by the index number in
the first column.
[Return to example]
This line describes the
main
routine, which is the subject of this portion of the call-graph
profile because it is the leftmost routine in the rightmost column of this
section.
The index number
[2]
in the first column corresponds
to the index number
[2]
in the index section at the end
of the output.
The percentage in the second column reports the total amount
of time in the subgraph that is accounted for by
main
and
its descendants, in this case
add_vector
,
mul_by_scalar
, and
print_it
.
The
1
in the
called
column indicates the total number of times that the
main
routine is called.
[Return to example]
This line describes the relationship
of
add_vector
to
main
.
Because
add_vector
is below
main
in this section,
add_vector
is identified as the child of
main
.
The fraction
100/100
indicates that of the total of 100
calls to
add_vector
(the denominator), 100 of these calls
came from
main
(the numerator).
[Return to example]
This line describes the relationship
of
main
to
add_vector
.
Because
main
is listed above
add_vector
in the last column,
main
is identified as the parent of
add_vector
.
[Return to example]
For nonthreaded programs,
hiprof
can alternatively
count the number of machine cycles used or page faults suffered by the program.
The cost of each parent procedure's calls is accurately calculated, making
the call-graph profile much more useful than the one in the default mode,
which can only estimate the costs of parents.
Also, the CPU time (in nanosecond
units for short tests) or page-faults count reported for the instrumented
routines includes that for the uninstrumented routines that they call.
This
can summarize the costs and reduce the run-time overhead, but note that the
machine-cycle counter wraps if no instrumented procedure is called at least
every few seconds.
In the following example, the
hiprof
command's
-cycles
option is used to display a profile of the machine cycles
used by the
sample
program:
%
cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
%
cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
%
hiprof -cycles -numbers -L. -inc libsample.so sample
The resulting sample profile is shown in
Example 8-4.
Example 8-4: Sample hiprof -cycles Profile Using gprof
% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c add_vector.c: mul_by_scalar.c: print_it.c: % cc -o sample -g1 -O2 profsample.c -L. -lsample -lm % hiprof -cycles -numbers -L. -inc libsample.so sample hiprof: info: instrumenting sample ... hiprof: info: the LD_LIBRARY_PATH environment variable is not defined hiprof: info: setting LD_LIBRARY_PATH=.:. hiprof: info: running instrumented program sample.hiprof ... Value = 179804.149985 hiprof: info: instrumented program exited with status 0 hiprof: info: displaying profile for sample ... gprof -b -scaled -all -numbers -L. sample ./sample.hiout granularity: cycles; units: seconds*1e-9; total: 362320292e-9 seconds called / total parents index %total self descendents called + self name index called / total children <spontaneous> [1] 100.0 893310 361426982 1 __start [1] 361371860 1 / 1 main [2] ------------------------------------------------------ 361371860 1 / 1 __start [1] [2] 99.7 36316218 325055642 1 main [2] 321107275 100 / 100 add_vector [3] 3519530 2 / 2 mul_by_scalar [4] 428838 1 / 1 print_it [5] ------------------------------------------------------ 321107275 100 / 100 main [2] [3] 88.6 321107275 0 100 add_vector [3] ------------------------------------------------------ 3519530 2 / 2 main [2] [4] 1.0 3519530 0 2 mul_by_scalar [4] ------------------------------------------------------ 428838 1 / 1 main [2] [5] 0.1 428838 0 1 print_it [5] ------------------------------------------------------ granularity: cycles; units: seconds*1e-9; total: 362320292e-9 seconds % cumulative self self/call total/call total units units calls seconds seconds name 88.6 321107275 321107275 100 3211e-6 3211e-6 add_vector [3] 10.0 357423492 36316218 1 36316e-6 361372e-6 main [2] 1.0 360943022 3519530 2 1760e-6 1760e-6 mul_by_scalar [4] 0.2 361836332 893310 1 893310e-9 362320e-6 __start [1] 0.1 362265170 428838 1 428838e-9 428838e-9 print_it [5] Index by function name - 11 entries [1] __start <sample> [3] add_vector <libsample.so>:"add_vector.c":1 [2] main <sample>:"profsample.c":12 [4] mul_by_scalar <libsample.so>:"mul_by_scalar.c":1 [5] print_it <libsample.so>:"print_it.c":4
Using the cc Command's -pg Option
The
cc
command's
-pg
option uses
the same sampling technique as
hiprof
, but the program
needs to be instrumented by compiling with the
-pg
option.
(The program also needs to be compiled with the debug and optimization options
indicated in
Section 8.3.) Only the executable is
profiled (not shared libraries), and few system libraries are compiled to
generate a call-graph profile; therefore,
hiprof
may be
preferred.
However, the
cc
command's
-pg
option and
gprof
are supported in a very similar way on
different vendors' UNIX systems, so this may be an advantage.
For example:
%
cc -pg -o sample -g1 -O2 *.c -lm
[1]%
./sample
[2]%
gprof -scaled -b -numbers sample
[3]
The
cc
command's
-pg
call-graph profiling option creates an instrumented version
of the program,
sample
.
(You must specify the
-pg
option for both the compile and link phases.)
[Return to example]
Running the instrumented program produces
a profiling data file (named
gmon.out
, by default) to be
used by the
gprof
tool.
For information about working with
multiple data files, see
Section 8.9.
[Return to example]
The
gprof
command
extracts information from the profiling data file and displays it.
The
-scaled
option displays CPU time in units that give the best precision
without exceeding the report's column widths.
The
-b
option
suppresses the printing of a description of each field in the profile.
[Return to example]
The resulting sample profile is shown in
Example 8-5.
The
gprof
tool estimates the total cost of calling a procedure
(including its calls) in its call-graph profile.
Example 8-5: Sample cc -pg Profile Using gprof
% cc -pg -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: mul_by_scalar.c: print_it.c: profsample.c: % ./sample Value = 179804.149985 % gprof -scaled -b -numbers sample granularity: samples per 8 bytes; units: seconds*1e-3; total: 326e-3 seconds called / total parents index %total self descendents called + self name index called / total children 5 321 1 / 1 __start [2] [1] 100.0 5 321 1 main [1] 317 0 100 / 100 add_vector [3] 4 0 2 / 2 mul_by_scalar [4] 0 0 1 / 1 print_it [5] ------------------------------------------------------ 317 0 100 / 100 main [1] [3] 97.3 317 0 100 add_vector [3] ------------------------------------------------------ 4 0 2 / 2 main [1] [4] 1.2 4 0 2 mul_by_scalar [4] ------------------------------------------------------ 0 0 1 / 1 main [1] [5] 0.0 0 0 1 print_it [5] ------------------------------------------------------ granularity: samples per 8 bytes; units: seconds*1e-3; total: 326e-3 seconds % cumulative self self/call total/call total units units calls seconds seconds name 97.3 317 317 100 3174e-6 3174e-6 add_vector [3] 1.5 322 5 1 4883e-6 326172e-6 main [1] 1.2 326 4 2 1953e-6 1953e-6 mul_by_scalar [4] 0.0 326 0 1 0 0 print_it [5] Index by function name - 4 entries [3] add_vector <sample>:"add_vector.c":1 [1] main <sample>:"profsample.c":12 [4] mul_by_scalar <sample>:"mul_by_scalar.c":1 [5] print_it <sample>:"print_it.c":4
8.5.2.2 CPU-Time/Event Profiles for Sourcelines/Instructions
A good performance-improvement strategy may start with a procedure-level profile of the whole program (possibly with a call graph, to provide an overall picture), but it will often progress to detailed profiling of individual source lines and instructions. The following tools provide this information:
The
uprofile
profiler
The
hiprof
profiler
The
cc
command's
-p
option
The
pixie
profiler
Using the uprofile Profiler
The
uprofile
profiler (see
uprofile
(1)) uses a sampling technique to generate
a profile of the CPU time or events, such as cache misses, associated with
each procedure or source line or instruction.
The sampling frequency depends
on the processor type and the statistic being sampled, but it is on the order
of a millisecond for CPU time.
The profiler achieves this without modifying
the application program at all, by using hardware counters that are built
into the Alpha CPU.
Running the
uprofile
command with no
arguments yields a list of all the kinds of events that a particular machine
can profile, depending on the nature of its architecture.
The default is to
profile machine cycles, resulting in a CPU-time profile.
The following example
shows how to display a profile of the instructions that used the top 90 percent
of CPU time:
%
cc -o sample -g1 -O2 *.c -lm
[1]%
uprofile -asm -quit 90cum% sample
[2]
The resulting sample profile, which includes explanatory text, is shown in Example 8-6.
For information about the
-g1
and
-O2
options, see
Section 8.3.
[Return to example]
The
uprofile
command
runs the
sample
program, collecting the performance counter
data into a profile data file (named
umon.out
, by default).
Because
prof
options (-asm
and
-quit
) are specified,
uprofile
then automatically
runs the
prof
tool to display the profile.
The
-asm
option provides per-instruction profiling
of cycles (and other CPU statistics, such as data cache misses, if specified).
Because no counter statistics are specified here,
uprofile
displays a CPU-time profile for each instruction.
The
-quit 90cum%
option truncates the profile after 90 percent of the whole has
been printed (see
Section 8.8.3).
(Also available
are the
-heavy
option, which reports the lines that executed
the most instructions, and the
-lines
option, which reports
the profile per source line within each procedure (see
Section 8.8.2).
[Return to example]
% cc -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: mul_by_scalar.c: print_it.c: profsample.c: % uprofile -asm -quit 90cum% sample Value = 179804.149985 Writing umon.out Displaying profile for sample: Profile listing generated Thu Dec 3 10:29:25 1998 with: prof -asm -quit 90cum% sample umon.out ---------------------------------------------------------------------------- * -a[sm] using performance counters: * * cycles0: 1 sample every 65536 Cycles (0.000164 seconds) * * sorted in descending order by total time spent in each procedure; * * unexecuted procedures excluded * ---------------------------------------------------------------------------- Each sample covers 4.00 byte(s) for 0.052% of 0.3123 seconds millisec % cum % address:line instruction add_vector (add_vector.c) 0.0 0.00 0.00 0x120001260:2 addl zero, a2, a2 0.0 0.00 0.00 0x120001264:5 bis zero, zero, t0 0.0 0.00 0.00 0x120001268:5 ble a2, 0x12000131c 0.0 0.00 0.00 0x12000126c:5 subl a2, 0x3, t1 0.0 0.00 0.00 0x120001270:5 cmple t1, a2, t2 0.0 0.00 0.00 0x120001274:5 beq t2, 0x1200012f4 0.0 0.00 0.00 0x120001278:5 ble t1, 0x1200012f4 0.0 0.00 0.00 0x12000127c:5 subq a0, a1, t3 0.0 0.00 0.00 0x120001280:5 lda t3, 31(t3) 0.0 0.00 0.00 0x120001284:5 cmpule t3, 0x3e, t3 0.0 0.00 0.00 0x120001288:5 bne t3, 0x1200012f4 0.0 0.00 0.00 0x12000128c:5 ldq_u zero, 0(sp) 20.2 6.45 6.45 0x120001290:6 ldl zero, 128(a1) 1.3 0.42 6.87 0x120001294:6 lds $f31, 128(a0) 35.6 11.39 18.26 0x120001298:6 ldt $f0, 0(a1) 20.0 6.40 24.66 0x12000129c:6 ldt $f1, 0(a0) 9.7 3.10 27.75 0x1200012a0:6 ldt $f10, 8(a1) 14.9 4.77 32.53 0x1200012a4:6 ldt $f11, 8(a0) 17.4 5.56 38.09 0x1200012a8:6 ldt $f12, 16(a1) 7.0 2.26 40.35 0x1200012ac:6 ldt $f13, 16(a0) 8.2 2.62 42.97 0x1200012b0:6 ldt $f14, 24(a1) 12.9 4.14 47.11 0x1200012b4:6 ldt $f15, 24(a0) 24.9 7.97 55.09 0x1200012b8:6 addt $f1,$f0,$f0 24.7 7.92 63.01 0x1200012bc:6 addt $f11,$f10,$f10 37.8 12.12 75.13 0x1200012c0:6 addt $f13,$f12,$f12 39.2 12.54 87.67 0x1200012c4:6 addt $f15,$f14,$f14 0.8 0.26 87.93 0x1200012c8:5 addl t0, 0x4, t0 0.0 0.00 87.93 0x1200012cc:5 lda a1, 32(a1) 8.4 2.68 90.61 0x1200012d0:6 stt $f0, 0(a0)
By comparison, the following example shows how to display a profile of the instructions that suffered the top 90 percent of data cache misses on an EV56 Alpha system:
%
cc -o sample -g1 -O2 *.c -lm
%
uprofile -asm -quit 90cum% dcacheldmisses sample
The resulting sample profile is shown in
Example 8-7.
Example 8-7: Sample uprofile Data-Cache-Misses Profile Using prof
% uprofile -asm -quit 90cum% dcacheldmisses sample Value = 179804.149985 Writing umon.out Displaying profile for sample: Profile listing generated Thu Dec 3 10:34:25 1998 with: prof -asm -quit 90cum% sample umon.out ---------------------------------------------------------------------------- * -a[sm] using performance counters: * * dcacheldmiss: 1 sample every 16384 DCache LD Misses [1] * * sorted in descending order by samples recorded for each procedure; * * unexecuted procedures excluded * ---------------------------------------------------------------------------- Each sample covers 4.00 byte(s) for 0.18% of 550 samples [2] samples % cum % address:line instruction add_vector (add_vector.c) 0.0 0.00 0.00 0x120001260:2 addl zero, a2, a2 0.0 0.00 0.00 0x120001264:5 bis zero, zero, t0 0.0 0.00 0.00 0x120001268:5 ble a2, 0x12000131c 0.0 0.00 0.00 0x12000126c:5 subl a2, 0x3, t1 0.0 0.00 0.00 0x120001270:5 cmple t1, a2, t2 0.0 0.00 0.00 0x120001274:5 beq t2, 0x1200012f4 0.0 0.00 0.00 0x120001278:5 ble t1, 0x1200012f4 0.0 0.00 0.00 0x12000127c:5 subq a0, a1, t3 0.0 0.00 0.00 0x120001280:5 lda t3, 31(t3) 0.0 0.00 0.00 0x120001284:5 cmpule t3, 0x3e, t3 0.0 0.00 0.00 0x120001288:5 bne t3, 0x1200012f4 0.0 0.00 0.00 0x12000128c:5 ldq_u zero, 0(sp) 1.0 0.18 0.18 0x120001290:6 ldl zero, 128(a1) 3.0 0.55 0.73 0x120001294:6 lds $f31, 128(a0) 62.0 11.27 12.00 0x120001298:6 ldt $f0, 0(a1) [3] 64.0 11.64 23.64 0x12000129c:6 ldt $f1, 0(a0) 8.0 1.45 25.09 0x1200012a0:6 ldt $f10, 8(a1) 47.0 8.55 33.64 0x1200012a4:6 ldt $f11, 8(a0) 13.0 2.36 36.00 0x1200012a8:6 ldt $f12, 16(a1) 38.0 6.91 42.91 0x1200012ac:6 ldt $f13, 16(a0) 15.0 2.73 45.64 0x1200012b0:6 ldt $f14, 24(a1) 51.0 9.27 54.91 0x1200012b4:6 ldt $f15, 24(a0) 49.0 8.91 63.82 0x1200012b8:6 addt $f1,$f0,$f0 142.0 25.82 89.64 0x1200012bc:6 addt $f11,$f10,$f10 13.0 2.36 92.00 0x1200012c0:6 addt $f13,$f12,$f12
The stated sampling rate of 1 sample every 16384 means that each sample shown in the profile occurred after 16384 data cache misses, but not all of these occurred at the instruction indicated. [Return to example]
The total number of samples is shown, not the number of data cache misses. [Return to example]
Indicates the number of samples recorded at an instruction, which statistically implies a proportionate number of data cache misses. [Return to example]
The
uprofile
profiling technique has the advantage
of very low run-time overhead.
Also, the detailed information it can provide
on the costs of executing individual instructions or source lines is essential
in identifying exactly which operation in a procedure is slowing the program
down.
The disadvantages of
uprofile
are as follows:
Only executables can be profiled.
To profile code in a library,
you must first link the program with the
-non_shared
option.
Only one program can be profiled with the hardware counters at one time.
Threads cannot be profiled individually.
The Alpha EV6 architecture's execution of instructions out of sequence can significantly reduce the accuracy of fine-grained profiles.
As noted earlier,
the
hiprof
command's default PC-sampling technique can
also generate CPU-time profiles like those of
uprofile
,
using a sampling frequency of a little under a millisecond.
The profile is
not as accurate, because the call counting affects the performance, but it
has some advantages:
Shared libraries can be profiled.
Threads can be individually profiled (at the cost of very large memory and data file size).
It is independent of hardware resources and architecture.
In the following example, the
hiprof
command's
-lines
option is used to display a profile of the CPU time used by
each source line, grouped by procedure:
%
cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c
%
cc -o sample -g1 -O2 profsample.c -L. -lsample -lm
%
hiprof -lines -numbers -L. -inc libsample.so sample
The resulting sample profile is shown in
Example 8-8.
Example 8-8: Sample hiprof -lines PC-Sampling Profile
% cc -o libsample.so -shared -g1 -O2 add_vector.c mul_by_scalar.c print_it.c add_vector.c: mul_by_scalar.c: print_it.c: % cc -o sample -g1 -O2 profsample.c -L. -lsample -lm % hiprof -lines -numbers -L. -inc libsample.so sample hiprof: info: instrumenting sample ... hiprof: info: the LD_LIBRARY_PATH environment variable is not defined hiprof: info: setting LD_LIBRARY_PATH=.:. hiprof: info: running instrumented program sample.hiprof ... Value = 179804.149985 hiprof: info: instrumented program exited with status 0 hiprof: info: displaying profile for sample ... gprof -b -scaled -all -lines -numbers -L. sample.hiprof ./sample.hiout Milliseconds per source line, in source order within functions procedure (file) line bytes millisec % cum % add_vector (add_vector.c) 2 52 0.0 0.00 0.00 5 92 1.0 0.30 0.30 6 92 318.4 98.19 98.49 8 4 0.0 0.00 98.49 mul_by_scalar (mul_by_scalar.c) 2 52 0.0 0.00 98.49 5 72 0.0 0.00 98.49 6 64 3.9 1.20 99.70 8 4 0.0 0.00 99.70 main (profsample.c) 9 32 0.0 0.00 99.70 12 128 0.0 0.00 99.70 16 8 0.0 0.00 99.70 19 32 0.0 0.00 99.70 20 36 0.0 0.00 99.70 21 16 0.0 0.00 99.70 22 24 0.0 0.00 99.70 24 4 0.0 0.00 99.70 25 36 0.0 0.00 99.70 26 16 0.0 0.00 99.70 27 28 1.0 0.30 100.00 28 20 0.0 0.00 100.00 29 40 0.0 0.00 100.00 30 4 0.0 0.00 100.00 31 20 0.0 0.00 100.00 32 4 0.0 0.00 100.00 33 20 0.0 0.00 100.00 34 56 0.0 0.00 100.00
Using the cc Command's -p Option
The
cc
command's
-p
option uses a low-frequency sampling technique to generate
a profile of CPU time that is similar to
uprofile
's but
statistically less accurate.
However, the
-p
option does
offer the following advantages:
Shared libraries can be profiled
Threads can be individually profiled
Independent of hardware resources and architecture
Common to many UNIX operating systems
On Tru64 UNIX, can profile all the shared libraries used by a program
The program needs to be relinked with the
-p
option, but it does not need to be recompiled from source so long
as the original compilation used an acceptable debug level, such as the
-g1
cc command option (see
Section 8.3).
For example, to profile individual source lines and procedures of a program
(if the program runs for long enough to generate a dense sample array):
%
cc -p -o sample -g1 -O2 *.c -lm
[1]%
setenv PROFFLAGS '-all -stride 1'
[2]%
./sample
[3]%
prof -all -proc -heavy -numbers sample
[4]
The
cc
command's
-p
PC-sample-profiling option creates an instrumented version of
the program, called
sample
.
[Return to example]
The
-all
option specified
with the
PROFFLAGS
environment variable asks for all shared
libraries to be profiled (see
Section 8.8.4).
This
causes
sqrt
(from
libm.so
) to show up
in the profile as the second highest CPU-time user.
The variable must be set
before the instrumented program is run.
The
-stride 1
option in
PROFFLAGS
asks for a separate counter to be used for each instruction, to allow accurate
per-source-line profiling with
prof
's
-heavy
option.
[Return to example]
Running the instrumented program produces
a PC-sampling data file called
mon.out
, by default, to
be used by the
prof
tool.
For information about working
with multiple data files, see
Section 8.9.
[Return to example]
The
prof
tool (see
prof
(1))
uses the PC-sampling data file to produce the profile.
Because this technique
works by periodic sampling of the program counter, you might see different
output when you profile the same program multiple times.
When running
prof
manually, as in this example, you
can filter which shared libraries to include in the displayed profile; the
-all
option tells
prof
to include all libraries
(see
Section 8.8.4).
The
-proc[edures]
option profiles the instructions executed in each procedure and
the calls to procedures.
The
-heavy
option reports the
lines that executed the most instructions.
(Also,
-lines
shows per-line profiles and
-asm
shows per-instruction
profiles, both grouped by procedure.)
[Return to example]
The resulting sample profile is shown in
Example 8-9.
Example 8-9: Sample cc -p Profile Using prof
% cc -p -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: mul_by_scalar.c: print_it.c: profsample.c: % setenv PROFFLAGS "-all -stride 1" % ./sample Value = 179804.149985 % prof -all -proc -heavy -numbers sample Profile listing generated Mon Feb 23 15:33:07 1998 with: prof -all -proc -heavy -numbers sample ---------------------------------------------------------------------------- * -p[rocedures] using pc-sampling; * * sorted in descending order by total time spent in each procedure; * * unexecuted procedures excluded * ---------------------------------------------------------------------------- Each sample covers 4.00 byte(s) for 0.25% of 0.3955 seconds %time seconds cum % cum sec procedure (file) 93.1 0.3682 93.1 0.37 add_vector (<sample>:"add_vector.c":1) 5.4 0.0215 98.5 0.39 sqrt (<libm.so>) 1.0 0.0039 99.5 0.39 mul_by_scalar (<sample>:"mul_by_scalar.c":1) 0.5 0.0020 100.0 0.40 main (<sample>:"profsample.c":12) ---------------------------------------------------------------------------- * -h[eavy] using pc-sampling; * * sorted in descending order by time spent in each source line; * * unexecuted lines excluded * ---------------------------------------------------------------------------- Each sample covers 4.00 byte(s) for 0.25% of 0.3955 seconds procedure (file) line bytes millisec % cum % add_vector (add_vector.c) 6 80 363.3 91.85 91.85 add_vector (add_vector.c) 5 96 4.9 1.23 93.09 mul_by_scalar (mul_by_scalar.c) 6 60 3.9 0.99 94.07 main (profsample.c) 20 36 2.0 0.49 94.57
Using the pixie Profiler
The
pixie
tool (see
pixie
(1))
can also profile source-lines and instructions (including shared libraries),
but note that when it displays counts of
cycles
, it is
actually reporting counts of instructions executed, not machine cycles.
Its
-truecycles 2
option can estimate the number of cycles that
would be used if all memory accesses were satisfied by the cache, but programs
can rarely cache enough of their data for this to be accurate, and only the
Alpha EV4 and EV5 families can be fully simulated in this way.
For example:
%
cc -o sample -g1 -O2 *.c -lm
[1]%
pixie -all -proc -heavy -quit 5 sample
[2]
For information about the
-g1
and
-O2
options, see
Section 8.3.
[Return to example]
The
pixie
command creates
an instrumented version of the program (sample.pixie
) and
an instruction-addresses file (sample.Addrs
).
Because
prof
options (-proc, -heavy, -quit
) are specified,
pixie
automatically runs that instrumented program to create an
instructions-counts file (sample.Counts
) and then runs
prof
to display the profile, using the
.Addrs
and
.Counts
files as input.
The
-all
option asks for shared libraries to be profiled.
Although a
pixie
profile can include shared libraries,
system libraries (like
libm.so
, which contains the
sqrt
function) do not include source-line numbers, so they are not
included in the
-heavy
option's per-line profile, and static
procedures get
proc_at...
names created for them.
The
-heavy
option reports lines that executed the
most instructions.
The
-proc[edures]
option profiles the
instructions executed in each procedure and the calls to procedures.
The
-quit 5
option truncates the report after 5 lines for each mode
(-heavy
,
-proc[edures]
).
[Return to example]
% cc -o sample -g1 -O2 add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: mul_by_scalar.c: print_it.c: profsample.c: % pixie -all -proc -heavy -quit 5 sample pixie: info: instrumenting sample ... pixie: info: the LD_LIBRARY_PATH environment variable is not defined pixie: info: setting LD_LIBRARY_PATH=. pixie: info: running instrumented program sample.pixie ... Value = 179804.149985 pixie: info: instrumented program exited with status 0 pixie: info: displaying profile for sample ... Profile listing generated Mon Feb 23 15:33:55 1998 with: prof -pixie -all -procedures -heavy -quit 5 sample ./sample.Counts ---------------------------------------------------------------------------- * -p[rocedures] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * procedure; unexecuted procedures are excluded * ---------------------------------------------------------------------------- 69089823 cycles (0.1727 seconds at 400.00 megahertz) cycles %cycles cum % seconds cycles bytes procedure (file) /call /line 60001400 86.85 86.85 0.1500 600014 48 add_vector (add_vector.c) 7100008 10.28 97.12 0.0178 72 ? sqrt (<libm.so>) 1301816 1.88 99.01 0.0033 1301816 27 main (profsample.c) 675020 0.98 99.98 0.0017 337510 36 mul_by_scalar (mul_by_scalar.c) 854 0.00 99.98 0.0000 854 ? __cvtas_t_to_a (<libc.so>) ---------------------------------------------------------------------------- * -p[rocedures] using invocation counts; * * sorted in descending order by number of calls per procedure; * * unexecuted procedures are excluded * ---------------------------------------------------------------------------- 100504 invocations total calls %calls cum% bytes procedure (file) 100000 99.50 99.50 820 sqrt (<libm.so>) 100 0.10 99.60 192 add_vector (add_vector.c) 39 0.04 99.64 164 proc_at_0x3ff815bc1e0 (<libc.so>) 38 0.04 99.67 144 proc_at_0x3ff815bc150 (<libc.so>) 38 0.04 99.71 16 proc_at_0x3ff815bc140 (<libc.so>) ---------------------------------------------------------------------------- * -h[eavy] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * line; unexecuted lines are excluded * ---------------------------------------------------------------------------- procedure (file) line bytes cycles % cum % add_vector (add_vector.c) 6 92 45000000 65.13 65.13 add_vector (add_vector.c) 5 92 15001200 21.71 86.85 main (profsample.c) 20 36 600003 0.87 87.71 main (profsample.c) 22 24 600000 0.87 88.58 mul_by_scalar (mul_by_scalar.c) 6 64 487500 0.71 89.29
The
-procedures
profile includes a profile of call
counts.
The optional
dxprof
command provides a graphical
display of profiles collected by either
pixie
or the
cc
command's
-p
option.
You
can also run the
prof -pixstats
command on the executable
file
sample
to generate a detailed report on opcode frequencies,
interlocks, a miniprofile, and more.
For more information, see
prof
(1).
8.6 Minimizing System Resource Usage
The following sections describe the
techniques and tools to help you minimize use of system resources by your
application.
8.6.1 Techniques
The techniques described in the previous sections can improve an application's use of just the CPU. You can make further performance enhancements by improving the efficiency with which the application uses the other components of the computer system, such as heap memory, disk files, network connections, and so on.
As with CPU profiling, the first phase of a resource usage improvement process is to monitor how much memory, data I/O and disk space, elapsed time, and so on, is used. The throughput of the computer can then be increased or tuned in ways that help the program, or the program's design can be tuned to make better use of the computer resources that are available. For example:
Reduce the size of the data files that the program reads and writes.
Use memory-map files instead of regular I/O.
Allocate memory incrementally on demand instead of allocating at startup the maximum that could be required.
Fix heap leaks and do not leave allocated memory unused.
See the
System Configuration and Tuning
manual for a broader discussion
of analyzing and tuning a Tru64 UNIX system.
8.6.2 Tools and Examples
The following sections discuss using system monitors and the Third Degree
tool to minimize system resource usage.
8.6.2.1 System Monitors
The Tru64 UNIX base system commands
ps u
,
swapon -s
, and
vmstat3
can show the currently active processes' usage of system resources
such as CPU time, physical and virtual memory, swap space, page faults, and
so on.
See
ps
(1),
swapon
(8), and
vmstat
(3)
for more information.
The optional
pview
command provides a graphical display
of similar information for the processes that comprise an application.
See
pview
(1).
The
time
commands provided by the Tru64 UNIX
system and command shells provide an easy way to measure the total elapsed
and CPU times for a program and it descendants.
See
time
(1).
Performance Manager is an optional system performance monitoring and
management tool with a graphical interface.
See
pmgr
(8X).
For more information about related tools, see the
System Configuration and Tuning
manual.
8.6.2.2 Heap Memory Analyzers
The
Third Degree tool (see
third
(1)) reports heap memory leaks in a program,
by instrumenting it with the Third Degree memory-usage checker, running it,
and displaying a log of leaks detected at program exit.
For example:
%
cc -o sample -g -non_shared *.c -lm
[1]%
third -display sample
[2]
Full debug information (that is, compiling
with the
-g
option) is usually best for Third Degree, but
if less is available, the reports will just be machine-level instead of source-level.
The
-g1
option is fine if you are just checking for memory
leaks.
[Return to example]
The
third
command creates
an instrumented version of the program (sample.third
).
Because the
-display
option is specified,
third
automatically runs that instrumented program to create a log file
(sample.3log
) and then runs
more
to
display the file.
[Return to example]
cc -o sample -g -non_shared add_vector.c mul_by_scalar.c print_it.c profsample.c -lm add_vector.c: mul_by_scalar.c: print_it.c: profsample.c: third -display sample third: info: instrumenting sample ... third: info: running instrumented program sample.third ... Value = 179804.149985 third: info: instrumented program exited with status 0 third: info: error log for sample ... more ./sample.3log Third Degree version 5.0 sample.third run by jpx on frumpy.abc.dec.com at Mon Jun 21 16:59:50 1999 ////////////////////////////// Options ////////////////////////////// ----------------------------------------------------------------------- ----------------------------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ New blocks in heap after program exit Leaks - blocks not yet deallocated but apparently not in use: * A leak is not referenced by static memory, active stack frames, or unleaked blocks, though it may be referenced by other leaks. * A leak "not referenced by other leaks" may be the root of a leaked tree. * A block referenced only by registers, unseen thread stacks, mapped memory, or uninstrumented library data is falsely reported as a leak. Instrumenting shared libraries, if any, may reduce the number of such cases. * Any new leak lost its last reference since the previous heap report, if any. A total of 800000 bytes in 1 leak were found: 800000 bytes in 1 leak (including 1 not referenced by other leaks) created at: malloc sample main sample, profsample.c, line 19 __start sample Objects - blocks not yet deallocated and apparently still in use: * An object is referenced by static memory, active stack, or other objects. * A leaked block may be falsely reported as an object if a pointer to it remains when a new stack frame or heap block reuses the pointer's memory. Using the option to report uninitialized stack and heap may avoid such cases. * Any new object was allocated since the previous heap report, if any. A total of 0 bytes in 0 objects were found: ------------------------------------------------------------------------ ------------------------------------------------------------------------ memory layout at program exit heap 933888 bytes [0x140012000-0x1400f6000] stack 819504 bytes [0x11ff37ed0-0x120000000] sample data 66464 bytes [0x140000000-0x1400103a0] sample text 802816 bytes [0x120000000-0x1200c4000] =============================================================================
By default, Third Degree profiles memory leaks, with little overhead.
If you are interested only in leaks occurring during the normal operation of the program, not during startup or shutdown, you can specify additional places to check for previously unreported leaks. For example, the preshutdown leak report will give this information:
%
third -display -after startup -before shutdown sample
Third Degree can also detect various kinds of bugs that may be affecting the correctness or performance of a program. See Chapter 7 for more information on debugging and leak detection.
The optional
dxheap
command provides a graphical
display of Third Degree's heap and bug reports.
See
dxheap
(1).
The optional
mview
command provides a graphical analysis
of heap usage over time.
This view of a program's heap can clearly show the
presence (if not the cause) of significant leaks or other undesirable trends
such as wasted memory.
See
mview
(1).
8.7 Verifying the Significance of Test Cases
The following sections discuss the techniques
and tools used to verify the significance of test cases.
8.7.1 Techniques
Most of the profiling techniques described in the previous sections
are effective only if you profile and optimize or tune the parts of the program
that are executed in the scenarios whose performance is important.
Careful
selection of the data used for the profiled test runs is often sufficient,
but you may want a quantitative analysis of which code was and was not executed
in a given set of tests.
8.7.2 Tools and Examples
The
pixie
command's
-t
[estcoverage
] option reports lines of code that
were not executed in a given test run.
For example:
%
cc -o sample -g1 -O2 *.c -lm
%
pixie -t sample
Similarly, the
-zero
option reports the names of
procedures that were never executed.
Conversely,
pixie
's
-p
[rocedure
],
-h
[eavy
], and
-a
[sm
] options
show which procedures, source lines, and instructions were executed.
If multiple test runs are needed to build up a typical scenario, the
prof
command can be run separately on a set of profile data files.
For example:
%
cc -o sample -g1 -O2 *.c -lm
[1]%
pixie -pids sample
[2]%
./sample.pixie
[3]%
./sample.pixie
%
prof -pixie -t sample sample.Counts.*
[4]
For information about the
-g1
and
-O2
options, see
Section 8.3.
[Return to example]
The
pixie
command creates
an instrumented version of the program (sample.pixie
) and
an instruction-addresses file (sample.Addrs
).
The
-pids
option adds the process ID of the instrumented program's test
run (item 3) to the name of the profiling data file produced, so that a unique
file is retained after each run.
For information about working with multiple
data files, see
Section 8.9.
[Return to example]
The instrumented program is run twice (usually
with different input data) to produce two profiling data files named
sample.Counts.
pid.
[Return to example]
The
-pixie
option tells
prof
to use
pixie
mode rather than the default
PC-sampling mode.
The
prof
tool uses the
sample.Addrs
file and the two
sample.Counts.
pid
files to create the profile from the two runs.
[Return to example]
Depending on the size of the application and the type of profiling you request, profilers may generate a very large amount of output. However, you are often only interested in profiling data about a particular portion of your application. The following sections show how you can select the appropriate information by using:
prof
options (with the
pixie
,
uprofile
, or
prof
command)
gprof
options (with the
hiprof
or
gprof
command)
Many of the
prof
and
gprof
options perform similar functions and have similar names and syntax.
The examples
used show
prof
options.
For complete details, see
hiprof
(1),
pixie
(1),
uprofile
(1),
prof
(1),
and
gprof
(1).
See
Section 8.11
for information on using
monitor
routines to control profiling.
8.8.1 Limiting Profiling Display to Specific Procedures
The
prof
command's
-only
option prints profiling information for only a specified procedure.
You can
use this option several times on the command line.
For example:
%
pixie -only mul_by_scalar -only add_vector sample
The
-exclude
option prints profiling information
for all procedures except the specified procedure.
You can use this option
several times on the command line.
Do not use
-only
and
-exclude
together on the same command line.
Many of the
prof
profiling options print output as
percentages; for example, the percentage of total execution time attributed
to a particular procedure.
By default, the
-only
and
-exclude
options cause
prof
to calculate percentages
based on all of the procedures in the application even if they were omitted
from the listing.
You can change this behavior with the
-Only
and
-Exclude
options.
They work the same
as
-only
and
-exclude
, but
cause
prof
to calculate percentages based only on those
procedures that appear in the listing.
The
-totals
option, used with the
-procedures
and
-invocations
listings,
prints cumulative statistics for the entire object file instead of for each
procedure in the object.
8.8.2 Displaying Profiling Information for Each Source Line
The
prof
and
gprof
-heavy
and
-lines
options display the number of machine
cycles, instructions, page faults, cache misses, and so on for each source
line in your application.
The
-asm
option displays them for
each instruction.
The
-heavy
option prints an entry for every
source line.
Entries are sorted from the line with the heaviest use to the
line with the lightest.
Because
-heavy
often prints
a large number of entries, you might want to use one of the
-quit
,
-only
, or
-exclude
options to reduce output to a manageable size (see
Section 8.8.3).
The
-lines
option is similar to
-heavy
, but sorts the output differently.
This option prints the lines
for each procedure in the order that they occur in the source file.
Even
lines that never executed are printed.
The procedures themselves are sorted
by decreasing total use.
8.8.3 Limiting Profiling Display by Line
The
-quit
option reduces
the amount of profiling output displayed.
It affects the output from the
-procedures
,
-heavy
, and
-lines
profiling modes.
The
-quit
option has the following three forms:
-quit
N |
Truncates the listing after N lines. |
-quit
N% |
Truncates the listing after the first line
that shows less than
N%
of the
total. |
-quit
Ncum% |
Truncates the listing after the line at which
the cumulative total reaches
N%
of the total. |
If you specify several modes on the same command line,
-quit
affects the output from each mode.
The following command prints
only the 20 most time-consuming procedures and the 20 most time-consuming
source lines:
%
pixie -procedures -heavy -quit 20 sample
Depending on the profiling mode, the total can refer to the total amount
of time, the total number of executed instructions, or the total number of
invocation counts.
8.8.4 Including Shared Libraries in the Profiling Information
When you are using the
hiprof
,
pixie
,
third
,
prof
, or
gprof
commands, the following options let you selectively display
profiling information for shared libraries used by the program:
-all
displays profiles for all shared
libraries (if any) described in the data file(s) in addition to the executable.
-incobj
lib
displays the profile for the named shared library in addition to the executable.
-excobj
lib
does not display the profile for the named shared library, if
-all
was specified.
For example, the following command displays profiling information in
all shared libraries except for
user1.so
:
%
prof -all -excobj user1.so sample
When you are using the
cc
command's
-p
option to obtain a PC-sampling profile, you can use the
PROFFLAGS
environment variable to include or exclude profiling information
for shared libraries when the instrumented program is run (as shown in
Example 8-9).
The
PROFFLAGS
variable
accepts the
-all
,
-incobj
, and
excobj
options.
For more information specific to shared libraries, see
Section 8.8.4.1.
8.8.4.1 Specifying the Location of Instrumented Shared Libraries
The
LD_LIBRARY_PATH
environment variable is used
to tell the loader where to find instrumented shared libraries.
By default, when you run
hiprof
,
pixie
,
or
third
, the
LD_LIBRARY_PATH
variable
is automatically set to point to the working directory.
You can set it to
point to an arbitrary directory, as in the following C shell example:
%
setenv LD_LIBRARY_PATH "my_inst_libraries"
8.9 Merging Profile Data Files
If the program you are profiling is fairly complicated,
you may want to run it several times with different inputs to get an accurate
picture of its profile.
Each of the profiling tools lets you merge profile
data from different runs of a program.
But first you must override the default
behavior whereby each run of the instrumented program overwrites the existing
data file.
Section 8.9.1
explains how to do that.
Section 8.9.2
explains how to merge data files.
8.9.1 Data File-Naming Conventions
The default name of a profiling data file depends on the tool used to create it, as follows:
Tool | Default Name of Profiling Data File |
hiprof |
program.hiout |
pixie |
program.Counts
(used with
program.Addrs ) |
uprofile |
umon.out |
cc -p |
mon.out |
cc -pg |
gmon.out |
By default, when you make multiple profiling runs, each run overwrites
the existing data file in the working directory.
Each tool has options for
renaming data files so they can be preserved from run to run.
The
hiprof
,
pixie
, and
uprofile
commands have the following options:
-dirname
path |
Specifies a directory where the data file is to be created. |
-pids |
Adds the process ID of the instrumented program's run to the data file name. |
For example, the following command sequence produces two data files
of the form
sample.
pid.hiout
in the subdirectory
profdata
:
%
hiprof -dir profdata -pids sample
%
./sample
%
./sample
Then, when using the
gprof
command, you can specify
a wildcard to include all the profiling data files in the directory:
%
gprof -b sample profdata/*
When using the
cc -p
or
cc -pg
profiling options, you need to set the
PROFFLAGS
environment
variable (before running the instrumented program).
For example, the following
command sequence would produce two data files of the form
pid.sample
in the subdirectory
profdata
(C shell example):
%
cc -pg -o sample -g1 -O2 *.c -lm
%
setenv PROFFLAGS "-dir profdata -pids"
%
./sample
%
./sample
Note that
PROFFLAGS
affects the profiling behavior
of a program during its execution; it has no effect on the
prof
and
gprof
postprocessors.
When you set
PROFFLAGS
to a null string, no profiling occurs.
For more information about file-naming conventions, see the tool reference
pages.
See
Section 8.10
for the file-naming conventions
for multithreaded programs.
8.9.2 Data File-Merging Techniques
After creating several profiling data files from multiple runs of a program, you can display profiling information based on an average of these files.
Use the
prof
or
gprof
postprocessor,
depending on the profiling technique used, as follows:
If the profiling tool is: | Use this post processor: |
cc -p ,
uprofile , or
pixie |
prof |
cc -pg , or
hiprof |
gprof |
One merge technique is to specify the name of each data file explicitly
on the command line.
For example, the following command prints profiling information
from two profile data files generated using
hiprof
:
%
gprof sample sample.1510.hiout sample.1522.hiout
Keeping track of many different profiling data files, however, can be
difficult.
Therefore,
prof
and
gprof
provide the
-merge
option to combine several data
files into a single merged file.
When
prof
operates in
pixie
mode (prof -pixie
), the
-merge
option combines the
.Counts
files.
When
prof
operates in PC-sampling mode, this switch combines
the
mon.out
or other profile data files.
The following example combines two profile data files into a single
data file named
total.out
:
%
prof -merge total.out sample 1773.sample 1777.sample
At a later time, you can then display profiling data using the combined
file, just as you would use a normal
mon.out
file.
For
example:
%
prof -procedures sample total.out
The merge process is similar for
-pixie
mode.
You must specify the executable file's name, the
.Addrs
file, and each
.Counts
file.
For example:
%
prof -pixie -merge total.Counts a.out a.out.Addrs \
a.out.Counts.1866 a.out.Counts.1868
8.10 Profiling Multithreaded Applications
Note
To analyze a multithreaded appplication for potential logic and performance problems, you can use Visual Threads, which is available on the Tru64 UNIX Associated Products Volume 1 installation media. Visual Threads can be used on POSIX Threads Library applications and on Java applications. See
dxthreads
(1).
Profiling multithreaded applications is essentially the same as profiling
nonthreaded applications.
However, to profile multithreaded applications,
you must compile your program with the
-pthread
or
-threads
option (as they are defined by the
cc
command).
Other threads packages are not supported.
When using
hiprof
(1),
pixie
(1), or
third
(1), specify the
-pthread
option to enable the tool's thread-safe profiling support.
The
uprofile
(1)
command and the
cc
command's
-p
and
-pg
options need no extra option to become
thread-safe.
The default case for profiling multithreaded applications is to provide one profile covering all threads. In this case, you get profiling across the entire process, and you get one output file of profiling data from all threads.
When using
hiprof
(1)
or
pixie
(1), specify the
-threads
option for per-thread data.
When using the
cc
command's
-p
or
-pg
option, set the
PROFFLAGS
environment variable to
-threads
for per-thread profiling, as shown in the following C shell example:
setenv PROFFLAGS "-threads"
The
uprofile
(1)
and
third
(1)
tools do not provide per-thread data.
Per-thread profiling data files are given names that include unique thread numbers, which reflect the sequence in which the threads were created or in which their profiling was started.
If you use the
monitor( )
or
monstartup( )
calls (see
Section 8.11) in a threaded program,
you must first set
PROFFLAGS
to
"-disable_default
-threads"
, which gives you complete control of profiling the application.
If the application uses
monitor( )
and allocates
separate buffers for each thread profiled, you must first set
PROFFLAGS
to
"-disable_default -threads"
because this
setting affects the file-naming conventions that are used.
Without the
-threads
option, the buffer and address range used as a result
of the first
monitor
or
monstartup
call
would be applied to every thread that subsequently requests profiling.
In
this case, a single data file that covers all threads being profiled would
be created.
Each thread in a process must call the
monitor( )
or
monstartup( )
routines to initiate profiling
for itself.
8.11 Using monitor Routines to Control Profiling
The default behavior for the
cc
command's
-p
and
-pg
modes
on Tru64 UNIX systems is to profile the entire text segment of your program
and place the profiling data in
mon.out
for
prof
profiling or in
gmon.out
for
gprof
profiling.
For large
programs, you might not need to profile the entire text segment.
The
monitor
routines provide the ability to profile portions of your
program specified by the lower and upper address boundaries of a function
address range.
The
monitor
routines are as follows:
monitor
Use this routine to
gain control of explicit profiling by turning profiling on and off for a specific
text range.
This routine is not supported for
gprof
profiling.
monstartup
Similar to
monitor
except it specifies address range only and is supported for
gprof
profiling.
moncontrol
Use this routine
with
monitor
and
monstartup
to turn
PC sampling on or off during program execution for a specific
process or
thread.
monitor_signal
Use this routine to profile nonterminating programs, such as daemons.
You can use
monitor
and
monstartup
to profile an address range in each shared library as well as in the static
executable.
For more information on these functions, see
monitor
(3).
By default, profiling begins as soon your program starts to execute.
To prevent this behavior, set the
PROFFLAGS
environment
variable to
-disable_default
, as shown in the following
C shell example:
setenv PROFFLAGS "-disable_default"
Then,
you can use the
monitor
routines to begin profiling after
the first call to
monitor
or
monstartup.
Example 8-12
demonstrates how to use
the
monstartup
and
monitor
routines
within a program to begin and end profiling.
Example 8-12: Using monstartup() and monitor()
/* Profile the domath( ) routine using monstartup. * This example allocates a buffer for the entire program. * Compile command: cc -p foo.c -o foo -lm * Before running the executable, enter the following * from the command line to disable default profiling support: * setenv PROFFLAGS -disable_default */ #include <stdio.h> #include <sys/syslimits.h> char dir[PATH_MAX]; extern void _ _start( ); extern unsigned long _etext; main( ) { int i; int a = 1; /* Start profiling between _ _start (beginning of text) * and _etext (end of text). The profiling library * routines will allocate the buffer. */ monstartup(_ _start,&_etext); for(i=0;i<10;i++) domath( ); /* Stop profiling and write the profiling output file. */ monitor(0); } domath( ) { int i; double d1, d2; d2 = 3.1415; for (i=0; i<1000000; i++) d1 = sqrt(d2)*sqrt(d2); }
The external name
_etext
lies just above all the
program text.
See
end
(3)
for more information.
When you set the
PROFFLAGS
environment variable to
-disable_default
, you disable default profiling buffer support.
You can allocate buffers within your program, as shown in
Example 8-13.
Example 8-13: Allocating Profiling Buffers Within a Program
/* Profile the domath routine using monitor(). * Compile command: cc -p foo.c -o foo -lm * Before running the executable, enter the following * from the command line to disable default profiling support: * setenv PROFFLAGS -disable_default */ #include <sys/types.h> #include <sys/syslimits.h> extern char *calloc( ); void domath(void); void nextproc(void); #define INST_SIZE 4 /* Instruction size on Alpha */ char dir[PATH_MAX]; main( ) { int i; char *buffer; size_t bufsize; /* Allocate one counter for each instruction to * be sampled. Each counter is an unsigned short. */ bufsize = (((char *)nextproc - (char *)domath)/INST_SIZE) * sizeof(unsigned short); /* Use calloc( ) to ensure that the buffer is clean * before sampling begins. */ buffer = calloc(bufsize,1); /* Start sampling. */ monitor(domath,nextproc,buffer,bufsize,0); for(i=0;i<10;i++) domath( ); /* Stop sampling and write out profiling buffer. */ monitor(0); } void domath(void) { int i; double d1, d2; d2 = 3.1415; for (i=0; i<1000000; i++) d1 = sqrt(d2)*sqrt(d2); } void nextproc(void) {}
You use the
monitor_signal( )
routine to profile
programs that do not terminate.
Declare this routine as a signal handler in
your program and build the program for
prof
or
gprof
profiling.
While the program is executing, send a signal from
the shell by using the
kill
command.
When the signal is received,
monitor_signal
is invoked
and writes profiling data to the data file.
If the program receives another
signal, the data file is overwritten.
Example 8-14
shows how to use the
monitor_signal
routine.
Setting the
PROFFLAGS
environment variable to
-sigdump SIGNAME
provides the same capability without needing any
new program code.
Example 8-14: Using monitor_signal() to Profile Nonterminating Programs
/* From the shell, start up the program in background. * Send a signal to the process, for example: kill -30 <pid> * Process the [g]mon.out file normally using gprof or prof */ #include <signal.h> extern int monitor_signal(); main() { int i; double d1, d2; /* * Declare monitor_signal() as signal handler for SIGUSR1 */ signal(SIGUSR1,monitor_signal); d2 = 3.1415; /* * Loop infinitely (absurd example of non-terminating process) */ for (;;) d1 = sqrt(d2)*sqrt(d2); }