Optimizing an application program can involve modifying the build process, modifying the source code, or both.
In many instances, optimizing an application program can result in major improvements in run-time performance. Two preconditions should be met, however, before you begin measuring the run-time performance of an application program and analyzing how to improve the performance:
lint
command with the
-Q
flag or compile your program using the C compiler's
-check
flag (in combination with the
-migrate
or
-newc
flags) to identify possible portability problems that you may
need to resolve.)
After you verify that these conditions have been met, you can begin the optimization process.
The process of optimizing an application can be divided into two separate, but complementary, activities:
The following sections provide details that relate to these two aspects of the optimization process.
Opportunities for improving an application's run-time performance exist in all phases of the build process. The following sections identify some of the major opportunities that exist in the areas of compiling, linking and loading, preprocessing and postprocessing, and library selection.
See
Appendix D
for additional optimization information that pertains only to the
-oldc
version of the C compiler.
Appendix D
contains information on
uopt,
the global optimizer (which is not used by the
-migrate
or
-newc
versions of the C compiler).
Compile your application with the highest optimization level possible,
that is, the level that produces the best performance and the correct
results.
In general, applications that conform to language-usage standards
should tolerate the highest optimization levels, and applications that
do not conform to such standards may have to be built at lower
optimization levels. For details, see
cc(1)
or
Chapter 2.
If your application will tolerate it, compile all of the source files together in a single compilation. Compiling multiple source files increases the amount of code that the compiler can examine for possible optimizations. This can have the following effects:
To take advantage of these optimizations, use the following compilation flags:
-newc
and
-migrate
versions of the C compiler, use
-ifo
and one of the following optimization-level flags:
-newc
flag, use
-O3
or
-O4.
-migrate
flag, use
-O4
(preferred) or
-O5.
(To determine whether the highest level of optimization benefits your particular program, compare the results of two separate compilations of the program, with one compilation at the highest level of optimization and the other compilation at the next lower level of optimization.)
-oldc
version of the C compiler, use
-O3.
See
cc(1)
or
Chapter 2
for information on when to use which
version of the C compiler.
Note that some routines may not tolerate a high level of optimization; such routines will have to be compiled separately.
Other compilation considerations that can have a significant impact on run-time performance include the following:
-fp_reorder
flag if a small difference in the result is acceptable.
char,
short,
or
int
data items within loops, you may be able to use the C compiler's
highest-level optimization flag to improve performance. (The
highest-level optimization flag
(-O4
with
-newc
and
-O5
with
-migrate)
implements byte vectorization, among other optimizations,
for Alpha systems.)
-speculate
flag. When a program
compiled with this flag is executed, values associated with a variety of
execution paths are precomputed so that they are immediately available
if they are needed. This "work ahead" operation uses idle machine
cycles, so it has no negative effect on performance. Performance is
usually improved whenever a precomputed value is used.
The
-speculate
flag can be specified in two forms:
-speculate
all
-speculate
by_routine
Both options result in exceptions being dismissed: the
-speculate
all
flag dismisses exceptions generated in all compilation units of the
program and the
-speculate
by_routine
flag dismisses only the exceptions in the compilation unit to which
it applies. If speculative execution results in a significant
number of dismissed exceptions, performance will be degraded.
The
-speculate
all
option is more aggressive and may result in greater performance
improvements than the other option, especially for programs doing
floating-point computations. The
-speculate
all
flag cannot be used if any routine in the program does exception
handling; however, the
-speculate
by_routine
option can be used when exception handling occurs outside
the compilation unit on which it is used. Neither
-speculate
option should be used if debugging is being done.
To print a count of the number of dismissed exceptions when the program does a normal termination, specify the following environment variable:
%
setenv _SPECULATE_ARGS -stats
The statistics feature is not currently available with the
-speculate
all
flag.
Use of the
-speculate
all
and
-speculate
by_routine
flags disables all messages about alignment fixups.
To generate alignment messages for both speculative and nonspeculative
alignment fixups, specify the following environment variable:
%
setenv _SPECULATE_ARGS -alignmsg
Both options can be specified as follows:
%
setenv _SPECULATE_ARGS -stats -alignmsg
-newc,
-migrate,
and
-oldc
versions of the C compiler to improve run-time performance:
| Flag | Description |
-ansi_alias
|
Specifies whether source code observes ANSI C aliasing rules. ANSI C aliasing rules allow for more aggressive optimizations. |
-ansi_args
|
Specifies whether source code observes ANSI C rules about arguments. If ANSI C rules are observed, special argument-cleaning code does not have to be generated. |
-fast
|
Turns on the optimizations for the following flags for increased
performance.
For
For only
|
-feedback
|
Specifies the name of a previously created feedback file. Information in the file can be used by the compiler when performing optimizations. |
-fp_reorder
|
Specifies whether certain code transformations that affect floating-point operations are allowed. |
-G
|
Specifies the maximum byte size of data items in the small data sections (sbss or sdata). |
-inline
|
Specifies whether to perform inline expansion of functions. |
-ifo
|
Provides improved optimization (interfile optimization) and code generation across file boundaries that would not be possible if the files were compiled separately. |
-O
|
Specifies the level of optimization that is to be achieved by the compilation. |
-Olimit
|
Specifies the maximum size, in basic blocks, of a routine that will be
optimized by the global optimizer
(uopt).
(This flag can be used only with the
-oldc
flag.)
|
-om
|
Performs a variety of code optimizations for programs compiled
with the
-non_shared
flag.
|
-preempt_module
|
Supports symbol preemption on a module-by-module basis. |
-speculate
|
Enables work (for example, load or computation operations) to be done in running programs on execution paths before the paths are taken. |
-tune
|
Selects processor-specific instruction tuning for specific implementations of the Alpha architecture. |
-unroll
|
Controls loop unrolling done by the optimizer at levels
-O2
and above.
(This flag can be used only with the
-newc
or
-migrate
flags.)
|
cc(1)
for details on these flags.
With the default exception handling mode, overflow, divide-by-zero,
and invalid-operation exceptions always signal the
SIGFPE
exception handler. Also, any use of an IEEE infinity, an IEEE NaN
(not-a-number), or an IEEE denormalized number will signal the
SIGFPE
exception handler.
By default, underflows silently produce a zero result, although the
compilers support a separate flag that allows underflows
to signal the
SIGFPE
handler.
The default exception handling mode is suitable for any portable
program that does not depend on the special characteristics of
particular floating-point formats. The default mode provides the
best exception handling performance.
-ieee)
With the portable IEEE exception handling mode, floating-point
exceptions do not signal unless a special call is made to enable the
fault. This mode correctly produces and handles IEEE infinity, IEEE
NaNs, and IEEE denormalized numbers. This mode also provides support
for most of the nonportable aspects of IEEE floating point: all status
flags and trap enables are supported, except for the inexact exception.
(See
ieee(3)
for information on the inexact exception feature
(-ieee_with_inexact).
Using this feature can slow down floating-point calculations by a
factor of 100 or more, and few, if any, programs have a need for
its use.)
The portable IEEE exception handling mode is suitable for any program that depends on the portable aspects of the IEEE floating-point standard. This mode is usually 10-20% slower than the default mode, depending on the amount of floating-point computation in the program. In some situations, this mode can increase execution time by more than a factor of two.
If your application does not use many large libraries, consider
linking it nonshared. This allows the linker to optimize calls into
the library, thus decreasing your application's startup time and
improving run-time performance (if calls are made frequently).
Nonshared applications, however, can use more system resources than
call-shared applications. If you are running a large number of
applications simultaneously and the applications have a set
of libraries in common (for example,
libX11
or
libc),
you may increase total system performance by linking them as
call-shared. See
Chapter 4
for details.
For applications that use shared libraries, ensure that those libraries can be quickstarted. Quickstarting is a Digital UNIX capability that can greatly reduce an application's load time. For many applications, load time is a significant percentage of the total time that it takes to start and run the application. If an object cannot be quickstarted, it still runs, but startup time is slower. See Section 4.7 for details.
You perform postlink optimizations by using
the
-om
flag on the
cc
command line. This flag must be used with the
-non_shared
flag and must be
specified when performing the final link, for example:
%
cc -om -non_shared prog.c
The postlink optimizer performs the following code optimizations:
nop
(no operation)
instructions, that is, those instructions that have no effect on
machine state.
\.lita
data, that is, that portion of the data section of an executable
image that holds address literals for 64-bit addressing.
Using available switches, you can remove unused
\.lita
entries after optimization and then compress the
\.lita
section.
When you use the
-om
flag, you get the full range of postlink optimizations. To specify
a specific postlink optimization, use the
-WL
compiler flag, followed by
-om_option ,
where
option
can be one of the following:
compress_lita
.lita
entries after optimization, then compresses the
.lita
section.
dead_code
.lita
section is not compressed by this option.
ireorg_feedback,file
file.Counts and
file.Addrs to reorganize the instructions to reduce cache thrashing.
no_inst_sched
no_align_labels
-om
flag will align the targets of all branches on quadword boundaries
to improve loop performance.
Gcommon,num
num
will be allocated close together.
For more information, see the
cc(1)
reference page.
Preprocessing options and postprocessing (run-time) options that can affect performance include the following:
KAP is especially useful for applications with the following characteristics on both symmetric multiprocessing systems (SMP) and uniprocessor systems:
To take advantage of the parallel processing capabilities of SMP systems, the KAP preprocessors support automatic and directed decomposition for C programs. KAP's automatic decomposition feature analyzes an existing program to locate loops that are candidates for parallel execution. Then, it decomposes the loops and inserts all necessary synchronization points. If more control is desired, the programmer can manually insert directives to control the parallelization of individual loops. On Digital UNIX systems, KAP uses DECthreads to implement parallel processing.
For C programs, KAP is invoked with the
kapc
(which invokes separate KAP processing) or
kcc
command (which invokes combined KAP processing and DEC C compilation).
For information on how to use KAP on a C program, see the
KAP for C for Digital UNIX User Guide.
KAP is available for Digital UNIX systems as a separately orderable layered product.
cord
utility
(-cord option)
to improve the instruction cache behavior for C
applications.
This utility uses data from an actual run of your application to
improve your application's use of the instruction cache.
To use the
cord
utility, you must first create a feedback file with the
pixie
and
gprof
tools.
See
pixie(5),
prof(1),
cord(1),
and
runcord(1)
for details. Also,
Chapter 8
describes how to use these tools.
(If you have produced a feedback file and you are are going to
compile your program with the
-non_shared
flag, it is better to use the feedback file with the
-om
flag than with the
-cord
flag. See
Section 10.1.2.1
for details on the
om
utility.)
-newc
and
-migrate
versions of the C compiler, the feedback information is most useful
at the highest two levels of optimization
(-O3
or
-O4
for
-newc
and
-O4
or
-O5
for
-migrate).
(The
-oldc
version of the C compiler does not support the use of feedback
files in its processing.)
If you are compiling a program with a feedback file and with the
-non_shared
flag, it is better to use the
-prof_use_om_feedback
flag than the
-prof_use_feedback
or
-feedback
flags. (See
Section 10.1.2.1
for details on the
om
utility.)
See Section 8.11 for information on how to create and use feedback files.
Library routine options that can affect performance include the following:
By using DXML, applications that involve numerically intensive
operations may run significantly faster on Digital UNIX systems,
especially when used with KAP. DXML routines can be called
explicitly from your program or, in certain cases, from KAP
(that is, when KAP recognizes opportunities to use the DXML routines).
You access DXML by specifying the
-ldxml
flag on the compilation command line.
For details on DXML, see the Digital Extended Mathematical Library for Digital UNIX Systems Reference Manual.
The DXML routines are written in Fortran. For information on calling Fortran routines from a C program, see the Digital UNIX user manual for the version of Fortran that you are using (DEC Fortran or DEC Fortran 90). (Information about calling DXML routines from C programs is also provided in the TechAdvantage C/C++ Getting Started Guide.)
-D_FASTMATH
flag on the compilation command
causes the compiler to use faster floating-point
routines at the expense of three bits of floating-point accuracy.
See
cc(1)
for details.
-D_INTRINSICS
and
-D_INLINE_INTRINSICS
flags; this causes the compiler to inline calls to certain standard C
library routines.
If you are willing to modify your application, use the profiler tools to determine where your application spends most of its time. Many applications spend most of their time in a few routines. Concentrate your efforts on improving the speed of those heavily used routines.
Digital provides several profiling tools that work for programs written
in C
and other languages. See
Chapter 8,
atom(1),
gprof(1),
hiprof(5),
pixie(5),
and
prof(1)
for more details.
After you identify the heavily used portions of your application, consider the algorithms used by that code. Is it possible to replace a slow algorithm with a more efficient one? Replacing a slow algorithm with a faster one often produces a larger performance gain than tweaking an existing algorithm.
When you are satisfied with the efficiency of your algorithms, consider making code changes to help the compiler optimize the object code that it generates for your application. High Performance Computing by Kevin Dowd (O'Reilly & Associates, Inc., ISBN 1-56592-032-5) is a good source of general information on how to write source code that maximizes optimization opportunities for compilers.
The following sections identify performance opportunities involving data types, cache usage and data alignment, and general coding issues.
Data type considerations that can affect performance include the following:
If performance is a critical concern, avoid using integer and logical
data types that are less than 32 bits, especially for scalars that are
used frequently. In C programs, consider replacing
char
and
short
declarations with
int
and
long
declarations.
Integer division operations are not native to the Alpha processor and must be emulated in software, so they can be slow. Other non-native operations include transcendental operations (for example, sine and cosine) and square root.
Cache usage patterns can have a critical impact on performance:
uprofile
tool to determine the number of cache collisions and their locations.
See Appendix A of the
Alpha Architecture Reference Manual
for additional information on data cache collisions.
Data alignment can also affect performance. By default, the C compiler aligns each data item on its natural boundary; that is, it positions each data item so that its starting address is an even multiple of the size of the data type used to declare it. Data not aligned on natural boundaries is called misaligned data. Misaligned data can slow performance because it forces the software to make necessary adjustments at run time.
In C programs, misalignment can occur when you type cast a pointer
variable from one data type to a larger data type; for example,
type casting a
char
pointer (1-byte alignment) to an
int
pointer (4-byte alignment) and then dereferencing the new pointer
may cause unaligned access.
Also in C, creating packed structures using the
#pragma
pack
directive can cause unaligned access. (See
Chapter 3
for details on
the
#pragma
pack
directive.)
To correct alignment problems in C programs,
you can use the
-align
flag or you can make necessary modifications to the source code.
If instances of misalignment are required by your program for some
reason, use the
_
_unaligned
data-type qualifier in any pointer definitions that involve the
misaligned data.
When data is accessed through the use of a pointer declared
_
_unaligned,
the compiler generates the additional code necessary to copy or store
the data without generating alignment errors.
(Alignment errors have a much more costly impact on performance than the
additional code that is generated.)
Warning messages identifying misaligned data are not issued
during the compilation of C programs by any version
of the C compiler
(-newc,
-migrate,
or
-oldc).
During execution of any program, the kernel issues warning messages
("unaligned access") for most instances of misaligned data.
The messages include the program counter (pc) value for the address
of the instruction that caused the misalignment.
You can use the machine code debugging capabilities of the
dbx
or
ladebug
debugger to determine the source code locations associated with
pc values.
For additional information on data alignment, see
Appendix A in the
Alpha Architecture Reference Manual.
See
cc(1)
for details on alignment-control flags that you can specify on
compilation command lines.
General coding considerations specific to C applications include the following:
libc
functions (for example:
strcpy,
strlen,
strcmp,
bcopy,
bzero,
memset,
memcpy)
instead of writing similar routines or your own loops.
These functions are hand-coded for efficiency.
unsigned
data type for variables wherever
possible because:
Consider the following example:
int long i; unsigned long j;
.
.
.
return i/2 + j/2;
In the example,
i/2
is an expensive expression; however,
j/2
is inexpensive.
The compiler generates three instructions for the signed
i/2
operations:
addq $l, l, $28 cmovge $l, $l, $28 sra $28, l, $2
The compiler generates only one instruction for the unsigned
j/2
operation:
srl $3, 1, $4
Also,
consider using the
-unsigned
flag to treat all
char
declarations as
unsigned
char.
malloc
function instead of declaring it statically.
When you have finished using the memory, free it so it can be used
for other data structures later in your program.
Using this technique to reduce the total memory usage of your
application can substantially increase the performance of applications
running in an environment in which physical memory is a
scarce resource.
If an application uses the
malloc
function extensively, you may be able to improve the application's
performance (processing speed, memory utilization, or both) by using
malloc's
control variables to tune memory allocation.
See
malloc(3)
for details.
alloca
function, which uses very few instructions and is very efficient.
Storage allocated by the
alloca
function is automatically reclaimed when an exit is made from the
routine in which the allocation is made.
The
alloca
function allocates space on the stack, not the heap, so you must make
sure that the object being allocated does not exhaust all of the free
stack space. If the object does not fit in the stack, a
core
dump is issued.
Programs that issue calls to the
alloca
function should include the
alloca.h
header file. If the header file is not included, the program will
execute properly, but it will run much slower.
-xtaso
flag. The
-xtaso
flag is supported by all versions of the
C compiler
(-newc,
-migrate,
and
-oldc
versions).
To use the flag, you must modify your source code with a C-language
pragma that controls pointer size allocations.
See
cc(1)
and
Chapter 2
for details.
do while
instead of
while
or
for
whenever possible. With
do while,
the optimizer does not have to duplicate the loop condition in order
to move code from within the loop to outside the loop.
static,
unless that variable is referenced by another source
file. Minimizing the use of global variables increases
optimization opportunities for the compiler.
++
and
--
operators within an expression.
When you use these operators
for their values instead of their side-effects, you often get
bad code. For example, the following coding is
not recommended:
while (n--)
{
.
.
.
}
The following coding is recommended:
while (n != 0)
{
n--;
.
.
.
}
&
values). Using
&
values can
create aliases, make the optimizer store variables from
registers to their home storage locations, and significantly
reduce optimization opportunities.
static
unless the function is referenced by another source module.
Use of
static
functions allows the optimizer to use more efficient calling
sequences.
You should also avoid aliases where possible by introducing local variables to store dereferenced results. (A dereferenced result is the value obtained from a specified address.) Dereferenced values are affected by indirect operations and calls, whereas local variables are not; local variables can be kept in registers. Example 10-1 shows how the proper placement of pointers and the elimination of aliasing enable the compiler to produce better code.
int len = 10; char a[10];
void zero() { char *p; for (p = a; p != a + len; ) *p++ = 0; }
Consider the use of pointers
in
Example 10-1.
Because the statement
*p++=0
might modify
len,
the compiler
must load it from memory and add it to the address of
a
on each pass through the loop, instead of computing
a + len
in a register once outside the loop.
Two different methods can be used to increase the efficiency of the code used in Example 10-1:
azero
procedure eliminates aliasing; the compiler keeps the value of
len
in a register, saving two instructions, and still uses a pointer to
access
a
efficiently, even though a pointer is not specified in the source code:
Source Code:
char a[10];
int len;
void
azero()
{
int i;
for (i = 0; i != len; i++) a[i] = 0;
}
len
as a local variable or formal argument ensures that
aliasing cannot take place and permits the compiler to place
len
in a register:
Source Code:
char a[10];
void
lpzero(len)
int len;
{
char *p;
for (p = a; p != a + len; ) *p++ = 0;
}