Optimizing an application program can involve modifying the build process, modifying the source code, or both.
In many instances, optimizing an application program can result in major improvements in run-time performance. Two preconditions should be met, however, before you begin measuring the run-time performance of an application program and analyzing how to improve the performance:
Check the software on your system to ensure that you are using the latest versions of the compiler and the operating system to build your application program. Newer versions of a compiler often perform more advanced optimizations, and newer versions of the operating system often operate more efficiently.
Test your application program to ensure that it runs without
errors.
Whether you are porting an application from a 32-bit system to Tru64 UNIX
or developing a new application, never attempt to optimize an application
until it has been thoroughly debugged and tested.
(If
you are porting an application written in C, compile your program using the
C compiler's
-message_enable questcode
option, and/or use
lint
with the
-Q
option to help identify
possible portability problems that you may need to resolve.)
After you verify that these conditions have been met, you can begin the optimization process.
The process of optimizing an application can be divided into two separate, but complementary, activities:
Tuning your application's build process so that you use, for example, an optimal set of automatic preprocessing and compilation optimizations (see Section 10.1).
Analyzing your application's source code to ensure that it uses efficient algorithms, and that it does not use programming language constructs that can degrade performance (see Section 10.2). This manual phase also includes the use of profiling tools to analyze performance, as explained in Chapter 8.
The following sections provide details that relate to these two aspects
of the optimization process.
10.1 Guidelines to Build an Application Program
Opportunities to
automatically improve an application's run-time performance exist in all phases
of the build process.
The following sections identify some of the major opportunities
that exist in the areas of compiling, linking and loading, preprocessing and
postprocessing, and library selection.
A particularly effective technique
is profile-directed optimization with the
spike
tool (Section 10.1.3).
10.1.1 Compilation Considerations
Compile
your application with the highest optimization level possible, that is, the
level that produces the best performance and the correct results.
In general,
applications that conform to language-usage standards should tolerate the
highest optimization levels, and applications that do not conform to such
standards may have to be built at lower optimization levels.
See
cc
(1)
or
Chapter 2
for more information.
If your application will tolerate it, compile all of the source files together in a single compilation. Compiling multiple source files increases the amount of code that the compiler can examine for possible optimizations. This can have the following effects:
To
take advantage of these optimizations, use the
-ifo
and either
-O3
or
-O4
compilation options.
To determine whether the highest level of optimization benefits your particular program, compare the results of two separate compilations of the program, with one compilation at the highest level of optimization and the other compilation at the next lower level of optimization. Some routines may not tolerate a high level of optimization; such routines will have to be compiled separately.
Other compilation considerations that can have a significant impact on run-time performance include the following:
For C applications
with numerous floating-point operations, consider using the
-fp_reorder
option if a small difference in the result is acceptable.
If your C application uses a lot of
char
,
short
, or
int
data items within loops, you may
be able to use the C compiler's highest-level optimization option to improve
performance.
(The highest-level optimization option (-O4
)
implements byte vectorization, among other optimizations, for Alpha systems.)
For C and Fortran applications whose performance can be characterized using one or more sample runs, consider using the -feedback option. This option is especially effective when used with the -spike option, as discussed in Section 10.1.3.2, and/or the -ifo option for even better results.
For C applications that are thoroughly debugged and
that do not generate any exceptions, consider using the
-speculate
option.
When a program compiled with this option is executed, values
associated with a variety of execution paths are precomputed so that they
are immediately available if they are needed.
This work ahead operation uses
idle machine cycles, so it has no negative effect on performance.
Performance
is usually improved whenever a precomputed value is used.
The
-speculate
option can be specified in two
forms:
-speculate all
-speculate by_routine
Both options result in exceptions being dismissed: the
-speculate
all
option dismisses exceptions generated in all compilation units
of the program, and the
-speculate by_routine
option
dismisses only the exceptions in the compilation unit to which it applies.
If speculative execution results in a significant number of dismissed exceptions,
performance will be degraded.
The
-speculate all
option is more aggressive and may result in greater performance improvements
than the other option, especially for programs doing floating-point computations.
The
-speculate all
option cannot be used if any routine
in the program does exception handling; however, the
-speculate
by_routine
option can be used when exception handling occurs outside
the compilation unit on which it is used.
Neither
-speculate
option should be used if debugging is being done.
To print a count of the number of dismissed exceptions when the program does a normal termination, specify the following environment variable:
% setenv _SPECULATE_ARGS -stats
The statistics feature is not currently available with
the
-speculate all
option.
Use of the
-speculate all
and
-speculate
by_routine
options disables all messages about alignment fixups.
To generate alignment messages for both speculative and nonspeculative alignment
fixups, specify the following environment variable:
% setenv _SPECULATE_ARGS -alignmsg
You can specify both options as follows:
% setenv _SPECULATE_ARGS -stats -alignmsg
You can use the following compilation
options together or individually to improve run-time performance (see
cc
(1)
for more information):
Option | Description |
-arch |
Specifies which version of the Alpha architecture
to generate instructions for.
See
-arch
in
cc (1)
for an explanation of the differences between
-arch
and
-tune. |
-ansi_alias |
Specifies whether source code observes ANSI C aliasing rules. ANSI C aliasing rules allow for more aggressive optimizations. |
-ansi_args |
Specifies whether source code observes ANSI C rules about arguments. If ANSI C rules are observed, special argument-cleaning code does not have to be generated. |
-fast |
Turns on the optimizations for the following options for increased performance:
|
-feedback |
Specifies that the compiler should use the profile information contained in the specified file when performing optimizations. For more information, see Section 10.1.3.2. |
-fp_reorder |
Specifies whether certain code transformations that affect floating-point operations are allowed. |
-G |
Specifies the maximum byte size of data items in the small data sections (sbss or sdata). |
-inline |
Specifies whether to perform inline expansion of functions. |
-ifo |
Provides improved optimization (interfile optimization) and code generation across file boundaries that would not be possible if the files were compiled separately. |
-O |
Specifies the level of optimization that is to be achieved by the compilation. |
-om |
Performs a variety of postlink code optimizations.
Most effective with programs compiled with the
-non_shared
option (see
Appendix F).
This option is being replaced
with the
-spike
option (see
Section 10.1.3). |
-preempt_module |
Supports symbol preemption on a module-by-module basis. |
-speculate |
Enables work (for example, load or computation operations) to be done in running programs on execution paths before the paths are taken. |
-spike |
Performs a variety of postlink code optimizations (see Section 10.1.3). |
-tune |
Selects processor-specific instruction tuning
for specific implementations of the Alpha architecture.
See
-arch
in
cc (1)
for an explanation of the differences between
-tune
and
-arch. |
-unroll |
Controls loop unrolling done by the optimizer
at levels
-O2
and above.
|
Using the preceding options may cause a reduction in accuracy and adherence to standards.
For C applications, the compilation option in effect for handling floating-point exceptions can have a significant impact on execution time as follows:
Default exception handling (no special compilation option)
With the default exception-handling mode, overflow, divide-by-zero,
and invalid-operation exceptions always signal the
SIGFPE
exception handler.
Also, any use of an IEEE infinity, an IEEE NaN (not-a-number),
or an IEEE denormalized number will signal the
SIGFPE
exception
handler.
By default, underflows silently produce a zero result, although the
compilers support a separate option that allows underflows to signal the
SIGFPE
handler.
The default exception-handling mode is suitable for any portable program that does not depend on the special characteristics of particular floating-point formats. The default mode provides the best exception-handling performance.
Portable IEEE exception handling
(-ieee
)
With the portable IEEE exception-handling mode, floating-point exceptions
do not signal unless a special call is made to enable the fault.
This mode
correctly produces and handles IEEE infinity, IEEE NaNs, and IEEE denormalized
numbers.
This mode also provides support for most of the nonportable aspects
of IEEE floating point: all status options and trap enables are supported,
except for the inexact exception.
(See
ieee
(3)
for information on the inexact exception
feature (-ieee_with_inexact
).
Using this feature
can slow down floating-point calculations by a factor of 100 or more, and
few, if any, programs have a need for its use.)
The portable IEEE exception-handling mode is suitable for any program that depends on the portable aspects of the IEEE floating-point standard. This mode is usually 10-20 percent slower than the default mode, depending on the amount of floating-point computation in the program. In some situations, this mode can increase execution time by more than a factor of two.
10.1.2 Linking and Loading Considerations
If your application does not use many large libraries,
consider linking it nonshared.
This allows the linker to optimize calls into
the library, which decreases your application's startup time and improves
run-time performance (if calls are made frequently).
Nonshared applications,
however, can use more system resources than call-shared applications.
If you
are running a large number of applications simultaneously and the applications
have a set of libraries in common (for example,
libX11
or
libc
), you may increase total system performance by
linking them as call-shared.
See
Chapter 4
for details.
For applications
that use shared libraries, ensure that those libraries can be quickstarted.
Quickstarting is a Tru64 UNIX capability that can greatly reduce an application's
load time.
For many applications, load time is a significant percentage of
the total time that it takes to start and run the application.
If an object
cannot be quickstarted, it still runs, but startup time is slower.
See
Section 4.7
for details.
10.1.3 Spike and Profile-Directed Optimization
This
section describes use of the
spike
postlink optimizer.
10.1.3.1 Overview of spike
The
spike
tool performs code optimization after linking.
Because it can operate on an entire program,
spike
is able
to do optimizations that the compiler cannot do.
spike
is most effective when it uses profile information to guide optimization,
as discussed in
Section 10.1.3.2.
spike
is new with Tru64 UNIX Version 5.1 and
is intended to replace
om
and
cord
.
It provides better control and more effective optimization, and it can be
used with both executables and shared libraries.
spike
cannot be used with
om
or
cord
.
For
information about
om
and
cord
, see
Appendix F.
Some of the optimizations that
spike
performs are
code layout, deleting unreachable code, and optimization of address computations.
spike
can process binaries that are linked on Tru64 UNIX
V4.0 or later systems.
Binaries that are linked on V5.1 or later systems contain
information that allows
spike
to do additional optimization.
Note
spike
does only some address optimizations on Tru64 UNIX V5.1 or later images, butom
will do the optimization on V4 images. If you are usingspike
on pre V5.1 binaries and you enable linker optimization (-O passed tocc
in the link step), the difference in performance betweenom
andspike
is not expected to be significant.
You can use
spike
in two ways:
By applying the
spike
command to a binary
file after compilation.
As part of the compilation process, by specifying the
-spike
option with the
cc
command (or with the
cxx
,
f77
, or
f90
command,
if the corresponding compiler is installed on your system).
The examples in this section and
Section 10.1.3.2
show how to use both forms of
spike
The
spike
command is more convenient when you do not want to relink the executable
(Example 1) or when you are using profile information after
compilation (Example 5
and
Example 6).
The
-spike
option is more convenient when you are not using profile information
(Example 2), or when you are using profile information in
the compiler, too (Example 3
and
Example 4).
Example 1
and
Example 2
show how to
use
spike
without profiling information to guide the optimization.
Section 10.1.3.2
explains how to use
spike
with feedback information from the
pixie
profiler.
Example 1
In this example,
spike
is applied to the binary
my_prog
, producing the optimized output file
prog1.opt
.
%
spike my_prog -o prog1.opt
In this example,
spike
is applied during compilation
with the
cc
command's
-spike
option:
%
cc -c file1.c%
cc -o prog3 file1.o -spike
The first command line creates the object file
file1.o
.
The second command line links
file1.o
into an executable
and uses
spike
to optimize the executable.
All of the
spike
command's options can be passed
directly to the
cc
command's
-spike
option
by using the (cc
)
-WS
option.
The following
example shows the syntax:
%
cc -spike -feedback prog -o prog *.c \
-WS,-splitThresh,.999,-noaggressiveAlign
For complete information on the
spike
command's options
and any restrictions on using
spike
, see
spike
(1).
10.1.3.2 Using spike for Profile-Directed Optimization
You can achieve some degree of automatic optimization by using the compiler's automatic optimization options that are described in the previous sections, such as -O, -fast, -inline, and so on. These options can help in the generation of minimal instruction sequences that make best use of the CPU architecture and cache memory.
However, the compiler and linker can improve on these optimizations if given information on which instructions are executed most often when a program is run with its normal input data and environment. Tru64 UNIX helps you provide this information by allowing a profiler's results to be fed back into a recompilation. This customized, profile-directed optimization can be used in conjunction with automatic optimization.
The following examples show how to use
spike
with
the
pixie
profiler and various feedback techniques to tune
the generated instruction sequences of a program.
Example 3
This example shows the three basic steps for profile-directed optimization
with
spike
: (1) preparing the program for optimization,
(2) creating an instrumented version of the program and running it to collect
profiling statistics, and (3) feeding that information back to the compiler
and linker to help them optimize the executable code.
Later examples show
how to elaborate on these steps to accommodate ongoing changes during development
and data from multiple profiling runs.
%
cc -feedback prog -o prog -O3 *.c [1]%
pixie -update prog [2]%
cc -feedback prog -o prog -spike -O3 *.c [3]
When the program is compiled with the
-feedback
option for the first time, a special augmented executable
file is created.
It contains information that the compiler uses to relate
the executable to the source files.
It also contains a section that is used
later to store profiling feedback information for the compiler.
This section
remains empty after the first compilation because the
pixie
profiler has not yet generated any feedback information (step 2).
Make sure
that the file name specified with the
-feedback
option is
the same as the executable file name, which in this example is
prog
(from the
-o
option).
By default, the
-feedback
option applies the
-g1
option, which provides
optimum symbolization for profiling.
You need to experiment with the
-On
option to find the level of optimization
that provides the best run-time performance for your program and compiler.
The compiler issues this message during the first compilation, because no
feedback information is yet available:
cc: Info: Feedback file prog does not exist (nofbfil) cc: Info: Compilation will proceed without feedback optimizations (nofbopt)
The
pixie
command
creates an instrumented version of the program (prog.pixie
)
and then runs it (because a
prof
option,
-update, is specified).
Execution statistics and address mapping data are
automatically collected in an instruction-counts file (prog.Counts
) and an instruction-addresses file (prog.Addrs
).
The
-update
option puts this profiling information in the
augmented executable.
[Return to example]
In the second compilation with the -feedback option, the profiling information in the augmented executable guides the compiler and (through the -spike option) the postlink optimizer. This customized feedback enhances any automatic optimization that the -O3 and -spike options provide. You can make compiler optimizations even more effective by using the -ifo and/or -assume whole_program options in conjunction with the -feedback option. However, as noted in Section 10.1.1, the compiler may be unable to compile very large programs as if there were only one source file. [Return to example]
See
pixie
(1)
and
cc
(1)
for more information.
The profiling information in an augmented executable file makes it larger
than a normal executable (typically 3-5 percent).
After development
is completed, you can use the
strip
command to remove any
profiling and symbol table information.
For example:
%
strip prog
spike
cannot process stripped images.
Example 4
During a typical development process, steps 2 and 3 of Example 3 are repeated as needed to reflect the impact of any changes to the source code. For example:
%
cc -feedback prog -o prog -O3 *.c%
pixie -update prog%
cc -feedback prog -o prog -O3 *.c [modify source code]%
cc -feedback prog -o prog -O3 *.c ..... [modify source code]%
cc -feedback prog -o prog -O3 *.c%
pixie -update prog%
cc -feedback prog -o prog -spike -O3 *.c
Because the profiling information in the augmented executable persists
from compilation to compilation, the
pixie
processing step
that updates the information does not have to be repeated every time that
a source module is modified and recompiled.
But each modification reduces
the relevance of the old feedback information to the actual code and degrades
the potential quality of the optimization, depending on the exact modification.
The
pixie
processing step after the last modification and
recompilation guarantees that the feedback information is correctly updated
for the last compilation.
Example 5
You might want to run your instrumented program several times with different
inputs to get an accurate picture of its profile.
This example shows how to
optimize a program by merging profiling statistics from two instrumented runs
of a program,
prog
, whose output varies from run to run
with different sets of input:
%
cc -feedback prog -o prog *.c [1]%
pixie -pids prog [2]%
prog.pixie [3] (input set 1)%
prog.pixie (input set 2)%
prof -pixie -update prog prog.Counts.* [4]%
spike prog -feedback prog -o prog.opt [5]
The first compilation produces an augmented executable, as explained in Example 3. [Return to example]
By default, each run of the instrumented
program (prog.pixie
) produces a profiling data file called
prog.Counts
.
The
-pids
option adds the process ID
of each of the instrumented program's test runs to the name of the profiling
data file that is produced (prog.Counts.
pid).
Thus, the data files that subsequent runs produce do not
overwrite each other.
[Return to example]
The instrumented program is run twice,
producing a uniquely named data file each time -- for example,
prog.Counts.371
and
prog.Counts.422
.
[Return to example]
The
prof -pixie
command
merges the two data files.
The
-update
option updates the
executable,
prog
, with the combined information.
[Return to example]
The
spike
command with
the
-feedback
option uses the combined profiling information
from the two runs of the program to guide the optimization, producing the
optimized output file
prog.opt
.
[Return to example]
The last step of this example could be changed to the following:
%
cc -spike -feedback prog -o prog -O3 *.c
The
-spike
option requires that you relink the program.
When using the
spike
command, you do not have to link the
program a second time to invoke
spike
.
Example 6
This example differs from
Example 5
in that a normal
(unaugmented) executable is created, and the
spike
command's
-fb
option (rather than the
-feedback
option) is used:
%
cc prog -o prog *.c%
pixie -pids prog%
prog.pixie (input set 1)%
prog.pixie (input set 2)%
prof -pixie -merge prog.Counts prog prog.Addrs prog.Counts.*%
spike prog -fb prog -o prog.opt
The
prof -pixie -merge
command merges the two data
files from the two instrumented runs into one combined
prog.Counts
file.
With this form of feedback, the
-g1
option
must be specified explicitly to provide optimum symbolization for profiling.
The
spike -fb
command uses the information in
prog.Addrs
and
prog.Counts
to produce the optimized
output file
prog.opt
.
The method of
Example 5
is preferred.
The method in
Example 6
is supported for compatibility and should be used only
if you cannot compile with the
-feedback
option that uses
feedback information stored in the executable.
10.1.4 Preprocessing and Postprocessing Considerations
Preprocessing options and postprocessing (run-time) options that can affect performance include the following:
Use the Kuck & Associates Preprocessor (KAP) tool to gain extra optimizations. The preprocessor uses final source code as input and produces an optimized version of the source code as output.
KAP is especially useful for applications with the following characteristics on both symmetric multiprocessing systems (SMP) and uniprocessor systems:
To take advantage of the parallel-processing capabilities of SMP systems, the KAP preprocessors support automatic and directed decomposition for C programs. KAP's automatic decomposition feature analyzes an existing program to locate loops that are candidates for parallel execution. Then, it decomposes the loops and inserts all necessary synchronization points. If more control is desired, the programmer can manually insert directives to control the parallelization of individual loops. On Tru64 UNIX systems, KAP uses the POSIX Threads Library to implement parallel processing.
For C programs, KAP is invoked with the
kapc
(which
invokes separate KAP processing) or
kcc
command (which
invokes combined KAP processing and Compaq C compilation).
For information
on how to use KAP on a C program, see the
KAP for C for Tru64 UNIX
manual.
KAP is available for Tru64 UNIX systems as a separately orderable layered product.
Use the following tools, especially with profile-directed feedback, for post-link optimization and procedure reordering:
spike
(see
Section 10.1.3)
om
(see
Appendix F)
cord
(see
Appendix F)
10.1.5 Library Routine Selection
Library routine options that can affect performance include the following:
Use the Compaq Extended Math Library (CXML, formerly Digital Extended Math Library -- DXML) for applications that perform numerically intensive operations. CXML is a collection of mathematical routines that are optimized for Alpha systems -- both SMP systems and uniprocessor systems. The routines in CXML are organized in the following four libraries:
BLAS -- A library of basic linear algebra subroutines
LAPACK -- A linear algebra package of linear system and eigensystem problem solvers
Sparse Linear System Solvers -- A library of direct and iterative sparse solvers
Signal Processing -- A basic set of signal-processing functions, including one-, two-, and three-dimensional fast Fourier transforms (FFTs), group FFTs, sine/cosine transforms, convolution functions, correlation functions, and digital filters
By using CXML, applications that involve numerically intensive
operations may run significantly faster on Tru64 UNIX systems, especially
when used with KAP.
CXML routines can be called explicitly from your program
or, in certain cases, from KAP (that is, when KAP recognizes opportunities
to use the CXML routines).
You access CXML by specifying the
-ldxml
option on the compilation command line.
For details on CXML, see the Compaq Extended Math Library Reference Guide.
The CXML routines are written in Fortran. For information on calling Fortran routines from a C program, see the Compaq Fortran (formerly Digital Fortran) user manual for Tru64 UNIX. (Information about calling CXML routines from C programs is also provided in the TechAdvantage C/C++ Getting Started Guide.)
If your application does not require extended-precision accuracy,
you can use math library routines that are faster but slightly less accurate.
Specifying
the
-D_FASTMATH
option on the compilation command
causes the compiler to use faster floating-point routines at the expense of
three bits of floating-point accuracy.
See
cc
(1)
for more information.
Consider compiling your C programs with
the
-D_INTRINSICS
and
-D_INLINE_INTRINSICS
options; this causes the compiler to inline calls to certain standard
C library routines.
10.2 Application Coding Guidelines
If you are willing to modify your application, use the profiling tools to determine where your application spends most of its time. Many applications spend most of their time in a few routines. Concentrate your efforts on improving the speed of those heavily used routines.
Tru64 UNIX provides several profiling tools that work for programs
written in C and other languages.
See
Chapter 7,
Chapter 8,
Chapter 9,
prof_intro
(1),
gprof
(1),
hiprof
(1),
pixie
(1),
prof
(1),
third
(1),
uprofile
(1),
and
atom
(1)
for more information.
After you identify the heavily used portions of your application, consider the algorithms used by that code. Is it possible to replace a slow algorithm with a more efficient one? Replacing a slow algorithm with a faster one often produces a larger performance gain than tweaking an existing algorithm.
When you are satisfied with the efficiency of your algorithms, consider making code changes to help the compiler optimize the object code that it generates for your application. High Performance Computing by Kevin Dowd (O'Reilly & Associates, Inc., ISBN 1-56592-032-5) is a good source of general information on how to write source code that maximizes optimization opportunities for compilers.
The following sections identify performance opportunities involving
data types, I/O handling, cache usage and data alignment, and general coding
issues.
10.2.1 Data-Type Considerations
Data-type considerations that can affect performance include the following:
The smallest unit of efficient access on Alpha systems is 32 bits. A 32- or 64-bit data item can be accessed with a single, efficient machine instruction.If your application's performance on older implementations of the Alpha architecture (processors earlier than EV56) is critical, you may want to consider the following points:
Avoid using integer and logical data types that are less than 32 bits, especially for scalars that are used frequently.
In C programs, consider replacing
char
and
short
declarations with
int
and
long
declarations.
Division of integer quantities is slower than division of floating-point quantities. If possible, consider replacing such integer operations with equivalent floating-point operations.
Integer division operations are not native to the Alpha processor and must be emulated in software, so they can be slow. Other non-native operations include transcendental operations (for example, sine and cosine) and square root.
10.2.2 Using Direct I/O on AdvFS Files
Direct I/O allows an application to use the file-system features that the Advanced File System (AdvFS) provides, such as file management, online backup, and online recovery, while eliminating the overhead of copying user data into the AdvFS cache. Direct I/O uses Direct Memory Access (DMA) commands to copy the user data directly between an application's buffer and a disk.
Normal file-system I/O maintains file pages in a cache. This allows the I/O to be completed asynchronously; once the data is in the cache and scheduled for I/O, the application does not need to wait for the data to be transferred to disk. In addition, because the data is already in the cache, subsequent accesses to this page do not need to read the data from disk. Most applications use normal file-system I/O.
Normal file-system I/O is not suited for applications that access the
data on disk infrequently and manage inter-thread competition themselves.
Such applications can take advantage of the reduced overhead of direct I/O.
However, because data is not cached, access to a given page must be serialized
among competing threads.
To do this, direct I/O enforces synchronous I/O as
the default.
This means that when the
read()
routine returns
to the application, the I/O has completed and the data is on disk.
Any subsequent
retrieval of that data will also incur an I/O operation to retrieve the data
from disk.
An application
can take advantage of asynchronous I/O (AIO), but still use the underlying
direct I/O mechanism, by using the
aio_read()
and
aio_write()
system routines.
These routines will return to the application
before the data has been transferred to disk, and the
aio_error()
routine allows the application to poll for the completion of the
I/O.
(The kernel synchronizes the access to file pages so that two threads
cannot concurrently write the same page.)
Threads using direct I/O to access a given file will be able to do so concurrently, provided that they do not access the same range of pages. For example, if thread A is writing pages 10 through 19 and thread B is writing pages 20 through 39, these operations will occur simultaneously. Continuing this example, if thread B attempts to write pages 15 through 39 in a single direct I/O transfer, it will be forced to wait until thread A completes its write because their page ranges overlap.
When using direct I/O, the best performance occurs when the requested transfer is aligned on a disk sector boundary and the transfer size is an even multiple of the underlying sector size. Larger transfers are generally more efficient than smaller ones, although the optimal transfer size depends on the underlying storage hardware.
Note
Direct I/O mode and the use of mapped file regions (
mmap
) are exclusive operations. You cannot set direct I/O mode on a file that uses mapped file regions. Mapping a file will also fail if the file is already open for direct I/O.Direct I/O and atomic data logging modes are also mutually exclusive. If a file is open in one of these modes, subsequent attempts to open the file in the other mode will fail.
You can activate the direct I/O feature for use on an AdvFS file for both AIO and non-AIO applications.
To activate the feature, use the
open
function in an application,
setting the
O_DIRECTIO
file access flag.
For example:
open ("file", O_DIRECTIO | O_RDWR, 0644)
Direct I/O mode remains in effect until the file is closed by all users.
The
fcntl()
function with the parameter
F_GETCACHEPOLICY
can be used to return the caching policy of a file,
either
FCACHE
or
FDIRECTIO
mode.
For
example:
int fcntlarg = 0; ret = fcntl( filedescriptor, F_GETCACHEPOLICY, &fcntlarg ); if ( ret != -1 && fcntlarg == FDIRECTIO ) { . . .
For details on the use of direct I/O and AdvFS, see
fcntl
(2)
and
open
(2).
10.2.3 Cache Usage and Data Alignment Considerations
Cache usage patterns can have a critical impact on performance:
If your application has a few heavily used data structures, try to allocate these data structures on cache-line boundaries in the secondary cache. Doing so can improve the efficiency of your application's use of cache. See Appendix A of the Alpha Architecture Reference Manual for more information.
Look for potential
data cache collisions between heavily used data structures.
Such collisions
occur when the distance between two data structures allocated in memory is
equal to the size of the primary (internal) data cache.
If your data structures
are small, you can avoid this by allocating them contiguously in memory.
You
can use the
uprofile
tool to determine the number of cache
collisions and their locations.
See Appendix A of the
Alpha Architecture Reference Manual
for
more information on data cache collisions.
Data alignment can also affect performance. By default, the C compiler aligns each data item on its natural boundary; that is, it positions each data item so that its starting address is an even multiple of the size of the data type used to declare it. Data not aligned on natural boundaries is called misaligned data. Misaligned data can slow performance because it forces the software to make necessary adjustments at run time.
In C programs, misalignment can occur when you type cast a pointer variable
from one data type to a larger data type; for example, type casting a
char
pointer (1-byte alignment) to an
int
pointer
(4-byte alignment) and then dereferencing the new pointer may cause unaligned
access.
Also in C, creating packed structures using the
#pragma pack
directive can cause unaligned access.
(See
Chapter 3
for details on the
#pragma pack
directive.)
To correct alignment problems in C programs, you can use the
-misalign
option (or
-assume noaligned_objects) or
you can make necessary modifications to the source code.
If instances of misalignment
are required by your program for some reason, use the
_ _unaligned
data-type qualifier in any pointer definitions that involve the
misaligned data.
When data is accessed through the use of a pointer declared
_ _unaligned
, the compiler generates the additional code
necessary to copy or store the data without generating alignment errors.
(Alignment
errors have a much more costly impact on performance than the additional code
that is generated.)
Warning messages identifying misaligned data are not issued during the compilation of C programs. However, during execution of any program, the kernel issues warning messages ("unaligned access") for most instances of misaligned data. The messages include the program counter (PC) value for the address of the instruction that caused the misalignment.
You can use either of the following two methods to access code that causes the unaligned access fault:
By using a debugger to examine the PC value presented in the "unaligned access" message, you can find the routine name and line number for the instruction causing the misalignment. (In some cases, the "unaligned access" message results from a pointer passed by a calling routine. The return address register (ra) contains the address of the calling routine -- if the contents of the register have not been changed by the called routine.)
By turning off the -align option on the command line and running your program in a debugger session, you can examine your program's stack and variables at the point where the debugger stops due to the unaligned access.
For more information on data alignment, see Appendix A in the
Alpha Architecture Reference Manual.
See
cc
(1)
for information on alignment-control options that you can
specify on compilation command lines.
10.2.4 General Coding Considerations
General coding considerations specific to C applications include the following:
Use
libc
functions (for example:
strcpy
,
strlen
,
strcmp
,
bcopy
,
bzero
,
memset
,
memcpy
) instead of writing similar routines or your own loops.
These
functions are hand coded for efficiency.
Use the
unsigned
data type for variables
wherever possible because:
The variable is always greater than or equal to zero, which enables the compiler to perform optimizations that would not otherwise be possible.
The compiler generates fewer instructions for all unsigned divide operations.
Consider the following example:
int long i; unsigned long j; . . return i/2 + j/2;
In the example,
i/2
is
an expensive expression; however,
j/2
is inexpensive.
The compiler generates three instructions for the signed
i/2
operations:
addq $l, l, $28 cmovge $l, $l, $28 sra $28, l, $2
The compiler generates only one instruction
for the unsigned
j/2
operation:
srl $3, 1, $4
Also, consider using the
-unsigned
option to treat all
char
declarations
as
unsigned
char
.
If your application
temporarily needs large amounts of data, consider using the
malloc
function or the
alloca
built-in function instead
of declaring the data statically.
The
alloca
function allocates memory from the stack.
The memory is automatically released when the function that allocated it returns.
You must make sure that any code that uses
alloca
first
includes
alloca.h
.
If you do not do this, your code may
not work correctly.
Consider using the
malloc
function if your application
needs memory to live beyond the context of a specific function invocation.
The
malloc
function allocates memory from the process's
heap.
The memory remains available until it is explicitly released by a call
to
free
.
Using these functions can increase the performance of applications where physical memory is a scarce resource.
For multithreaded applications, note that
alloca
allocates memory from the calling thread's stack, which means that the allocating
and freeing of this memory does not incur any contention.
The
malloc
function (and associated functions) may allocate their memory from
a common pool using locking and atomic operations to control concurrent access.
See the Tuning Memory Allocation section of
malloc
(3)
for information
on simple ways to improve the performance of single and multithreaded applications
that use
malloc
.
Also for multithreaded applications, consider using the arena malloc
(amalloc
) mechanism to set up separate heaps for each thread
of a multithreaded application.
Minimize type casting, especially type conversion from integer to floating point and from a small data type to a larger data type.
To avoid cache misses, make sure that multidimensional arrays are traversed in natural storage order; that is, in row major order with the rightmost subscript varying fastest and striding by 1. Avoid column major order (which is used by Fortran).
If
your application fits in a 32-bit address space and allocates large amounts
of dynamic memory by allocating structures that contain many pointers, you
may be able to save significant amounts of memory by using the
-xtaso
option.
To use this option, you must
modify your source code with a C-language pragma that controls pointer size
allocations.
See
cc
(1)
and
Chapter 2
for information.
Do not use indirect calls in C programs (that is, calls that use routines or pointers to functions as arguments). Indirect calls introduce the possibility of changes to global variables. This effect reduces the amount of optimization that can be safely performed by the optimizer.
Use functions to return values instead of reference parameters.
Use
do while
instead of
while
or
for
whenever possible.
With
do while
, the optimizer does not have to duplicate
the loop condition to move code from within the loop to outside the loop.
Use local variables and avoid global variables.
Declare any
variable outside a function as
static
, unless that variable
is referenced by another source file.
Minimizing the use of global variables
increases optimization opportunities for the compiler.
Use value parameters instead of reference parameters or global variables. Reference parameters have the same degrading effects as pointers.
Write straightforward code.
For example, do not use
++
and
--
operators within an expression.
When
you use these operators for their values instead of their side effects, you
often get bad code.
For example, the following coding is not recommended:
while (n--) { . . }
The following coding is recommended:
while (n != 0) { n--; . . }
Avoid taking and passing addresses (that is,
&
values).
Using
&
values can create aliases,
make the optimizer store variables from registers to their home storage locations,
and significantly reduce optimization opportunities.
Avoid creating functions that take a variable number of arguments. A function with a variable number of arguments causes the optimizer to unnecessarily save all parameter registers on entry.
Declare functions as
static
unless the
function is referenced by another source module.
Use of
static
functions allows the optimizer to use more efficient calling sequences.
Also, avoid aliases where possible by introducing local variables to
store dereferenced results.
(A dereferenced result is the value obtained from
a specified address.) Dereferenced values are affected by indirect operations
and calls, but local variables are not; local variables can be kept in registers.
Example 10-1
shows how the proper placement of pointers
and the elimination of aliasing enable the compiler to produce better code.
Example 10-1: Pointers and Optimization
Source Code: int len = 10; char a[10]; void zero() { char *p; for (p = a; p != a + len; ) *p++ = 0; }
Consider the use of pointers in
Example 10-1.
Because the statement
*p++ = 0
might modify
len
, the compiler must load it from memory and add it to the address
of
a
on each pass through the loop, instead of computing
a + len
in a register once outside
the loop.
You can use two different methods to increase the efficiency of the code used in Example 10-1:
Use subscripts instead of pointers.
As shown in the following
example, the use of subscripting in the
azero
procedure
eliminates aliasing; the compiler keeps the value of
len
in a register, saving two instructions, and still uses a pointer to access
a
efficiently, even though a pointer is not specified in the source
code:
Source Code: char a[10]; int len; void azero() { int i; for (i = 0; i != len; i++) a[i] = 0; }
Use local variables.
As shown in the following example, specifying
len
as a local variable or formal argument ensures that aliasing
cannot take place and permits the compiler to place
len
in a register:
Source Code: char a[10]; void lpzero(len) int len; { char *p; for (p = a; p != a + len; ) *p++ = 0; }