10    Optimizing Techniques

Optimizing an application program can involve modifying the build process, modifying the source code, or both.

In many instances, optimizing an application program can result in major improvements in run-time performance. Two preconditions should be met, however, before you begin measuring the run-time performance of an application program and analyzing how to improve the performance:

After you verify that these conditions have been met, you can begin the optimization process.

The process of optimizing an application can be divided into two separate, but complementary, activities:

The following sections provide details that relate to these two aspects of the optimization process.

10.1    Guidelines to Build an Application Program

Opportunities to automatically improve an application's run-time performance exist in all phases of the build process. The following sections identify some of the major opportunities that exist in the areas of compiling, linking and loading, preprocessing and postprocessing, and library selection. A particularly effective technique is profile-directed optimization with the spike tool (Section 10.1.3).

10.1.1    Compilation Considerations

Compile your application with the highest optimization level possible, that is, the level that produces the best performance and the correct results. In general, applications that conform to language-usage standards should tolerate the highest optimization levels, and applications that do not conform to such standards may have to be built at lower optimization levels. See cc(1) or Chapter 2 for more information.

If your application will tolerate it, compile all of the source files together in a single compilation. Compiling multiple source files increases the amount of code that the compiler can examine for possible optimizations. This can have the following effects:

To take advantage of these optimizations, use the -ifo and either -O3 or -O4 compilation options.

To determine whether the highest level of optimization benefits your particular program, compare the results of two separate compilations of the program, with one compilation at the highest level of optimization and the other compilation at the next lower level of optimization. Some routines may not tolerate a high level of optimization; such routines will have to be compiled separately.

Other compilation considerations that can have a significant impact on run-time performance include the following:

Option Description
-arch Specifies which version of the Alpha architecture to generate instructions for. See -arch in cc(1) for an explanation of the differences between -arch and -tune.
-ansi_alias Specifies whether source code observes ANSI C aliasing rules. ANSI C aliasing rules allow for more aggressive optimizations.
-ansi_args Specifies whether source code observes ANSI C rules about arguments. If ANSI C rules are observed, special argument-cleaning code does not have to be generated.
-fast

Turns on the optimizations for the following options for increased performance:

-ansi_alias
-ansi_args
-assume trusted_short_alignment
-D_FASTMATH
-float
-fp_reorder
-ifo
-D_INLINE_INTRINSICS
-D_INTRINSICS
-intrinsics
-O3
-readonly_strings

-feedback Specifies that the compiler should use the profile information contained in the specified file when performing optimizations. For more information, see Section 10.1.3.2.
-fp_reorder Specifies whether certain code transformations that affect floating-point operations are allowed.
-G Specifies the maximum byte size of data items in the small data sections (sbss or sdata).
-inline Specifies whether to perform inline expansion of functions.
-ifo Provides improved optimization (interfile optimization) and code generation across file boundaries that would not be possible if the files were compiled separately.
-O Specifies the level of optimization that is to be achieved by the compilation.
-om Performs a variety of postlink code optimizations. Most effective with programs compiled with the -non_shared option (see Appendix F). This option is being replaced with the -spike option (see Section 10.1.3).
-preempt_module Supports symbol preemption on a module-by-module basis.
-speculate Enables work (for example, load or computation operations) to be done in running programs on execution paths before the paths are taken.
-spike Performs a variety of postlink code optimizations (see Section 10.1.3).
-tune Selects processor-specific instruction tuning for specific implementations of the Alpha architecture. See -arch in cc(1) for an explanation of the differences between -tune and -arch.
-unroll Controls loop unrolling done by the optimizer at levels -O2 and above.

Using the preceding options may cause a reduction in accuracy and adherence to standards.

10.1.2    Linking and Loading Considerations

If your application does not use many large libraries, consider linking it nonshared. This allows the linker to optimize calls into the library, which decreases your application's startup time and improves run-time performance (if calls are made frequently). Nonshared applications, however, can use more system resources than call-shared applications. If you are running a large number of applications simultaneously and the applications have a set of libraries in common (for example, libX11 or libc), you may increase total system performance by linking them as call-shared. See Chapter 4 for details.

For applications that use shared libraries, ensure that those libraries can be quickstarted. Quickstarting is a Tru64 UNIX capability that can greatly reduce an application's load time. For many applications, load time is a significant percentage of the total time that it takes to start and run the application. If an object cannot be quickstarted, it still runs, but startup time is slower. See Section 4.7 for details.

10.1.3    Spike and Profile-Directed Optimization

This section describes use of the spike postlink optimizer.

10.1.3.1    Overview of spike

The spike tool performs code optimization after linking. Because it can operate on an entire program, spike is able to do optimizations that the compiler cannot do. spike is most effective when it uses profile information to guide optimization, as discussed in Section 10.1.3.2.

spike is new with Tru64 UNIX Version 5.1 and is intended to replace om and cord. It provides better control and more effective optimization, and it can be used with both executables and shared libraries. spike cannot be used with om or cord. For information about om and cord, see Appendix F.

Some of the optimizations that spike performs are code layout, deleting unreachable code, and optimization of address computations.

spike can process binaries that are linked on Tru64 UNIX V4.0 or later systems. Binaries that are linked on V5.1 or later systems contain information that allows spike to do additional optimization.

Note

spike does only some address optimizations on Tru64 UNIX V5.1 or later images, but om will do the optimization on V4 images. If you are using spike on pre V5.1 binaries and you enable linker optimization (-O passed to cc in the link step), the difference in performance between om and spike is not expected to be significant.

You can use spike in two ways:

The examples in this section and Section 10.1.3.2 show how to use both forms of spike The spike command is more convenient when you do not want to relink the executable (Example 1) or when you are using profile information after compilation (Example 5 and Example 6). The -spike option is more convenient when you are not using profile information (Example 2), or when you are using profile information in the compiler, too (Example 3 and Example 4).

Example 1 and Example 2 show how to use spike without profiling information to guide the optimization. Section 10.1.3.2 explains how to use spike with feedback information from the pixie profiler.

Example 1

In this example, spike is applied to the binary my_prog, producing the optimized output file prog1.opt.

% spike my_prog -o prog1.opt

Example 2

In this example, spike is applied during compilation with the cc command's -spike option:

% cc -c file1.c
% cc -o prog3 file1.o -spike

The first command line creates the object file file1.o. The second command line links file1.o into an executable and uses spike to optimize the executable.

All of the spike command's options can be passed directly to the cc command's -spike option by using the (cc) -WS option. The following example shows the syntax:

% cc -spike -feedback prog -o prog *.c \
     -WS,-splitThresh,.999,-noaggressiveAlign
 
 

For complete information on the spike command's options and any restrictions on using spike, see spike(1).

10.1.3.2    Using spike for Profile-Directed Optimization

You can achieve some degree of automatic optimization by using the compiler's automatic optimization options that are described in the previous sections, such as -O, -fast, -inline, and so on. These options can help in the generation of minimal instruction sequences that make best use of the CPU architecture and cache memory.

However, the compiler and linker can improve on these optimizations if given information on which instructions are executed most often when a program is run with its normal input data and environment. Tru64 UNIX helps you provide this information by allowing a profiler's results to be fed back into a recompilation. This customized, profile-directed optimization can be used in conjunction with automatic optimization.

The following examples show how to use spike with the pixie profiler and various feedback techniques to tune the generated instruction sequences of a program.

Example 3

This example shows the three basic steps for profile-directed optimization with spike: (1) preparing the program for optimization, (2) creating an instrumented version of the program and running it to collect profiling statistics, and (3) feeding that information back to the compiler and linker to help them optimize the executable code. Later examples show how to elaborate on these steps to accommodate ongoing changes during development and data from multiple profiling runs.

% cc -feedback prog -o prog -O3 *.c [1]

% pixie -update prog [2]

% cc -feedback prog -o prog -spike -O3 *.c [3]
 
 

  1. When the program is compiled with the -feedback option for the first time, a special augmented executable file is created. It contains information that the compiler uses to relate the executable to the source files. It also contains a section that is used later to store profiling feedback information for the compiler. This section remains empty after the first compilation because the pixie profiler has not yet generated any feedback information (step 2). Make sure that the file name specified with the -feedback option is the same as the executable file name, which in this example is prog (from the -o option). By default, the -feedback option applies the -g1 option, which provides optimum symbolization for profiling. You need to experiment with the -On option to find the level of optimization that provides the best run-time performance for your program and compiler. The compiler issues this message during the first compilation, because no feedback information is yet available:

    cc: Info: Feedback file prog does not exist (nofbfil)
    cc: Info: Compilation will proceed without feedback optimizations (nofbopt)
     
     
    

    [Return to example]

  2. The pixie command creates an instrumented version of the program (prog.pixie) and then runs it (because a prof option, -update, is specified). Execution statistics and address mapping data are automatically collected in an instruction-counts file (prog.Counts) and an instruction-addresses file (prog.Addrs). The -update option puts this profiling information in the augmented executable. [Return to example]

  3. In the second compilation with the -feedback option, the profiling information in the augmented executable guides the compiler and (through the -spike option) the postlink optimizer. This customized feedback enhances any automatic optimization that the -O3 and -spike options provide. You can make compiler optimizations even more effective by using the -ifo and/or -assume whole_program options in conjunction with the -feedback option. However, as noted in Section 10.1.1, the compiler may be unable to compile very large programs as if there were only one source file. [Return to example]

See pixie(1) and cc(1) for more information.

The profiling information in an augmented executable file makes it larger than a normal executable (typically 3-5 percent). After development is completed, you can use the strip command to remove any profiling and symbol table information. For example:

% strip prog

spike cannot process stripped images.

Example 4

During a typical development process, steps 2 and 3 of Example 3 are repeated as needed to reflect the impact of any changes to the source code. For example:

% cc -feedback prog -o prog -O3 *.c
% pixie -update prog
% cc -feedback prog -o prog -O3 *.c
[modify source code]
% cc -feedback prog -o prog -O3 *.c
.....
[modify source code]
% cc -feedback prog -o prog -O3 *.c
% pixie -update prog
% cc -feedback prog -o prog -spike -O3 *.c
 
 

Because the profiling information in the augmented executable persists from compilation to compilation, the pixie processing step that updates the information does not have to be repeated every time that a source module is modified and recompiled. But each modification reduces the relevance of the old feedback information to the actual code and degrades the potential quality of the optimization, depending on the exact modification. The pixie processing step after the last modification and recompilation guarantees that the feedback information is correctly updated for the last compilation.

Example 5

You might want to run your instrumented program several times with different inputs to get an accurate picture of its profile. This example shows how to optimize a program by merging profiling statistics from two instrumented runs of a program, prog, whose output varies from run to run with different sets of input:

% cc -feedback prog -o prog *.c [1]

% pixie -pids prog [2]

% prog.pixie [3]
(input set 1)
% prog.pixie
(input set 2)

% prof -pixie -update prog prog.Counts.* [4]

% spike prog -feedback prog -o prog.opt [5]

  1. The first compilation produces an augmented executable, as explained in Example 3. [Return to example]

  2. By default, each run of the instrumented program (prog.pixie) produces a profiling data file called prog.Counts. The -pids option adds the process ID of each of the instrumented program's test runs to the name of the profiling data file that is produced (prog.Counts.pid). Thus, the data files that subsequent runs produce do not overwrite each other. [Return to example]

  3. The instrumented program is run twice, producing a uniquely named data file each time -- for example, prog.Counts.371 and prog.Counts.422. [Return to example]

  4. The prof -pixie command merges the two data files. The -update option updates the executable, prog, with the combined information. [Return to example]

  5. The spike command with the -feedback option uses the combined profiling information from the two runs of the program to guide the optimization, producing the optimized output file prog.opt. [Return to example]

The last step of this example could be changed to the following:

% cc -spike -feedback prog -o prog -O3 *.c

The -spike option requires that you relink the program. When using the spike command, you do not have to link the program a second time to invoke spike.

Example 6

This example differs from Example 5 in that a normal (unaugmented) executable is created, and the spike command's -fb option (rather than the -feedback option) is used:

% cc prog -o prog *.c
% pixie -pids prog
% prog.pixie
(input set 1)
% prog.pixie
(input set 2)
% prof -pixie -merge prog.Counts prog prog.Addrs prog.Counts.*
% spike prog -fb prog -o prog.opt
 
 

The prof -pixie -merge command merges the two data files from the two instrumented runs into one combined prog.Counts file. With this form of feedback, the -g1 option must be specified explicitly to provide optimum symbolization for profiling.

The spike -fb command uses the information in prog.Addrs and prog.Counts to produce the optimized output file prog.opt.

The method of Example 5 is preferred. The method in Example 6 is supported for compatibility and should be used only if you cannot compile with the -feedback option that uses feedback information stored in the executable.

10.1.4    Preprocessing and Postprocessing Considerations

Preprocessing options and postprocessing (run-time) options that can affect performance include the following:

10.1.5    Library Routine Selection

Library routine options that can affect performance include the following:

10.2    Application Coding Guidelines

If you are willing to modify your application, use the profiling tools to determine where your application spends most of its time. Many applications spend most of their time in a few routines. Concentrate your efforts on improving the speed of those heavily used routines.

Tru64 UNIX provides several profiling tools that work for programs written in C and other languages. See Chapter 7, Chapter 8, Chapter 9, prof_intro(1), gprof(1), hiprof(1), pixie(1), prof(1), third(1), uprofile(1), and atom(1) for more information.

After you identify the heavily used portions of your application, consider the algorithms used by that code. Is it possible to replace a slow algorithm with a more efficient one? Replacing a slow algorithm with a faster one often produces a larger performance gain than tweaking an existing algorithm.

When you are satisfied with the efficiency of your algorithms, consider making code changes to help the compiler optimize the object code that it generates for your application. High Performance Computing by Kevin Dowd (O'Reilly & Associates, Inc., ISBN 1-56592-032-5) is a good source of general information on how to write source code that maximizes optimization opportunities for compilers.

The following sections identify performance opportunities involving data types, I/O handling, cache usage and data alignment, and general coding issues.

10.2.1    Data-Type Considerations

Data-type considerations that can affect performance include the following:

10.2.2    Using Direct I/O on AdvFS Files

Direct I/O allows an application to use the file-system features that the Advanced File System (AdvFS) provides, such as file management, online backup, and online recovery, while eliminating the overhead of copying user data into the AdvFS cache. Direct I/O uses Direct Memory Access (DMA) commands to copy the user data directly between an application's buffer and a disk.

Normal file-system I/O maintains file pages in a cache. This allows the I/O to be completed asynchronously; once the data is in the cache and scheduled for I/O, the application does not need to wait for the data to be transferred to disk. In addition, because the data is already in the cache, subsequent accesses to this page do not need to read the data from disk. Most applications use normal file-system I/O.

Normal file-system I/O is not suited for applications that access the data on disk infrequently and manage inter-thread competition themselves. Such applications can take advantage of the reduced overhead of direct I/O. However, because data is not cached, access to a given page must be serialized among competing threads. To do this, direct I/O enforces synchronous I/O as the default. This means that when the read() routine returns to the application, the I/O has completed and the data is on disk. Any subsequent retrieval of that data will also incur an I/O operation to retrieve the data from disk.

An application can take advantage of asynchronous I/O (AIO), but still use the underlying direct I/O mechanism, by using the aio_read() and aio_write() system routines. These routines will return to the application before the data has been transferred to disk, and the aio_error() routine allows the application to poll for the completion of the I/O. (The kernel synchronizes the access to file pages so that two threads cannot concurrently write the same page.)

Threads using direct I/O to access a given file will be able to do so concurrently, provided that they do not access the same range of pages. For example, if thread A is writing pages 10 through 19 and thread B is writing pages 20 through 39, these operations will occur simultaneously. Continuing this example, if thread B attempts to write pages 15 through 39 in a single direct I/O transfer, it will be forced to wait until thread A completes its write because their page ranges overlap.

When using direct I/O, the best performance occurs when the requested transfer is aligned on a disk sector boundary and the transfer size is an even multiple of the underlying sector size. Larger transfers are generally more efficient than smaller ones, although the optimal transfer size depends on the underlying storage hardware.

Note

Direct I/O mode and the use of mapped file regions (mmap) are exclusive operations. You cannot set direct I/O mode on a file that uses mapped file regions. Mapping a file will also fail if the file is already open for direct I/O.

Direct I/O and atomic data logging modes are also mutually exclusive. If a file is open in one of these modes, subsequent attempts to open the file in the other mode will fail.

You can activate the direct I/O feature for use on an AdvFS file for both AIO and non-AIO applications. To activate the feature, use the open function in an application, setting the O_DIRECTIO file access flag. For example:

 open ("file", O_DIRECTIO | O_RDWR, 0644)

Direct I/O mode remains in effect until the file is closed by all users.

The fcntl() function with the parameter F_GETCACHEPOLICY can be used to return the caching policy of a file, either FCACHE or FDIRECTIO mode. For example:

int fcntlarg = 0;
ret = fcntl( filedescriptor, F_GETCACHEPOLICY, &fcntlarg );
if ( ret != -1 && fcntlarg == FDIRECTIO ) {
.
.
.

For details on the use of direct I/O and AdvFS, see fcntl(2) and open(2).

10.2.3    Cache Usage and Data Alignment Considerations

Cache usage patterns can have a critical impact on performance:

Data alignment can also affect performance. By default, the C compiler aligns each data item on its natural boundary; that is, it positions each data item so that its starting address is an even multiple of the size of the data type used to declare it. Data not aligned on natural boundaries is called misaligned data. Misaligned data can slow performance because it forces the software to make necessary adjustments at run time.

In C programs, misalignment can occur when you type cast a pointer variable from one data type to a larger data type; for example, type casting a char pointer (1-byte alignment) to an int pointer (4-byte alignment) and then dereferencing the new pointer may cause unaligned access. Also in C, creating packed structures using the #pragma  pack directive can cause unaligned access. (See Chapter 3 for details on the #pragma  pack directive.)

To correct alignment problems in C programs, you can use the -misalign option (or -assume noaligned_objects) or you can make necessary modifications to the source code. If instances of misalignment are required by your program for some reason, use the _ _unaligned data-type qualifier in any pointer definitions that involve the misaligned data. When data is accessed through the use of a pointer declared _ _unaligned, the compiler generates the additional code necessary to copy or store the data without generating alignment errors. (Alignment errors have a much more costly impact on performance than the additional code that is generated.)

Warning messages identifying misaligned data are not issued during the compilation of C programs. However, during execution of any program, the kernel issues warning messages ("unaligned access") for most instances of misaligned data. The messages include the program counter (PC) value for the address of the instruction that caused the misalignment.

You can use either of the following two methods to access code that causes the unaligned access fault:

For more information on data alignment, see Appendix A in the Alpha Architecture Reference Manual. See cc(1) for information on alignment-control options that you can specify on compilation command lines.

10.2.4    General Coding Considerations

General coding considerations specific to C applications include the following:

Also, avoid aliases where possible by introducing local variables to store dereferenced results. (A dereferenced result is the value obtained from a specified address.) Dereferenced values are affected by indirect operations and calls, but local variables are not; local variables can be kept in registers. Example 10-1 shows how the proper placement of pointers and the elimination of aliasing enable the compiler to produce better code.

Example 10-1:  Pointers and Optimization

Source Code:
int len = 10;
char a[10];
 
void zero()
  {
  char *p;
  for (p = a; p != a + len; ) *p++ = 0;
  }

Consider the use of pointers in Example 10-1. Because the statement *p++ = 0 might modify len, the compiler must load it from memory and add it to the address of a on each pass through the loop, instead of computing a  +  len in a register once outside the loop.

You can use two different methods to increase the efficiency of the code used in Example 10-1: