If you have not already done so, you must collect performance data for cachetest. See Collecting Data for the cachetest Example for instructions.
In this section we examine the effect of two different optimization options on the program performance, -O2 and -fast. The transformations that have been made on the code are indicated by compiler commentary messages, which appear in the annotated source code.
The source code is identical to that in dgemv and dgemv2. The difference is that dgemv_opt1 and dgemv_opt2 have been compiled with the -O2 compiler option. Both functions show about the same decrease in CPU time, whether measured by User CPU time or by CPU cycles, and about the same decrease in the number of instructions executed, but in neither function is the cache behavior improved.
The source code is identical to that in dgemv_opt1 and dgemv_opt2. The difference is that they have been compiled with the -fast compiler option. Now both routines have the same CPU time and the same cache performance. Both the CPU time and the cache stall cycle time have decreased compared to dgemv_opt1 and dgemv_opt2. Waiting for the cache to be loaded takes about 80% of the execution time.
The compiler has done much more work to optimize this function. It has interchanged the loops that were the cause of the DTLB miss problems. In addition, the compiler has created new loops that have more floating-point add and floating-point multiply operations per loop cycle, and inserted prefetch instructions to improve the cache behavior.
The compiler commentary messages are the same as for dgemv_hi1 except for the loop interchange. The assembly code generated by the compiler for the two versions of the function is now essentially the same.
Compare the disassembly listing with that for dgemv_g1 or dgemv_opt1. There are many more instructions generated for dgemv_hi1, but the number of instructions executed is the smallest of the three versions of the function. Optimization can produce more instructions, but the instructions are used more efficiently and executed less frequently.
See also | |
---|---|
The Functions Tab The Source Tab The Disassembly Tab Finding Data Flow Problems |