Program Optimization and Performance

If you have not already done so, you must collect performance data for cachetest. See Collecting Data for the cachetest Example for instructions.

In this section we examine the effect of two different optimization options on the program performance, -O2 and -fast. The transformations that have been made on the code are indicated by compiler commentary messages, which appear in the annotated source code.

  1. If you have not already done so, do the following tasks:
    1. Choose File and choose Open and open cpi.er.
    2. Choose File and choose Add and add dcstall.er.
    3. Choose View and choose Set Data Presentation and ensure that the metrics for CPU Cycles as a time and for Instructions Executed are selected.
    4. Click the header of the Name column in the Functions tab.
  2. Compare the metrics for dgemv_opt1 and dgemv_opt2 with the metrics for dgemv and dgemv_g2.

    The source code is identical to that in dgemv and dgemv2. The difference is that dgemv_opt1 and dgemv_opt2 have been compiled with the -O2 compiler option. Both functions show about the same decrease in CPU time, whether measured by User CPU time or by CPU cycles, and about the same decrease in the number of instructions executed, but in neither function is the cache behavior improved.

  3. In the Functions tab compare the metrics for dgemv_opt1 and dgemv_opt2 with the metrics for dgemv_hi1 and dgemv_hi2.

    The source code is identical to that in dgemv_opt1 and dgemv_opt2. The difference is that they have been compiled with the -fast compiler option. Now both routines have the same CPU time and the same cache performance. Both the CPU time and the cache stall cycle time have decreased compared to dgemv_opt1 and dgemv_opt2. Waiting for the cache to be loaded takes about 80% of the execution time.

  4. Click dgemv_hi1, then click the Source tab. Resize and scroll the dispay so that you can see the source code for all of dgemv_hi1.

    The compiler has done much more work to optimize this function. It has interchanged the loops that were the cause of the DTLB miss problems. In addition, the compiler has created new loops that have more floating-point add and floating-point multiply operations per loop cycle, and inserted prefetch instructions to improve the cache behavior.

  5. Scroll down to see the source code for dgemv_hi2.

    The compiler commentary messages are the same as for dgemv_hi1 except for the loop interchange. The assembly code generated by the compiler for the two versions of the function is now essentially the same.

  6. Click the Disassembly tab.

    Compare the disassembly listing with that for dgemv_g1 or dgemv_opt1. There are many more instructions generated for dgemv_hi1, but the number of instructions executed is the smallest of the three versions of the function. Optimization can produce more instructions, but the instructions are used more efficiently and executed less frequently.

See also
The Functions Tab
The Source Tab
The Disassembly Tab
Finding Data Flow Problems

Can't find what you are looking for? Submit your comments at http://www.sun.com/hwdocs/feedback.
Legal Notices