This example addresses the issue of efficient data access and optimization. It uses two implementations of a matrix-vector multiplication routine, dgemv, which is included in standard BLAS libraries. Three copies of the two routines are included in the program. The first copy is compiled without optimization, to illustrate the effect of the order in which elements of an array are accessed on the performance of the routines. The second copy is compiled with -O2, and the third with -fast, to illustrate the effect of compiler loop reordering and optimization.
This example also illustrates the use of hardware counters and compiler commentary for performance analysis.
To do this example, you must have access to computer hardware no earlier than the UltraSPARCTM III processor family. For instructions on collecting performance data for this example, see Collecting Data for the cachetest Example. When you have collected performance data, you can do the tutorial, which is divided into three parts that are intended to be done in sequence:
Execution Speed
Program Structure and Cache Behavior
Program Optimization and Performance