If you have not already done so, you must collect performance data for cachetest. See Collecting Data for the cachetest Example for instructions.
In this part of the example, we examine the reasons why dgemv_g2 has better performance than dgemv_g1.
There is a difference between these two metrics for dgemv_g1 because of DTLB (data translation lookaside buffer) misses. The system clock is still running while the CPU is waiting for a DTLB miss to be resolved, but the cycle counter is turned off. The difference for dgemv_g2 is negligible, indicating that there are few DTLB misses.
There is less time spent waiting for the cache to be reloaded in dgemv_g2 than in dgemv_g1, because in dgemv_g2 the way in which data access occurs makes more efficient use of the cache.
To see why, we examine the annotated source code. First, to limit the data in the display we remove most of the metrics.
The loop structure in the two routines is different. Because the code is not optimized, the data in the array in dgemv_g1 is accessed by rows, with a large stride (in this case, 6000). This is the cause of the DTLB and cache misses. In dgemv_g2, the data is accessed by column, with a unit stride. Since the data for each loop iteration is contiguous, a large segment can be mapped and loaded into cache and there will be cache misses only when this segment has been used and another is required.