How Data Management Affects Cache Performance

If you have not already done so, you must collect performance data for mttest and open the two experiments, mttest.1.er and mttest.2.er in separate instances of the Performance Analyzer. See Collecting Data for the mttest Example for instructions.

  1. Find ComputeA() and ComputeB() in the Functions tab of both Performance Analyzer windows.

    In the one-CPU experiment, mttest.2.er, the inclusive user CPU time for ComputeA() is almost the same as for ComputeB().

    In the four-CPU experiment, mttest.1.er, ComputeB() uses much more inclusive user CPU time than ComputeA().

    The remaining instructions apply to the four-CPU experiment, mttest.1.er.

  2. Click ComputeA(), then click the Source tab. Scroll down so that the source for both ComputeA() and ComputeB() is displayed.

    The code for these functions is identical: a loop adding one to a variable. All the user CPU time is spent in this loop. To find out why ComputeB() uses more time than ComputeA(), you must examine the code that calls these two functions.

  3. Use the Find tool to find cache_trash. Repeat the search until the source code for cache_trash()is displayed.

    Both ComputeA() and ComputeB() are called by reference using a pointer, so their names do not appear in the source code.

    You can verify that cache_trash() is the caller of ComputeB() by selecting ComputeB() in the Functions tab then clicking the Callers-Callees tab.

  4. Compare the calls to ComputeA() and ComputeB().

    ComputeA() is called with a double in the thread's work block as an argument (&array->list[0]), that can be read and written to directly without danger of contention with other threads.

    ComputeB(), however, is called with a series of doubles that occupy successive words in memory (&element[array->index]). Whenever a thread writes to one of these addresses in memory, any other threads that have that address in their cache must delete the data, which is now out-of-date. If one of the threads needs the data again later in the program, the data must be copied back into the data cache from memory, even if the data item that is needed has not changed. The resulting cache misses, which are attempts to access data not available in the data cache, waste a lot of CPU time. This explains why ComputeB() uses much more user CPU time than ComputeA() in the four-CPU experiment.

    In the one-CPU experiment, only one thread is running at a time and no other threads can write to memory. The running thread's cache data never becomes invalid. No cache misses or resulting copies from memory occur, so the performance for ComputeB() is just as efficient as the performance for ComputeA() when only one CPU is available.

Extension exercise

If you are using a computer that has hardware counters, run the four-CPU experiment again and collect data for one of the cache hardware counters, such as cache misses or stall cycles. On UltraSPARC III hardware you can use the command

% collect -p off -h dcstall -o mttest.3.er mttest

You can combine the information from this new experiment with the previous four-CPU experiment by choosing File and choose Add. Examine the hardware-counter data for ComputeA and ComputeB in the Functions tab and the Source tab.


Can't find what you are looking for? Submit your comments at http://www.sun.com/hwdocs/feedback.
Legal Notices