Some efficiency problems can be the result of a poor ordering of the functions in the executable or from the sequence in which instructions are loaded for execution.
At the worst, two functions that are connected by a call from one function to the other could be widely separated in the text space. The call and the return could cause text page faults. You can locate these by examining Text Fault timing metrics. One way of fixing this problem is to generate a mapfile that is used by the compiler to reorder the functions in the executable. The new order is determined by the sort metric. You should choose a sort metric that places the functions that cause text page faults next to each other in the function list. See Generating and Using a Mapfile for more information.
Another kind of delay in the execution of instructions happens when the instruction is not mapped. The lack of mapping causes a instruction translation lookaside buffer (ITLB) miss. The first miss is necessary to load the instruction, but other misses might be avoidable. You can record data for the ITLB hardware counter (for example, using the itlb alias for the counter name) and examine the corresponding metrics.
Further delays can occur if the instruction is mapped but not in the instruction cache. The first time an instruction is mapped an instruction cache miss occurs, but subsequent misses might be avoidable. You can locate the places where instruction cache misses occur and the time they take by examining metrics for instruction cache misses and instruction cache stall cycles. To view these metrics you must record the corresponding hardware-counter overflow profiling data. On UltraSPARCTM III hardware, you can use the icm and icstall aliases to record data for the hardware counters that count instruction cache misses and instruction cache stall cycles.