Comparing Parallel Sections and Parallel Do Strategies

If you have not already done so, you must collect performance data for omptest and open the first experiment, omptest.1.er. See Collecting Data for the omptest Example for instructions.

This exercise compares the performance of two routines, psec_() and pdo_(), that use the PARALLEL SECTIONS directive and the PARALLEL DO directive. The performance of the routines is compared as a function of the number of CPUs.

To compare the four-CPU run with the two-CPU run, you must start another instance of the Performance Analyzer with omptest.2.er loaded into it. In a terminal window, type the following command.

% analyzer omptest.2.er & 
  1. In the Functions tab of each Performance Analyzer window, find the function psec_ and select it.

    You can use the Find tool to find this function. Note that there are other functions that start with psec_, which have been generated by the compiler.

  2. Position the windows so that you can compare the Summary tabs.
  3. Compare the inclusive metrics for user CPU time, wall clock time, and total LWP time.

    For the two-CPU run, the ratio of wall clock time to either user CPU time or total LWP is about 1 to 2, which indicates relatively efficient parallelization.

    For the four-CPU run, psec_() takes about the same wall clock time as for the two-CPU run, but both the user CPU time and the total LWP time are higher. There are only two sections within the psec_() PARALLEL SECTION construct, so only two threads are required to execute them, using only two of the four available CPUs at any given time. The other two threads are spending CPU time waiting for work. Because there is no more work available, the time is wasted.

  4. In each Analyzer window, select pdo_ in the Functions tab.

    The data for pdo_() is now displayed in the Summary tab.

  5. Compare the inclusive metrics for user CPU time, wall-clock time, and total LWP time.

    The user CPU time for pdo_() is about the same as for psec_(). The ratio of wall-clock time to user CPU time is about 1 to 2 on the two-CPU run, and about 1 to 4 on the four-CPU run, indicating that the pdo_() parallelizing strategy scales much more efficiently on multiple CPUs, taking into account how many CPUs are available and scheduling the loop appropriately.

  6. Close the Analyzer window that is displaying omptest.2.er.

Can't find what you are looking for? Submit your comments at http://www.sun.com/hwdocs/feedback.
Legal Notices