13 OpenMP Parallel Processing

The Compaq C compiler supports the development of shared memory parallel applications through its conformance to the OpenMP C Application Program Interface. This API defines a collection of compiler directives, library functions, and environment variables that instruct the compiler, linker, and run-time environment to perform portions of the application in parallel.

The OpenMP directives let you write code that can run concurrently on multiple processors without altering the structure of the source code from that of ordinary ANSI C serial code. Correct use of these directives can greatly improve the elapsed-time performance of user code by allowing that code to execute simultaneously on different processors of a multiprocessor machine. Compiling the same source code, but ignoring the parallel directives, produces a serial C program that performs the same function as the OpenMP compilation.

The OpenMP C and C++ Application Programming Interface specification is available on the Internet from:

http://www.openmp.org/mp-documents/cspec.pdf
http://www.openmp.org/mp-documents/cspec.ps

This chapter addresses the following topics:

Compilation options (Section 13.1)

Environment variables (Section 13.2)

Run-time performance tuning (Section 13.3)

Common programming problems (Section 13.4)

Implementation-specific behavior (Section 13.5)

Debugging (Section 13.6)

13.1 Compilation Options

The following options on the cc command line support parallel processing:

-mp

Causes the compiler to recognize both OpenMP manual decomposition pragmas and old-style manual decomposition directives. Forces libots3 to be included in the link. (Old-style manual decomposition directives are described in Appendix D.)

-omp

Causes the compiler to recognize only OpenMP manual decomposition pragmas and to ignore old-style manual decomposition directives. (Note that the -mp and -omp switches are the same except for their treatment of old-style manual decomposition directives; -mp recognizes the old-style directives and -omp does not.)

-granularity size

Controls the size of shared data in memory that can be safely accessed from different threads. Valid values for size are: byte, longword, and quadword:

byte: Requests that all data of one byte or greater can be accessed from different threads sharing data in memory. This option will slow run-time performance.
longword: Ensures that naturally aligned data of four bytes or greater can be accessed safely from different threads sharing access to that data in memory. Accessing data items of three bytes or less and unaligned data may result in data items written from multiple threads being inconsistently updated.
quadword: Ensures that naturally aligned data of eight bytes can be accessed safely from different threads sharing data in memory. Accessing data items of seven bytes or less and unaligned data may result in data items written from multiple threads being inconsistently updated. This is the default.

-check_omp

Enables run-time checking of certain OpenMP constructs. This includes run-time detection of invalid nesting and other invalid OpenMP cases. When invalid nesting is discovered at run time and this switch is set, the executable will fail with a Trace/BPT trap. If this switch is not set and invalid nesting is discovered, the behavior is indeterminate (for example, an executable may hang).

The compiler detects the following invalid nesting conditions:

Entering a for, single, or sections directive if already in a work-sharing construct, critical section, or a master

Executing a barrier directive if already in a work-sharing sharing construct, a critical section, or a master

Executing a master directive if already in a work-sharing construct

Executing an ordered directive if already in a critical section

Executing an ordered directive unless already in an ordered for

The default is disabled run-time checking.

13.2 Environment Variables

In addition to the environment variables outlined in the OpenMP specification, the following environment variables are recognized by the compiler and the run-time system:

MP_THREAD_COUNT: Specifies how many threads are to be created by the run-time system. The default is the number of processors available to your process. The OMP_NUM_THREADS environment variable takes precedence over this variable.
MP_STACK_SIZE: Specifies how many bytes of stack space are to be allocated by the run-time system for each thread. If you specify zero, the run-time system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable, the run-time system allocates 5 MB.
MP_SPIN_COUNT: Specifies how many times the run-time system can spin while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.
MP_YIELD_COUNT: Specifies how many times the run-time system can alternate between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

13.3 Run-Time Performance Tuning

The OpenMP specification provides a variety of methods for distributing work to the available threads within a parallel for construct. The following sections describe these methods.

13.3.1 Schedule Type and Chunksize Settings

The choice of settings for the schedule type and the chunksize can affect the ultimate performance of the resulting parallelized application, either positively or negatively. Choosing inappropriate settings for the schedule type and the chunksize can degrade the performance of parallelized application to the point where it performs as bad or worse than it would if it was serialized.

The general guidelines are as follows:

Smaller chunksize values generally perform faster than larger. The values for the chunksize should be less than or equal to the values derived by dividing the number of iterations by the number of available threads.

The behavior of the dynamic and guided schedule types make them better suited for target machines with a variety of workloads, other than the parallelized application. These types assign iterations to threads as they become available; if a processor (or processors) becomes tied up with other applications, the available threads will pick up the next iterations.

Although the runtime schedule type does facilitate tuning of the schedule type at run time, it results in a minor performance penalty in run-time overhead.

An effective means of determining appropriate settings for schedule and chunksize can be to set the schedule to runtime and experiment with various schedule and chunksize pairs through the OMP_SCHEDULE environment variable. After the exercise, explicitly set the schedule and chunksize to the values that yielded the best performance.

Note that the schedule and chunksize settings are only two of the many factors that can affect the performance of your application. Some of the other areas that can affect performance include:

Availability of system resources: CPUs on the target machine spending time processing other applications are not available to the parallelized application.

Structure of parallelized code: Threads of a parallelized region that perform disproportionate amounts of work.

Use of implicit and explicit barriers: Parallelized regions that force synchronization of all threads at these explicit or implicit points may cause the application to suspend while waiting for a thread (or threads).

Use of critical sections versus atomic statements: Using critical sections incurs more overhead than atomic. For more information on schedule types and chunksize settings, see Appendix D of the OpenMP C and C++ Application Programming Interface specification.

13.3.2 Additional Controls

When one of the threads needs to wait for an event caused by some other thread, a three-level process begins:

The thread spins for a number of iterations waiting for the event to occur.

It yields the processor to other threads a number of times, checking for the event to occur.

It posts a request to be awakened and goes to sleep.

When another thread causes the event to occur, it will awaken the sleeping thread.

You may get better performance by tuning the threaded environment with the MP_SPIN_COUNT and MP_YIELD_COUNT environment variables or by using the mpc_destroy routine:

MP_SPIN_COUNT -- If your application is running standalone, the default settings will give good performance. However, if your application needs to share the processors with other applications, it is probably appropriate to reduce MP_SPIN_COUNT. This will make the threads waste less time spinning and give up the processor sooner; the cost is the extra time to put a thread to sleep and re-awaken it. In such a shared environment, an MP_SPIN_COUNT of about 1000 might be a good choice.

mpc_destroy -- If you need to perform operations that are awkward when extra threads are present (for example, fork), the mpc_destroy routine can be useful. It destroys any worker threads created to run parallel regions. Normally, you would only call it when you are not inside a parallel region. (The mpc_destroy routine is defined in the libots3 library.)

13.4 Common Programming Problems

The following sections describe some errors that commonly occur in parallelized programs.

13.4.1 Scoping

The OpenMP parallel construct applies to the structured block that immediately follows it. When more than one statement is to be performed in parallel, make sure that the structured block is contained within curly braces. For example:

#pragma omp parallel
{
   pstatement one
   pstatement two
}

The preceding structured block is quite different from the following, where the OpenMP parallel construct applies to only the first statement:

#pragma omp parallel
   pstatement one
   pstatement two

The use of curly braces to explicitly define the scope of the subsequent block (or blocks) is strongly encouraged.

13.4.2 Deadlock

As with any multithreaded application, programmers must use care to prevent run-time deadlock conditions. With the implicit barriers at the end of many OpenMP constructs, an application will result in a deadlock if all threads do not actively participate in the construct. These types of conditions may be more prevalent when implementing parallelism in dynamic extents of the application. For example:

worker ()
 {
 #pragma omp barrier
 }
 
 main ()
 {
 #pragma omp parallel sections
   {
   #pragma omp section
     worker();
   }
 }

The preceding example results in deadlock (with more than one thread active) because not all threads visit the worker routine and the barrier waits for all threads. The -check_omp option (see Section 13.1) aids in detecting such conditions.

For more information, see the OpenMP C and C++ Application Programming Interface specification for a description of valid and invalid directive nesting.

13.4.3 Threadprivate Storage

The threadprivate directive identifies variables that have file scope but are private to each thread. The values for these variables are maintained if the number of threads remains constant. If you explicitly increase or decrease the number of threads within a program, the impact on the values for threadprivate variables is not defined.

13.4.4 Using Locks

Using the lock control routines (see the OpenMP C and C++ Application Programming Interface specification) requires that they be called in a specific sequence:

The lock to be associated with the lock variable must first be initialized.

The associated lock is made available to the executing thread.

The executing thread is released from lock ownership.

When finished, the lock must always be disassociated from the lock variable.

Attempting to use the locks outside the above sequence may cause unexpected behavior, including deadlock conditions.

13.5 Implementation-Specific Behavior

The OpenMP specification identifies several features and default values as implementation-specific. This section lists those instances and the implementation chosen by Compaq C.

Support for nested parallel regions: Whenever a nested parallel region is encountered, a team consisting of one thread is created to execute that region.
Default value for OMP_SCHEDULE: The default value is dynamic,1. If an application uses the run-time schedule but OMP_SCHEDULE is not defined, then this value is used.
Default value for OMP_NUM_THREADS: The default value is equal to the number of processors on the machine.
Default value for OMP_DYNAMIC: The default value is 0. Note that this implementation does not support dynamic adjustments to the thread count. Attempts to use omp_set_dynamic to a nonzero value have no effect on the run-time environment.
Default schedule: When a for or parallel for loop does not contain a schedule clause, a dynamic schedule type is used with the chunksize set to 1.
Flush directive: The flush directive, when encountered, will flush all variables, even if one or more variables are specified in the directive.

13.6 Debugging

The following sections provide tips and hints on how to diagnose the behavior of and debug applications that use the OpenMP application programming interface (API).

13.6.1 Background Information Needed for Debugging

The -mp or -omp options cause the compiler to recognize OpenMP directives and to transform specified portions of code into parallel regions. The compiler implements a parallel region by taking the code in the region and putting it into a separate, compiler-created routine. This process is called outlining, because it is the inverse of inlining a routine into source code at the point where the routine is called.

Note

Understanding how the parallel regions are outlined is necessary to effectively use the debugger and other application-analysis tools.

In place of the parallel region, the compiler inserts a call to a run-time library routine. The run-time library routine creates the slave threads in the team (if they were not already created), starts all threads in the team, and causes them to call the outlined routine. As threads return from the outlined routine, they return to the run-time library, which waits for all threads to finish before the master thread returns to the calling routine. While the master thread continues nonparallel execution, the slave threads wait, or spin, until either a new parallel region is encountered or until the environment-variable controlled wait time (MP_SPIN_COUNT) is reached. If the wait time expires, the slave threads are put to sleep until the next parallel region is encountered.

The following source code contains a parallel region in which the variable id is private to each thread. The code preceding the parallel region explicitly sets the number of threads used in the parallel region to 2. The parallel region then obtains the thread number of the executing thread and displays it with a printf statement.

 1
 2  main()
 3  {
 4    int id;
 5    omp_set_num_threads(2);
 6  # pragma omp parallel private (id)
 7    {
 8      id= omp_get_thread_num();
 9      printf ("Hello World from OpenMP Thread %d\n", id);
10    }
11  }

Using the dis command to disassemble the object module produced from the preceding source code results in the following output:

       _ _main_6:   [1]
0x0:    27bb0001        ldah    gp, 1(t12)
0x4:    2ffe0000        ldq_u   zero, 0(sp)
0x8:    23bd8110        lda     gp, -32496(gp)
0xc:    2ffe0000        ldq_u   zero, 0(sp)
0x10:   23defff0        lda     sp, -16(sp)
0x14:   b75e0000        stq     ra, 0(sp)
0x18:   a2310020        ldl     a1, 32(a1)
0x1c:   f620000e        bne     a1, 0x58
0x20:   a77d8038        ldq     t12, -32712(gp)
0x24:   6b5b4000        jsr     ra, (t12), omp_get_thread_num
0x28:   27ba0001        ldah    gp, 1(ra)
0x2c:   47e00411        bis     zero, v0, a1
0x30:   23bd80e8        lda     gp, -32536(gp)
0x34:   a77d8028        ldq     t12, -32728(gp)
0x38:   a61d8030        ldq     a0, -32720(gp)
0x3c:   6b5b4000        jsr     ra, (t12), printf
0x40:   27ba0001        ldah    gp, 1(ra)
0x44:   23bd80d0        lda     gp, -32560(gp)
0x48:   a75e0000        ldq     ra, 0(sp)
0x4c:   63ff0000        trapb   0x50:   23de0010        lda     sp, 16(sp)
0x54:   6bfa8001        ret     zero, (ra), 1
0x58:   221ffff4        lda     a0, -12(zero)
0x5c:   000000aa        call_pal        gentrap
0x60:   c3ffffef        br      zero, 0x20
0x64:   2ffe0000        ldq_u   zero, 0(sp)
0x68:   2ffe0000        ldq_u   zero, 0(sp)
0x6c:   2ffe0000        ldq_u   zero, 0(sp)
         main:
0x70:   27bb0001        ldah    gp, 1(t12)
0x74:   2ffe0000        ldq_u   zero, 0(sp)
0x78:   23bd80a0        lda     gp, -32608(gp)
0x7c:   2ffe0000        ldq_u   zero, 0(sp)
0x80:   a77d8020        ldq     t12, -32736(gp)
0x84:   23defff0        lda     sp, -16(sp)
0x88:   b75e0000        stq     ra, 0(sp)
0x8c:   47e05410        bis     zero, 0x2, a0
0x90:   6b5b4000        jsr     ra, (t12), omp_set_num_threads
0x94:   27ba0001        ldah    gp, 1(ra)
0x98:   47fe0411        bis     zero, sp, a1
0x9c:   2ffe0000        ldq_u   zero, 0(sp)
0xa0:   23bd807c        lda     gp, -32644(gp)
0xa4:   47ff0412        bis     zero, zero, a2
0xa8:   a77d8010        ldq     t12, -32752(gp)
0xac:   a61d8018        ldq     a0, -32744(gp)
0xb0:   6b5b4000        jsr     ra, (t12), _OtsEnterParallelOpenMP  [2]
0xb4:   27ba0001        ldah    gp, 1(ra) :     a75e0000        ldq     ra,
0(sp)
0xbc:   2ffe0000        ldq_u   zero, 0(sp)
0xc0:   23bd805c        lda     gp, -32676(gp)
0xc4:   47ff0400        bis     zero, zero, v0
0xc8:   23de0010        lda     sp, 16(sp)
0xcc:   6bfa8001        ret     zero, (ra), 1

_ _main_6 is the outlined routine created by the compiler for the parallel region beginning in routine main at listing line 6. The format for naming the compiler-generated outlined routines is as follows: _ _original-routine-name_listing-line-number [Return to example]

The call to _OtsEnterParallelOpenMP is inserted by the compiler to coordinate the thread creation and execution for the parallel region. Run-time control remains within _OtsEnterParallelOpenMP until all threads have completed the parallel region. [Return to example]

13.6.2 Debugging and Application-Analysis Tools

The principal tool for debugging OpenMP applications is the Ladebug debugger. Other tools include Visual Threads, the atom tools pixie and third, and the OpenMP tool ompc.

13.6.2.1 Ladebug

This section describes how to use the Ladebug debugger with OpenMP applications. It explains unique considerations for an OpenMP application over a traditional, multithreaded application. It uses the example program in Section 13.6.1 to demonstrate the concepts of debugging an OpenMP application. For more complete information on debugging multithreaded programs, see the Ladebug Debugger Manual.

Because OpenMP applications are multithreaded, they can generally be debugged using the same strategies as regular multithreaded programs. There are, however, a few special considerations:

As with optimized code, the compiler alters the source module to enable OpenMP support. Thus, the source module shown in the debugger will not reflect the actual execution of the program. For example, the generated routines from the outlining process performed by the compiler will not be visible as distinct routines. Prior to a debugging session, an output listing or object module disassembly will provide the names of these routines. These routines can be analyzed within a Ladebug session in the same way as any normal routine.

The OpenMP standard defines thread numbers beginning with thread 0 (the master thread). Ladebug does not interpret OpenMP thread numbers; it interprets their pthreads equivalent, whose numbering begins with thread 1.

The call stack for OpenMP slave threads originate at a pthreads library routine called thdBase and proceed through a libots3 routine called slave_main.

Variables that are private to a parallel region are private to each thread. Variables that are explicitly private (qualified by firstprivate, lastprivate, private, or reduction) have different memory locations for each thread.

To debug a parallel region, you can set a breakpoint at the outlined routine name. The following example depicts starting a Ladebug session, setting a breakpoint in the parallel region, and continuing execution. The user commands are described in footnotes.

> ladebug example                         [1]
Welcome to the Ladebug Debugger Version 4.0-48
------------------
object file name: example
Reading symbolic information ...done
(ladebug) stop in _ _main_6                [2]
[#1: stop in void _ _main_6(int, int) ]
(ladebug) run                             [3]
[1] stopped at [void _ _main_6(int, int):6 0x1200014e0]
      6 # pragma omp parallel private (id)
(ladebug) thread                          [4]
 Thread Name            State        Substate    Policy       Pri
 ------ --------------- ------------ ----------- ------------ ---
>*> 1 default thread    running                  SCHED_OTHER  19
(ladebug) cont                            [5]
Hello World from OpenMP Thread 0
[1] stopped at [void _ _main_6(int, int):6 0x1200014e0]
      6 # pragma omp parallel private (id)
(ladebug) thread                          [6]
 Thread Name            State        Substate    Policy       Pri
 ------ --------------- ------------ ----------- ------------ ---
>*    2 <anonymous>     running                  SCHED_OTHER  19
(ladebug) cont                            [7]
Hello World from OpenMP Thread 1
Process has exited with status 0

Start a Ladebug session with the example application. [Return to example]

Create a breakpoint to stop at the start of the outlined routine _ _main_6. [Return to example]

Start the program. Note that control stops at the beginning of _ _main_6. [Return to example]

Show which thread is actively executing the parallel region (pthread 1, OpenMP thread 0, in this example). [Return to example]

Continue from this point to allow the parallel region for OpenMP thread 0 to complete and print the Hello World message with the proper OpenMP thread number before the breakpoint is hit again. [Return to example]

Show the next thread that is actively executing the parallel region (pthread 2, OpenMP thread 1). [Return to example]

Continue from this point to print the next message and complete the execution of the program. [Return to example]

The following example shows how to set a breakpoint at the beginning of the outlined routine when pthread 2 (OpenMP thread 1) begins at execution of the parallel region.

> ladebug example
Welcome to the Ladebug Debugger Version 4.0-48
------------------
object file name: example
Reading symbolic information ...done
(ladebug) stop thread 2 in _ _main_6                            [1]
[#1: stop thread (2) in void _ _main_6(int, int) ]
(ladebug) r
Hello World from OpenMP Thread 0
[1] stopped at [void _ _main_6(int, int):6 0x1200014e0]
      6 # pragma omp parallel private (id)
(ladebug) thread
 Thread Name             State        Substate    Policy       Pri
 ------ ---------------- ------------ ----------- ------------ ---
>*    2 <anonymous>      running                  SCHED_OTHER  19
(ladebug) c
Hello World from OpenMP Thread 1
Process has exited with status 0

Stop OpenMP thread 1 (pthread 2) when it encounters the start of the parallel region. [Return to example]

Debugging the OpenMP combined work-sharing constructs (for and sections directives) is analogous to the process shown in the preceding examples.

NOTE

The Ladebug debugger does not yet fully support OpenMP debugging. Variables that are declared as threadprivate are not recognized by Ladebug and cannot be viewed.

13.6.2.2 Visual Threads

Programs instrumented with OpenMP can be monitored with the Compaq Visual Threads (dxthreads) product, which is on Associated Products Volume 1 CD-ROM. For details, see the Visual Threads online help.

13.6.2.3 Atom and OpenMP Tools

OpenMP applications can be instrumented using ompc, a special tool created for monitoring OpenMP applications, and the Atom-based tools pixie (for profiling an executable) and Third Degree (third, for monitoring memory access and potential leaks).

The ompc tool captures the pertinent environment variables settings and maintains counts of calls to the OpenMP-related run-time library routines. It generates warnings and error messages to call attention to situations that may not be what the developer intended. Finally, based on the settings of environment variables, it will trace all calls to these run-time library routines and report, by OpenMP thread number, the sequence in which they were called. See ompc(5) for more information.

The Atom-based pixie tool can be used to detect inefficient thread usage of the application. As described in Section 13.6.1, slave threads will wait, or spin, until a new parallel region is encountered or the MP_SPIN_COUNT expires. If an application experiences long delays between parallel regions, the threads will spin until they are put to sleep. By instrumenting the application with pixie, you can see where the application is spending most of its time. If your application is spending large amounts of compute time in slave_main, this is a good indication that the threads are spending too much time spinning. By reducing the MP_SPIN_COUNT (the default is 16000000) for these types of applications, you may realize better overall performance. For more information about pixie, see Chapter 8 and pixie(1).

For information about the Third Degree tool, see Chapter 7 and third(1).

13.6.2.4 Other Debugging Aids

Other debugging aids include the following:

The compile-time option -check_omp, which embeds additional run-time checking to detect deadlock and race conditions (see Section 13.1).

The omp_set_num_threads and mpc_destroy functions, which let you modify the number of active threads in a program.

You can modify the number of active threads in a program by calling either omp_set_num_threads or mpc_destroy. In either case, any data declared threadprivate and associated with the slave threads is reinitialized to the values at application startup. For example, if the active number of threads is 4 and a call is made to set the number to 2 (via omp_set_num_threads), then any threadprivate data associated with OpenMP threads 1, 2, and 3 will be reset. The threadprivate data associated with the master thread (OpenMP thread 0) is unchanged.

For more information about mpc_destroy, see Section 13.3.2.