The Compaq C compiler supports the development of shared memory parallel applications through its conformance to the OpenMP C Application Program Interface. This API defines a collection of compiler directives, library functions, and environment variables that instruct the compiler, linker, and run-time environment to perform portions of the application in parallel.
The OpenMP directives let you write code that can run concurrently on multiple processors without altering the structure of the source code from that of ordinary ANSI C serial code. Correct use of these directives can greatly improve the elapsed-time performance of user code by allowing that code to execute simultaneously on different processors of a multiprocessor machine. Compiling the same source code, but ignoring the parallel directives, produces a serial C program that performs the same function as the OpenMP compilation.
The OpenMP C and C++ Application Programming Interface specification is available on the Internet from:
http://www.openmp.org/mp-documents/cspec.pdf http://www.openmp.org/mp-documents/cspec.ps
This chapter addresses the following topics:
Compilation options (Section 13.1)
Environment variables (Section 13.2)
Run-time performance tuning (Section 13.3)
Common programming problems (Section 13.4)
Implementation-specific behavior (Section 13.5)
Debugging (Section 13.6)
The following options on the
cc
command line support
parallel processing:
-mp
Causes the compiler to
recognize both OpenMP manual decomposition pragmas and old-style manual decomposition
directives.
Forces
libots3
to be included in the link.
(Old-style manual decomposition directives are described in
Appendix D.)
-omp
Causes the compiler to
recognize only OpenMP manual decomposition pragmas and to ignore old-style
manual decomposition directives.
(Note that the
-mp
and
-omp
switches are the same except for their treatment of old-style
manual decomposition directives;
-mp
recognizes the old-style
directives and
-omp
does not.)
-granularity size
Controls the size of shared data in memory that can be safely
accessed from different threads.
Valid values for
size
are:
byte
,
longword
, and
quadword
:
byte
Requests that all data of one byte or greater can be accessed from different threads sharing data in memory. This option will slow run-time performance.
longword
Ensures that naturally aligned data of four bytes or greater can be accessed safely from different threads sharing access to that data in memory. Accessing data items of three bytes or less and unaligned data may result in data items written from multiple threads being inconsistently updated.
quadword
Ensures that naturally aligned data of eight bytes can be accessed safely from different threads sharing data in memory. Accessing data items of seven bytes or less and unaligned data may result in data items written from multiple threads being inconsistently updated. This is the default.
-check_omp
Enables run-time checking of certain OpenMP constructs. This includes run-time detection of invalid nesting and other invalid OpenMP cases. When invalid nesting is discovered at run time and this switch is set, the executable will fail with a Trace/BPT trap. If this switch is not set and invalid nesting is discovered, the behavior is indeterminate (for example, an executable may hang).
The compiler detects the following invalid nesting conditions:
Entering a
for
,
single
,
or
sections
directive if already in a work-sharing construct,
critical
section, or a
master
Executing a
barrier
directive if already
in a work-sharing sharing construct, a
critical
section,
or a
master
Executing a
master
directive if already
in a work-sharing construct
Executing an
ordered
directive if already
in a
critical
section
Executing an
ordered
directive unless already
in an
ordered
for
The default is disabled run-time checking.
In addition to the environment variables outlined in the OpenMP specification, the following environment variables are recognized by the compiler and the run-time system:
MP_THREAD_COUNT
Specifies how
many threads are to be created by the run-time system.
The default is the
number of processors available to your process.
The
OMP_NUM_THREADS
environment variable takes precedence over this variable.
MP_STACK_SIZE
Specifies how many bytes of stack space are to be allocated by the run-time system for each thread. If you specify zero, the run-time system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable, the run-time system allocates 5 MB.
MP_SPIN_COUNT
Specifies how many times the run-time system can spin while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.
MP_YIELD_COUNT
Specifies how
many times the run-time system can alternate between calling
sched_yield
and testing the condition before going to sleep by waiting for
a thread condition variable.
The default is 10.
13.3 Run-Time Performance Tuning
The OpenMP specification provides a variety of methods for distributing
work to the available threads within a parallel for construct.
The following
sections describe these methods.
13.3.1 Schedule Type and Chunksize Settings
The choice of settings for the schedule type and the chunksize can affect the ultimate performance of the resulting parallelized application, either positively or negatively. Choosing inappropriate settings for the schedule type and the chunksize can degrade the performance of parallelized application to the point where it performs as bad or worse than it would if it was serialized.
The general guidelines are as follows:
Smaller chunksize values generally perform faster than larger. The values for the chunksize should be less than or equal to the values derived by dividing the number of iterations by the number of available threads.
The behavior of the
dynamic
and
guided
schedule types make them better suited for target machines
with a variety of workloads, other than the parallelized application.
These
types assign iterations to threads as they become available; if a processor
(or processors) becomes tied up with other applications, the available threads
will pick up the next iterations.
Although the
runtime
schedule type does
facilitate tuning of the schedule type at run time, it results in a minor
performance penalty in run-time overhead.
An effective means of determining appropriate settings for
schedule and chunksize can be to set the schedule to
runtime
and experiment with various schedule and chunksize pairs through the
OMP_SCHEDULE
environment variable.
After the exercise, explicitly
set the schedule and chunksize to the values that yielded the best performance.
Note that the schedule and chunksize settings are only two of the many factors that can affect the performance of your application. Some of the other areas that can affect performance include:
Availability of system resources: CPUs on the target machine spending time processing other applications are not available to the parallelized application.
Structure of parallelized code: Threads of a parallelized region that perform disproportionate amounts of work.
Use of implicit and explicit barriers: Parallelized regions that force synchronization of all threads at these explicit or implicit points may cause the application to suspend while waiting for a thread (or threads).
Use of
critical
sections versus
atomic
statements: Using
critical
sections incurs
more overhead than
atomic
.
For more information on schedule
types and chunksize settings, see Appendix D of the
OpenMP C and C++ Application Programming Interface
specification.
When one of the threads needs to wait for an event caused by some other thread, a three-level process begins:
The thread spins for a number of iterations waiting for the event to occur.
It yields the processor to other threads a number of times, checking for the event to occur.
It posts a request to be awakened and goes to sleep.
When another thread causes the event to occur, it will awaken the sleeping thread.
You may get better performance by tuning the threaded environment with
the
MP_SPIN_COUNT
and
MP_YIELD_COUNT
environment variables or by using the
mpc_destroy
routine:
MP_SPIN_COUNT
-- If your application is running
standalone, the default settings will give good performance.
However, if your
application needs to share the processors with other applications, it is probably
appropriate to reduce
MP_SPIN_COUNT
.
This will make the
threads waste less time spinning and give up the processor sooner; the cost
is the extra time to put a thread to sleep and re-awaken it.
In such a shared
environment, an
MP_SPIN_COUNT
of about 1000 might be a
good choice.
mpc_destroy
-- If you need to perform
operations that are awkward when extra threads are present (for example,
fork
), the
mpc_destroy
routine can be useful.
It destroys any worker threads created to run parallel regions.
Normally,
you would only call it when you are not inside a parallel region.
(The
mpc_destroy
routine is defined in the
libots3
library.)
13.4 Common Programming Problems
The following sections describe some errors that commonly occur in parallelized
programs.
13.4.1 Scoping
The OpenMP parallel construct applies to the structured block that immediately follows it. When more than one statement is to be performed in parallel, make sure that the structured block is contained within curly braces. For example:
#pragma omp parallel { pstatement one pstatement two }
The preceding structured block is quite different from the following, where the OpenMP parallel construct applies to only the first statement:
#pragma omp parallel pstatement one pstatement two
The use of curly braces to explicitly define the scope of the subsequent
block (or blocks) is strongly encouraged.
13.4.2 Deadlock
As with any multithreaded application, programmers must use care to prevent run-time deadlock conditions. With the implicit barriers at the end of many OpenMP constructs, an application will result in a deadlock if all threads do not actively participate in the construct. These types of conditions may be more prevalent when implementing parallelism in dynamic extents of the application. For example:
worker () { #pragma omp barrier } main () { #pragma omp parallel sections { #pragma omp section worker(); } }
The preceding example results in deadlock (with more than one thread
active) because not all threads visit the
worker
routine
and the barrier waits for all threads.
The
-check_omp
option
(see
Section 13.1) aids in detecting such conditions.
For more information, see the
OpenMP C and C++ Application Programming Interface
specification for a description of valid and invalid directive nesting.
13.4.3 Threadprivate Storage
The
threadprivate
directive identifies variables that have file scope
but are private to each thread.
The values for these variables are maintained
if the number of threads remains constant.
If you explicitly increase or decrease
the number of threads within a program, the impact on the values for
threadprivate
variables is not defined.
13.4.4 Using Locks
Using the lock control routines (see the OpenMP C and C++ Application Programming Interface specification) requires that they be called in a specific sequence:
The lock to be associated with the lock variable must first be initialized.
The associated lock is made available to the executing thread.
The executing thread is released from lock ownership.
When finished, the lock must always be disassociated from the lock variable.
Attempting to use the locks outside the above sequence may cause unexpected
behavior, including deadlock conditions.
13.5 Implementation-Specific Behavior
The OpenMP specification identifies several features and default values as implementation-specific. This section lists those instances and the implementation chosen by Compaq C.
Whenever a nested parallel region is encountered, a team consisting of one thread is created to execute that region.
OMP_SCHEDULE
The default value is
dynamic,1
.
If an application
uses the run-time schedule but
OMP_SCHEDULE
is not defined,
then this value is used.
OMP_NUM_THREADS
The default value is equal to the number of processors on the machine.
OMP_DYNAMIC
The default value is 0.
Note that this implementation does
not support dynamic adjustments to the thread count.
Attempts to use
omp_set_dynamic
to a nonzero value have no effect on the run-time
environment.
When a for or parallel for loop does not contain a schedule clause, a dynamic schedule type is used with the chunksize set to 1.
The
flush
directive,
when encountered, will flush all variables, even if one or more variables
are specified in the directive.
The following sections provide tips and hints on how to diagnose the
behavior of and debug applications that use the OpenMP application programming
interface (API).
13.6.1 Background Information Needed for Debugging
The -mp or -omp options cause the compiler to recognize OpenMP directives and to transform specified portions of code into parallel regions. The compiler implements a parallel region by taking the code in the region and putting it into a separate, compiler-created routine. This process is called outlining, because it is the inverse of inlining a routine into source code at the point where the routine is called.
Note
Understanding how the parallel regions are outlined is necessary to effectively use the debugger and other application-analysis tools.
In place of the parallel region, the compiler inserts a call to a run-time
library routine.
The run-time library routine creates the slave threads in
the team (if they were not already created), starts all threads in the team,
and causes them to call the outlined routine.
As threads return from the outlined
routine, they return to the run-time library, which waits for all threads
to finish before the master thread returns to the calling routine.
While the
master thread continues nonparallel execution, the slave threads wait, or
spin, until either a new parallel region is encountered or until the environment-variable
controlled wait time (MP_SPIN_COUNT
) is reached.
If the
wait time expires, the slave threads are put to sleep until the next parallel
region is encountered.
The following source code contains a parallel region in which the variable
id
is private to each thread.
The code preceding the parallel
region explicitly sets the number of threads used in the parallel region to
2.
The parallel region then obtains the thread number of the executing thread
and displays it with a
printf
statement.
1 2 main() 3 { 4 int id; 5 omp_set_num_threads(2); 6 # pragma omp parallel private (id) 7 { 8 id= omp_get_thread_num(); 9 printf ("Hello World from OpenMP Thread %d\n", id); 10 } 11 }
Using the
dis
command to disassemble the object module
produced from the preceding source code results in the following output:
_ _main_6: [1] 0x0: 27bb0001 ldah gp, 1(t12) 0x4: 2ffe0000 ldq_u zero, 0(sp) 0x8: 23bd8110 lda gp, -32496(gp) 0xc: 2ffe0000 ldq_u zero, 0(sp) 0x10: 23defff0 lda sp, -16(sp) 0x14: b75e0000 stq ra, 0(sp) 0x18: a2310020 ldl a1, 32(a1) 0x1c: f620000e bne a1, 0x58 0x20: a77d8038 ldq t12, -32712(gp) 0x24: 6b5b4000 jsr ra, (t12), omp_get_thread_num 0x28: 27ba0001 ldah gp, 1(ra) 0x2c: 47e00411 bis zero, v0, a1 0x30: 23bd80e8 lda gp, -32536(gp) 0x34: a77d8028 ldq t12, -32728(gp) 0x38: a61d8030 ldq a0, -32720(gp) 0x3c: 6b5b4000 jsr ra, (t12), printf 0x40: 27ba0001 ldah gp, 1(ra) 0x44: 23bd80d0 lda gp, -32560(gp) 0x48: a75e0000 ldq ra, 0(sp) 0x4c: 63ff0000 trapb 0x50: 23de0010 lda sp, 16(sp) 0x54: 6bfa8001 ret zero, (ra), 1 0x58: 221ffff4 lda a0, -12(zero) 0x5c: 000000aa call_pal gentrap 0x60: c3ffffef br zero, 0x20 0x64: 2ffe0000 ldq_u zero, 0(sp) 0x68: 2ffe0000 ldq_u zero, 0(sp) 0x6c: 2ffe0000 ldq_u zero, 0(sp) main: 0x70: 27bb0001 ldah gp, 1(t12) 0x74: 2ffe0000 ldq_u zero, 0(sp) 0x78: 23bd80a0 lda gp, -32608(gp) 0x7c: 2ffe0000 ldq_u zero, 0(sp) 0x80: a77d8020 ldq t12, -32736(gp) 0x84: 23defff0 lda sp, -16(sp) 0x88: b75e0000 stq ra, 0(sp) 0x8c: 47e05410 bis zero, 0x2, a0 0x90: 6b5b4000 jsr ra, (t12), omp_set_num_threads 0x94: 27ba0001 ldah gp, 1(ra) 0x98: 47fe0411 bis zero, sp, a1 0x9c: 2ffe0000 ldq_u zero, 0(sp) 0xa0: 23bd807c lda gp, -32644(gp) 0xa4: 47ff0412 bis zero, zero, a2 0xa8: a77d8010 ldq t12, -32752(gp) 0xac: a61d8018 ldq a0, -32744(gp) 0xb0: 6b5b4000 jsr ra, (t12), _OtsEnterParallelOpenMP [2] 0xb4: 27ba0001 ldah gp, 1(ra) : a75e0000 ldq ra, 0(sp) 0xbc: 2ffe0000 ldq_u zero, 0(sp) 0xc0: 23bd805c lda gp, -32676(gp) 0xc4: 47ff0400 bis zero, zero, v0 0xc8: 23de0010 lda sp, 16(sp) 0xcc: 6bfa8001 ret zero, (ra), 1
_ _main_6
is the outlined routine
created by the compiler for the parallel region beginning in routine
main
at listing line 6.
The format for naming the compiler-generated
outlined routines is as follows:
_ _original-routine-name
[Return to example]_
listing-line-number
The call to
_OtsEnterParallelOpenMP
is inserted
by the compiler to coordinate the thread creation and execution for the parallel
region.
Run-time control remains within
_OtsEnterParallelOpenMP
until all threads have completed the parallel region.
[Return to example]
The principal tool for debugging OpenMP applications is the Ladebug
debugger.
Other tools include Visual Threads, the atom tools
pixie
and
third
, and the OpenMP tool
ompc
.
13.6.2.1 Ladebug
This section describes how to use the Ladebug debugger with OpenMP applications. It explains unique considerations for an OpenMP application over a traditional, multithreaded application. It uses the example program in Section 13.6.1 to demonstrate the concepts of debugging an OpenMP application. For more complete information on debugging multithreaded programs, see the Ladebug Debugger Manual.
Because OpenMP applications are multithreaded, they can generally be debugged using the same strategies as regular multithreaded programs. There are, however, a few special considerations:
As with optimized code, the compiler alters the source module to enable OpenMP support. Thus, the source module shown in the debugger will not reflect the actual execution of the program. For example, the generated routines from the outlining process performed by the compiler will not be visible as distinct routines. Prior to a debugging session, an output listing or object module disassembly will provide the names of these routines. These routines can be analyzed within a Ladebug session in the same way as any normal routine.
The OpenMP standard defines thread numbers beginning with
thread 0 (the master thread).
Ladebug does not interpret OpenMP thread numbers;
it interprets their
pthreads
equivalent, whose numbering
begins with thread 1.
The call stack for OpenMP slave threads originate at a
pthreads
library routine called
thdBase
and proceed
through a
libots3
routine called
slave_main
.
Variables that are private to a parallel region are private
to each thread.
Variables that are explicitly private (qualified by
firstprivate
,
lastprivate
,
private
,
or
reduction
) have different memory locations for each
thread.
To debug a parallel region, you can set a breakpoint at the outlined routine name. The following example depicts starting a Ladebug session, setting a breakpoint in the parallel region, and continuing execution. The user commands are described in footnotes.
> ladebug example [1] Welcome to the Ladebug Debugger Version 4.0-48 ------------------ object file name: example Reading symbolic information ...done (ladebug) stop in _ _main_6 [2] [#1: stop in void _ _main_6(int, int) ] (ladebug) run [3] [1] stopped at [void _ _main_6(int, int):6 0x1200014e0] 6 # pragma omp parallel private (id) (ladebug) thread [4] Thread Name State Substate Policy Pri ------ --------------- ------------ ----------- ------------ --- >*> 1 default thread running SCHED_OTHER 19 (ladebug) cont [5] Hello World from OpenMP Thread 0 [1] stopped at [void _ _main_6(int, int):6 0x1200014e0] 6 # pragma omp parallel private (id) (ladebug) thread [6] Thread Name State Substate Policy Pri ------ --------------- ------------ ----------- ------------ --- >* 2 <anonymous> running SCHED_OTHER 19 (ladebug) cont [7] Hello World from OpenMP Thread 1 Process has exited with status 0
Start a Ladebug session with the example application. [Return to example]
Create a breakpoint to stop at the start of the outlined routine
_ _main_6
.
[Return to example]
Start the program.
Note that control stops at the beginning
of
_ _main_6
.
[Return to example]
Show which thread is actively executing the parallel region (pthread 1, OpenMP thread 0, in this example). [Return to example]
Continue from this point to allow the parallel region for OpenMP
thread 0 to complete and print the
Hello World
message
with the proper OpenMP thread number before the breakpoint is hit again.
[Return to example]
Show the next thread that is actively executing the parallel region (pthread 2, OpenMP thread 1). [Return to example]
Continue from this point to print the next message and complete the execution of the program. [Return to example]
The following example shows how to set a breakpoint at the beginning of the outlined routine when pthread 2 (OpenMP thread 1) begins at execution of the parallel region.
> ladebug example Welcome to the Ladebug Debugger Version 4.0-48 ------------------ object file name: example Reading symbolic information ...done (ladebug) stop thread 2 in _ _main_6 [1] [#1: stop thread (2) in void _ _main_6(int, int) ] (ladebug) r Hello World from OpenMP Thread 0 [1] stopped at [void _ _main_6(int, int):6 0x1200014e0] 6 # pragma omp parallel private (id) (ladebug) thread Thread Name State Substate Policy Pri ------ ---------------- ------------ ----------- ------------ --- >* 2 <anonymous> running SCHED_OTHER 19 (ladebug) c Hello World from OpenMP Thread 1 Process has exited with status 0
Stop OpenMP thread 1 (pthread 2) when it encounters the start of the parallel region. [Return to example]
Debugging the OpenMP combined work-sharing constructs (for
and
sections
directives) is analogous to the
process shown in the preceding examples.
NOTE
The Ladebug debugger does not yet fully support OpenMP debugging. Variables that are declared as
threadprivate
are not recognized by Ladebug and cannot be viewed.
Programs instrumented with OpenMP can be monitored with the Compaq Visual
Threads (dxthreads
) product, which is on Associated Products
Volume 1 CD-ROM.
For details, see the Visual Threads online help.
13.6.2.3 Atom and OpenMP Tools
OpenMP applications can be instrumented using
ompc
,
a special tool created for monitoring OpenMP applications, and the Atom-based
tools
pixie
(for profiling an executable) and Third Degree
(third
, for monitoring memory access and potential leaks).
The
ompc
tool captures the pertinent environment
variables settings and maintains counts of calls to the OpenMP-related run-time
library routines.
It generates warnings and error messages to call attention
to situations that may not be what the developer intended.
Finally, based
on the settings of environment variables, it will trace all calls to these
run-time library routines and report, by OpenMP thread number, the sequence
in which they were called.
See
ompc
(5)
for more information.
The Atom-based
pixie
tool can be used to detect inefficient
thread usage of the application.
As described in
Section 13.6.1,
slave threads will wait, or spin, until a new parallel region is encountered
or the
MP_SPIN_COUNT
expires.
If an application experiences
long delays between parallel regions, the threads will spin until they are
put to sleep.
By instrumenting the application with
pixie
,
you can see where the application is spending most of its time.
If your application
is spending large amounts of compute time in
slave_main
,
this is a good indication that the threads are spending too much time spinning.
By reducing the
MP_SPIN_COUNT
(the default
is 16000000) for these types of applications, you may realize better overall
performance.
For more information about
pixie
, see
Chapter 8
and
pixie
(1).
For information about the Third Degree tool, see
Chapter 7
and
third
(1).
13.6.2.4 Other Debugging Aids
Other debugging aids include the following:
The compile-time option -check_omp, which embeds additional run-time checking to detect deadlock and race conditions (see Section 13.1).
The
omp_set_num_threads
and
mpc_destroy
functions, which let you modify the number of active threads in
a program.
You can modify the number of active threads in a program by calling
either
omp_set_num_threads
or
mpc_destroy
.
In either case, any data declared
threadprivate
and associated
with the slave threads is reinitialized to the values at application startup.
For example, if the active number of threads is 4 and a call is made to set
the number to 2 (via
omp_set_num_threads
), then any
threadprivate
data associated with OpenMP threads 1, 2, and 3 will
be reset.
The
threadprivate
data associated with the master
thread (OpenMP thread 0) is unchanged.
For more information about
mpc_destroy
, see
Section 13.3.2.