Parallel processing of Compaq C programs is supported in two forms on Tru64 UNIX systems:
OpenMP interface -- a parallel-processing interface defined by the OpenMP Architecture Review Board.
Old-style parallel-processing interface -- a parallel-processing interface developed prior to the OpenMP interface.
This appendix describes the old-style parallel-processing interface, that is, the language features supported before the OpenMP interface was implemented. See Chapter 13 for information about the OpenMP interface.
NOTE
Programmers using the old-style interface should consider converting to the OpenMP interface, an industry standard.
Anyone converting an application to parallel processing or developing a new parallel-processing application should use the OpenMP interface.
Understanding this appendix requires a basic understanding of the concepts of multiprocessing, such as what a thread is, if a data access is thread-safe, and so forth.
The parallel-processing directives use the
#pragma
preprocessing directive of ANSI C, which is the standard C mechanism for adding
implementation-defined behaviors to the language.
Because of this, the terms
parallel-processing directives
(or
parallel directives) and
parallel-processing pragmas
are used
somewhat interchangeably in this appendix.
This appendix contains information on the following topics:
The general coding rules that apply to the use of parallel-processing pragmas (Section D.1)
The syntax of the parallel-processing pragmas (Section D.2)
Environment variables that can be used to control certain aspects of thread resource allocation at run time (Section D.3)
D.1 Use of Parallel-Processing Pragmas
This section describes the general coding rules that apply to all parallel-processing
pragmas and provides an overview of how the pragmas are generally used.
D.1.1 General Coding Rules
In many ways, the coding rules for the parallel-processing pragmas follow the rules of most other pragmas in Compaq C. For example, macro substitution is not performed in the pragmas. In other ways, the parallel-processing pragmas are unlike any other pragmas in Compaq C. This is because while other pragmas generally perform such functions as setting a compiler state (for example, message or alignment), these pragmas are statements. For example:
The pragmas must appear inside a function body.
The pragmas affect the program execution (as described in this appendix).
Most pragmas apply to the statement that follows them. If you wish the pragma to apply to more than one statement, you must use a compound statement (that is, the statements must be enclosed in curly braces).
Several of the pragmas can be followed by modifiers that specify additional
information.
To make using these modifiers easier, each one can appear on
a separate line following the parallel-processing pragma as long as the line
containing the modifiers also begins with
#pragma
followed
by the modifier.
For example:
#pragma parallel if(test_function()) local(var1, var2, var2)
This example can also be written as:
#pragma parallel #pragma if(test_function()) #pragma local(var1, var2, var2)
Note that the modifiers themselves cannot be broken over several lines. For example, the earlier code could not be written as:
#pragma parallel #pragma if(test_function()) local(var1, #pragma var2, var2)
The
#pragma parallel
directive is
used to begin a parallel region.
The statement that follows the
#pragma parallel
directive delimits the extent of
the parallel region.
It is typically either a compound statement containing
ordinary C statements (with or without other parallel-processing directives)
or another parallel-processing directive (in which case the parallel region
consists of that one statement).
Within a compound statement delimiting a
parallel region, any ordinary C statements not controlled by other parallel-processing
directives simply execute on each thread.
The C statements within the parallel
region that are controlled by other parallel-processing directives execute
according to the semantics of that directive.
All other parallel-processing pragmas, except for
#pragma critical
, must appear lexically inside a parallel region.
The most common
type of code that appears in a parallel region is a
for
loop to be executed by the threads.
Such a
for
loop must
be preceded by a
#pragma pfor
.
This construct
allows different iterations of the
for
loop to be executed
by different threads, which speeds up the program execution.
The following
example shows the pragmas that might be used to execute a loop in parallel:
#pragma parallel local(a) #pragma pfor iterate(a = 1 ; 1000 ; 1) for(a = 0 ; a < 1000 ; a++) { <loop code> }
A loop that executes in parallel must obey certain properties. These include:
The index variable is not modified by the loop except by the
third expression of the
for
statement.
Further, that expression
must always adjust the index variable by the same amount.
Each iteration of the loop must be independent. That is, the computations performed by one iteration of the loop must not depend on the results of another iteration
The number of iterations of the loop is known before the loop starts.
The programmer is responsible for verifying that the parallel loops obey these restrictions.
Another use of parallel processing is to have several different blocks
of code run in parallel.
The
#pragma psection
and
#pragma sections
directives are used
for this purpose.
The following code shows how these directives might be used:
#pragma parallel #pragma psection { #pragma section { <code block> } #pragma section { <code block> } #pragma section { <code block> } }
Once again, certain restrictions apply to the code block. For example, one code block must not rely on computations performed in other code blocks.
The final type of code that can appear in a parallel region is serial
code.
Serial code is neither within a
#pragma pfor
nor a
#pragma psection
.
In this
case, the same code will be executed by all of the threads created to execute
the parallel region.
While this may seem wasteful, it is often desirable to
place serial code between two
#pragma pfor
loops or
#pragma psection
blocks.
Although
the serial code will be executed by all of the threads, this construct is
more efficient than closing the parallel region after the first
pfor
or
psection
and opening another before the
second one.
This is due to run-time overhead associated with creating and
closing a parallel region.
Be careful when placing serial code in a parallel region. Note that the following statements could produce unexpected results:
a++; b++
Unexpected results may occur because all threads will execute the statements,
causing the variables
a
and
b
to be
incremented some number of times.
To avoid this problem, enclose the serial
code in a
#pragma one
processor
block.
For example:
#pragma one processor { a++; b++ }
Note that no thread can proceed past the code until it has been executed.
D.1.3 Nesting Parallel Directives
Nested parallel regions are not currently supported in Compaq C. If a parallel region lexically contains another parallel region, the compiler will issue an error. However, if a routine executing inside a parallel region calls another routine that then tries to enter a parallel region, this second parallel region will execute serially and no error will be reported.
With the exception of
#pragma parallel
,
it is invalid for most parallel constructs to execute other parallel constructs.
For example, when running the code in a
#pragma pfor
,
#pragma one processor
,
#pragma section
, or
#pragma critical code block
, the only other parallel-processing construct that can execute
is a
#pragma critical
.
In the case where
one parallel-processing pragma is lexically nested within another, the compiler
will issue an error for all illegal cases.
However, if code running in a code
block transfers to a routine that then executes one of these directives, the
behavior is unpredictable.
(As noted earlier in this appendix, all parallel-processing pragmas,
except for
#pragma critical
, must appear
lexically inside a
#pragma parallel
region.)
D.2 Parallel-Processing Pragma Syntax
This section describes the syntax of each of the parallel-processing pragmas.
The following parallel-processing pragmas are supported by the old-style parallel-processing interface:
#pragma parallel
--
Denotes a parallel region of code (Section D.2.1).
#pragma pfor
-- Marks
a
for
loop that is to be run in parallel (Section D.2.2).
#pragma psection
--
Begins a number of code sections, each of which is to be run in parallel with
the others (Section D.2.3).
#pragma section
--
Specifies each code section within a
psection
area (Section D.2.3).
#pragma critical
--
Protects access to a critical area of code so that only one thread at a time
can execute it (Section D.2.4).
#pragma one processor
-- A section of code that should be executed by only one thread
(Section D.2.5).
#pragma synchronize
--
Stops threads until they all reach this point (Section D.2.6).
#pragma enter gate
and
#pragma exit gate
--
A more complex form of synchronization.
No thread is allowed to leave the
exit gate until all threads have passed the enter gate (Section D.2.7).
The
#pragma parallel
directive marks a parallel region
of code.
The syntax of this pragma is:
#pragma parallel [parallel-modifiers...] statement-or-code_block
The
parallel-modifiers
for
#pragma parallel
are:
local(variable-list) byvalue(variable-list) shared(variable-list) if (expression) [[no]ifinline] numthreads(numthreads-option)
local
,
byvalue
, and
shared
modifiersThe
variable-list
argument to the
local
,
byvalue
, and
shared
modifiers is a comma-separated list
of variables that have already been declared in the program.
You can specify
any number of
local
,
byvalue
, and
shared
modifiers.
This is useful if one of the modifiers requires
a large number of variables.
The variables following the
shared
and
byvalue
modifiers will be shared by each thread.
The variables following the
local
modifier will be
unique to each thread.
Note that the value of variables outside the region
are not passed into the region.
Inside the region, the value of variables
on the
local
modifier is undefined.
Putting a variable
in the local list has the same effect as declaring that variable inside the
parallel region.
These modifiers are provided only for compatibility with other C compilers.
In Compaq C, all visible variables declared outside the parallel
region can be accessed and modified within the parallel region, and are shared
by all threads (unless the variable is specified in the
local
modifier).
For example:
int a,b,c; #pragma parallel local(a) shared(b) byvalue(c) { <code that references a, b, and c> }
This is the same as:
int a,b,c; #pragma parallel { int a; <code that references a, b, and c> }
if
modifierThe expression
following the
if
modifier specifies a condition that will
determine whether the code in the parallel region will actually be executed
in parallel by many threads or serially by a single thread.
If the condition
is nonzero, the code will be run in parallel.
This modifier can be used to
delay, until run time, the decision to parallelize or not.
Note that running a small amount of code in parallel may take more time than running the code serially. This is due to the overhead involved in creating and destroying the threads needed to run the code in parallel.
noifinline
modifierThe
noifinline
modifier can only be used if the
if
modifier is present.
The default value,
ifinline
, tells
the compiler to generate two versions of the code within the parallel region:
one to execute in parallel if the
if
expression is nonzero,
and one to execute serially if the
if
expression is zero.
The
noifinline
modifier tells the compiler to generate
only one form of the code.
The
noifinline
modifier will
cause less code to be generated, but the code will execute more slowly for
those cases in which the
if
expression is zero.
numthreads
modifierThe numthreads-option is one of:
min=
expr1,
max=
expr2percent=
expr
In all cases, the expressions should evaluate to a positive integer
value.
The case of
numthreads(
expr)
is equivalent to
numthreads(min=0,max=
expr)
.
If a
min
clause
is specified, the code will be run in parallel only if
expr1
threads (or more) are available to execute the region.
If a
max
clause is specified, the parallel region will be executed by
no more than
expr2
threads.
If a
percent
clause is specified, the parallel region will be executed by
expr
percent of the available threads.
An example of a parallel region is:
#pragma parallel local(a,b) if(func()) numthreads(x) { code }
The region of code will be executed if
func
returns
a nonzero value.
If it is executed in parallel, at most
x
threads will be used.
Inside the parallel region each thread will have a local
copy of the variables
a
and
b
.
All other
variables will be shared by the threads.
D.2.2 #pragma pfor
The
#pragma pfor
directive marks a loop for parallel execution.
A
#pragma pfor
can only appear lexically inside a parallel region.
The syntax
of this pragma is:
#pragma pfor iterate(iterate-expressions) [pfor-options] for-statement
As the syntax shows, the
#pragma pfor
must be followed by the
iterate
modifier.
The
iterate-expressions
takes a form similar to a for loop:
index-variable=
expr1;
expr2;
expr3
The
index-variable
=
expr1
must match the first expression in the
for
statement that follows the
#pragma pfor
.
To run correctly, the
index-variable
must be local
to the parallel region.
The expr2 expression specifies the number of times the loop will execute.
The expr3 expression specifies the value to be added to the index variable during each iteration of the loop.
Note that the
iterate-expressions
are closely
related to the expressions that appear in the
for
statement
that follows the
#pragma pfor
.
It is the
programmer's responsibility to ensure that the information provided in the
iterate-expressions
correctly characterizes how the for loop
will execute.
The pfor-options are:
schedtype(schedule-type) chunksize(expr)
The
schedtype
option tells the run-time scheduler
how to partition the iterations among the available threads.
Valid
schedule-type
values are:
simple
-- The scheduler will partition
the iterations evenly among all of the available threads.
This is the default.
dynamic
-- The scheduler will give
each thread the number of iterations specified by the
chunksize
expression.
interleave
-- This is the same as
dynamic
except that the work is assigned to the threads in an interleaved
way.
gss
-- The scheduler will give each
thread a varied number of iterations.
This is like
dynamic
,
but instead of giving each thread a fixed chunksize, the number of iterations
will begin with a large number and end with a small number.
The
chunksize
option is required for a
schedtype
of either
dynamic
or
interleave
.
It is used to specify the number of iterations.
D.2.3 #pragma psection and #pragma section
The
#pragma psection
and
#pragma section
directives
designate sections of code that are to be executed in parallel with each other.
These directives can only appear lexically inside a parallel region.
The syntax
of these pragmas is:
#pragma psection
{
#pragma section
stmt1
#pragma section
stmt2. . .
#pragma section
stmtn
}
These pragmas do not have modifiers.
The
#pragma psection
must be followed by a code block enclosed in braces.
The code block
must consist only of
#pragma section
directives
followed by a statement or a group of statements enclosed in braces.
You can
specify any number of
#pragma section
directives
within a
psection
code block.
D.2.4 #pragma critical
The
#pragma critical
directive designates a section of
code that is to be executed by no more than one thread at a time.
The syntax
of this pragma is:
#pragma critical [lock-option] statement-or-code-block
The lock-option can be one of:
block
-- The lock is specific to this
critical section.
Other threads can execute other critical sections while
this critical section is executing, but only one thread can execute this critical
section.
This option can only be specified for critical sections within a
parallel region.
region
-- The lock is specific to
this parallel region.
Other threads that are executing code lexically outside
the parallel region can execute other critical sections, but no other critical
section within the parallel region can execute.
This option can only be specified
for critical sections within a parallel region.
global
-- The global lock.
No other
critical section can execute while this one is executing.
This is the default
value.
expr -- An expression that specifies a user-supplied lock variable. In this case, the expression must designate a 32-bit or 64-bit integer variable.
The
#pragma one processor
directive designates
a section of code that is to be executed by only one thread.
This directive
can only appear inside a parallel region.
The syntax of this pragma is:
#pragma one processor statement-or-code-block
The
#pragma synchronize
directive prevents the next statement
from being executed until all threads have reached this point.
This directive
can only appear inside a parallel region.
The syntax of this pragma is:
#pragma synchronize
D.2.7 #pragma enter gate and #pragma exit gate
The
#pragma enter gate
and
#pragma exit gate
directives allow a more flexible form of synchronization than
#pragma synchronize
.
These directives can only appear inside a parallel region.
Each
#pragma enter gate
in the region must have a matching
#pragma exit gate
.
The syntax of these
pragmas are:
#pragma enter gate (name)
#pragma exit gate (name)
The
name
is an identifier that designates
each gate.
The names of gates are in their own name space; for example, a
gate name of
foo
is distinct from a variable named
foo
.
A gate name is not declared before it is used.
This type of synchronization operates as follows: No thread can execute
the statement after the
#pragma exit gate
until all threads have passed the matching
#pragma enter gate
.
D.3 Environment Variables
Certain aspects of parallel code execution can be controlled by the values of environment variables in the process when the program is started. The environment variables currently examined at the start of the first parallel execution in the program are as follows:
MP_THREAD_COUNT
-- Tells the run-time
system how many threads to create.
The default is to use the number of processors
on the system as the number of threads to create.
MP_CHUNK_SIZE
-- Tells the run-time
system what chunksize to use if the user either asked for the
RUNTIME
schedule type or omitted the chunksize when asking for another
schedule type that requires a chunksize.
MP_STACK_SIZE
-- Tells the run-time
system how many bytes of stack space to allocate for each thread when it creates
threads.
The default is quite small, and if you declare any large arrays as
local, you need to specify stack that is large enough to allocate them in.
MP_SPIN_COUNT
-- Tells the run-time
system how many times to spin while waiting for a condition to become true.
MP_YIELD_COUNT
-- Tells the run-time
system how many times to alternate between calling
sched_yield
and testing the condition before really going to sleep by waiting for a Pthread
condition variable.
You can set these environment variables to integer values by using the conventions of your command-line shell. If an environment variable is not set, the run-time system chooses a plausible default behavior (which is generally biased toward allocating resources to minimize elapsed time).