D Parallel Processing -- Old Style

Parallel processing of Compaq C programs is supported in two forms on Tru64 UNIX systems:

OpenMP interface -- a parallel-processing interface defined by the OpenMP Architecture Review Board.

Old-style parallel-processing interface -- a parallel-processing interface developed prior to the OpenMP interface.

This appendix describes the old-style parallel-processing interface, that is, the language features supported before the OpenMP interface was implemented. See Chapter 13 for information about the OpenMP interface.

NOTE

Programmers using the old-style interface should consider converting to the OpenMP interface, an industry standard.
Anyone converting an application to parallel processing or developing a new parallel-processing application should use the OpenMP interface.

Understanding this appendix requires a basic understanding of the concepts of multiprocessing, such as what a thread is, if a data access is thread-safe, and so forth.

The parallel-processing directives use the #pragma preprocessing directive of ANSI C, which is the standard C mechanism for adding implementation-defined behaviors to the language. Because of this, the terms parallel-processing directives (or parallel directives) and parallel-processing pragmas are used somewhat interchangeably in this appendix.

This appendix contains information on the following topics:

The general coding rules that apply to the use of parallel-processing pragmas (Section D.1)

The syntax of the parallel-processing pragmas (Section D.2)

Environment variables that can be used to control certain aspects of thread resource allocation at run time (Section D.3)

D.1 Use of Parallel-Processing Pragmas

This section describes the general coding rules that apply to all parallel-processing pragmas and provides an overview of how the pragmas are generally used.

D.1.1 General Coding Rules

In many ways, the coding rules for the parallel-processing pragmas follow the rules of most other pragmas in Compaq C. For example, macro substitution is not performed in the pragmas. In other ways, the parallel-processing pragmas are unlike any other pragmas in Compaq C. This is because while other pragmas generally perform such functions as setting a compiler state (for example, message or alignment), these pragmas are statements. For example:

The pragmas must appear inside a function body.

The pragmas affect the program execution (as described in this appendix).

Most pragmas apply to the statement that follows them. If you wish the pragma to apply to more than one statement, you must use a compound statement (that is, the statements must be enclosed in curly braces).

Several of the pragmas can be followed by modifiers that specify additional information. To make using these modifiers easier, each one can appear on a separate line following the parallel-processing pragma as long as the line containing the modifiers also begins with #pragma followed by the modifier. For example:

#pragma parallel if(test_function()) local(var1, var2, var2)

This example can also be written as:

#pragma parallel
#pragma if(test_function())
#pragma local(var1, var2, var2)

Note that the modifiers themselves cannot be broken over several lines. For example, the earlier code could not be written as:

#pragma parallel
#pragma if(test_function()) local(var1,
#pragma var2, var2)

D.1.2 General Use

The #pragma parallel directive is used to begin a parallel region. The statement that follows the #pragma parallel directive delimits the extent of the parallel region. It is typically either a compound statement containing ordinary C statements (with or without other parallel-processing directives) or another parallel-processing directive (in which case the parallel region consists of that one statement). Within a compound statement delimiting a parallel region, any ordinary C statements not controlled by other parallel-processing directives simply execute on each thread. The C statements within the parallel region that are controlled by other parallel-processing directives execute according to the semantics of that directive.

All other parallel-processing pragmas, except for #pragma critical, must appear lexically inside a parallel region. The most common type of code that appears in a parallel region is a for loop to be executed by the threads. Such a for loop must be preceded by a #pragma pfor. This construct allows different iterations of the for loop to be executed by different threads, which speeds up the program execution. The following example shows the pragmas that might be used to execute a loop in parallel:

#pragma parallel local(a)
#pragma pfor iterate(a = 1 ; 1000 ; 1)
for(a = 0 ; a < 1000 ; a++) {
<loop code>
}

A loop that executes in parallel must obey certain properties. These include:

The index variable is not modified by the loop except by the third expression of the for statement. Further, that expression must always adjust the index variable by the same amount.

Each iteration of the loop must be independent. That is, the computations performed by one iteration of the loop must not depend on the results of another iteration

The number of iterations of the loop is known before the loop starts.

The programmer is responsible for verifying that the parallel loops obey these restrictions.

Another use of parallel processing is to have several different blocks of code run in parallel. The #pragma psection and #pragma sections directives are used for this purpose. The following code shows how these directives might be used:

#pragma parallel 
#pragma psection
{
   #pragma section
   {   <code block>
   }
   #pragma section
   {   <code block>
   }
   #pragma section
   {   <code block>
   }
}

Once again, certain restrictions apply to the code block. For example, one code block must not rely on computations performed in other code blocks.

The final type of code that can appear in a parallel region is serial code. Serial code is neither within a #pragma pfor nor a #pragma psection. In this case, the same code will be executed by all of the threads created to execute the parallel region. While this may seem wasteful, it is often desirable to place serial code between two #pragma pfor loops or #pragma psection blocks. Although the serial code will be executed by all of the threads, this construct is more efficient than closing the parallel region after the first pfor or psection and opening another before the second one. This is due to run-time overhead associated with creating and closing a parallel region.

Be careful when placing serial code in a parallel region. Note that the following statements could produce unexpected results:

a++;
b++

Unexpected results may occur because all threads will execute the statements, causing the variables a and b to be incremented some number of times. To avoid this problem, enclose the serial code in a #pragma one processor block. For example:

#pragma one processor
{
a++;
b++
}

Note that no thread can proceed past the code until it has been executed.

D.1.3 Nesting Parallel Directives

Nested parallel regions are not currently supported in Compaq C. If a parallel region lexically contains another parallel region, the compiler will issue an error. However, if a routine executing inside a parallel region calls another routine that then tries to enter a parallel region, this second parallel region will execute serially and no error will be reported.

With the exception of #pragma parallel, it is invalid for most parallel constructs to execute other parallel constructs. For example, when running the code in a #pragma pfor, #pragma one processor, #pragma section, or #pragma critical code block, the only other parallel-processing construct that can execute is a #pragma critical. In the case where one parallel-processing pragma is lexically nested within another, the compiler will issue an error for all illegal cases. However, if code running in a code block transfers to a routine that then executes one of these directives, the behavior is unpredictable.

(As noted earlier in this appendix, all parallel-processing pragmas, except for #pragma critical, must appear lexically inside a #pragma parallel region.)

D.2 Parallel-Processing Pragma Syntax

This section describes the syntax of each of the parallel-processing pragmas.

The following parallel-processing pragmas are supported by the old-style parallel-processing interface:

#pragma parallel -- Denotes a parallel region of code (Section D.2.1).

#pragma pfor -- Marks a for loop that is to be run in parallel (Section D.2.2).

#pragma psection -- Begins a number of code sections, each of which is to be run in parallel with the others (Section D.2.3).

#pragma section -- Specifies each code section within a psection area (Section D.2.3).

#pragma critical -- Protects access to a critical area of code so that only one thread at a time can execute it (Section D.2.4).

#pragma one processor -- A section of code that should be executed by only one thread (Section D.2.5).

#pragma synchronize -- Stops threads until they all reach this point (Section D.2.6).

#pragma enter gate and #pragma exit gate -- A more complex form of synchronization. No thread is allowed to leave the exit gate until all threads have passed the enter gate (Section D.2.7).

D.2.1 #pragma parallel

The #pragma parallel directive marks a parallel region of code. The syntax of this pragma is:


#pragma parallel [parallel-modifiers...] statement-or-code_block

The parallel-modifiers for #pragma parallel are:

local(variable-list)
byvalue(variable-list) 
shared(variable-list) 
if (expression) [[no]ifinline]
numthreads(numthreads-option)

local, byvalue, and shared modifiers

The variable-list argument to the local, byvalue, and shared modifiers is a comma-separated list of variables that have already been declared in the program. You can specify any number of local, byvalue, and shared modifiers. This is useful if one of the modifiers requires a large number of variables.

The variables following the shared and byvalue modifiers will be shared by each thread.

The variables following the local modifier will be unique to each thread. Note that the value of variables outside the region are not passed into the region. Inside the region, the value of variables on the local modifier is undefined. Putting a variable in the local list has the same effect as declaring that variable inside the parallel region.

These modifiers are provided only for compatibility with other C compilers. In Compaq C, all visible variables declared outside the parallel region can be accessed and modified within the parallel region, and are shared by all threads (unless the variable is specified in the local modifier). For example:

int a,b,c;
#pragma parallel local(a) shared(b) byvalue(c)
{
<code that references a, b, and c>
}

This is the same as:

int a,b,c;
#pragma parallel
{
int a;
<code that references a, b, and c>
}

if modifier

The expression following the if modifier specifies a condition that will determine whether the code in the parallel region will actually be executed in parallel by many threads or serially by a single thread. If the condition is nonzero, the code will be run in parallel. This modifier can be used to delay, until run time, the decision to parallelize or not.

Note that running a small amount of code in parallel may take more time than running the code serially. This is due to the overhead involved in creating and destroying the threads needed to run the code in parallel.

noifinline modifier

The noifinline modifier can only be used if the if modifier is present. The default value, ifinline, tells the compiler to generate two versions of the code within the parallel region: one to execute in parallel if the if expression is nonzero, and one to execute serially if the if expression is zero. The noifinline modifier tells the compiler to generate only one form of the code. The noifinline modifier will cause less code to be generated, but the code will execute more slowly for those cases in which the if expression is zero.

numthreads modifier

The numthreads-option is one of:

min=expr1, max=expr2

percent=expr

expr

In all cases, the expressions should evaluate to a positive integer value. The case of numthreads(expr) is equivalent to numthreads(min=0,max=expr). If a min clause is specified, the code will be run in parallel only if expr1 threads (or more) are available to execute the region. If a max clause is specified, the parallel region will be executed by no more than expr2 threads. If a percent clause is specified, the parallel region will be executed by expr percent of the available threads.

An example of a parallel region is:

#pragma parallel local(a,b) if(func()) numthreads(x)
{
code
}

The region of code will be executed if func returns a nonzero value. If it is executed in parallel, at most x threads will be used. Inside the parallel region each thread will have a local copy of the variables a and b. All other variables will be shared by the threads.

D.2.2 #pragma pfor

The #pragma pfor directive marks a loop for parallel execution. A #pragma pfor can only appear lexically inside a parallel region. The syntax of this pragma is:


#pragma pfor iterate(iterate-expressions) [pfor-options] for-statement

As the syntax shows, the #pragma pfor must be followed by the iterate modifier. The iterate-expressions takes a form similar to a for loop:

index-variable = expr1 ; expr2 ; expr3

The index-variable = expr1 must match the first expression in the for statement that follows the #pragma pfor. To run correctly, the index-variable must be local to the parallel region.

The expr2 expression specifies the number of times the loop will execute.

The expr3 expression specifies the value to be added to the index variable during each iteration of the loop.

Note that the iterate-expressions are closely related to the expressions that appear in the for statement that follows the #pragma pfor. It is the programmer's responsibility to ensure that the information provided in the iterate-expressions correctly characterizes how the for loop will execute.

The pfor-options are:

schedtype(schedule-type)
chunksize(expr)

The schedtype option tells the run-time scheduler how to partition the iterations among the available threads. Valid schedule-type values are:

simple -- The scheduler will partition the iterations evenly among all of the available threads. This is the default.

dynamic -- The scheduler will give each thread the number of iterations specified by the chunksize expression.

interleave -- This is the same as dynamic except that the work is assigned to the threads in an interleaved way.

gss -- The scheduler will give each thread a varied number of iterations. This is like dynamic, but instead of giving each thread a fixed chunksize, the number of iterations will begin with a large number and end with a small number.

The chunksize option is required for a schedtype of either dynamic or interleave. It is used to specify the number of iterations.

D.2.3 #pragma psection and #pragma section

The #pragma psection and #pragma section directives designate sections of code that are to be executed in parallel with each other. These directives can only appear lexically inside a parallel region. The syntax of these pragmas is:

#pragma psection

#pragma section

stmt1

#pragma section

stmt2. . .

#pragma section

stmtn

These pragmas do not have modifiers. The #pragma psection must be followed by a code block enclosed in braces. The code block must consist only of #pragma section directives followed by a statement or a group of statements enclosed in braces. You can specify any number of #pragma section directives within a psection code block.

D.2.4 #pragma critical

The #pragma critical directive designates a section of code that is to be executed by no more than one thread at a time. The syntax of this pragma is:


#pragma critical [lock-option] statement-or-code-block

The lock-option can be one of:

block -- The lock is specific to this critical section. Other threads can execute other critical sections while this critical section is executing, but only one thread can execute this critical section. This option can only be specified for critical sections within a parallel region.

region -- The lock is specific to this parallel region. Other threads that are executing code lexically outside the parallel region can execute other critical sections, but no other critical section within the parallel region can execute. This option can only be specified for critical sections within a parallel region.

global -- The global lock. No other critical section can execute while this one is executing. This is the default value.

expr -- An expression that specifies a user-supplied lock variable. In this case, the expression must designate a 32-bit or 64-bit integer variable.

D.2.5 #pragma one processor

The #pragma one processor directive designates a section of code that is to be executed by only one thread. This directive can only appear inside a parallel region. The syntax of this pragma is:

#pragma one processor statement-or-code-block

D.2.6 #pragma synchronize

The #pragma synchronize directive prevents the next statement from being executed until all threads have reached this point. This directive can only appear inside a parallel region. The syntax of this pragma is:

#pragma synchronize

D.2.7 #pragma enter gate and #pragma exit gate

The #pragma enter gate and #pragma exit gate directives allow a more flexible form of synchronization than #pragma synchronize. These directives can only appear inside a parallel region. Each #pragma enter gate in the region must have a matching #pragma exit gate. The syntax of these pragmas are:


#pragma enter gate (name)


#pragma exit gate (name)

The name is an identifier that designates each gate. The names of gates are in their own name space; for example, a gate name of foo is distinct from a variable named foo. A gate name is not declared before it is used.

This type of synchronization operates as follows: No thread can execute the statement after the #pragma exit gate until all threads have passed the matching #pragma enter gate.

D.3 Environment Variables

Certain aspects of parallel code execution can be controlled by the values of environment variables in the process when the program is started. The environment variables currently examined at the start of the first parallel execution in the program are as follows:

MP_THREAD_COUNT -- Tells the run-time system how many threads to create. The default is to use the number of processors on the system as the number of threads to create.

MP_CHUNK_SIZE -- Tells the run-time system what chunksize to use if the user either asked for the RUNTIME schedule type or omitted the chunksize when asking for another schedule type that requires a chunksize.

MP_STACK_SIZE -- Tells the run-time system how many bytes of stack space to allocate for each thread when it creates threads. The default is quite small, and if you declare any large arrays as local, you need to specify stack that is large enough to allocate them in.

MP_SPIN_COUNT -- Tells the run-time system how many times to spin while waiting for a condition to become true.

MP_YIELD_COUNT -- Tells the run-time system how many times to alternate between calling sched_yield and testing the condition before really going to sleep by waiting for a Pthread condition variable.

You can set these environment variables to integer values by using the conventions of your command-line shell. If an environment variable is not set, the run-time system chooses a plausible default behavior (which is generally biased toward allocating resources to minimize elapsed time).