Let us consider the following program example.

• Fortran
```real function asum(a, n, a0)
real a(10), a0, asum

asum = a0
do i=1, n
asum = asum+a(i)
enddo
return
end```
• C
```double asum(int n, double a0){
double asum;

asum = a0;
for(i=0; i<n; i++)
asum = asum+a[i];

return asum;
}```

In the above loop statement, if we interpret this as sequential execution, the value of the variable asum is shared across and passed on to successive iterations. Therefore, the loop directive we have described up to now cannot be used on its own to parallelize the loop. However, if we can recognize that this loop is a reduction with asum as the reduction variable, we can parallelize it in the following way, by adding the reduction clause to the loop directive.

• Fortran
```real function asum(a, n, a0)
!\$xmp nodes p(2)
!\$xmp template t(10)
!\$xmp distribute t(cyclic) onto p
real a(10), a0, asum
!\$xmp align a(i) with t(i)

asum = a0
!\$xmp loop on t(i) reduction(+:asum)
do i=1, n
asum = asum+a(i)
enddo
return
end```
• C
```#pragma xmp nodes p(2)
#pragma xmp template t(0:9)
#pragma xmp distribute t(cyclic) onto p
#pragma xmp align a[i] with t(i)
a?�a?�a?�
double asum(int n, double a0){
double asum;

asum = a0;
#pragma xmp loop on t(i) reduction(+:asum)
for(i=0; i<n; i++)
asum = asum+a[i];

return asum;
}```

The reduction clause has the reduction variable specified together with the reduction operation to be performed. In the above example, the summation operation is specified, and the fact that the reduction determines the total sum shared across the nodes is expressed. The above program shows a computation where element a is sequentially added to asum. However, while the sequence is preserved, the loop cannot be parallelization. In the above program, it is specified that the reduction clause disrupts the sequence and parallelizes the loop.

The following program shows parallel processing on two processors. The loop is parallelized, where the loop processing on line 4 can be divided in accordance with the (cyclic) distribution of variable a. The code generated for the reduction variable is likely to vary somewhat between compilers, but will generally look like the code shown in bold face type. tmp is a variable that is automatically generated by the compiler. By modifying the execution sequence to the following:

1. The current value is stored and then initialized to zero (lines 2 and 3).
2. Only local data are tallied within the parallelized loop.
3. The data are tallied among processors (line 7).
4. The value is added to the stored value, and the loop terminates (line 8).

the loop has been parallelized (in the case of a floating point representation of the reduction variable, differences in the execution sequence may result in a difference between the results obtained for sequential execution and for parallel execution).

The implementation of this kind of reduction is limited to associative operations. The operations that can be used in XcalableMP include addition, multiplication, logical AND and OR, maximum, minimum, and bitwise AND and OR. The reduction variables can be arrays. In such a case, the reduction operation is applied to all of the array elements.