• Using Multidimensional Arrays
• Using Duplicate Arrays
• Selecting the Distribution Dimension

## Using Multidimensional Arrays

We will now consider a program that uses multidimensional arrays. The following example is a subroutine that computes the product of matrices. An [m,l]-size matrix a, an [l,n]-size matrix b, and the sizes l, m, and n are assigned as parameters, and the result is stored in an [m,n]-matrix c.

The align directive for arrays b and c instructs a two-dimensional block distribution. Because there is no align directive for array a, it becomes a duplicate array. Loop directives are also coded for the same template.

• Fortran
```subroutine sub(a, b, c, m, l, n)
integer m, n, l
real a(m,l), b(l,n), c(m,n)

!\$xmp nodes p(2)
!\$xmp template t(n)
!\$xmp distribute t(block) onto p
!\$xmp align b(*, i) with t(i)
!\$xmp align c(*, i) with t(i)

!\$xmp loop on t(j)
do j=1, n
do i=1, m
do k=1, l
c(i,j) = c(i,j)+a(i,k)*b(k,j)
enddo
enddo
enddo
return
end```
• C
```#pragma xmp nodes p(2)
#pragma xmp template t(0:99)  // n = 100
#pragma xmp distribute t(block) onto p
#pragma xmp align b[i][*] with t(i)
#pragma xmp align c[i][*] with t(i)
a?�a?�a?�
void sub(int m, int l, int n){

#pragma xmp loop on t(j)
for(j=0; j<n; j++)
for(i=0; i<m; i++)
for(k=0; k<l;k++)
c[j][i] = c[j][i] + a[k][i] * b[j][k];

}```

The following diagram shows the distribution and array access pattern. In the diagram, the vertical lines cutting through the array show the data distribution in the case of four parallel distributions. Looking from a particular processor’s perspective (the second processor), the read-only array elements are light blue, and the read-write array elements are red. For distributed arrays b and c, the data distribution and the read-write array elements match exactly, so no communication occurs. For the duplicate array a, each node references all the array elements to perform its own calculation. ## Using Duplicate Arrays

Like array a in the preceding example, if the data are read-only, communication can be eliminated because they are duplicate variables. If distribution instructions are not written, they become duplicate variables. The characteristics of duplicate arrays can be summarized as follows:

• Duplicate arrays can be read at high speed because there is no communication required.
• Writing duplicate arrays takes longer than writing distributed arrays. To write to duplicate arrays, it is necessary to either use a reduction execution or to broadcast to all the nodes after writing to a particular node. It is not possible to parallelize, as with writing to a distributed array. The upper limit for the execution speed is the same as that for a sequential program.
• Duplicate arrays use more memory than distributed arrays. The memory used per node does not decrease, even if the number of nodes increases.

In other words, the data are suited for reading. One should therefore note whether the benefit of parallelization is being seen and whether memory is being wasted.

## Selecting the Distribution Dimension

The distribution dimension can be selected for a multidimensional array. For example, for a two-dimensional array, as the figure below shows, selecting the dimension for aligning the template selects the distribution dimension.

• Fortran
```!\$xmp nodes p(4)
real a(100,200),b(100,200)
!\$xmp template t1(100)
!\$xmp distribute t1(block) onto p
!\$xmp align b(i,*) with t1(i)
!\$xmp template t2(200)
!\$xmp distribute t2(block) onto p
!\$xmp align a(*,j) with t2(j)```
• C
```#pragma xmp nodes p(4)
double a, b;
#pragma xmp t1(0:99)
#pragma xmp distribute t1(block) onto p
#pragma xmp align b[*][i] with t1(i)
#pragma xmp template t2(0:199)
#pragma xmp distribute t2(block) onto p
#pragma xmp align a[j][*] with t2(i)``` The following example is the same matrix multiplication program that was presented earlier, except that the output matrix c is a duplicate array. Arrays a and b are distributed on the second and first dimensions, respectively. Moving the k loop, which is parallelized by loop interchange, to the outermost position, we can decrease the number of times communication is performed.

The following points are characteristic of the parallelization of this program.

• For the on clause of the loop directive, the distribution is aligned to the right side of the assignment statement, rather than to the left side. For the access pattern, the data used (light blue portion) will be the area handled by each node, and the defined data (red portion) will be the entire array.
• The array variable c is regarded as a reduction variable, and the reduction operation is performed on the array.
• Fortran
```subroutine sub(a, b, c, m, l, n)
!\$xmp nodes p(2)
real a(m,l), b(l,n), c(m,n)
!\$xmp template t(l)
!\$xmp distribute t(block) onto p
!\$xmp align a(*,i) with t(i)
!\$xmp align b(i,*) with t(i)

!\$xmp loop on t(j) reduction(+:c)
do k=1, l
do j=1, n
do i=1, m
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
return
end```
• C
```#pragma xmp nodes p(2)
#pragma xmp template t(0:l-1)
#pragma xmp t(block) onto p
double a[l][m], b[n][l]:, c[n][m];
#pragma xmp a[i][*] with t(i)
#pragma xmp b[*][i] with t(i)
a?�a?�a?�
void sub(int m, int l, int n){

#pragma xmp loop on t(j) reduction(+:c)
for(k=0; k<l;k++)
for(j=0; j<n; j++)
for(i=0; i<m; i++)
c[j][i] = c[j][i] + a[k][i] * b[j][k];

}``` For the same matrix multiplication program, we examined the methods for (1) making array a a duplicate array and (2) making array c a duplicate array. Comparing the two, one may say that method (1) has better performance. This is because, for method (2), there is additional overhead compared to (1), in that the reduction operation is performed to add between nodes for all the array elements of duplicate array c.

However, we cannot decide on the array distribution with this subroutine alone. For other parts of the program, distribution methods like in (2) may be better than the kind in (1). Also, duplicate arrays require a large memory space, so the size of matrices a, b, and c must also be taken into consideration.