Japanese Home Download Overview Manual Tutorial Publication Support

 Introduction

XcalableMP is a directive-based language extension that allows users to develop parallel programs for distributed memory systems easily and tune the performance by having minimal and simple notations.

Distributed memory systems, such as PC clusters, are the typical platform for high-performance computing, and most users write programs using MPI. Although MPI is the de-facto standard for parallel programming for distributed memory systems, writing MPI programs is often a cumbersome and complicated process. Although a number of parallel programming languages, including High Performance Fortran (HPF), have been developed for distributed memory architectures, these programming models are no longer commonly used [1][2][3]. On the other hand, OpenMP is widely used on shared memory architectures, including SMP-configured PC clusters and multi-core CPUs. The most important feature of OpenMP from the perspective of programmability is to enable parallelization with simple directives, which helps users to extend their codes relatively easily from sequential codes. The target platform of OpenMP however is limited to shared memory architectures.

XcalableMP Application Program Interface (XcalableMP API) is a collection of compiler directives, which are runtime library routines that can be used to specify distributed-memory parallel programming in C and Fortran programs. This specification provides a model of parallel programming for distributed memory multiprocessor systems, and the directives extend the C and Fortran base languages to describe distributed-memory parallel programs, as in OpenMP.

XcalableMP introduces simple, but effective features to describe typical scientific applications using a concept similar to OpenMP. This paradigm features functions for array distribution and work mapping for loops on parallel processes, which are normally executed as MPI processes. These features can be coded using directives similar to OpenMP.

XcalableMP supports typical parallelization based on the data parallel paradigm and work mapping under “global view”, and enables parallelization of the original sequential code using minimal modification with simple directives, as in OpenMP. XcalableMP also includes a CAF-like Partitioned Global Address Space (PGAS) feature for “local view” programming. The important design principle of XcalableMP is “performance-awareness”. All actions of communication and synchronization are performed by directives, which are different from automatic parallelizing compilers. The user should be aware of what happens by XcalableMP directives in the execution model on the distributed memory architecture. This is very important for easy of understanding in performance tuning.

Existing Solutions

Existing parallel programming languages or models for distributed memory systems, such as Co-Array Fortran (CAF) or HPF, provide various features by which to describe parallel programs and execute them efficiently. However, these features are often too complicated for most users to write parallel programs. Moreover, the sophisticated implementation makes it difficult for users to optimize their code.

HPF provides many useful directives to describe parallel programs. HPF makes it easier to write parallel programs from serial programs. However, it is not always easy to achieve high performance. In HPF, internode communications are automatically inserted by the compiler. This makes it difficult for the users to efficiently describe communications and optimize performance.

In CAF, each process (image) has its own data, referred to collectively as a coarray. In order to access the data of another process, the user must use the extended assignment statement with the process number. Since CAF does not provide global array indices for the entire data space, the user can only use local array indices in the communication statement. Programming with CAF is very similar to that in MPI, in that CAF provides primitive elements for parallel programming. Since all communications are clear to the users, performance tuning can be easily achieved. However, CAF requires a large amount of learning in order to effectively code a parallel program.

XcalableMP is being designed based on experience gained using HPF, Fujitsu XPF (VPP Fortran), and OpenMPD.

XcalableMP Overview

Execution model and nodes directive

The target of XcalableMP is a distributed memory system. Each computation node, which may have several cores sharing main memory, has its own local memory, and each node is connected via a network. Each node can access and modify its local memory directly and can access the memory on the other nodes via communication. However, it is assumed that accessing remote memory is much slower than accessing local memory.

The basic execution model of XcalableMP is a Single Program Multiple Data (SPMD) model on distributed memory. In each node, a program starts from the same main routine. An XcalableMP program begins as a single thread of execution in each node. The set of nodes when starting a program is referred to as the entire set of nodes.

When the thread encounters XcalableMP directives, synchronization and communication occurs between nodes. In other words, no synchronization or communication happens without directives. In this case, the program performs duplicate executions of the same program on local memory in each node. By default, the data declared in the program is allocated in each node and is referenced locally by threads executed in the node.

The node directive declares a node array to express a set of nodes. The following directives declare the entire set of nodes as an array of 16 nodes.

#pragma xmp nodes p(16)

A task is a specific instance of executable code and its data environments executed in a set of nodes. A task when starting a program in the entire set of nodes is referred to as an initial task. The initial task can generate a subtask, which is executed on a subset of the nodes by the task construct. The set of nodes executing the same task is referred to as the executing nodes. If no task construct is encountered, then a program is executed as a single task, and its executing nodes are the entire set of nodes.

The task construct is used to execute a block of code on the specified node. For example, the following code executes the block only on the master node (specified by 1), as the master directive of OpenMP.

#pragma xmp task on 1
  { …  block … }

XcalableMP supports two models of viewing data: the global-view programming model and the local-view programming model. In the local-view programming model, accesses to data in remote nodes are performed explicitly by language extension for get/put operations on remote nodes with the node number of the target nodes, while reference to local data is executed implicitly.

Global View programming model

The global-view programming model is useful when, starting from the sequential version of the program, the programmer parallelizes the program in the data-parallel model by adding directives incrementally with minimum modifications. Since these directives can be ignored as a comment by the compilers of base languages (C and Fortran), an XcalableMP program derived from a sequential program can preserve the integrity of the original program when it is run sequentially.

The global-view programming model shares a number of major conceptual similarities with HPF. The programmer describes the data distribution of the data shared among the nodes by data distribution directives. In order to specify the data distribution, a template is used as a dummy array distributed on nodes.

#pragma xmp nodes P(4)
#pragma xmp template T(0:15)
#pragma xmp distribute T(block) onto p

Block, cyclic, block-cyclic, and gen-block distributions are supported. In this example, the one-dimensional template T is distributed on four nodes in blocks of the same size.

A distributed array is declared by aligning the array to the template by an align directive. In the following fragment of code, array A is aligned to the template, i.e., with block distribution

double A[16];
#pragma xmp align A[i] with T(i)

Figure 1 shows the assignment of one-dimensional array A to four nodes. Each node is assigned an actual portion of the entire array to use, as denoted by gray/red in Figure 1.

The loop construct maps iterations to the node containing referenced data. Template is used to specify the mapping of iterations. Using the template used for the data distribution, iterations are assigned to the node of the data. Note that in XcalableMP, the programmer must control all data reference required computations performed locally by any appropriate directives. For example, consider the following XcalableMP loop:

#pragma xmp loop on t(i)
for(i = 2; i <= 10; i++) array[i] = . . .

Figure 1 shows an example in which the loop actually scans the array elements a[2] to a[10], for the case in which the size of the array is 16.

Figure 1. Data distribution and work sharing

Data synchronization and communication

Global-view communication directives are used to synchronize nodes, maintain the consistency of the shadow area, and move a part or all of the distributed data globally. In XcalableMP, inter-node communication must be described explicitly. The compiler guarantees that communication takes place only if communication is explicitly specified.

The gmove construct is a powerful operation in global-view programming in XcalableMP. The gmove construct copies the data of a distributed array in the global view. This directive is followed by the assignment statement of scalar value and array sections. The assignment operation of the array sections of a distributed array may require communication between nodes. In XcalableMP, the C language is extended to support array section notation in order to support an assignment of array objects.

The gmove construct must be executed by nodes in the executing node set. In addition, the value of scalar objects, the index value, and the range value of the array section in the assignment statement must be the same in every node executing this directive. When no option is specified, the copy operation is performed collectively by all nodes in the executing node set. In this case, all elements in both the source array and the target array must be distributed onto the executing node set. For example, the following example executes all-to-all communication in order to perform data copy between arrays that have different distributions.

double X[16][16], Y[16][16];
#pragma xmp align X[i][*] with T(i)
#pramga xmp align Y[*][i] with T(i)

#pragma xmp gmove
  X[:][:] = Y[:][:];  // array section assignment, which execute all-to-all comm

If the right-hand side is owned by one node, then the gmove operation is implemented as a broadcast communication.

The shadow directive specifies the shadow width for which the area is used to communicate the neighbor element of a block of a distributed array. The data stored in the storage area declared by the shadow directive is referred to as the shadow object. The reflect directive assigns the value of a reflection source to a shadow object for variables having the shadow attribute. Of the data allocated to a storage area other than a shadow area, data representing the same array element as that of a shadow object is referred to as a reflection source of the shadow object.

As shown in Figure 2, it is typical to refer to the neighboring elements of the array as the spatial domain decomposition method. Here, the white region is assigned to each process. If the neighboring elements are referred in order to update an element, then the process requires the right edge element of the left neighboring process (or the left edge element of the right neighboring process). This region (shown in gray) is referred to as the shadow, and these neighboring shadows must be swapped before each process refers to these regions. The size of the shadow depends on the algorithm. The size can be specified as an option.

For example, in Figure 2, the shadow size is 2 because the maximum offset from the referred index variable i is 2. The shadow directive specifies the size of the shadow area. The reflect directive describes sleeve synchronization before the array access.

#pragma xmp align v[i] with T[i]
#pragma xmp shadow v[2:2]
..
#pragma xmp reflect (v)
#pragma xmp loop on T[i]
for(i=2; i < N+1; i++){
     u[i]=v[i-2]/2.0+v[i-1]-v[i]/6.0+v[i+1]+v[i+2]/2.0;
}

 

Figure 2. Shadow synchronization

It is sometime useful that each process has a complete array, and the actual computation results are stored individually to the elements mapped to each process and all values are exchanged before the next array reference. This synchronization is referred to as full-shadow synchronization, and is shown in Figure 3. However, since full-shadow synchronization requires an all-to-all data exchange for all parallel processes, it is much more expensive than shadow synchronization.

 
Figure 3. Full shadow synchronization
When each process partially calculates the same array with other processes, a global operation must be performed after these partial computations are performed on the array. This type of synchronization, referred to as reduction, can be specified by the reduction clause in loop directives. Reduction corresponds to the all-reduce operation of an array. After synchronization, all processes share the same reduced value of the array.

For collective communications, barrier, reduction, and broadcast operations are provided by the directives.

Local view programming

Local view is suitable for programs that explicitly describe the algorithm of each node and for explicit remote data reference (Figure 3). Since MPI is considered to have local view, the local-view programming model of XcalableMP has high interoperability with MPI.

XcalableMP adopts coarray notations as an extension of languages for local-view programming. In the case of Fortran as the base language, most coarray notations are compatible to that of Coarray Fortran(CAF) except that the task constructs are used for task parallelism. For example, in Fortran, in order to access an array element of A(i) located on computation node N, the expression of A(i)[N] is used. If the access is a value reference, then communication to obtain the value takes place. If the access updates the value, then communication to set a new value occurs.

In order to use coarray notations in C, we propose a language extension of the language. A coarray is declared by the coarray directive in C.

#pragma xmp coarray array-variable co-array-dimension

For example,

int A[10], B[10];
#pragma xmp coarray [*]: A, B

The coarray object is referenced in the following expression:

   scalar-variable : [image-index]
   array-section-expression:[image-index]

Array section notation is a notation to describe part of an array and is adopted in Fortran90. In C, an array section has the following form:

    array_name '[' [start_index]':'[length] [':' step] ']'...

An array section is built from some subset of the elements of an array object, namely, the elements associated with a selected subset of the index range attached to the object. The lower_bound and upper_bound specify the range of array elements of an array object. Either the lower bound or the upper bound can be omitted in the index range of a section, in which case they default to the lowest or highest values taken by the index of the array. Thus, A[:] is a section containing the whole of A. If the step is specified, then the elements of an array section are every “step”-th element in the specified range. For example, B[1:9:3] is an array section of size 4 containing every third element of B with indices between 1 and 10 (i.e., indices 1, 4, 7, 10). Collectively ranges specified by start_index, length, and step are referred to as triplets. For multi-dimensional arrays, some dimensions can be subscripted with a normal scalar expression, and some dimensions can be “sectioned” with triplets.

For example,

 A[:]= B[:]:[10]; // copy from B on image 10 to A

 

Example

Example1: Laplace solver with 4 points stencil operations

#pragma xmp nodes p(*)
#pragma xmp template t(1:N)
#pragma xmp distribute t(block) on p

double u[XSIZE+2][YSIZE+2],uu[XSIZE+2][YSIZE+2];
#pragma xmp align u[i][*] to t(i)
#pragma xmp align uu[i][*] to t(i)
#pragma xmp shadow uu[1:1] 

lap_main()
{
    int x,y,k;
    double sum;

    for(k = 0; k < NITER; k++){
	/* old <- new */
#pragma xmp loop on t(x)
	for(x = 1; x <= XSIZE; x++)
	  for(y = 1; y <= YSIZE; y++)
	    uu[x][y] = u[x][y];
#pragma xmp reflect (uu)
#pragma xmp loop on t(x)
	for(x = 1; x <= XSIZE; x++)
	  for(y = 1; y <= YSIZE; y++)
	    u[x][y] = (uu[x-1][y] + uu[x+1][y] + uu[x][y-1] + uu[x][y+1])/4.0;
    }
    /* check sum */
    sum = 0.0;
#pragma xmp loop on t(x) reduction(+:sum)
    for(x = 1; x <= XSIZE; x++)
	for(y = 1; y <= YSIZE; y++)
	  sum += (uu[x][y]-u[x][y]);

#pragma xmp block on master
    printf("sum = %g\n",sum);
}

main routine in a NPB CG

#pragma xmp nodes p(*)
#pragma xmp template t(0:N-1)
#pragma xmp distribute t(block) on p

static void conj_grad (
    int colidx[],       /* colidx[1:nzz] */
    int rowstr[],       /* rowstr[1:naa+1] */
    double x[],         /* x[*] */
    double z[],         /* z[*] */
    double a[],         /* a[1:nzz] */
    double p[],         /* p[*] */
    double q[],         /* q[*] */
    double r[],         /* r[*] */
    double w[],         /* w[*] */
    double *rnorm )
#pragma xmp align [i] to t(i) :: x,z,p,q,r,w
#pragma xmp shadow [*] :: x,z,p,q,r,w
{
    static double d, sum, rho, rho0, alpha, beta;
    int i, j, k;
    int cgit, cgitmax = 25;

#pragma xmp loop on t(j)
    for (j = 1; j <= naa+1; j++) {
        q[j] = 0.0;
        z[j] = 0.0;
        r[j] = x[j];
        p[j] = r[j];
        w[j] = 0.0;
    }

    sum = 0.0;
#pragma xmp loop on t(j) reduction(+:sum)
    for (j = 1; j <= lastcol-firstcol+1; j++) {
        sum = sum + r[j]*r[j];
    }
    rho = sum;
    for (cgit = 1; cgit <= cgitmax; cgit++) {
#pragma xmp reflect (p)
#pragma xmp loop on t(j)
        for (j = 1; j <= lastrow-firstrow+1; j++) {
            sum = 0.0;
            for (k = rowstr[j]; k <= rowstr[j+1]-1; k++) {
                sum = sum + a[k]*p[colidx[k]];
            }
            w[j] = sum;
        }
#pragma xmp loop on t(j)
        for (j = 1; j <= lastcol-firstcol+1; j++) {
            q[j] = w[j];
        }
     ...
   }
  ...
}

References

  • [1] Hitoshi Murai, Takuya Araki, Yasuharu Hayashi, Kenji Suehiro and Yoshiki Seo: Implementation and Evaluation of HPF/SX V2, Concurrency and Computation – Practice & Experience, Vol.14, No. 8-9, Wiley (2002), 603-629.
  • [2] Hitoshi Sakagami, Hitoshi Murai, Yoshiki Seo and Mitsuo Yokokawa: 14.9 TFLOPS Three-dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator, Proc. SC2002 (2002).
  • [3] Hitoshi Murai and Yasuo Okabe: Pipelined Parallelization in HPF Programs on the Earth Simulator, Proc. 6th Intl. Symposium on HPC (ISHPC-VI), LNCS 4759, Springer (2005), 365-373.
  • [4] Jinpil Lee, Mitsuhisa Sato and Taisuke Boku, “OpenMPD: A Directive-Based Data Parallel Language Extensions for Distributed Memory Systems”, First International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2008