Operating System for Manycore-based Supercomputers

Introduction

A manycore architecture is one of the promising ap-proaches to building a post peta-scale supercomputing environment. There are mainly two ways to realize a compute node using manycore architectures as shown in Figure 1. One is that manycore CPU forms an accelerator which is connected to the host CPU via PCI Express. Intel Xeon Phi is an example of this architecture. Another way to construct a compute node is that a manycore CPU is used as a compute node.

mckernel-manycore-arch-1

mckernel-manycore-arch-2

Figure 1 Manycore architectures

In either case, a new OS keeping Linux API is required to fulfil an efficient scalable distributed parallel system.

What are Challenges

Challenges of system software to carry out the post peta-scale environment using manycore architectures are summarized as follows:

  • Cache-aware system software stack

Because manycores have small memory caches and lim-ited memory bandwidth, the footprint in the cache during both user and system program executions should be min-imized.

  • Scalability

One of the scalability issues results from two sources: one is enlarging the internal data structures to manage re-sources for not only local node but also other nodes. A new resource management technique should be designed. Another source is so-called OS noises. To eliminate OS noises during application execution, OS for application execution should be separated from OS for system ser-vices, e.g., light-weight microkernel with Linux kernel.

  • Minimum overhead of communication facility

Minimum overhead of communication between cores as well as direct memory access between manycore units is required for strong scaling.

  • Portability

The system software stack should support portability of existing programs running in PC clusters.

Development Platform

The Information Technology Center has been designing and developing a new scalable and cache-aware system software stack for manycore-based supercomputers in cooperation with RIKEN AICS, Hitachi, NEC, and Fujitsu. The current development platform is based on Intel Xeon Phi as shown in Figure 2.

mckernel-development-platform

Figure 2 Development Platform

Operating System for Manycore-based Supercomputers

Figure 3 depicts three types of configurations. Figure 3 (a) is for the development platform that we have been de-signed and developed. Linux kernel runs in the host CPU while a co-kernel, called McKernel, runs in the manycore unit. Figure 3’s (b) and (c) are for the manycore only node configuration. The manycore units are partitioned to two groups: a Linux kernel runs in one group and single or multiple McKernels run in another group.

mckernel-overview-a
(a) Attached
mckernel-overview-b
(b) Built-in (single McKernel)
mckernel-overview-c
(c) Built-in (multiple McKernel)

Figure 3 An Overview of McKernel

To make a co-kernel, sitting next to Linux, portable, hardware-specific functions and a communication facility between the co-kernel and Linux are designed and imple-mented. This layer is called IHK (Interface for Heteroge-neous Kernel). It consists of three modules. The IHK-Linux driver provides the monitor of co-processors, such as booting, memory copy, and interrupt. IHK-cokernel abstracts the hardware functions of the manycore devices. IHK-IKC provides communication between the Linux and co-kernel.
McKernel is implemented on top of IHK. All applica-tions running on Linux kernel run on the McKernel with-out modification, but it does not mean that McKernel pro-vides all Linux APIs. Instead, McKernel only implements minimum OS services to manage threads and processes, such as clone, fork, synchronizations, and signals. When other Linux APIs are called, McKernel delegates those requests to Linux Kernel via IHK. Thus, as shown in Fig-ure 3, the GNU libc library for Linux and other Linux li-braries run on the McKernel.
Because McKernel is a light-weight microkernel and isolated from Linux in terms of memory, core, and inter-rupt managements, it is easy to modify the kernel for test-ing new ideas. The paper [2] explores a new memory management scheme for a hierarchical memory system.

DCFA and MPICH/DCFA

In order to provide a low-latency communication facility, The DCFA user-level communication facility [3][4] has been designed and implemented in the manycore unit. The host CPU initializes the Infiniband HCA and sets up in-ternal structures for the HCA. The memory regions, work and completion queues in the HCA are allocated in memory area of the manycore unit instead of the host CPU so that data in the manycore unit is directly trans-ferred to memory in a remote manycore unit via Infini-band. That is, a program running in the manycore unit may post send/receive requests to the HCA directly with-out host CPU’s assist. MPICH has been ported to DCFA, called MPICH/DCFA, based on the implementation de-scribe in [1].

Distribution

The McKernel, DCFA, and MPICH/DCFA are now open source software. It is distributed by the PC Cluster Con-sortium.

References

[1] Masamichi Takagi, Yuichi Nakamura, Atsushi Hori, Balazs Gerofi, and Yutaka Ishikawa, “Revisiting Rendezvous Protocols in the Context of RDMA-capable Host Channel Adapters and Many-Core Processors,” Eu-roMPI2013, 2013.
[2] Balazs Gerofi, Akio Shimada, Atsushi Hori, Yutaka Ishikawa, “Partially Separated Page Tables for Efficient Operating System Assisted Hierar-chical Memory Management on Heterogeneous Architectures,” CCGRID 2013, pp. 360-368, 2013.
[3] Min Si, Yutaka Ishikawa, and Masamichi Takagi, “Direct MPI Library for Intel Xeon Phi co-processors,” The 3rd Workshop on Communication Ar-chitecture for Scalable Systems (CASS 2013) in conjunction with IPDPS2013, 2013.
[4] Min Si and Yutaka Ishikawa, “Design of Direct Communication Facility for Manycore-based Accelerators, “ to appear at CASS2012 in conjunction with IPDPS2012, 2012
[5] Taku Shimosawa, Yutaka Ishikawa, “Inter-kernel Communication between Multiple Kernels on Multicore Machines”, IPSJ Transactions on Advanced Computing Systems Vol.2 No.4 (ACS 28), pp. 64-82, 2009.
[6] Taku Shimosawa, Hiroya Matsuba, Yutaka Ishikawa, “Logical Partitioning without Architectural Supports”, 32nd IEEE Intl. Computer Software and Applications Conference (COMPSAC 2008), pp. 355-364, 2008.
Note that McKernel and IHK origin HIDOS and AAL designed and im-plemented by Taku Shimosawa[5][6] when he was a PhD. student of Yutaka Ishikawa’s laboratory at the University of Tokyo.

(Nov. 2013)