Introduction

The University of Tokyo, Riken Advanced Institute of Computational Science, Hitachi, NEC, and Fujitsu have been investing for an exascale system software stack as a part of the feasibility study on advanced and efficient latency core-based architectures. In the feasibility study, architecture, system software, execution/programming models, and applications are cooperatively designed. The co-design of software and architecture is a chicken-and-egg problem. Especially, it is difficult to design software without the concrete architecture. Because our approach is to evolution of K computer, homogeneous manycore architecture is assumed to design system software as a first step. A prototype system is being designed and implemented as a proof-of-concept implementation. This prototyping is also intended to be used in commodity many-core architectures.

An Example of Target Architecture

There are several architectural parameters to form a compute node using manycore architectures, i.e., whether cores are homogeneous or heterogeneous, how the memory is hierarchical placed, what kind of inter core communication mechanisms are provided, and how cache coherences are organized. An example of the target architecture is depicted in Figure 1.

syssoft-target-archFigure 1 An example of target architecture

What are Challenges

Challenges of exa-scale system software are categorized based on architectural characteristics:

  • Limited core capabilities

➢Reducing cache pollution
By increasing number of cores, the size of caches is decreased. Thus, the foot print of the system software should be as small as possible.
➢ Localizing data
To reduce cache miss hits, internal structures of the system software should be localized as much as possible.
➢ Managing memory hierarchy
Exascale architectures are expected to introduce a new memory device to provide enough memory for applications in each node. The system software should utilize and hide such a new memory hierarchy.

  • O(100) cores per node

➢ Reducing memory and hardware device contentions
Threads for OS and runtime system running on cores should share a minimum memory set and coordinate hardware devices such as network and secondary storages to minimize contentions.
➢ Reducing data movement among local memory areas
In case of a NUMA architecture, data movement resulting from buffering I/O data should be minimized. One approach is to parallelize system functions each of thread handles the data located on its local memory.
➢ Providing fast communication
Minimum overhead of communication between cores as well as direct memory access between nodes are required for strong scaling.

  • O(100M+) cores / O(1M+) nodes

➢ Reducing global management information
It should be designed that information for managing resources should be obtained on demand to minimize memory consumption.
➢ Providing fast communication mechanisms
Low latency and high bandwidth communications are needed for collective communications and I/O operations.
➢ Handling huge number of files, accesses and stores
An efficient file I/O mechanism for handling huge number of files must be provided..
➢ Providing fault resilience
The mechanism of fault resilience cooperating with applications is needed.

Candidates of OS organizations

As shown in Figure 2 through Figure 4, there are mainly three types of OS organizations. The first candidate is that Linux kernel is used in each node. As shown in Figure 2, two types of configurations can be considered, single Linux on each node and multiple Linux’s running on each node.

syssoft-linux-kernel-on-node
(a) Single Linux (b) Multiple Linux’s
Figure 2 Linux kernel on compute node

Another possible organization is combination of Linux and a light-weight micro kernel (LMK in short). As shown in Figure 3, two types of configurations can be considered, single Linux kernel with single LMK and single Linux kernel with multiple LMK’s.

syssoft-linux-kernel-LMK
(a) Single Linux with Single LMK (b) Single Linux with multiple LMK’s
Figure 3 Linux kernel with LMK on compute node

The last possible organization is that LMK is used for compute node and Linux servers are provided to rich OS services to compute nodes. As shown in Figure 4, two configurations such like the previous organizations can be considered.

syssoft-LMK-node
(a) Single LMK (b) Multiple LMK’s
Figure 4 LMK on compute node

 

Current Status

syssoft-system-software

Figure 5 System Software

Figure 5 depicts a system software stack being designed at the project. A light-weight micro kernel, called McKernel[1], is based on the configuration depicted in Figure 3. On top of Linux and McKernel, a low level communication layer is being design in order to hide the communication hardware layer and provide remote-DMA-based scalable low-latency communication library. On top of this library, a parallel file I/O facility, an MPI communication library, parallel programming languages are implemented. A prototype system has been implemented and it is distributed as open source software.

References

[1] Yutaka Ishikawa, “Operating System for Manycore-based Supercomputers,” SC’13 Brochure of ITC at the University of Tokyo, 2013.