Distributed

Task-based parallelism in Chmy.jl is featured by the usage of Threads.@spawn, with an additional layer of a Worker construct for efficiently managing the lifespan of tasks. Note that the task-based parallelism provides a high-level abstraction of program execution not only for shared-memory architecture on a single device, but it can be also extended to hybrid parallelism, consisting of both shared and distributed-memory parallelism. The Distributed module in Chmy.jl allows users to leverage the hybrid parallelism through the power of abstraction.

We will start with some basic background knowledge for understanding the architecture of modern HPC clusters, the underlying memory model and the programming paradigm complied with it.

HPC Cluster & Distributed Memory

An high-performance computing (HPC) cluster is a collection of many separate servers (computers), called nodes, which are connected via a fast interconnect. Each node manages its own private memory. Such system with interconnected nodes, without having access to memory of any other node, features the distributed memory model. The underlying fast interconnect (e.g. InfiniBand), that physically connects the nodes in the network via specialised hardware, can transfer the data from one node to another in an extremely efficient manner.

By using the fast interconnect, processes across different nodes can communicate with each other through the exchange of messages in a high-throughput, low-latency fashion. The syntax and semantics of how message passing should proceed through such network is defined by a standard called the Message-Passing Interface (MPI), and there are different libraries that implement the standard, resulting in a wide range of choice (MPICH, Open MPI, MVAPICH etc.) for users. MPI.jl package provides a high-level API for Julia users to call library routines of an implementation of user's choice.

Message-Passing Interface (MPI) is a General Specification

In general, implementations based on MPI standard can be used for a great variety of computers, not just on HPC clusters, as long as these computers are connected by a communication network.

Distributed Architecture

Expanding upon our understanding of message passing in HPC clusters, we now turn our focus to its application within GPU-enhanced environments in Chmy.jl. Our distributed architecture builds upon the abstraction of having GPU clusters that build on the same GPU architecture. Note that in general, GPU clusters may be equipped with hardware from different vendors, incorporating different types of GPUs to exploit their unique capabilities for specific tasks.

The Distributed architecture currently defaults to GPU-aware MPI when the CUDABackend or ROCBackend GPU backends are selected for multi-GPU computations. For the Distributed architecture to function properly on those backends, a GPU-aware MPI library installation shall be used. Otherwise, a segmentation fault will occur.

The Distributed architecture can also be initialised to use CPU buffers making it compatible with non GPU-aware MPI library installations. This is the default behaviour for the CPU and MetalBackend backends. Setting the keyword argument gpu_aware allows to modified the default behaviour upon architecture initialisation.

GPU-Aware MPI and Distributed Architecture

Using non GPU-aware MPI may require pinned (or page locked) memory buffers, a feature which is currently not implemented, and may result in slight performance reduction.

Distributed ​

HPC Cluster & Distributed Memory ​

Distributed Architecture ​

Distributed

HPC Cluster & Distributed Memory

Distributed Architecture