Kernels

The KernelAbstractions.jl package provides a macro-based dialect that hides the intricacies of vendor-specific GPU programming. It allows one to write hardware-agnostic kernels that can be instantiated and launched for different device backends without modifying the high-level code nor sacrificing performance.

In the following, we show how to write and launch kernels on various backends. We also explain the concept of a Launcher in Chmy.jl, that complements the default kernel launching, allowing us to hide the latency between the bulk of the computations and boundary conditions or MPI communications.

Writing Kernels

This section highlights some important features of KernelAbstractions.jl that are essential for understanding the high-level abstraction of the kernel concept that is used throughout our package. As it barely serves for illustrative purposes, for more specific examples, please refer to their documentation.

using KernelAbstractions

# Define a kernel that performs element-wise operations on A
@kernel function mul2!(A)
    # use @index macro to obtain the global Cartesian index of the current work item.
    I = @index(Global, Cartesian)
    A[I] *= 2
end

The kernel mul2! being defined using the @kernel macro, we can launch it on the desired backend to perform the element-wise operations on host.

# Define array and work group size
A       = ones(1024, 1024)
backend = get_backend(A) # CPU

# Launch kernel and explicitly synchronize
kernel = mul2!(backend)
kernel(A, ndrange=size(A))
KernelAbstractions.synchronize(backend)

# Result assertion
@assert(all(A .== 2.0) == true)

To launch the kernel on GPU devices, one could simply define A as CuArray, ROCArray or oneArray as detailed in the section "launching kernel on the backend". More fine-grained memory access is available using the @index macro as described here.

Thread Indexing

Thread indexing is essential for memory usage on GPU devices; however, it can quickly become cumbersome to figure out the thread index, especially when working with multi-dimensional grids of multi-dimensional blocks of threads. The performance of kernels can also depend significantly on access patterns.

In the example above, we saw the usage of I = @index(Global, Cartesian), which retrieves the global index of threads for the two-dimensional array A. Such powerful macros are provided by KernelAbstractions.jl for conveniently retrieving the desired index of threads.

The following table is non-exhaustive and provides a reference of commonly used terminology. Here, KernelAbstractions.@index is used for index retrieval, and KernelAbstractions.@groupsize is used for obtaining the dimensions of blocks of threads.

KernelAbstractionsCPUCUDAAMDGPU
@index(Local, Linear)mod(i, g)threadIdx().xworkitemIdx().x
@index(Local, Cartesian)[2]threadIdx().yworkitemIdx().y
@index(Local, Cartesian)[3]threadIdx().zworkitemIdx().z
@index(Group, Linear)i ÷ gblockIdx().xworkgroupIdx().x
@index(Group, Cartesian)[2]blockIdx().yworkgroupIdx().y
@groupsize()[3]blockDim().zworkgroupDim().z
@index(Global, Linear)iglobal index computation neededglobal index computation needed
@index(Global, Cartesian)[2]global index computation neededglobal index computation needed
@index(Global, NTuple)global index computation neededglobal index computation needed

The @index(Global, NTuple) returns a NTuple object, allowing more fine-grained memory control over the allocated arrays.

@kernel function memcpy!(a, b)
    i, j = @index(Global, NTuple)
    @inbounds a[i, j] = b[i, j]
end

A tuple can be splatted with ... Julia operator when used to avoid explicitly using i, j indices.

@kernel function splatting_memcpy!(a, b)
    I = @index(Global, NTuple)
    @inbounds a[I...] = b[I...]
end

Kernel Launcher

In Chmy.jl, the KernelLaunch module is designed to provide handy utilities for performing different grid operations on selected data entries of Fields that are involved at each kernel launch, in which the grid geometry underneath is also taken into account.

Followingly, we define a kernel launcher associated with an UniformGrid object, supporting CUDA backend.

# Define backend and geometry
arch = Arch(CUDABackend())
grid = UniformGrid(arch; origin=(-1, -1), extent=(2, 2), dims=(126, 126))

# Define launcher
launch = Launcher(arch, grid)

We also have two kernel functions compute_q! and update_C! defined, which shall update the fields q and C using grid operators (see section Grid Operators) ∂x, ∂y, divg that are anchored on some grid g accordingly.

@kernel inbounds = true function compute_q!(q, C, χ, g::StructuredGrid, O)
    I = @index(Global, Cartesian)
    I = I + O
    q.x[I] = -χ * ∂x(C, g, I)
    q.y[I] = -χ * ∂y(C, g, I)
end

@kernel inbounds = true function update_C!(C, q, Δt, g::StructuredGrid, O)
    I = @index(Global, Cartesian)
    I = I + O
    C[I] -= Δt * divg(q, g, I)
end

To spawn the kernel, we invoke the launcher using the launch function to perform the field update at each physical timestep, and specify desired boundary conditions for involved fields in the kernel.

# Define physics, numerics, geometry ...
for it in 1:nt
    # without boundary conditions
    launch(arch, grid, compute_q! => (q, C, χ, grid))

    # with Neumann boundary conditions and MPI exchange
    launch(arch, grid, update_C! => (C, q, Δt, grid); bc=batch(grid, C => Neumann(); exchange=C))
end