[실습] CUDA의 Cooperative Groups

매운돌 2023. 3. 5. 04:34

병렬 알고리즘을 작성하다 보면 Thread들을 그룹화하고 이러한 그룹들간에 동기화를 수행해야 하는 경우가 많이 있습니다. 그런데 CUDA 버전9 이전에는 Thread Block 내의 모든 Thread들을 동기화 하는 __syncthreads( )만 제공 되어졌습니다. 그러다 보니 Thread Block보다 더 작은 그룹을 정의하고 이에 대한 동기화하고자 하는 요구가 많이 존재했었습니다. 왜냐하면 보다 작은 그룹을 구성함으로써 불필요한 동기화를 최소화 할수있고, 이에 더해 설계에 대한 유연성을 높일 수 있기 때문입니다. 그리고 Cooperative Group는 이러한 요구를 충족시킬 수 있게 하위 블록 및 다중 블록 세분성에서 Thread 그룹을 명시적으로 정의하고 Thread에 대한 동기화와 같은 집합적 작업을 수행하는 기능이 도입되었습니다. (추가적으로 Grid단위나 멀티 디바이스에 대한 Synchronization도 제공합니다.) 또한 추상화를 제공함으로써 다양한 GPU에 유연하고 확장성 있는 코드를 작성할 수 있게 해줍니다.

우선 Cooperative Groups를 사용하기 위해서는 아래의 헤더 파일들을 상황에 맞게 포함해야 합니다.

// Primary header is compatible with pre-C++11, collective algorithm headers require C++11
#include <cooperative_groups.h>
// Optionally include for memcpy_async() collective
#include <cooperative_groups/memcpy_async.h>
// Optionally include for reduce() collective
#include <cooperative_groups/reduce.h>
// Optionally include for inclusive_scan() and exclusive_scan() collectives
#include <cooperative_groups/scan.h>

Cooperative Groups 프로그래밍 모델은 다음 요소로 구성됩니다.

cooperating threads 그룹을 나타내는 데이터 타입
CUDA launch API에 의해 정의된 암시적 그룹을 얻기 위한 operation들
기존 그룹을 새 그룹으로 분할하기 위한 Collectives
데이터 이동과 생산을 위한 알고리즘 모음 (e.g. memcpy_async, reduce, scan)
그룹 내 모든 Thread를 동기화하는 operation
그룹 속성을 검사하기 위한 operation
low-level, 그룹별, 하드웨어 가속이 된 operation들 모음

아래의 코드는 block-wide하게 sum reduction을 수행하는 코드입니다. parallel_kernel에서 sum을 호출하게 되는데, sum 내부에서 __syncthreads()을 수행하고 있습니다. 따라서 block내의 모든 thread가 __syncthreads()에 도달해야만 합니다. 하지만 다른 개발자들이 이러한 조건을 파악하기 위해서는 sum 함수 코드를 확인해야만 합니다.

__device__ int sum(int *x, int n) {
    // ...
    __syncthreads();
    return total;
}

__global__ void parallel_kernel(float *x) {
    // ...
    // Entire thread block must call sum
    sum(x, n);
}

Cooperative Group을 사용하면 thread 그룹을 명시적 넘겨주어 동기화에 대해 sum코드을 읽어보지 않고도 쉽게 파악할 수 있습니다.

__device__ int sum(const thread_block& g, int *x, int n) {
    // ...
    g.sync()
    return total;
}

__global__ void parallel_kernel(...) {
    // ...
    // Entire thread block must call sum
    thread_block tb = this_thread_block();
    sum(tb, x, n);
    // ...
}

단, 암시적 그룹은 코드의 어느 부분에서든 생성할 수 있지만, 가능한 코드 앞부분에서 생성하는게 좋습니다. 왜냐하면 특정 분기에서 암시적 그룹을 생성하게 되면 특정 thread들은 해당 분기에 진입하지 못해 데드락이나 data corruption이 발생할 수 있습니다.

Implicit Groups

Thread Groups

Cooperative Groups에서 가장 기본적인 타입입니다.

모든 Group은 Thread Groups을 상속받아 구현되어 있습니다.

unsigned size()을 통해 현재 그룹의 Thread 개수 [0, num_threads)
unsigned num_threads()을 통해 현재 그룹의 Thread 개수 [0, num_threads)
unsigned thread_rank()을 통해 현재 그룹내의 Thread 인덱스
bool is_valid()을 통해 해당 그룹이 유효한지 알 수 있습니다.
아래의 두가지 방법을 통해서 동기화를 수행할 수 있습니다.

g.sync();           // synchronize group g
cg::synchronize(g); // an equivalent way to synchronize g

Thread Blocks Group

CUDA개발을 경험해 본 사람이라면 thread blocks에 대해 익숙할 것입니다. 그래서 Cooperative Groups에서는 커널 내에서 이 개념을 명시적으로 나타내기 위해, 새로운 데이터 유형인 thread_block을 도입했습니다. 아래의 방법을 통해 초기화 할 수 있습니다.

thread_block block = this_thread_block();

또한 이들은 각각 CUDA의 blockIdx 및 threadIdx와 동일합니다.

dim3 group_index();  // 3-dimensional block index within the grid
dim3 thread_index(); // 3-dimensional thread index within the block
dim3 dim_threads(); // Dimensions of the launched block in units of threads

Cluster Group

이 그룹은 단일 클러스터에서 시작된 모든 Thread를 나타냅니다.
(만약 비클러스터 Grid에서 시작했다면, API는 1x1x1 클러스터를 가정합니다.)

cluster_group g = this_cluster();

cluster_group::arrival_token barrier_arrive(): cluster barrier를 기다립니다. 그리고 cluster barrier에 도착하면 barrier_wait()에 전달할 토큰을 가져옵니다.
void barrier_wait(cluster_group::arrival_token&& t): RValue 참조로 barrier_arrive()에서 얻은 arrival_token을 반환합니다.
unsigned int query_shared_rank(const void *addr): Obtain the block rank to which a shared memory address belongs

#include <cooperative_groups.h>

using namespace cooperative_groups;

void __device__ init_shared_data(const thread_block& block, int *data);
void __device__ local_processing(const thread_block& block);
void __device__ process_shared_data(const thread_block& block, int *data);

__global__ void cluster_kernel() {
    extern __shared__ int array[];
    auto cluster = this_cluster();
    auto block   = this_thread_block();

    // Use this thread block to initialize some shared state
    init_shared_data(block, &array[0]);

    auto token = cluster.barrier_arrive(); // Let other blocks know this block is running and data was initialized

    // Do some local processing to hide the synchronization latency
    local_processing(block);

    // Map data in shared memory from the next block in the cluster
    int *dsmem = cluster.map_shared_rank(&array[0], (cluster.block_rank() + 1) % cluster.num_blocks());

    // Make sure all other blocks in the cluster are running and initialized shared data before accessing dsmem
    cluster.barrier_wait(std::move(token));

    // Consume data in distributed shared memory
    process_shared_data(block, dsmem);
}

Grid Group

이 그룹은 단일 Grid에서 시작된 모든 Thread를 나타냅니다.

grid_group g = this_grid();

기존에는 글로벌 함수 안에서 Grid내의 모든 Thread들을 동기화하는 방법은 없었습니다. 따라서 아래의 코드와 같이 현재 글로벌 함수를 끝내고 새로운 글로벌 함수를 시작하여 이러한 문제를 해결했습니다.

// threads update particles in parallel
integrate<<<blocks, threads, 0, s>>>(particles);

// Note: implicit sync between kernel launches

// Collide each particle with others in neighborhood
collide<<<blocks, threads, 0, s>>>(particles);

하지만 Grid Group을 사용하면 아래 처럼 하나의 global 함수내에서 해결할 수 있습니다.

이와 같이 구현하게 되면 불필요한 Global Memory 접근을 줄일 수 있어서 성능에 도움이 됩니다.

__global__ void particleSim(Particle *p, int N) {
  
  grid_group g = this_grid();
  // phase 1
  for (i = g.thread_rank(); i < N; i += g.size())
    integrate(p[i]);

  g.sync() // Sync whole grid

  // phase 2
  for (i = g.thread_rank(); i < N; i += g.size())
    collide(p[i], p, N);
}

unsigned long long cluster_rank(): 현재 그룹내의 Cluster의 인덱스 [0, num_clusters)
unsigned long long num_clusters(): 현재 그룹내의 Cluster의 개수
dim3 dim_clusters(): clusters 단위로 실행된 grid의 Dimensions
dim3 cluster_index(): 시작된 grid 내의 cluster의 3 차원 인덱스

Multi Grid Group

이 그룹은 모든 장치에서 시작된 모든 Thread를 나타냅니다.
(Grid 그룹과 달리 모든 API는 적절한 launch API를 사용해야합니다.)

// Kernel must be launched with the cooperative multi-device API
multi_grid_group g = this_multi_grid();

unsigned int grid_rank(): Rank of the grid within [0,num_grids]
unsigned int num_grids(): 실행된 전체 Grid의 개수

Group Partitioning

tiled_partition

Cooperative Groups은 유연한하게 기존에 Group으로 부터 새로운 Group을 생성할 수 있게 해주는 tiled_partition 함수를 제공합니다.

tiled_partition는 아래 코드에서 볼 수 있듯이 template 형태와 기본 함수 형태가 존재합니다.

template <unsigned int Size, typename ParentT>
thread_block_tile<Size, ParentT> tiled_partition(const ParentT& g);

thread_group tiled_partition(const thread_group& parent, unsigned int tilesz);

tiled_partition 함수를 수행하면 (부모 그룹의 크기 / tilesz)개의 그룹이 생성됩니다.

부모 그룹 크기는 tilesz에 따라 균등하게 나눌 수 있어야 하며, tilesz보다 더 커야 합니다.

또한 tilesz는 native hardware sizes(1/2/4/8/16/32)에 제한됩니다.(템플릿 버전의 tiled_partition은 64/128/256/512도 지원하지만, 추가적인 단계가 필요합니다.)

아래의 예시는 thread_block 그룹으로 부터 32개의 Thread를 가지는 새로운 그룹을 생성하게 됩니다.

/// The following code will create a 32-thread tile
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);

또한 Warp의 크기보다 더 작은 Thread 그룹을 생성할 수도 있습니다.

auto tile4 = tiled_partition<4>(tile32);
// or using a general group
// thread_group tile4 = tiled_partition(tile32, 4);

if (tile4.thread_rank()==0) printf("Hello from tile4 rank 0\n");
// then the statement would be printed by every fourth thread in the block:

labeled_partition

labeled_partition 함수는 부모 그룹을 1 차원 하위 그룹으로 분할하는 collective operation입니다.

그룹내 각 Thread들을 조건 레이블(condition label)로 평가하고, 레이블에 대해 동일한 값은 같은 그룹으로 지정합니다.

template <typename Label>
coalesced_group labeled_partition(const coalesced_group& g, Label label);

template <unsigned int Size, typename Label>
coalesced_group labeled_partition(const thread_block_tile<Size>& g, Label label);

binary_partition

binary_partition 함수는 부모 그룹을 1 차원 하위 그룹으로 분할하는 collective operation입니다.

predicate(= pred)를 평가하고 동일한 값을 가진 스레드를 동일한 그룹에 할당합니다. 이것은 라벨이 0 또는 1 일 수있는 특수 형태의 labeled_partition()입니다.

coalesced_group binary_partition(const coalesced_group& g, bool pred);

template <unsigned int Size>
coalesced_group binary_partition(const thread_block_tile<Size>& g, bool pred);

이번 글에서는 Cooperative Groups에서 다양한 그룹들과 partition에 대해서만 알아보았습니다. 하지만 Cooperative Groups에는 이외에도 더 많은 내용들(Data Transfer and Manipulation, Execution control, Multi-Grid synchronization 등등 )이 존재합니다. 하지만 너무 많은 내용이 한 글에 존재하면 정독하는데 어려움이 많으니, 다음 내용들은 다음 글에서 다뤄보겠습니다.

Reference

CUDA C++ Programming Guide (nvidia.com)

CUDA C++ Programming Guide

Texture memory is read from kernels using the device functions described in Texture Functions. The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for t

docs.nvidia.com

CUDA 9 Features Revealed: Volta, Cooperative Groups and More | NVIDIA Technical Blog

The CUDA 9 release includes support for Volta GPUs, Cooperative Groups programming model extensions, faster libraries, and improved developer tools.

developer.nvidia.com

Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to…

developer.nvidia.com