#cuda #cpp #gpu #parallel-programming

Learning CUDA: From Zero to Parallel

A comprehensive guide to learning the fundamentals of CUDA programming, grids, blocks, threads, and building your first kernel.

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. If you’re used to writing serial CPU code, writing for the GPU requires a mental shift: instead of one very fast processor doing things sequentially, you have thousands of simpler processors doing things concurrently.

In this post, we’ll break down the concepts of CUDA from the ground up.


The Core Concept: CPU vs GPU

Think of a CPU like a small team of highly intelligent professors. They can handle complex, branching logic exceptionally well. Think of a GPU like an army of thousands of elementary school students. Individually, they aren’t going to solve complex calculus sequentially very fast, but if you give them a million basic arithmetic problems to solve in parallel, they’ll crush the professors.

In CUDA terminology:

  • Host: The CPU and its memory.
  • Device: The GPU and its memory.

You write code that runs on the Host, which then copies data to the Device, launches a function (called a Kernel) on the Device to process that data in parallel, and finally copies the results back to the Host.


Grids, Blocks, and Threads

When you launch a kernel on the GPU, you spawn thousands of Threads. To organize them, CUDA groups threads into a hierarchy:

  1. Thread: The fundamental unit of execution. Each thread runs the kernel code.
  2. Block: A group of threads. Threads within the same block can cooperate, synchronize, and share fast memory.
  3. Grid: A group of blocks. A kernel launch creates one grid.

💡 The Indexing Trick: When your kernel runs, every thread executes the same code. How does a thread know which piece of data to work on? By using its unique ID! A thread’s global index is calculated as: int index = threadIdx.x + blockIdx.x * blockDim.x;


A Real Example: Vector Addition

Let’s look at the “Hello World” of parallel computing: adding two arrays (vectors) together. C[i] = A[i] + B[i]

1. The Kernel (Runs on the GPU)

Notice the __global__ keyword. This tells the compiler: “This function runs on the device (GPU) but is called from the host (CPU)“.

__global__ void addVectors(float *A, float *B, float *C, int N) {
    // Calculate the unique global thread ID
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Make sure we don't read/write past the end of our arrays
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

2. The Host Code (Runs on the CPU)

Memory management in CUDA is similar to standard C (malloc and free), but we use cudaMalloc and cudaFree for GPU memory. We also use cudaMemcpy to move data between the host and device.

#include <iostream>
#include <cuda_runtime.h>

int main() {
    int N = 100000; // Array size
    size_t size = N * sizeof(float);

    // 1. Allocate Host Memory
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize data on the host
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // 2. Allocate Device (GPU) Memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // 3. Copy data from Host to Device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // 4. Launch the Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    
    // The <<<blocks, threads>>> syntax is specific to CUDA
    addVectors<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // 5. Copy the result back to Host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // (Optional) Print first result to verify
    std::cout << "C[0] = " << h_C[0] << std::endl;

    // 6. Free Memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Understanding the Launch Configuration

The weirdest part of the code above is addVectors<<<blocksPerGrid, threadsPerBlock>>>(). This is called the execution configuration.

  • threadsPerBlock = 256: We are telling the GPU we want 256 threads in every block. You generally want a multiple of 32 (called a warp size).
  • blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock: This is a standard integer division trick to round up. If N is 100,000, and we have 256 threads per block, we need exactly 391 blocks to process all elements (391 * 256 = 100,096 threads).
  • The if (i < N) check inside the kernel ensures those extra 96 threads do nothing and don’t segfault by going out of bounds.

Conclusion

You just learned the three fundamental steps of literally every CUDA program:

  1. Move data from CPU to GPU (cudaMemcpy H2D).
  2. Process data on the GPU by launching a kernel (<<<...>>>).
  3. Move data from GPU to CPU (cudaMemcpy D2H).

From here, you can start writing your own simple kernels and compiling them using nvcc!