Learning CUDA: From Zero to Parallel
A comprehensive guide to learning the fundamentals of CUDA programming, grids, blocks, threads, and building your first kernel.
CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. If you’re used to writing serial CPU code, writing for the GPU requires a mental shift: instead of one very fast processor doing things sequentially, you have thousands of simpler processors doing things concurrently.
In this post, we’ll break down the concepts of CUDA from the ground up.
The Core Concept: CPU vs GPU
Think of a CPU like a small team of highly intelligent professors. They can handle complex, branching logic exceptionally well. Think of a GPU like an army of thousands of elementary school students. Individually, they aren’t going to solve complex calculus sequentially very fast, but if you give them a million basic arithmetic problems to solve in parallel, they’ll crush the professors.
In CUDA terminology:
- Host: The CPU and its memory.
- Device: The GPU and its memory.
You write code that runs on the Host, which then copies data to the Device, launches a function (called a Kernel) on the Device to process that data in parallel, and finally copies the results back to the Host.
Grids, Blocks, and Threads
When you launch a kernel on the GPU, you spawn thousands of Threads. To organize them, CUDA groups threads into a hierarchy:
- Thread: The fundamental unit of execution. Each thread runs the kernel code.
- Block: A group of threads. Threads within the same block can cooperate, synchronize, and share fast memory.
- Grid: A group of blocks. A kernel launch creates one grid.
💡 The Indexing Trick: When your kernel runs, every thread executes the same code. How does a thread know which piece of data to work on? By using its unique ID! A thread’s global index is calculated as:
int index = threadIdx.x + blockIdx.x * blockDim.x;
A Real Example: Vector Addition
Let’s look at the “Hello World” of parallel computing: adding two arrays (vectors) together. C[i] = A[i] + B[i]
1. The Kernel (Runs on the GPU)
Notice the __global__ keyword. This tells the compiler: “This function runs on the device (GPU) but is called from the host (CPU)“.
__global__ void addVectors(float *A, float *B, float *C, int N) {
// Calculate the unique global thread ID
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Make sure we don't read/write past the end of our arrays
if (i < N) {
C[i] = A[i] + B[i];
}
}
2. The Host Code (Runs on the CPU)
Memory management in CUDA is similar to standard C (malloc and free), but we use cudaMalloc and cudaFree for GPU memory. We also use cudaMemcpy to move data between the host and device.
#include <iostream>
#include <cuda_runtime.h>
int main() {
int N = 100000; // Array size
size_t size = N * sizeof(float);
// 1. Allocate Host Memory
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
// Initialize data on the host
for (int i = 0; i < N; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
// 2. Allocate Device (GPU) Memory
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// 3. Copy data from Host to Device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// 4. Launch the Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// The <<<blocks, threads>>> syntax is specific to CUDA
addVectors<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// 5. Copy the result back to Host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// (Optional) Print first result to verify
std::cout << "C[0] = " << h_C[0] << std::endl;
// 6. Free Memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Understanding the Launch Configuration
The weirdest part of the code above is addVectors<<<blocksPerGrid, threadsPerBlock>>>(). This is called the execution configuration.
threadsPerBlock = 256: We are telling the GPU we want 256 threads in every block. You generally want a multiple of 32 (called a warp size).blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock: This is a standard integer division trick to round up. IfNis 100,000, and we have 256 threads per block, we need exactly 391 blocks to process all elements (391 * 256 = 100,096 threads).- The
if (i < N)check inside the kernel ensures those extra 96 threads do nothing and don’t segfault by going out of bounds.
Conclusion
You just learned the three fundamental steps of literally every CUDA program:
- Move data from CPU to GPU (
cudaMemcpyH2D). - Process data on the GPU by launching a kernel (
<<<...>>>). - Move data from GPU to CPU (
cudaMemcpyD2H).
From here, you can start writing your own simple kernels and compiling them using nvcc!