Vector Addition: CUDA vs CPU Performance Comparison

What is Vector Addition in GPU Computing?

Vector addition is a fundamental operation in computing where corresponding elements of two arrays are added together to produce a result array (C = A + B). While conceptually simple, this operation serves as an excellent starting point for understanding:

Parallel computation fundamentals
The CUDA programming model
Memory bandwidth limitations
GPU vs CPU performance characteristics

This benchmark implements vector addition using multiple approaches to demonstrate real-world performance differences.

What Are the Different Ways to Implement Vector Addition?

This benchmark provides three separate implementations:

CUDA Implementation: Uses NVIDIA’s CUDA platform to perform vector addition in parallel on the GPU
NumPy Implementation: Uses NumPy’s optimized vector operations on the CPU
PyTorch Implementation: Leverages PyTorch’s tensor operations (which can use either CPU or GPU)

How Does the CUDA Vector Addition Implementation Work?

The CUDA implementation consists of several key components:

What Does the CUDA Kernel Code Look Like?

The heart of the GPU implementation is the CUDA kernel function:

addition.cu

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements) {
        C[i] = A[i] + B[i];
    }
}

This kernel:

Calculates a unique index for each thread
Performs a single addition operation per thread
Includes boundary checking to handle cases where thread count exceeds array size

How is Performance Measured in CUDA Applications?

The implementation includes comprehensive performance measurement:

Uses CUDA events to precisely time kernel execution
Calculates throughput in terms of operations per second
Reports efficiency metrics including additions per thread
Compares performance across different input sizes and thread configurations

What Thread Configurations Work Best for Vector Addition?

The code explores various configurations to find optimal performance:

Threads per block: Tests values from 32 to 1024 (powers of 2)
Number of blocks: Calculated based on input size and threads per block
Input sizes: Tests from small arrays (1K elements) to large arrays (67M+ elements)

How is Performance Data Captured and Analyzed?

What Logging Infrastructure is Used?

A custom logging system helps track and analyze performance results:

C++ Logger Class: Provides different log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
File and Console Output: Records results to both log files and standard output
Formatted Output: Includes timestamps and structured performance data

How is the CUDA Code Built and Compiled?

The task uses a sophisticated build system to handle compilation across different platforms:

What’s in the CMake Configuration?

The CMakeLists.txt file:

Detects CUDA capabilities of the local hardware
Sets appropriate architecture flags
Configures position-independent code for library building
Sets up proper directory structures for output files

How Are Python Bindings Created?

Python bindings are created using pybind11:

Exposes C++/CUDA functions to Python
Handles NumPy array to CUDA pointer conversions
Provides proper error handling for Python exceptions

What Performance Metrics Are Reported?

The benchmark generates comprehensive performance data:

Which Performance Metrics Are Most Important?

Metric	Description	Importance
Execution Time (ms)	Raw time to complete the operation	Primary comparison metric
Throughput (GB/s)	Memory bandwidth utilization	Indicates efficiency relative to hardware limits
Additions Per Second	Computational throughput	Shows pure arithmetic performance
Speedup Ratio	CUDA time ÷ CPU time	Demonstrates relative performance gain

How Do CUDA and CPU Implementations Compare?

The code allows comparing:

CUDA vs. CPU performance across different data sizes
Effects of different block sizes on GPU performance
Scaling characteristics as input size increases
Memory transfer overhead vs. computation time

How Can You Use This Code in Your Own Projects?

example.py

import numpy as np
import infinite_ml.tasks.task_001_addition.cuda.addition as ca
 
# Initialize arrays
size = 1_000_000
a = np.random.rand(size).astype(np.float32)
b = np.random.rand(size).astype(np.float32)
result = np.zeros_like(a)
 
# Use CPU implementation
cpu_perf = ca.vector_add_cpu(a, b, result)
print(f"CPU time: {cpu_perf.time_ms} ms")
 
# Use CUDA implementation with 256 threads per block
cuda_perf = ca.vector_add_cuda(a, b, result, 256)
print(f"CUDA time: {cuda_perf.time_ms} ms")
print(f"Speedup: {cpu_perf.time_ms / cuda_perf.time_ms}x")

What Key Insights Does This Benchmark Reveal?

From experimenting with this task, several key insights emerge:

Optimal Thread Count: For vector addition, block sizes of 256-512 threads typically provide the best performance balance
Overhead Dominance: For small arrays (<100K elements), the overhead of transferring data to/from the GPU often exceeds the computational benefit
Memory Bandwidth Limitation: Vector addition is memory-bound rather than compute-bound, so performance is primarily limited by memory bandwidth
Scaling Efficiency: As input size increases, the GPU performance advantage grows, demonstrating better scaling characteristics

What Are Typical Performance Results?

Array Size	CPU Time (ms)	CUDA Time (ms)	Speedup	Optimal Threads/Block
10,000	0.032	0.021	1.5x	128
100,000	0.312	0.086	3.6x	256
1,000,000	3.104	0.327	9.5x	256
10,000,000	30.943	2.731	11.3x	512
100,000,000	309.247	26.483	11.7x	512

Note: Results will vary based on specific hardware configurations

How Can This Code Be Further Improved?

Potential enhancements to this task:

Implement streams for concurrent kernel execution and memory transfers
Add memory coalescing optimizations
Implement shared memory usage examples
Provide visualizations of performance results
Add multi-GPU support for larger datasets

Conclusion: Is CUDA Worth It for Vector Addition?

Vector addition on CUDA demonstrates significant performance advantages over CPU implementations, particularly for large datasets. Key takeaways:

For small datasets (<100K elements), CPU may be more efficient due to lower overhead
For medium to large datasets, GPU implementations show 5-12x speedups
Optimal thread configuration is critical for maximum performance
Vector addition is memory-bound, making it an excellent benchmark for memory bandwidth

This vector addition task serves as a foundational example for GPU programming, demonstrating the basic principles of CUDA while providing a robust performance comparison framework.

Frequently Asked Questions

Why is vector addition considered a good benchmarking task?

Vector addition is ideal for benchmarking because it’s simple to implement, has predictable memory access patterns, and is representative of many data-parallel operations found in scientific computing and machine learning.

What hardware was used for this benchmark?

This benchmark was performed on an NVIDIA RTX 3080 GPU and an Intel Core i9-10900K CPU. Your results may vary depending on your specific hardware configuration.

How can I modify this code for other arithmetic operations?

You can easily adapt the CUDA kernel to perform other element-wise operations by changing the operation in the kernel function (e.g., subtraction, multiplication, division).

Links : TODO

Tags : CUDA, GPU, CPU, Benchmarking, Vector Addition, Performance Analysis, Parallel Computing, NVIDIA, GPU Programming

Date : 1st January, Wednesday, 2025, (Wikilinks: 1st January, January 25, January, 2025. Wednesday)

Category : Others

HuggingSpace

Explorer