**Compilaction** (insert ../crumbs.html here)
[**A Simpler Introduction To CUDA**](20.04.11-cuda.html)
This article aims to provide you with the best practices to start CUDA C++ programming, in order to write clean, efficient code without breaking a sweat. We'll first go through what CUDA C++ is to begin with, detail what a modern CUDA architecture is like, then finish with an overview of the CUDA ecosystem and what tools you should use. ## What is CUDA C++ in the first place ? CUDA C++ is described by Nvidia as a super-set of C++, which means you should be able to run any C++ program on your GPU, but more specifically that the **memory model** implemented by CUDA C++ is compliant with **standard C++**. While it is *mostly* true, the language compatibility will depend on which compiler you're using. Also you'll only be able to run "device" functions in your CUDA code. "host" functions on the other hand can only run on the CPU. "global" functions are used as the entry point for GPU computation, as these are invoked from the CPU and run on the GPU. Two parameters must be given in addition to their regular parameters: the thread block grid's dimension, and the dimension of a thread block. ```C++ my_func<<< 16, 256 >>>(); // Invokes 16 blocks of 256 threads running my_func ``` All threads within a block share the same shared memory/cache. For several reasons including this one, thread block sizing can have a **very high impact** on performance. This, and the bug-inducing hassle it causes to have two levels of sizing, is why you should stay away from invoking kernels by yourself. The good news is that Nvidia provides great tools that will take care of those matters for you, and we'll see them. But first, we'll do a quick overview of a CUDA GPU architecture. ## Short briefing to a CUDA GPU architecture CUDA GPU architectures can vary quite a lot, but they do revolve around the same idea: they simply have lots of powerful SIMD processing units. CUDA threads are grouped into "wraps", and run simultaneously, meaning that all the "threads" within the same wrap will be executing the same instruction at the same time. And when threads diverge after a condition, they can't all be run simultaneously anymore: the two branches will have to be executed consecutively, and then the two groups of threads will resume simultaneous execution. This is something you can achieve with SIMD processing on CPUs by doing masking on your registers. However, GPUs improves branching performance by grouping several wraps within larger units called Streaming Multiprocessors, in which threads can be rescheduled. That way, when two groups of threads diverge, they can be redispatched to make sure that the two groups are run in different wraps that can be run simultaneously, thus mitigating the performance loss induced by branching. In the Turing architecture[^turing], streaming multiprocessors pack four processing blocks (ie. SIMD cores) that can execute 32 threads per clock. These four blocks share 96KB of L1 cache/shared memory (which you can configure to choose between more cache or more shared memory to fit your needs). Because SMs have 4 processing blocks that can execute up to 32 threads at a time on the Turing architecture, you should use block sizes that are multiples of 32, but make sure that you don't saturate the L1 cache or shared memory by having too many threads within a block... I think you got it: tuning kernels for GPUs is **difficult**. Nvidia has some great documentation about how to do it[^turing-tuning], but doing it for every architecture ends up being an incredibly time-consuming process, and I didn't even talk about more specific things such as thread synchronization, shuffle instructions, or simply how to correctly address memory in order to maximize data locality. What you must remember, however, is that GPUs have **massive** memory bandwidth (900GB/s on Volta, while even the best CPUs won't go past 100GB/s), and if your application can benefit that, then you should probably go for it. Even a mildly tuned GPU application can outperform its well optimized CPU counterpart quite easily, as long as **all** the code of your application gets to run on the GPU. Otherwise its performance will be greatly impacted by the bandwidth of CPU/GPU intercom (ie. PCI-E). The good news is that you don't have to optimize GPU code by yourself: Nvidia did that effort before you, and packed efficient routines in libraries like Thrust, CUB, and cuBLAS. ## The CUDA tooling ecosystem Now we'll have a closer look at the tools and libraries you can use to get the most out of your GPUs, without spending too much time reimplementing the wheel. ### Libraries Libraries in CUDA C++ are the key to well performing programs across generations of GPU architectures. They gather all the optimization tricks you'll need and make them easily reusable for you at no cost, thanks the C++ zero-cost abstraction principle. The first library you should think about when writing CUDA code is Thrust[^thrust]. It is made to be a drop-in replacement to the STL, but with CUDA compatibility, a few extra algorithms, and that with very high performance implementations. Its [documentation](https://thrust.github.io/doc/modules.html) is pretty straightforward, and any code you implement using Thrust can run on both CPU and GPU. If your problem can be solved using Thurst's primitives, just go for it. It is provided in the CUDA package, so you can think of it like the CUDA standard library. If your implementation requires more specialized algorithmic skeletons however, make sure you have a look at CUB[^cub]. It is designed to provide you with primitives at all scales: device-wide (invokable from the CPU), block-wide (to abstract things that require shared memory, for example), and warp-wide (all the way down to CUDA assembly (aka PTX) intrinsics) primitives. Similarly to Thrust, CUB's primitives are optimized for all CUDA GPU architectures, but they offer different grains for optimization at all scales, and they are specific to GPUs (as in not CPU-compatible). And finally for domain-specific tasks, Nvidia has you covered again[^gpu-libraries] from linear algebra with [cuBLAS](https://developer.nvidia.com/cublas), all the way to signal processing with [cuFFT](https://docs.nvidia.com/cuda/cufft/index.html), and many other. ### Compilers Compilers however is a grey area of CUDA development. The only two compilers that support CUDA C++ are `nvcc` and `clang`. `nvcc` is the only one that's officially supported by Nvidia, but don't get fooled: `clang` is a **much better** option. `nvcc`'s C++ frontend is full of shortcomings, it's not hard to find its limits when you try to compile regular C++ code: parts of the STL can *not* compile with `nvcc`, and some libraries that rely heavily on template resolution mechanisms such as Blaze will probably never get to compile with it. `nvcc` doesn't even support C++17, and even its C++14 is **very** fragile. It lacks the error reporting you find on regular C++ compilers making your code hard to debug... So overall, `nvcc` is not a great tool. It will mostly get in your way, force you to reorganize your project between C++ and CUDA C++ code, link them together, thus making templates unavailable between C++ and CUDA C++ code. Basically, you'll have to rearrange your whole project just to tackle `nvcc`'s poor C++ support. On the other hand `clang` has bulletproof support for C++, and CUDA C++ as well. All the CUDA libraries will compile with `clang`, and while it's lagging behind `nvcc` when it comes to the latest CUDA version supported, it's not even *that* bad, and its support for modern C++ standards is just a whole different story. Even the CLI is better with `clang`. It supports the same flags as `gcc`. This is very important and underrated, because it also means that you won't have to do anything but add a `-x cuda` flag to compile `.cpp` files with CUDA C++ support. You look at my [project templates](https://github.com/JPenuchot/project-templates) if you want to know more about how to compile CUDA code using `clang` with both `make` and `cmake`. And finally, `clang` is and will always remain open-source. It is a mainline feature, Google supports it and will continue supporting it in the future. ### General advice - Profiling: NSight - Tout faire sur le GPU > Offloader au GPU (bus PCIe etc...) - Forget `nvcc` - In Thrust we Trust - In CUB we cub ### For more specific cases - Point d'entrée - Fonctions "global": `void __global__ fun( ... ) { ... }` Appel: `fun <<< grid_size, block_size >>> ( ... );` --> Le CPU pilote des appels sur des données dans la mémoire GPU /!\ Ne passer que des pointeurs, itérateurs, vues, etc. - Fonctions & lambdas `__host__ __device__` ou `constexpr` appelables sur GPU - `cudaManagedAlloc()`: Alloue de la mémoire unifiée (Accessible via CPU & GPU) --> Pas besoin de `cudaMemCpy()`, géré par pagination en x86, via une fabric avec NVLink sous PowerPC (meilleures perfs, moins de latence, etc...) --> Utilisable comme allocateur pour std::vector Unrolling & memory addressing: Mémoire: | 0| 1| 2| 3| 4| 5| 6| 7| Threads 1, 2, 3 & 4, itérations 1 & 2: i\T| 1 2 3 4 0 | 0| 1| 2| 3| 1 | 4| 5| 6| 7| --> Adressage contigu, OK i\T| 1 2 3 4 0 | 0| 2| 4| 6| 1 | 1| 3| 5| 7| --> Adressage non contigu, PAS OK: localité mauvaise ## Conclusion Utilisez Clang ! - Compilo C++ fail-proof - Interface simple et standard, facile à intégrer dans Make ou CMake - Plus besoin de faire de la compilation séparée à cause de nvcc Utilisez Thrust ! - Simple & efficace - Inclus dans le package CUDA Utilisez cuBLAS ! - Interface similaire à BLAS (diverge sur certains points: handles, storage) - Utilise du PTX "secret" et les tensor cores pour gagner des perfs --> Questions ? Illustration ? ------------------------------------------------------------------------------- [^cuda-intro]: https://devblogs.nvidia.com/even-easier-introduction-cuda/ [^shared-mem]: https://devblogs.nvidia.com/using-shared-memory-cuda-cc/ [^turing]: https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/ [^turing-tuning]: https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf [^thrust]: https://thrust.github.io/ [^cub]: https://nvlabs.github.io/cub/index.html [^gpu-libraries]: https://developer.nvidia.com/gpu-accelerated-libraries