[**A Simpler Introduction To CUDA**](20.04.11-cuda.html)
This article aims to provide you with the best practices to start CUDA C++
programming, in order to write clean, efficient code without breaking a sweat.
We'll first go through what CUDA C++ is to begin with, detail what a modern
CUDA architecture is like, then finish with an overview of the CUDA ecosystem
and what tools you should use.
## What is CUDA C++ in the first place ?
CUDA C++ is described by Nvidia as a super-set of C++, which means you should
be able to run any C++ program on your GPU, but more specifically that the
**memory model** implemented by CUDA C++ is compliant with **standard C++**.
While it is *mostly* true, the language compatibility will depend on which
compiler you're using. Also you'll only be able to run "device" functions in
your CUDA code. "host" functions on the other hand can only run on the CPU.
"global" functions are used as the entry point for GPU computation, as these
are invoked from the CPU and run on the GPU. Two parameters must be given in
addition to their regular parameters: the thread block grid's dimension, and
the dimension of a thread block.
```C++
my_func<<< 16, 256 >>>(); // Invokes 16 blocks of 256 threads running my_func
```
All threads within a block share the same shared memory/cache.
For several reasons including this one, thread block sizing can have a **very
high impact** on performance. This, and the bug-inducing hassle it causes to
have two levels of sizing, is why you should stay away from invoking kernels
by yourself. The good news is that Nvidia provides great tools that will take
care of those matters for you, and we'll see them. But first, we'll do a quick
overview of a CUDA GPU architecture.
## Short briefing to a CUDA GPU architecture
CUDA GPU architectures can vary quite a lot, but they do revolve around the
same idea: they simply have lots of powerful SIMD processing units. CUDA
threads are grouped into "wraps", and run simultaneously, meaning that all the
"threads" within the same wrap will be executing the same instruction at the
same time. And when threads diverge after a condition, they can't all be run
simultaneously anymore: the two branches will have to be executed
consecutively, and then the two groups of threads will resume simultaneous
execution. This is something you can achieve with SIMD processing on CPUs by
doing masking on your registers.
However, GPUs improves branching performance by grouping several wraps within
larger units called Streaming Multiprocessors, in which threads can be
rescheduled. That way, when two groups of threads diverge, they can be
redispatched to make sure that the two groups are run in different wraps that
can be run simultaneously, thus mitigating the performance loss induced by
branching.
In the Turing architecture[^turing], streaming multiprocessors pack four
processing blocks (ie. SIMD cores) that can execute 32 threads per clock. These
four blocks share 96KB of L1 cache/shared memory (which you can configure to
choose between more cache or more shared memory to fit your needs).
Because SMs have 4 processing blocks that can execute up to 32 threads at a
time on the Turing architecture, you should use block sizes that are multiples
of 32, but make sure that you don't saturate the L1 cache or shared memory by
having too many threads within a block...
I think you got it: tuning kernels for GPUs is **difficult**. Nvidia has some
great documentation about how to do it[^turing-tuning], but doing it for every
architecture ends up being an incredibly time-consuming process, and I didn't
even talk about more specific things such as thread synchronization, shuffle
instructions, or simply how to correctly address memory in order to maximize
data locality.
What you must remember, however, is that GPUs have **massive** memory bandwidth
(900GB/s on Volta, while even the best CPUs won't go past 100GB/s), and if your
application can benefit that, then you should probably go for it. Even a mildly
tuned GPU application can outperform its well optimized CPU counterpart quite
easily, as long as **all** the code of your application gets to run on the
GPU. Otherwise its performance will be greatly impacted by the bandwidth of
CPU/GPU intercom (ie. PCI-E).
The good news is that you don't have to optimize GPU code by yourself:
Nvidia did that effort before you, and packed efficient routines in libraries
like Thrust, CUB, and cuBLAS.
## The CUDA tooling ecosystem
Now we'll have a closer look at the tools and libraries you can use to get the
most out of your GPUs, without spending too much time reimplementing the wheel.
### Libraries
Libraries in CUDA C++ are the key to well performing programs across
generations of GPU architectures. They gather all the optimization tricks
you'll need and make them easily reusable for you at no cost, thanks the C++
zero-cost abstraction principle.
The first library you should think about when writing CUDA code is
Thrust[^thrust]. It is made to be a drop-in replacement to the STL, but with
CUDA compatibility, a few extra algorithms, and that with very high
performance implementations.
Its [documentation](https://thrust.github.io/doc/modules.html) is pretty
straightforward, and any code you implement using Thrust can run on both CPU
and GPU. If your problem can be solved using Thurst's primitives, just go for
it. It is provided in the CUDA package, so you can think of it like the CUDA
standard library.
If your implementation requires more specialized algorithmic skeletons however,
make sure you have a look at CUB[^cub]. It is designed to provide you with
primitives at all scales: device-wide (invokable from the CPU), block-wide
(to abstract things that require shared memory, for example), and warp-wide
(all the way down to CUDA assembly (aka PTX) intrinsics) primitives.
Similarly to Thrust, CUB's primitives are optimized for all CUDA GPU
architectures, but they offer different grains for optimization at all scales,
and they are specific to GPUs (as in not CPU-compatible).
And finally for domain-specific tasks, Nvidia has you covered
again[^gpu-libraries] from linear algebra with
[cuBLAS](https://developer.nvidia.com/cublas), all the way to signal processing
with [cuFFT](https://docs.nvidia.com/cuda/cufft/index.html), and many other.
### Compilers
Compilers however is a grey area of CUDA development. The only two compilers
that support CUDA C++ are `nvcc` and `clang`. `nvcc` is the only one that's
officially supported by Nvidia, but don't get fooled: `clang` is a **much
better** option.
`nvcc`'s C++ frontend is full of shortcomings, it's not hard to find its limits
when you try to compile regular C++ code: parts of the STL can *not* compile
with `nvcc`, and some libraries that rely heavily on template resolution
mechanisms such as Blaze will probably never get to compile with it. `nvcc`
doesn't even support C++17, and even its C++14 is **very** fragile. It lacks
the error reporting you find on regular C++ compilers making your code hard to
debug... So overall, `nvcc` is not a great tool. It will mostly get in your
way, force you to reorganize your project between C++ and CUDA C++ code, link
them together, thus making templates unavailable between C++ and CUDA C++ code.
Basically, you'll have to rearrange your whole project just to tackle `nvcc`'s
poor C++ support.
On the other hand `clang` has bulletproof support for C++, and CUDA C++ as
well. All the CUDA libraries will compile with `clang`, and while it's lagging
behind `nvcc` when it comes to the latest CUDA version supported, it's not even
*that* bad, and its support for modern C++ standards is just a whole different
story.
Even the CLI is better with `clang`. It supports the same flags as `gcc`. This
is very important and underrated, because it also means that you won't have to
do anything but add a `-x cuda` flag to compile `.cpp` files with CUDA C++
support. You look at my
[project templates](https://github.com/JPenuchot/project-templates) if you want
to know more about how to compile CUDA code using `clang` with both `make` and
`cmake`.
And finally, `clang` is and will always remain open-source. It is a mainline
feature, Google supports it and will continue supporting it in the future.
### General advice
- Profiling: NSight
- Tout faire sur le GPU > Offloader au GPU (bus PCIe etc...)
- Forget `nvcc`
- In Thrust we Trust
- In CUB we cub
### For more specific cases
- Point d'entrée - Fonctions "global": `void __global__ fun( ... ) { ... }`
Appel: `fun <<< grid_size, block_size >>> ( ... );`
--> Le CPU pilote des appels sur des données dans la mémoire GPU
/!\ Ne passer que des pointeurs, itérateurs, vues, etc.
- Fonctions & lambdas `__host__ __device__` ou `constexpr` appelables sur GPU
- `cudaManagedAlloc()`: Alloue de la mémoire unifiée (Accessible via CPU & GPU)
--> Pas besoin de `cudaMemCpy()`, géré par pagination en x86, via une fabric
avec NVLink sous PowerPC (meilleures perfs, moins de latence, etc...)
--> Utilisable comme allocateur pour std::vector
Unrolling & memory addressing:
Mémoire:
| 0| 1| 2| 3| 4| 5| 6| 7|
Threads 1, 2, 3 & 4, itérations 1 & 2:
i\T| 1 2 3 4
0 | 0| 1| 2| 3|
1 | 4| 5| 6| 7|
--> Adressage contigu, OK
i\T| 1 2 3 4
0 | 0| 2| 4| 6|
1 | 1| 3| 5| 7|
--> Adressage non contigu, PAS OK: localité mauvaise
## Conclusion
Utilisez Clang !
- Compilo C++ fail-proof
- Interface simple et standard, facile à intégrer dans Make ou CMake
- Plus besoin de faire de la compilation séparée à cause de nvcc
Utilisez Thrust !
- Simple & efficace
- Inclus dans le package CUDA
Utilisez cuBLAS !
- Interface similaire à BLAS (diverge sur certains points: handles, storage)
- Utilise du PTX "secret" et les tensor cores pour gagner des perfs
--> Questions ? Illustration ?
-------------------------------------------------------------------------------
[^cuda-intro]: https://devblogs.nvidia.com/even-easier-introduction-cuda/
[^shared-mem]: https://devblogs.nvidia.com/using-shared-memory-cuda-cc/
[^turing]: https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/
[^turing-tuning]: https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf
[^thrust]: https://thrust.github.io/
[^cub]: https://nvlabs.github.io/cub/index.html
[^gpu-libraries]: https://developer.nvidia.com/gpu-accelerated-libraries