This lecture focuses on the following topics:
✅ Optimized Matrix Multiplication
✅ Shared Memory Techniques for CUDA
✅ Implementing Shared Memory Optimization
✅ Translating Python to CUDA and Performance Considerations
✅ Numba: Bringing Python and CUDA Together
#HPC#CUDA#OpenCL#LAPACK
If you had to do a lot of linear least square solves, with potentially rank-deficient matrices, what would you use on a GPU? On CPUs, LAPACK's DGELSY does work, but most GPU libraries seem to not implement routines for rank-deficient matrices.
A new crash course for getting started with #CUDA with #Python by Jeremy Howard 🚀. CUDA is NVIDIA's programming model for parallel computing on GPUs. CUDE is being used by tools such as #PyTorch#tensorflow and other #deeplearning and LLMs frameworks to speed up calculations. The course covers the following topics:
✅ Setting up CUDA
✅ CUDA foundation
✅ Working with Kernel
✅ CUDA with PyTorch
@VileLasagna Has a blog post on the relative speed of different #GPU compute frameworks on the same hardware and driver.
Tl;dr: on an #Nvidia card, with Nvidia drivers, #CUDA is the slowest, by far. Fastest is our old stalwart #OpenCL - almost twice as fast when used only for compute. #Vulcan is good, and the least affected by using the card for your desktop at the same time. Read it - it's good.
I've been facing many issues with using #Poetry (#pythonpoetry) with my #Python based #objectdetection project. I love Poetry for publishing packages, but think that #conda would be better since I have to deal with #CUDA and whatnot. Anyone familiar with a way to use pyproject.toml for publishing and building packages, even if Poetry isn't being used for dependency management?
So I bought a fancy #AMD graphics card because I didn‘t want to support the #Nvidia#CUDA hegemony. I also had high hopes for their supposedly more open drivers.
I am not sure if this was a great idea, because while it‘s been super good for my kids and their games, it‘s been a steep uphill climb (both ways) to get #ROCm and #HIP to do anything.
And the core bits are distributed as these precompiled packages that only work on a handful of specific versions of Linux distributions.
OK so I'm ready for today's #GPGPU lesson with the new laptop. My only gripe for the lesson will be that #Rusticl in #Mesa 23.2 doesn't support #profiling information. Apparently the feature was merged at a later commit https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24101
and I even tried upgrading to my distro's experimental 23.3-rc1 packages, but trying to use rusticl on those packages segfaults. So either I've messed up something with this mixed upgrade, or I've hit an actual bug.
I'm still moderately annoyed by the fact that there's no single #OpenCL platform to drive all computer devices on this machine. #PoCL comes close because it supports both the CPU and the #NVIDIA dGPU through #CUDA, but the not the #AMD iGPU (there's an #HSA device, but). #Rusticl supports the iGP (radeonsi) and the CPU (llvmpipe), but not the dGPU (partly because I'm running that on proprietary drivers for CUDA). Everything else has at best one supported device out of three available.
What an amazing talk by @airlied on the state of vendors, compute and community feedback. Please take the 45 minutes to watch - worth every minute! https://youtu.be/HzzLY5TdnZo
After one more year of intensive work and numerous test runs, a new major update for https://github.com/chrxh/alien is finally polished and ready. It offers possibilities I had only dreamed of before. 🪐
Anyway, as I mentioned recently, I have a new workstation that finally allows me to test our code using all three backends (#CUDA, #ROCm/#HIP and #CPU w/ #OpeMP) thanks to having an #AMD#Ryzen processor with an integrated #GPU in addition to a discrete #NVIDIA GPU.
Of course the iGPU is massively underpowered compared to the high-end dGPU workhorse, but I would expect it to outperform the CPU on most workloads.
And this is where things get interesting.
So, one of the reasons why we could implement the #HIP backend easily in #GPUSPH is that #AMD provides #ROCm drop-in replacement for much of the #NVIDIA#CUDA libraries, including #rocThrust, which (as I mentioned in the other thread) is a fork of #Thrust with a #HIP/#ROCm backend.
This is good as it reduces porting effort, but it also means you have to trust the quality of the provided implementation.
Turns out, the #AMD#HIP ecosystem is less mature than the #NVIDIA #CUDA one it emulates (unsurprising, giving how much more recent it is), and has obviously been tested much less in more exotic hardware configurations and with the wide variety of software and developers the CUDA libraries have had interactions with.
In the few days in which I've had the opportunity to play with it, I've already discovered two issues with it: