Yes, this is a reference to our experience with the one single library we have a hard dependency on, #Thrust, which isn'r even a “a random library on GitHub” but an official #NVIDIA thing, or its #AMD's counterpart:
Ooof, trying to build a piece of software that depends on the #Thrust library on a system that has both the #NVIDIA and the #AMD one, and one of them is the default path and the other isn't is painful.
Anyway, as I mentioned recently, I have a new workstation that finally allows me to test our code using all three backends (#CUDA, #ROCm/#HIP and #CPU w/ #OpeMP) thanks to having an #AMD#Ryzen processor with an integrated #GPU in addition to a discrete #NVIDIA GPU.
Of course the iGPU is massively underpowered compared to the high-end dGPU workhorse, but I would expect it to outperform the CPU on most workloads.
And this is where things get interesting.
There's another important difference I haven't mentioned, between the CA and #GPUSPH. All of the GPU code in the CA is “custom”, compute kernels written by yours truly. In GPUSPH, instead, there are a few instances where we rely on an external library: #Thrust.
I've already complained a bit about how this affects us <https://fediscience.org/@giuseppebilotta/110283708975056091> especially in terms of backend support, but things are even worse, and I'll take the opportunity here to complain a bit!
So, one of the reasons why we could implement the #HIP backend easily in #GPUSPH is that #AMD provides #ROCm drop-in replacement for much of the #NVIDIA#CUDA libraries, including #rocThrust, which (as I mentioned in the other thread) is a fork of #Thrust with a #HIP/#ROCm backend.
This is good as it reduces porting effort, but it also means you have to trust the quality of the provided implementation.
one is a build failure against my GPU (already reported, with a fix ready and pending release), and the other is … slow performance in one of the #Thrust API calls that we use!
Turns out, sort_by_key, at least in the way we use it, is somewhere between 25% and 50% slower on my #AMD iGPU when using the latest #rocThrust (from the 5.6.0 software stack) than it is on the CPU when using the latest #Thrust with the OpenMP backend!
FWIW, even the “original” #NVIDIA#Thrust has been a continuous source of issues for us, so much so that I have a GitHub repository just to collect test cases for my bug reports to upstream.
The most notable issues?
Certain versions of Thrust with certain versions of the NVIDIA drivers led to GTX TITAN X (Maxwll) GPUs stalling or deadlocking after thousands of iterations. https://github.com/NVIDIA/thrust/issues/742
You could imagine how fun this was for our PhD student whose workstation had that hardware.
The other one was #Thrust producing completely bogus results: https://github.com/NVIDIA/thrust/issues/1341
Interestingly, in both cases the issue wasn't with Thrust proper, but with complex interactions between our usage of the Thrust API, the compiler and/or the driver and the hardware.
Now, I don't know where the performance issues I'm seeing in #rocThrust are coming from, but I'm sure they'll get fixed soon.
Corporate #FLOSS at its worst: #NVIDIA controls the #Thrust library and its #CUDA, #OpenMP and #TBB backend. #AMD provides rocThrust, that is just Thrust with the CUDA part stripped and a new backend for #ROCm / #HIP. Nobody* is working on a backend for #SYCL #Intel provides its own #oneAPI alternative as #oneDPL, which is NOT a drop-in replacement.