Anyway, as I mentioned recently, I have a new workstation that finally allows me to test our code using all three backends (#CUDA, #ROCm/#HIP and #CPU w/ #OpeMP) thanks to having an #AMD#Ryzen processor with an integrated #GPU in addition to a discrete #NVIDIA GPU.
Of course the iGPU is massively underpowered compared to the high-end dGPU workhorse, but I would expect it to outperform the CPU on most workloads.
And this is where things get interesting.
So, one of the reasons why we could implement the #HIP backend easily in #GPUSPH is that #AMD provides #ROCm drop-in replacement for much of the #NVIDIA#CUDA libraries, including #rocThrust, which (as I mentioned in the other thread) is a fork of #Thrust with a #HIP/#ROCm backend.
This is good as it reduces porting effort, but it also means you have to trust the quality of the provided implementation.
one is a build failure against my GPU (already reported, with a fix ready and pending release), and the other is … slow performance in one of the #Thrust API calls that we use!
Turns out, sort_by_key, at least in the way we use it, is somewhere between 25% and 50% slower on my #AMD iGPU when using the latest #rocThrust (from the 5.6.0 software stack) than it is on the CPU when using the latest #Thrust with the OpenMP backend!
The other one was #Thrust producing completely bogus results: https://github.com/NVIDIA/thrust/issues/1341
Interestingly, in both cases the issue wasn't with Thrust proper, but with complex interactions between our usage of the Thrust API, the compiler and/or the driver and the hardware.
Now, I don't know where the performance issues I'm seeing in #rocThrust are coming from, but I'm sure they'll get fixed soon.