I got a bunch of responses to this, some private. Thanks for those!
I also did a bit of an exploration myself, including a bit of a fusion of FidelityFX sort and Onesweep ported to WebGPU, with the warp-local multi-split adapted to use shared memory instead of subgroups (warp operations). Details are in this Zulip thread: https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Sorting.20revisited
This is a preliminary investigation, which shows that performant WebGPU sorting is likely feasible. I hope it gets that conversation started.