I'm curious to hear if anyone has successful strategies for debugging shader... - GraphicsProgramming

julienbarnoin, 2 months ago

I'm curious to hear if anyone has successful strategies for debugging shader compile time issues (specifically #GLSL in #Vulkan in my case but still interested in hearing about others).

I've got this shader that takes over a minute to compile. There's various things I could do to prevent loop unrolling etc. but I still have to use trial and error to find the right place.
Do you have any better strategy than changing random places and seeing what makes a difference?

#gamedev #programming

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

solidangle, 2 months ago

@julienbarnoin hey @aras would your flamegraph stuff be available in any clang-based shader compiler like dxc (maybe with a little work to expose it?)

https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

aras, 2 months ago

@solidangle @julienbarnoin you mean what @beanz already did and shipped in DXC in late 2022? :) https://github.com/microsoft/DirectXShaderCompiler/releases/tag/v1.7.2212

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@aras @solidangle @beanz
Nice, thanks !
Well, this doesn't actually help for my specific issue which is glsl rather than hlsl, however definitely an interesting data point and giving me more insight.
I'm curious if there are ways of obtaining similar profiling data for Mesa's SPIR-V to NIR compiler.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

crzwdjk, 2 months ago

@julienbarnoin @aras @solidangle @beanz You can just use regular profiling tool on Mesa to see which pass is taking a long time, there is also NIR_PRINT=1 (or something like that) to print out the NIR after each pass in a debug build to see if the code size blows up for some reason.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@crzwdjk @aras @solidangle @beanz @NOTimothyLottes
Thanks for the suggestions all !
So I haven't been able to profile the compiler yet due to life things, but from some trial and error checks, it's quite clear that I'm always plagued with the same issue: I need to stop passing large structs as inout function parameters in GLSL just to change a few members.

This is not a compiler-specific issue, this tends to take a long time to optimize for multiple ones in my experience.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@crzwdjk @aras @solidangle @beanz @NOTimothyLottes I'm not sure if this is a known GLSL trap, but I'm pretty sure I can build a simple pathological example that demonstrates it. I have a real-world code example that now takes over 10 minutes compiling.
After optimizing the result is alright I think, it all gets eliminated, but there's just so much code copying things around before optimizing that it takes forever to optimize out.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@crzwdjk @aras @solidangle @beanz @NOTimothyLottes
There's some related discussion here https://github.com/KhronosGroup/GLSL/issues/84, though the original complaint was from a performance perspective, I might submit a pathological example there that kills the Vulkan SDK's SPIR-V optimizer (and others) in terms of compile time to demonstrate why inout is just not an adequate replacement for pass-by-reference for some uses.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

crzwdjk, 2 months ago

@julienbarnoin @aras @solidangle @beanz @NOTimothyLottes Ah yeah that would do it. Both GLSL and HLSL semantics say that inout function parameters are copied in at the beginning and out at the end. So yeah, passing it around as an inout is going to be messy, especially if the optimizer does things like break the structure into a bunch of scalars, then add code to copy each one. But I think SPIR-V does have the ability to pass by reference, and GLSL should expose it.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@crzwdjk @aras @solidangle @beanz @NOTimothyLottes It would be great if it did ! I know you mentioned making changes to GLSL a couple times, do you mind elaborating on your relationship with GLSL? Are you an external contributor or do you officially work on GLSL?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

crzwdjk, 2 months ago

@julienbarnoin @aras @solidangle @beanz @NOTimothyLottes I officially work on glslang, I am not sure anyone is really working on GLSL itself that much, but someone really should.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beanz, 2 months ago

@crzwdjk @julienbarnoin @aras @solidangle @beanz @NOTimothyLottes come work on HLSL. We have SFINAE.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@beanz @crzwdjk @aras @solidangle @NOTimothyLottes
Stop tempting me with your features... It would be so nice... But... I want independent standards ! Standards !!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@beanz @crzwdjk @aras @solidangle @NOTimothyLottes Seriously if Microsoft gave control of HLSL to Khronos there'd be an instant adoption boost I'm sure :P

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beanz, 2 months ago

@julienbarnoin @beanz @crzwdjk @aras @solidangle @NOTimothyLottes I’m not sure I think that is a good idea… I think language design needs a different approach than Khronos can really offer.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

crzwdjk, 2 months ago

@beanz @julienbarnoin @aras @solidangle @NOTimothyLottes From the GLSL point of view, I kind of agree. Khronos is ultimately a vendor consortium and nobody there is too motivated to move the language itself forward, because vendors mostly want to show off their cool new features which can mostly be bolted onto the side. It has to come from users (really from engines) who aren't well represented there.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beanz, 2 months ago

@crzwdjk @beanz @julienbarnoin @aras @solidangle @NOTimothyLottes I should also note that I do think MS needs to be more involved with Khronos and in a collaborative way. I just also think that HLSL is a mess of a language because it has become this messy amalgamation of disjointed ideas and concepts. We need a few years (or a decade) of polish and refinement to make it a sane cohesive language.
I welcome collaboration from outside MS to accomplish that goal.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@beanz @crzwdjk @julienbarnoin @aras @solidangle The important transition is that to having real pointers. Making HLSL on par with what is possible on the hardware in general purpose compute. Branch tables, function pointers, branch to address in register, etc.

You'd get a revolution of algorithm design.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beanz, 2 months ago

@NOTimothyLottes @beanz @crzwdjk @julienbarnoin @aras @solidangle ignoring the technical debt in the compilers, data pointers aren’t terribly hard. Function pointers are hard to do efficiently because you can’t spill your full register file to a stack, and divergent function calls could be the biggest foot-gun in GPU programming.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@beanz @crzwdjk @julienbarnoin @aras @solidangle

Solve that and the shader permutation problem in one go:

(1.) global union of structures for virtual register allocation

(2.) functions with no arguments or returns (instead they use (1.))

(3.) shared separately compiled modules

Then shaders that can either branch or copy-paste inline stuff from (3.).

More on divergence next...

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dneto, 2 months ago

@NOTimothyLottes @beanz @crzwdjk @julienbarnoin @aras @solidangle

COMMON blocks.
You've just reinvented Fortran COMMON blocks with (1.) and (2.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@dneto @beanz @crzwdjk @julienbarnoin @aras @solidangle Sure common blocks, or the C64 zeropage. It's not a new thing. It's how we did fast stuff back in the day.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@beanz @crzwdjk @julienbarnoin @aras @solidangle

Would be nice to support using a wave as a single thread of execution with an explicit lane mask which is only conditionally applied when required. No actual divergence. It makes everything a lot easier when wave ops can always use all lanes if necessary, and when it's always ok to readlane() on lane 0.

This doesn't exclude also building in a complex divergence setup, but have capacity for both.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dneto, 2 months ago

@NOTimothyLottes @beanz @crzwdjk @julienbarnoin @aras @solidangle

Tongue in cheek, but you're giving me flashbacks to the explicit predication using "indicators" (a.k.a. flags) in the Calculation lines in RPG II. You could have up to three predicates apply to guard a line of code or output.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beanz, 2 months ago

@crzwdjk @julienbarnoin @aras @solidangle @beanz @NOTimothyLottes DXIL can also represent references, but it is difficult to expose in HLSL due to tech debt. References will probably end up being a feature of the first HLSL version that is exclusive to Clang.
Feature Proposal: https://github.com/microsoft/hlsl-specs/blob/main/proposals/0006-reference-types.md
Tech debt (for context): https://github.com/microsoft/DirectXShaderCompiler/pull/5249

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dneto, 2 months ago

@beanz @crzwdjk @julienbarnoin @aras @solidangle @NOTimothyLottes

WGSL can pass pointers as params. But it also adds no-aliasing rules so that:

it's safe to map to inout in backend languages HLSL or GLSL, and

it's correctly maps to pointer args in MSL and SPIR-V.

The legalese is here: https://www.w3.org/TR/WGSL/#aliasing
But that doesn't explain the implementation strategy.

And Chrome 123 made pointer params a lot more flexible. See
https://developer.chrome.com/blog/new-in-webgpu-123?hl=en#unrestricted_pointer_parameters_in_wgsl

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dneto, 2 months ago

@beanz @crzwdjk @julienbarnoin @aras @solidangle @NOTimothyLottes

Heh. I should have pointed at https://google.github.io/tour-of-wgsl/types/pointers/

(Beware the "tour" is not finished.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@julienbarnoin @aras @solidangle @beanz Are you getting killed in the IHV compiler(s) or in glslang? I've found that if I don't spirv-opt.exe for size, the IHV compiler chokes.

If I was to guess, you are going to need workarounds like translating loops from compile-time known loop counts to fake dynamic (load something the compiler doesn't know is dynamically static).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dneto, 2 months ago

@NOTimothyLottes @julienbarnoin @aras @solidangle @beanz

We've seen bad big-Oh algorithms in FXC and driver compilers.

spirv-opt algorithms were originally written for simplicity, but not obviously wasteful algorithms. If you see a quadratic or worse blowup, do file a bug.

The very first thing spirv-opt (or size or speed; almost the same recipes) is to inline everything; passing large structs around might be bad in any case.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beanz, 2 months ago

@dneto @NOTimothyLottes @julienbarnoin @aras @solidangle @beanz this is effectively what DXC does too. I’ve got a post-it note on my monitor about trying a slightly different pass order in Clang to see if we can meaningfully simplify the code in functions before inlining.
DXC has an old LLVM 3.x bug where inlining can sometimes spend an insane amount of time propagating memory aliasing annotations. This can cause enormous compile time explosions (observed on the order of 10x).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

julienbarnoin, 2 months ago

@dneto @NOTimothyLottes @aras @solidangle @beanz
Still have to look at this closer and make an example I can share, but just from a quick test of copy-pasting of function calls in my shader and measuring the time spirv-opt takes, I see this, which seems a pretty tight match for quadratic time.

(This is not actually measuring the spirv-opt program but calling into the shaderc library with shaderc_optimization_level_performance, I assume this ends up being pretty much the same internally?)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment