@NOTimothyLottes@mastodon.gamedev.place
@NOTimothyLottes@mastodon.gamedev.place avatar

NOTimothyLottes

@NOTimothyLottes@mastodon.gamedev.place

if(!burning(GPU)) try harder;

This profile is from a federated server and may be incomplete. Browse more on the original instance.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar
aras, to Playdate
@aras@mastodon.gamedev.place avatar

Short blog post about "Everybody Wants to Crank the World", a #playdate #demoscene demo I made recently https://aras-p.info/blog/2024/05/20/Crank-the-World-Playdate-demo/

Calculating only some pixels both spatially and temporally is pretty much the same as DLSS, right? :P

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@aras Fixed spatial blue noise works great, would have been interesting to see spatial temporal blue noise adds any value or not

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@demofox @aras Even before getting temporal: pre-sharpening before quantization, and using spatial error diffusion. And temporally, there is that interesting question of making it blue in time with fixed position, or blue in time with reprojected position (which is seriously hard, but would be fun to try on a GPU). Aras, it's inspirational to see blue 1-bit/pixel on that device, had been hoping others would have done that instead of fixed dither patterns :)

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

It is not the amount of transistors, but rather what you do with them that counts ... C64 it on a real CRT: https://www.youtube.com/watch?v=bcUmVdd_t2s

dotstdy, to random
@dotstdy@mastodon.social avatar

How does GPU memory actually work in an APU situation? There's a dedicated carve-out, but what specifically is the purpose of that? Can the GPU address all of the system memory in practice? Is the split between "device memory" and "host memory" in UMA devices on PC just there because otherwise titles / perhaps drivers / the OS would flip out about memory if you told the truth?

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@dotstdy APU GPU part needs larger pages to avoid the small page TLB perf tax. The carve out enables that.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@dotstdy That and page allocation CPU side is quite expensive. For example on the AMD kernel driver on SteamDeck, if you try to allocate and usage really large regions of memory, it can even stall for a minute in kernel land, looks like a hang, but actually isn't.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

Have a pair of what I think are American Racer snakes living in my yard. Yard is either rock, or pine needles with bamboo, or ferns. In 2023 I accidently found their prior home under the pine straw. Not sure where they sleep now. I'm surprised they get enough to eat on 1/3 acre of my unusual (for FL) residential land, and that my 4 year olds didn't scare them away.

image/png

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@bitinn Could be, but I don't usually see them out in the rocks like in the photos. Got lucky the light was right on a window so they couldn't see us watching from the inside of the house.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

If you enjoyed the visuals in the Dune 2 movie and want to try it out at a grand scale, at 230m tall, the Great Sand Dunes (CO) is a worthy stop. Just be careful, the nearby towns entire GDP comes from speeding tickets, and they will go as low as to ticket you for speeding during a lane crossing pass.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@castano Enjoyed the drive to the dunes (and valley with moving rocks) in Death Valley. But the CO Dunes definitely feel monsterous in comparison. April is a good time to go for CO Dunes, cool air, easier hike, no snow. There is a offroading section by CO Dunes too, but it requires airing down, and I didn't bring a pump this time.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

Temp back from vacation, have a MakVision/Wei-YA M3129DS-LG new-old-stock arcade CRT sitting around ... was going to sell it, but after hooking it up to take a photo of it's 'poor linearity' (because it's the really short LG flat screen tube), realized it's actually not that bad, so will keep this one for a future cab.

Old curved CRTs are way better, but these end-of-stock Wei-YA's have new tubes ...

castano, to random
@castano@mastodon.gamedev.place avatar

Sierra Buttes, fire lookout.
Spring snow, race against the sun.
Hard work, in solitude.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@castano Looks like a lot of fun. Week ago tried hiking the easy paths out of Bear Lake in Rocky Mountain, but with no gear (no cleats + flat shoes) and a pair of 4 year olds. Had to turn back at some point due to safety concerns. But felt easy in comparison to hiking the Great Sand Dunes later. Already miss the snow ... near acclimatization, we are already back in the flat lands.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

Thankfully HLSL defined readlane to work with non-compile-time constants. So even though Vulkan says {'compile time constant' for lane index} it isn't actually required (see associated AMD disassembly, reading a lane index based on a scalar load). For the sake of shader compatibility with HLSL no PC driver could get away with anything other than HLSL compatibility.

image/png
image/png

aras, to random
@aras@mastodon.gamedev.place avatar

Jpegli: new JPG encoder library that achieves same visual quality with smaller file sizes. https://opensource.googleblog.com/2024/04/introducing-jpegli-new-jpeg-coding-library.html

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@aras "same visual quality" without publishing any examples on that weblink

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@BartWronski @aras Yes certainly custom quant tables and encoding changes can show benefit was just hoping to see it first hand instead of seeing a graph :)

dougbinks, to random
@dougbinks@mastodon.gamedev.place avatar

Annoying that in Vulkan compute I can't easily get hold of the optimal minimum number of invocations
of a shader to run.

Subgroup functionality gives us the SIMD lane width with subgroupSize (see https://www.khronos.org/blog/vulkan-subgroup-tutorial by @neilhenning ), but as far as I am aware there's no easy way to get the number of subgroups which can run simultaneously.

For many applications this doesn't mater, but in some cases it's really useful to know.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@dougbinks @neilhenning If you don't do mobile, it's easy,
(1.) AMD = 64 <yes RDNA can do 32, but 64 is better for various reasons like RNDA3's dual issue>
(2.) NV = 32
(3.) Intel = 16 <yes it can do 8 and 32, but 16 is the optimal one for memory access>

That roughly translates though into doing branching based on subgroup size and then depending on compilers dead code removal and optimization to remove the branch

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@dougbinks @neilhenning If mobile vendors are in the picture for PC (and only thinking about chipsets that can do compute)

(4.) Qualcomm = 128 <because they need that in compute to get dual issue 16-bit, but it comes with a problem of smaller register limits per invocation>

...

But I'd place a warning here, that I think in STP all the mobile platforms ended up using group share fallback because of either correctness or perf bugs with wave ops when using shader source filtered through DXC.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@dougbinks @neilhenning Note STP packs and aliases dual component 16-bit as a uint, and does multi-component uint wave ops, so that is where the dragons sleep ... DXC depends on vendor compilers to pattern match much of that, which they don't do.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

Is TDR in Windows no longer a thing? When I screw up the forward progress on compute workgroups dispatched in a graphics queue on AMD (meaning accidential infinite runtime probably mixed with bad memory behavior) it just takes the entire machine down, so debug is one test per reboot.

Wasn't it the myth of stability that MS used to force the WDDM middle man? Would rather have the old style direct IHV drivers instead.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@aeva Yeah BSOD is super rare for me, but complete loss of display BlackScreens[OD] is common on AMD. Once upon a time NVIDIA also had the Red[SOD] which didn't have any info beyond the color being red.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

What is shader execution reordering?

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

@aras Oh yeah that thing. In all irony, I wrote a version of that in CUDA over a decade ago when I worked at NVIDIA. It took that long before the benefit/cost curve finally got >1 simply due to how bad divergent shading actually is in practice today + vector cache sizes getting larger. Its there to when you fall off the perf cliff to soften the impact with the ground, but it cannot undo the perf fatality.

NOTimothyLottes, to random
@NOTimothyLottes@mastodon.gamedev.place avatar

Thread on kernel overlap. Mixing something latency bound with something full ALU bound.

It's only a 7.4% win to overlap these. That's more than good enough to bother, but why so low?

The latency bound task is launched first, it's runtime is only 6% longer during overlap.

However the ALU bound task grows by 56% in total runtime due to the overlap. But it's ALU boundness drops from 97% to 65%.

In theory, the latency bound task is eating too many waves, not enough VALU to schedule

image/png
image/png
image/png

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

So lets prove it: transform the latency bound task to run 1/16 the waves with 16x the work (basic unrolling).

Now it's a 38% perf win. It's almost like that latency bound work just disappeared completely.

The ALU bound task is now only 3% slower.

Key insight here, when pipelining GPU work, it's important to order (group) such that workloads complement each other, and second (the big one) it's important to rate limit the non-ALU bound stuff!!!!!

image/png

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

Drivers often (or even always) don't rate limit stuff. It's obvious if you see VALU drop when some memory bound overlap just eats all the waves and parks them doing nothing due to latency!

So the only option you have to make it right is to do the rate limiting yourself via dispatch size and unrolling.

NOTimothyLottes,
@NOTimothyLottes@mastodon.gamedev.place avatar

This BTW is a real world case, it's my engine, I'm working on the GPU-side groups of points scene graph management. This latency bound task is the one that rebuilds group index lists for {visible, and recycled} based on a group visibility bitarray.

No workgraph snake oil here, only raw perf, raw portability, and something workgraphs could never do :)

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • kavyap
  • DreamBathrooms
  • cisconetworking
  • magazineikmin
  • InstantRegret
  • everett
  • thenastyranch
  • Youngstown
  • rosin
  • slotface
  • khanakhh
  • Durango
  • mdbf
  • JUstTest
  • ethstaker
  • anitta
  • modclub
  • osvaldo12
  • normalnudes
  • ngwrru68w68
  • GTA5RPClips
  • tacticalgear
  • provamag3
  • tester
  • Leos
  • cubers
  • lostlight
  • All magazines