NOTimothyLottes

@NOTimothyLottes@mastodon.gamedev.place

if(!burning(GPU)) try harder;

This profile is from a federated server and may be incomplete. Browse more on the original instance.

NOTimothyLottes, 1 day ago to random

https://blog.danielschroeder.me/2024/05/voxel-displacement-modernizing-retro-3d/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ demofox, c0de517e, oblomov, pervognsen +1 more

aras, 4 days ago to Playdate

Short blog post about "Everybody Wants to Crank the World", a #playdate #demoscene demo I made recently https://aras-p.info/blog/2024/05/20/Crank-the-World-Playdate-demo/

Calculating only some pixels both spatially and temporally is pretty much the same as DLSS, right? :P

reply

expand (9)

collapse (9)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ demofox, pervognsen, badlogic, tojiro

NOTimothyLottes, 4 days ago

@aras Fixed spatial blue noise works great, would have been interesting to see spatial temporal blue noise adds any value or not

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 4 days ago

@demofox @aras Even before getting temporal: pre-sharpening before quantization, and using spatial error diffusion. And temporally, there is that interesting question of making it blue in time with fixed position, or blue in time with reprojected position (which is seriously hard, but would be fun to try on a GPU). Aras, it's inspirational to see blue 1-bit/pixel on that device, had been hoping others would have done that instead of fixed dither patterns :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 13 days ago to random

It is not the amount of transistors, but rather what you do with them that counts ... C64 it on a real CRT: https://www.youtube.com/watch?v=bcUmVdd_t2s

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ kornel, kerfuffle, msh, janbeta

dotstdy, 23 days ago to random

How does GPU memory actually work in an APU situation? There's a dedicated carve-out, but what specifically is the purpose of that? Can the GPU address all of the system memory in practice? Is the split between "device memory" and "host memory" in UMA devices on PC just there because otherwise titles / perhaps drivers / the OS would flip out about memory if you told the truth?

reply

expand (7)

collapse (7)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

NOTimothyLottes, 23 days ago

@dotstdy APU GPU part needs larger pages to avoid the small page TLB perf tax. The carve out enables that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

NOTimothyLottes, 23 days ago

@dotstdy That and page allocation CPU side is quite expensive. For example on the AMD kernel driver on SteamDeck, if you try to allocate and usage really large regions of memory, it can even stall for a minute in kernel land, looks like a hang, but actually isn't.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 23 days ago to random

Have a pair of what I think are American Racer snakes living in my yard. Yard is either rock, or pine needles with bamboo, or ferns. In 2023 I accidently found their prior home under the pine straw. Not sure where they sleep now. I'm surprised they get enough to eat on 1/3 acre of my unusual (for FL) residential land, and that my 4 year olds didn't scare them away.

image/png

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 23 days ago

@bitinn Could be, but I don't usually see them out in the rocks like in the photos. Got lucky the light was right on a window so they couldn't see us watching from the inside of the house.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago to random

If you enjoyed the visuals in the Dune 2 movie and want to try it out at a grand scale, at 230m tall, the Great Sand Dunes (CO) is a worthy stop. Just be careful, the nearby towns entire GDP comes from speeding tickets, and they will go as low as to ticket you for speeding during a lane crossing pass.

reply

expand (6)

collapse (6)

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago

@castano Enjoyed the drive to the dunes (and valley with moving rocks) in Death Valley. But the CO Dunes definitely feel monsterous in comparison. April is a good time to go for CO Dunes, cool air, easier hike, no snow. There is a offroading section by CO Dunes too, but it requires airing down, and I didn't bring a pump this time.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago to random

Temp back from vacation, have a MakVision/Wei-YA M3129DS-LG new-old-stock arcade CRT sitting around ... was going to sell it, but after hooking it up to take a photo of it's 'poor linearity' (because it's the really short LG flat screen tube), realized it's actually not that bad, so will keep this one for a future cab.

Old curved CRTs are way better, but these end-of-stock Wei-YA's have new tubes ...

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ aeva

castano, 1 month ago to random

Sierra Buttes, fire lookout.
Spring snow, race against the sun.
Hard work, in solitude.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago

@castano Looks like a lot of fun. Week ago tried hiking the easy paths out of Bear Lake in Rocky Mountain, but with no gear (no cleats + flat shoes) and a pair of 4 year olds. Had to turn back at some point due to safety concerns. But felt easy in comparison to hiking the Great Sand Dunes later. Already miss the snow ... near acclimatization, we are already back in the flat lands.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago to random

Thankfully HLSL defined readlane to work with non-compile-time constants. So even though Vulkan says {'compile time constant' for lane index} it isn't actually required (see associated AMD disassembly, reading a lane index based on a scalar load). For the sake of shader compatibility with HLSL no PC driver could get away with anything other than HLSL compatibility.

image/png
image/png

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

aras, 1 month ago to random

Jpegli: new JPG encoder library that achieves same visual quality with smaller file sizes. https://opensource.googleblog.com/2024/04/introducing-jpegli-new-jpeg-coding-library.html

reply

expand (12)

collapse (12)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ djlink

NOTimothyLottes, 1 month ago

@aras "same visual quality" without publishing any examples on that weblink

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago

@BartWronski @aras Yes certainly custom quant tables and encoding changes can show benefit was just hoping to see it first hand instead of seeing a graph :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dougbinks, 1 month ago to random

Annoying that in Vulkan compute I can't easily get hold of the optimal minimum number of invocations
of a shader to run.

Subgroup functionality gives us the SIMD lane width with subgroupSize (see https://www.khronos.org/blog/vulkan-subgroup-tutorial by @neilhenning ), but as far as I am aware there's no easy way to get the number of subgroups which can run simultaneously.

For many applications this doesn't mater, but in some cases it's really useful to know.

reply

expand (17)

collapse (17)

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago

@dougbinks @neilhenning If you don't do mobile, it's easy,
(1.) AMD = 64 <yes RDNA can do 32, but 64 is better for various reasons like RNDA3's dual issue>
(2.) NV = 32
(3.) Intel = 16 <yes it can do 8 and 32, but 16 is the optimal one for memory access>

That roughly translates though into doing branching based on subgroup size and then depending on compilers dead code removal and optimization to remove the branch

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago

@dougbinks @neilhenning If mobile vendors are in the picture for PC (and only thinking about chipsets that can do compute)

(4.) Qualcomm = 128 <because they need that in compute to get dual issue 16-bit, but it comes with a problem of smaller register limits per invocation>

...

But I'd place a warning here, that I think in STP all the mobile platforms ended up using group share fallback because of either correctness or perf bugs with wave ops when using shader source filtered through DXC.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 1 month ago

@dougbinks @neilhenning Note STP packs and aliases dual component 16-bit as a uint, and does multi-component uint wave ops, so that is where the dragons sleep ... DXC depends on vendor compilers to pattern match much of that, which they don't do.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago to random

Is TDR in Windows no longer a thing? When I screw up the forward progress on compute workgroups dispatched in a graphics queue on AMD (meaning accidential infinite runtime probably mixed with bad memory behavior) it just takes the entire machine down, so debug is one test per reboot.

Wasn't it the myth of stability that MS used to force the WDDM middle man? Would rather have the old style direct IHV drivers instead.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@aeva Yeah BSOD is super rare for me, but complete loss of display BlackScreens[OD] is common on AMD. Once upon a time NVIDIA also had the Red[SOD] which didn't have any info beyond the color being red.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago to random

What is shader execution reordering?

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago

@aras Oh yeah that thing. In all irony, I wrote a version of that in CUDA over a decade ago when I worked at NVIDIA. It took that long before the benefit/cost curve finally got >1 simply due to how bad divergent shading actually is in practice today + vector cache sizes getting larger. Its there to when you fall off the perf cliff to soften the impact with the ground, but it cannot undo the perf fatality.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

NOTimothyLottes, 2 months ago to random

Thread on kernel overlap. Mixing something latency bound with something full ALU bound.

It's only a 7.4% win to overlap these. That's more than good enough to bother, but why so low?

The latency bound task is launched first, it's runtime is only 6% longer during overlap.

However the ALU bound task grows by 56% in total runtime due to the overlap. But it's ALU boundness drops from 97% to 65%.

In theory, the latency bound task is eating too many waves, not enough VALU to schedule

image/png
image/png
image/png

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ aeva

NOTimothyLottes, 2 months ago

So lets prove it: transform the latency bound task to run 1/16 the waves with 16x the work (basic unrolling).

Now it's a 38% perf win. It's almost like that latency bound work just disappeared completely.

The ALU bound task is now only 3% slower.

Key insight here, when pipelining GPU work, it's important to order (group) such that workloads complement each other, and second (the big one) it's important to rate limit the non-ALU bound stuff!!!!!

image/png

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ aeva

NOTimothyLottes, 2 months ago

Drivers often (or even always) don't rate limit stuff. It's obvious if you see VALU drop when some memory bound overlap just eats all the waves and parks them doing nothing due to latency!

So the only option you have to make it right is to do the rate limiting yourself via dispatch size and unrolling.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov, demofox, aeva

NOTimothyLottes, 2 months ago

This BTW is a real world case, it's my engine, I'm working on the GPU-side groups of points scene graph management. This latency bound task is the one that rebuilds group index lists for {visible, and recycled} based on a group visibility bitarray.

No workgraph snake oil here, only raw perf, raw portability, and something workgraphs could never do :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov