Proving that Immediate Mode GUIs aren’t significant battery hogs. Computers... - Random

forrestthewoods, 22 days ago

Proving that Immediate Mode GUIs aren’t significant battery hogs. Computers are really fast!

MacBook M1 Idle: 3.5 watts

Dear ImGui: 7.5
ImPlot: 8.9
EGUI: 8.2
Rerun: 11.1

Spotify: 5.8
VSCode: 7.0
YouTube: 11.5
Facebook: 8.7

Compiling: 50.0

Full blog post: https://www.forrestthewoods.com/blog/proving-immediate-mode-guis-are-performant/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ daridrea, janriemer, ocornut, djlink +5 more

Image

Image alternative text

Doomed_Daniel, 21 days ago

@forrestthewoods
I think it would be interesting to have a "classic" retained mode GUI for comparison, like an application using Qt or similar - AFAIK Spotify and VS Code both use Chromium (and I assume Facebook is running in a browser as well)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

TomF, 21 days ago

@Doomed_Daniel @forrestthewoods Makes me think you could write an interposer library that exposes a retained-mode GUI, but then internally displays it using an immediate mode.

Rather pointless, except to enable direct apples-to-apples comparisons like this. I assume it would be slightly more power-hungry because you fundamentally added code in the middle, but maybe by only a few %

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Doomed_Daniel, 21 days ago

@TomF @forrestthewoods
Would probably be less work to implement a test application with the same functionality in ImGui and Qt (or similar), with the additional benefit that both versions could use code that's idiomatic for the respective platform (which would probably not be the case with a wrapper).

Of course still more work than one wants to do for a quick comparison :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

forrestthewoods, 21 days ago

@Doomed_Daniel @TomF if I did that people would just complain “test app isn’t reflective of real app”.

There genuinely is no pleasing people. And yet we live in a world of imperfect information where people have to make binary choices all the time!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@forrestthewoods I've worked on applications with (mostly custom) immediate-mode UIs that needed specialized performance-oriented design for the UI aspects. And the solution was always indexing/query caching at the level of the application domain's data rather than UI-level caching. And this applies as much to power efficiency at the lower end as it does to peak performance at the higher end.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dotstdy, 21 days ago

@pervognsen @forrestthewoods This is also a good description of most game UI performance issues I would guess. :') We have a bunch of kind of structural performance issues that limit the maximum speed, but in terms of "why is this slow" it's mostly just "we're querying N game systems every frame and then filtering the results manually to decide whether to show a button or not".

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ janriemer

forrestthewoods, 21 days ago

@pervognsen would love to hear more specifics about this. I’m a sucker for war stories.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@forrestthewoods In the case of the way query LOD worked in Telemetry 2, the basic idea was pretty similar to how you'd approach hierarchical LOD in a game renderer--you'd like to only do O(visible pixels) work if possible, so you both need to have fast data structures for culling to the extents of the viewport/frustum but you also need some kind of hierarchical LOD or mipmapping-like precomputation to avoid slowing down or "aliasing" when you zoom out too far.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@forrestthewoods Anyway, you do all this at the level of the data model, not the UI. In Telemetry's case you have time series data, everything from plots/events associated with instants/points in time to hierarchical zones, which are like timespans with attributes. So you index and LOD those (usually on the fly as you ingest the original time series data over the wire) and query against that each frame to build the UI on demand for the viewport.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@forrestthewoods There's a bunch of "UX design choices" that goes into how you choose to LOD things. For example, for plots I precomputed mipmap-like pyramidal summaries of min/max/sum so I could compute the min/max/avg over any span of time in O(log n) time, so that way you can zoom out on a plot and still see a big spike in your game's particle count or whatever (assuming you log particle count each frame) even though one zoomed-out pixel might cover a million recorded game frames of data.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

theWarhelm, 20 days ago

@pervognsen @forrestthewoods I might be misremembering but I think Tracy has a similar approach to rendering a large number of zones that cover a small number of pixels. There might be a blog about it as well

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 20 days ago

@theWarhelm @forrestthewoods Yeah, Tracy currently does some kind of LOD for zones but I don't think it does it yet for other kinds of data like plots (I admit I haven't checked in a while). The TM2 work I cited is over 10 years old at this point and I didn't see any other profilers doing that style of large-scale data LOD at the time but nowadays there are a couple that do at least some of them.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 20 days ago

@theWarhelm @forrestthewoods There was also a nice write-up a few years ago by @trishume of an experiment he did: https://thume.ca/2021/03/14/iforests/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

wolfpld, 15 days ago

@pervognsen @theWarhelm @forrestthewoods All Tracy does is checking if the zone is big enough to show on the screen, and if not, where is the next zone large enough to be visible. It is brute force, nothing too smart about it. Combine that with small data structures and parallel processing to overcome memory latency and that's it.

https://github.com/wolfpld/tracy/blob/master/profiler/src/profiler/TracyTimelineItemThread.cpp#L415-L469

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@wolfpld @theWarhelm @forrestthewoods Nice, using binary search to skip to the next first visible interval should work too if you're just trying to avoid processing too much data but don't care about "aliasing". I tried to handle both with the same LOD approach. I don't remember it being too tricky. First, each zone interval is binned based on its length at the appropriate power of two octave into its own sorted array. The preprocess then does a bottom-up "merge" from finest to coarsest level.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@wolfpld @theWarhelm @forrestthewoods The merge is just an ordered walk of the sorted array from the previous level where you merge together the longest sequence of intervals until the bounding width of the merged interval is above the minimum interval length for that level. Such an "LOD interval" aggregates relevant information like the number of covered zones, min/max/avg covered zone duration, etc.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@wolfpld @theWarhelm @forrestthewoods Then the query engine in the visualizer has to query each relevant octave for zones. You query all the octaves starting from the first relevant octave ("first relevant" meaning the finest octave whose maximum interval width is at least one pixel or whatever your cutoff is).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@wolfpld @theWarhelm @forrestthewoods Sorry for not explaining it better, I haven't thought about it for 10 years and I'm having to reconstruct the details quickly on demand.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

wolfpld, 15 days ago

@pervognsen @theWarhelm @forrestthewoods What you describe seems to be quite tricky to me ;) But I think I get the general gist of it.

The zones are not the problem though, they are fine as they are. You can run a debug build and it will go happily at 144 FPS, even if the screen is full or razor thin zones.

The real problem is with handling plot data, including CPU usage, which basically does a binary search for each screen column, and it's slow.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@wolfpld @theWarhelm @forrestthewoods Like I said, I handle plots too but I went into less detail with my description of that because they're much easier since you're dealing with point-sampled values rather than hierarchical intervals/timespans. At least for the plot types we supported.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

wolfpld, 15 days ago

@pervognsen @theWarhelm @forrestthewoods For plots, it checks if the number of plot items to draw is less than an arbitrary number, then draw all, otherwise do random sampling.

https://github.com/wolfpld/tracy/blob/master/profiler/src/profiler/TracyTimelineItemPlot.cpp#L132-L235

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

wolfpld, 15 days ago

@pervognsen @theWarhelm @forrestthewoods I have thought about how to improve this, but have not come up with a satisfactory solution. You either have a ton of metadata and a hard time when you need to insert a new item in the middle of already existing ones, or you have to do some kind of partitioning scheme. But then you have to decide whether to do it by time or by number of items, and your basic problem does not go away anyway, it just happens some time later.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@wolfpld @theWarhelm @forrestthewoods Maybe I'm misunderstanding but why would you need to "insert an item into the middle of an already existing one"? The fact that all the data arrives in time order (except for a short grace period where events from different threads can possibly be out of order with each other) should mean you don't require that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@forrestthewoods Incidentally, the RAD debugger's UI is not single-pass immediate-mode UI. The application builds the UI in a traditional immediate-mode style but that implicitly constructs a full widget tree under the hood which is traversed and laid out in several passes (in fact, far more passes than algorithmically needed for that layout system) before being turned into draw commands.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@forrestthewoods I have an unfinished email sitting in my draft folder I've been meaning to send to Ryan about the needless passes. But basically his separate pre-order, in-order and post-order passes can be folded into the measure() function here: https://c.godbolt.org/z/h137o3v1G. With a Flutter-style layout system you can actually do both measuring and parent-relative placement in the same recursive pass. Although this still requires a separate layout pass from the application's UI build pass.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

zwarich, 21 days ago

@pervognsen @forrestthewoods Is this just bidirectional type checking (or left corner parsing) in a new guise again? It seems like there's only like 3 good ideas...

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 21 days ago (edited 21 days ago)

@zwarich @forrestthewoods Heh, I didn't think of the connection with bidirectional typechecking but I did think of the connection with attribute grammars with how you dynamically switch per-node whether an attribute is inherited or synthesized (to use attribute grammar terminology).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

aras, 22 days ago

@forrestthewoods nice! You could also use Unity (prob some version older than 2021 - less UIToolkit there) as a “large/complex” IMGUI app case.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment