This came up in another thread today but I figure I'd throw a brief comment to... - Random

pervognsen, 15 days ago

This came up in another thread today but I figure I'd throw a brief comment to the timeline. The concept of "grace periods" where you separate the logical and physical deletion of resources is something you see in RCU, EBR, QSBR, etc, but it's just as useful in single-threaded code where you only have pseudo-concurrency through subroutine calls. Like the age-old pattern of deferring deletion until the end of the game loop, or autorelease pools in Objective C which get drained by the run loop.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ fanf

Image

Image alternative text

pervognsen, 15 days ago

E.g. the reason that weak handles for game objects don't need to do the thing people usually do for general-purpose weak pointers (i.e. temporarily upgrade a weak pointer to a strong pointer as a way to lock it in memory for the extent of the code region) is that we only clean-up/recycle objects when the current game loop iteration is done. (A standard variation is to add a multi-frame delay, with a semaphore as a backstop, to allow for pipelined threads that read/borrow from game data.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

And if you really need to explicitly mark "critical regions", the lesson from EBR is that it's more efficient to mark that in public data that's attached to your thread rather than to the particular object (which is effectively what you're doing when you upgrade a weak pointer to a strong pointer for a typical refcounted pointer class). Then the code that's responsible for recycling (logically but not physically deleted) resources can scan that public data from all threads.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

Jeff Preshing has an old article on using QSBR where he also mentions the comparison to game loops: https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

Also, the low-level ABA part of the reclamation problem is typically the cited reason people use EBR, etc, but that part can be practically solved with type-stable memory with generation-tagged slots to distinguish different generations of objects in the same slot. That's already standard fare in games. But that still doesn't solve the problem of making sure the slot doesn't get recycled while you have a (validated!) live pointer to it--that's a time-of-check vs time-of-use issue.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SonnyBonds, 15 days ago

@pervognsen What would you say the big upsides of e.g. "slot arenas" are compared to classic allocation of objects? Memory locality is obviously one, although I'm thinking that mostly affects operations that involve sequential iteration over objects?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@SonnyBonds It depends on the specific use cases. You can allocate/free the arena as a single operation. An index can be a lot smaller than a 64-bit pointer (though if you need a generation tag in the pointer to detect slot recycling that eats into that), usually only 32 bits but potentially even smaller if you can apply domain-specific knowledge to set reasonable capacity limits (e.g. 16 bits is plenty for almost any UI or game if it's just a matter of max live object count).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@SonnyBonds You can easily craft bespoke (both simpler and more efficient in space and time than general purpose) strategies for some of the aforementioned problems like weak handles or deferred deletion. You reduce fragmentation by packing objects of the same size together, especially if it's not a typical power-of-two size class. Related to the last point, you improve performance reliability by not co-mingling your objects with other allocations that "randomly" happen to share your size class.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@SonnyBonds I should mention that you can get some of these benefits (e.g. the last two) with a custom pool allocator too but I consider most uses of "pluggable allocators" like that as an anti-pattern that doesn't address the problem more holistically by just trying to reduce it to a memory management problem. At best, you're leaving massive opportunities on the table when you do that, so it's a kind of lazy shortcut (but lazy shortcuts have their place).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SonnyBonds, 15 days ago

@pervognsen Can you have generations in an arena with different object sizes? I'm thinking an old generation value should never end up in the middle of a new object, if that makes sense.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@SonnyBonds If you want to be able to reuse space in an arena incrementally (i.e, not just a bulk reset/clear at certain checkpoints where you know everything is dead), it needs to be divided into fixed-size slots. That's why I mentioned "one array per widget type" in the other thread. I generally don't co-locate objects that happen to have the same size in the same array either. In fact, I go a step further and almost always split tagged union variants into separate arrays.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

SonnyBonds, 15 days ago

@pervognsen Got it, just wanted to double check if there was some trick I hadn't thought about.

My most recent foray into it was mostly for a different reason; having clonable arenas so a full state can be synced between threads or stored for undo etc etc. But that means the arena location isn't static and needs to be passed around together with references and I never came up with a satisfactory way of doing that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@SonnyBonds Oh yeah, I left out a bunch of other advantages. It's hard to make a complete list on demand like this. :) The fact that an arena (or just index-based arrays) is a position-independent representation means that you can memcpy/relocate/serialize them without fixups, memory map them, etc. As for having to know the base pointer for the arena to resolve handles, I've never found it to be a big issue if you structure your code properly.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@SonnyBonds In practice, you either have the arena accessible from your context (through function parameters usually though sometimes static TLS can be an option in lieu of a thread-local global) or a fully resolved slot pointer (which was resolved up-stack somewhere that did have the arena through its context) or, very rarely, you might want to have a fat smart pointer class that pairs an arena pointer and a handle so it carries the context with it. The latter is almost never needed IME.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@SonnyBonds Actually, a different kind of fat pointer/cursor has been more useful in my experience--if you only have a fully resolved pointer and no arena context, you generally can't recover the key/handle corresponding to the pointer unless you are wasteful and replicate it in the header of the object. So it's useful to have a fat pointer that pairs the resolved slot pointer with its handle/key so it's available as a unit. This is a general pattern for collection cursors, not arena specific.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@SonnyBonds (I also think that last example shows one of the general issues with pointers. It's nice that they're "free standing" as a way to access a piece of memory but unless you put all the information related to the object directly in the pointed-to block of memory, it's hard or impossible to access that other information if you "just" have the pointer. So it tends to promote shoving everything related to an object, or at least pointers to that associated data, into one place.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

zwarich, 15 days ago

@pervognsen @SonnyBonds If only programming languages could associate the arena with the handle for you so you don't need to pass the arena around everywhere.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich @SonnyBonds I'm not sure that would do much to be honest. It feels like solving the wrong problem. You only really need that for referencing remote arena-owned resources. And usually you don't want those to proliferate too much into "leaves" anyway.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich @SonnyBonds E.g. if you have arenas A and B with independent lifetimes and you need A's resources to be able to reference B's you're often better off with these (remote arena pointer, remote handle) pairs in an arena-wide "import table" for A and then A's resources reference B's resources only through local handles into A's import table, so there's a level of indirection. So at that point you're back to accessing remote resources via local handles only. It's like dynamic linking.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

zwarich, 15 days ago

@pervognsen @SonnyBonds The only thing that annoys me about serializing these sorts of web-of-arenas structures is that once you start doing any sort of reclamation, the exact layout of the data depends on the history of all operations performed on that arena.

Of course, since you can always canonicalize the whole web, this is mostly an aesthetic concern.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich @SonnyBonds You have a similar issue with intern IDs. I think it's tolerable, especially if you make them sufficiently opaque and stick to treating them invariantly with respect to the equivalence relation that the interning is actually trying to model. But physically, in terms of the actual bitwise representation for any particular "local context", yeah, it's highly non-canonical. Arguably that part is a violation of what I said about position-independence, if not in an address sense.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

zwarich, 15 days ago

@pervognsen @SonnyBonds Related, but do you have a parallel interning strategy that you really like? I remembered seeing this post from matklad, but didn't reach a conclusion about it from reading it alone (and had no chance to implement it): https://matklad.github.io/2023/04/23/data-oriented-parallel-value-interner.html

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@zwarich I think step one is recognizing that this problem is brought about, largely, by trying to make the problem too globally consistent/coordinated and then getting stuck focusing on solving the scalability issues that inevitably ensue. :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

zwarich, 15 days ago

@pervognsen I know, it's giving Meta "<popular tool> can't handle our scale" vibes.

In places where I've had this problem in the past, I've just used local interners and eaten the cost of doing cross-boundary comparisons. It also has the advantage that you're much more likely to be able to eventually reclaim names than when using a global interner.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich One useful separation of concerns I've made in addressing related issues is dis-entangling interning/canonicalization from sharing/dedup. Opportunistic/best-effort sharing has much looser consistency requirements but depending on what you're trying to achieve it can generate a lot of the same benefits.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich For canonicalization, the local vs remote distinction is helpful. You can guarantee local canonicalization and if you eventually need global canonicalization it can be done, essentially, with bulk merging/linking, which can be done about as fast as just copying data if you set things up right. And then any on-the-fly non-local sharing can be opportunistic as a way of avoiding wasted work in an incremental/parallel computing sense.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich Looked at the article, I don't really see anything special there, I admit. An exponentially segmented array is a nice data structure, but for "one large array" kind of use cases like a global interner, you can just use a huge VM reservation. I tend to only use exponentially segmented arrays as a fallback when I need something "general purpose" for a library where I can't make assumptions for the API's client about whether reserving 4 GB (or whatever) for an array is reasonable/possible.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich I don't see anything unreasonable he's doing but he''s missing at least a few standard things like a thread-local cache in front of the shared table(s). The local caches are usually a win IME even for single-threaded interning when the global hash table gets big enough. I think he's seriously underestimating how much contention he would get with that sharded locking strategy for hot keys; you really want that thread-local cache.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago (edited 15 days ago)

@zwarich Since you're dealing with an append-only data structure, you can also consider a kind of memory hierarchy design beyond the thread-local caches where you have a big frozen/read-only shared table you always consult first when your local cache misses before you check the sharded read/write tables. You need a strategy for migrating "old" items (which are usually the hottest keys) to that frozen table but you can integrate that with your hash table growth strategy, for example.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich A new "old table" is a completely fresh rehash that's built from merging the old table with left-behind read/write tables when they have to grow. It can be rebuilt asynchronously by a background task and a pointer to it can be atomically swapped in to publish it as the new "old table". This "old table" is also a good candidate for pre-populating from previous compiler runs and such on the same work space.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich Anyway, just a random potpourri of other things I've tried in this vein, since you asked.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich Oh, another thing that works well for rebuilding the old table is to assign intern IDs sequentially as usual and when you rebuild the old table you just include all intern items with keys [0..watermark]. You publish the watermark when you've atomically swapped in/published the new table. Then read/write tables can tell what keys they don't need to carry forward when rehashing to a new table, by just doing a simple integer comparison of the intern ID vs the current watermark.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@zwarich If you're rebuilding the old table and the previous old table was [0..old_watermark] and the next old table is going to have [0..new_watermark] then a nice trick is to build a much smaller hash table for just the ID range [old_watermark..new_watermark] and then do the Robin Hood sorted-hash merge of that with the previous old table.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pkhuong, 15 days ago

@pervognsen @zwarich @SonnyBonds That's why I overuse collision resistant hashes (:

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 15 days ago

@pkhuong @zwarich And you can intern the collision-resistant hashes to get a compact local representation (e.g. 16-to-4 byte compression for repeat references to the same object hash in a local context)! The intern IDs for those hashes are still only locally meaningful but now you can translate back and forth between those local IDs and content-based hashes in constant time, so it's almost trivial to make the intern IDs fully opaque, even for hash tables where you now use the content hash.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

tobinbaker, 14 days ago

@pervognsen I found that EBR worked really well to truncate a log at a "safe" point in the past, where log truncation happens concurrently during cooperative GC, with no mutual exclusion. Each thread scanning the log just needed to register the starting point of the scan (as a timestamp), and any other thread trying to truncate the log would take the minimum of all such registered timestamps as a safe truncation point.

The GC side of the algorithm is here:
https://github.com/gaia-platform/GaiaPlatform/blob/d0473e6e5710b3dbfc4ab674dda9ebaee8a5a5d9/production/db/core/src/db_server.cpp#L2962

And the read side is here:
https://github.com/gaia-platform/GaiaPlatform/blob/d0473e6e5710b3dbfc4ab674dda9ebaee8a5a5d9/production/db/core/src/db_server.cpp#L2839

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 14 days ago (edited 14 days ago)

@tobinbaker Is it a stretch to call that EBR? With EBR the collector advances the epoch and then waits for all the readers to (1) enter a read-side critical section and acknowledge that they've seen the new epoch and (2) leave that read-side critical section. I.e. there's a collector -> reader -> collector round-trip of information flow. With trailing log truncation you don't really have anything like that?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pervognsen, 14 days ago (edited 14 days ago)

@tobinbaker And in EBR you also have a third role, the writer, who is orphaning old items and thereby producing garbage that needs to be cleaned up eventually, but not so soon the readers will be affected, by attaching it to an epoch. So it's not enough to have one-way information flow/passive observers.

But yeah, certainly the "min" approach works well for linear/circular buffer truncation. You also see this in LMAX's Disruptor.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment