@amonakov@mastodon.gamedev.place avatar

amonakov

@amonakov@mastodon.gamedev.place

This profile is from a federated server and may be incomplete. Browse more on the original instance.

pervognsen, to random
@pervognsen@mastodon.social avatar

There's a bunch of C-like successor languages that say they want to eliminate undefined behavior. I've never been able to figure out how they intend to deal with reads and writes to memory since a lot of these languages take what I would call the "naive" machine-centric view of memory which is hard to reconcile with source-level semantics for variables, etc. You can't really rename all of this stuff as "implementation-defined" and get out of jail for free.

amonakov,
@amonakov@mastodon.gamedev.place avatar

@pervognsen @harold <pic: Fry squinting at the first sentence>

(some behavior in C is undefined by virtue of the language standard not saying anything about it)

Yes, there is the truly weird category of UB which pertains to translation time, not execution time, and it seems completely unnecessary, but I assume that's not what was meant here.

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

(prompted by discussion of detecting bitwise and-not earlier in GCC's optimization pipeline)

My ideal compiler IR would not have and/or/xor as distinct bitwise ops, just generic ternlog and probably the corresponding two-operand function ("bilog"?) too.

amonakov,
@amonakov@mastodon.gamedev.place avatar

Perhaps my ideal CPU would also have a ternlog opcode instead of an incomplete zoo of bitwise ops.

mul-add is another useful three-input instruction (although unlike ternlog it's macro-fusible from separate mul+add). If one made a CPU geared towards such three-input instructions, I wonder what other combined ops would be there (clmul+xor? shift+or?) and what the trade-offs are.

amonakov,
@amonakov@mastodon.gamedev.place avatar

@pervognsen I guess because on GPUs the need for efficient blend is that much higher than on CPUs?

NVIDIA has lop3.

I'm not sure AMD has an equivalent?

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

PSA for people writing Arm SIMD code in C or C++: unlike x86, where you can cast any pointer to __m128* and be able to dereference it regardless of the dynamic type of the pointed-to memory, that is not the case on Arm: Neon types do not carry the may_alias attribute and standard type compatibility rules apply. Compare the differences between 'f' and 'g' on the first pic, and Arm codegen on the second pic.

image/png

pervognsen, to random
@pervognsen@mastodon.social avatar

Digging through old stuff, here's a fairly extreme example of branch predictor training effects in a benchmark of Robin Hood linear probing vs conventional linear probing. Look at the branch misses for 1 vs 1000 outer loop iterations for RH (and the impact on wall clock time): https://gist.github.com/pervognsen/e818251d52f725db7c67e562577a12f6

amonakov,
@amonakov@mastodon.gamedev.place avatar

@pervognsen Is that Rust's built-in perf_event interface in action for {branch,cache} miss counts?

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

In 'less', you can interactively add command-line arguments without leaving the pager by pressing '-': you can press '-S' to flip wrapping/chopping of long lines, and '-j11' to spawn 10 wor^W^W^W see extra ten lines of context above the match when searching!

vegard, to random
@vegard@mastodon.social avatar

Are these all the special sections that run code on program init/exit? Did I miss any?

.section .preinit_array; .quad fun
.section .init, "ax"; callq fun
.section .init_array; .quad fun
.section .ctors, "aw"; .quad fun
.section .dtors, "aw"; .quad fun
.section .fini_array; .quad fun
.section .fini, "ax"; callq fun

amonakov,
@amonakov@mastodon.gamedev.place avatar

@vegard you absolutely did, because you're not asking the right question. The right question is, what makes code from those sections run at startup?

forrestthewoods, to random
@forrestthewoods@mastodon.gamedev.place avatar

One of the arguments against immediate mode UI is that it will drain your battery. I think that's bollocks!

Here is my "proof of life" test to measure power draw. Preliminary result is that Dear ImGui consumes less power than YouTube. That feels like a fairbar to compare against

I'll of course take some much more rigorous measurements and tests over longer periods of time. This is just the first time I have actual data.

amonakov,
@amonakov@mastodon.gamedev.place avatar

@wolfpld @forrestthewoods @castano I am well aware of the joke that goes like "to learn how to do <X> on Linux, loudly claim 'Linux cannot do <X>! WTF!' and wait to be corrected", and I am not sure what you were looking for when both perf and turbostat can work with RAPL sensors, but I'm curious how you solved it in the end.

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

shower thought re. rgb 565 vs. theoretical 555|1 (common least significant bit for each component): it introduces higher "distortion" for darker colors (i.e. the closest encodable to dark purple { 1, 0, 1 }/64 is dark grey { 1, 1, 1 }/64), but our vision loses color sensitivity in low-light conditions anyway, so that probably would have been a better fit

amonakov,
@amonakov@mastodon.gamedev.place avatar

Oh, it's not theoretical at all: at least Sharp X68000 and Neo Geo employed it. Thanks, Wikipedia!

regehr, to random
@regehr@mastodon.social avatar

this has to be one of my all-time favorite bug-finding techniques: in your widely deployed software, at very low probability, you put a new heap allocation next to a protected page. performance is unaffected and the bugs that you find are those that actually matter to users in practice.

https://arxiv.org/pdf/2311.09394v2.pdf

amonakov,
@amonakov@mastodon.gamedev.place avatar

@regehr how do you rate this verbatim quote from the paper:

GWP-ASan is neither GWP nor ASan.

castano, to random
@castano@mastodon.gamedev.place avatar

Apparently PVR has a Vulkan layer that enables support gpu timestamps. Has anybody had any success enabling it?

https://developer.imaginationtech.com/downloads/latest-release-notes/#pvrcarbon

amonakov,
@amonakov@mastodon.gamedev.place avatar

@castano What. Why is it a layer. Is it because it adds overhead to everything.

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

We can
typedef float vec4 attribute((ext_vector_type(4)));

But what if we could

typedef void lane_fn(...);
typedef lane_fn vec_fn attribute((ext_vector_size(4)));

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

With xz backdoor opening an RCE pathway, have you thought "hey, it would be nice if the sshd sub-process doing the key/cert parsing would not be able to fork/exec anything?" Ideally the only thing it should be able to do is read/write to already-open fds and die a peaceful death, right?

Now, this particular backdoor was embedded deep enough that it might be able to workaround such privilege separation, but in general dropping privs for risky computations is an important part of defence-in-depth

amonakov,
@amonakov@mastodon.gamedev.place avatar

And that reminds me of another scenario where we parse untrusted certificates: WPA2-Enterprise authentication. Venerable wpa_supplicant does have some privilege-separation code (which I believe is rarely enabled on Linux), but what iwd does is completely incomprehensible to me: they pass certs from the access point straight to the kernel keyring subsystem, using the kernel as a fancy SSL library. Any weakness in the involved kernel code is thus open for exploitation by rogue access points.

demofox, to random
@demofox@mastodon.gamedev.place avatar
amonakov,
@amonakov@mastodon.gamedev.place avatar

@demofox the adversity of the effect will fall more on the waterer than the wateree

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

Inside of you there are two reviewers

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

My kingdom for a distro with smooth cross-installation (preparing a rootfs for a foreign architecture without qemu).

amonakov,
@amonakov@mastodon.gamedev.place avatar

@lanodan @wolf480pl good to know, thanks!

shafik, to programming
@shafik@hachyderm.io avatar

'long long long' is too long for GCC

is such a good ol'diagnostic

amonakov,
@amonakov@mastodon.gamedev.place avatar

@shafik and yet sometimes a quad-long is okay with g++ if you twist its arm in exactly the right way:

retr0id, to random
@retr0id@retr0.id avatar

SIZEOF_CHAR (sizeof 'A')

amonakov,
@amonakov@mastodon.gamedev.place avatar
amonakov, to random
@amonakov@mastodon.gamedev.place avatar

I still can't figure out the intended use-case for AMD's IBS (instruction-based sampling). You select a period N, and then for each N'th instruction you get info about that particular instruction (in which caches it missed, was a branch, was it mispredicted, ...). Which seems... completely unworkable for rare events? If I want to sample on mispredicted branches, and they account for 1% of all instructions, I'll have to discard 99% of IBS data, and my effective sampling period is 0.01 of nominal?

amonakov, to random
@amonakov@mastodon.gamedev.place avatar

Fun fact: there are at least five distinct choices for type T (not counting typedefs) such that a C compiler targeting a POSIX system cannot optimize out the call to 'aux' in

void *alloc(T *psize)
{
size_t sz = *psize;
void *a = malloc(sz);
if (sz != *psize)
aux();
return a;
}

mattst88, to gentoo
@mattst88@fosstodon.org avatar

Another day, another bizarre software discovery.

Apparently 's sys-apps/sandbox (which ensures ebuilds don't make a mess outside of their build "sandbox") had a huge performance regression which caused webkit-gtk build times to go from 9 minutes to 1 hour.

After collecting a ton of data, applying patches, reverting patches, etc, I filed https://bugs.gentoo.org/910273 and it seems we have a fix.

But I don't know how it's fixing things!

amonakov,
@amonakov@mastodon.gamedev.place avatar

@mattst88 Why guess when you can profile? ;)
Without AT_EACCESS in flags, faccessat really doesn't scale due to how access_override_creds works:
https://gist.github.com/amonakov/9281bba3974d931fe500eaad0369568c

The sandbox could probably use AT_EACCESS (which makes faccessat more efficient than fstatat).

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • slotface
  • ngwrru68w68
  • everett
  • mdbf
  • modclub
  • rosin
  • khanakhh
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • Youngstown
  • GTA5RPClips
  • InstantRegret
  • provamag3
  • kavyap
  • ethstaker
  • osvaldo12
  • normalnudes
  • tacticalgear
  • cisconetworking
  • cubers
  • Durango
  • Leos
  • anitta
  • tester
  • megavids
  • lostlight
  • All magazines