(prompted by discussion of detecting bitwise and-not earlier in GCC's optimization pipeline)
My ideal compiler IR would not have and/or/xor as distinct bitwise ops, just generic ternlog and probably the corresponding two-operand function ("bilog"?) too.
@amonakov@pervognsen It's just about how early in their development they were committed to having 3 source operands. Same with 2-source-reg-plus-index-reg shuffles. The generic crossbar network is usually already there: sometime past the 8th dedicated unpack/pack/shuffle pattern for different type sizes, it's easier to build the general crossbar and just supply constants for the index vector in the "canned shuffle" cases.
@rygorous@amonakov What about the increased RF/operand forwarding port pressure from three input operands? Don't GPU cores usually have some additional tricks they can play with RF ports due to their latency-tolerant design? Does this figure into the CPU vs GPU difference at all?
PSA for people writing Arm SIMD code in C or C++: unlike x86, where you can cast any pointer to __m128* and be able to dereference it regardless of the dynamic type of the pointed-to memory, that is not the case on Arm: Neon types do not carry the may_alias attribute and standard type compatibility rules apply. Compare the differences between 'f' and 'g' on the first pic, and Arm codegen on the second pic.
In 'less', you can interactively add command-line arguments without leaving the pager by pressing '-': you can press '-S' to flip wrapping/chopping of long lines, and '-j11' to spawn 10 wor^W^W^W see extra ten lines of context above the match when searching!
shower thought re. rgb 565 vs. theoretical 555|1 (common least significant bit for each component): it introduces higher "distortion" for darker colors (i.e. the closest encodable to dark purple { 1, 0, 1 }/64 is dark grey { 1, 1, 1 }/64), but our vision loses color sensitivity in low-light conditions anyway, so that probably would have been a better fit
With xz backdoor opening an RCE pathway, have you thought "hey, it would be nice if the sshd sub-process doing the key/cert parsing would not be able to fork/exec anything?" Ideally the only thing it should be able to do is read/write to already-open fds and die a peaceful death, right?
Now, this particular backdoor was embedded deep enough that it might be able to workaround such privilege separation, but in general dropping privs for risky computations is an important part of defence-in-depth
And that reminds me of another scenario where we parse untrusted certificates: WPA2-Enterprise authentication. Venerable wpa_supplicant does have some privilege-separation code (which I believe is rarely enabled on Linux), but what iwd does is completely incomprehensible to me: they pass certs from the access point straight to the kernel keyring subsystem, using the kernel as a fancy SSL library. Any weakness in the involved kernel code is thus open for exploitation by rogue access points.
@wolf480pl@amonakov cross-compilation without qemu? (note: I'm including binfmt here)
For cross-install without qemu I guess you'd want something like Alpine, in fact pmbootstrap literally is a tool for a cross-install.
I still can't figure out the intended use-case for AMD's IBS (instruction-based sampling). You select a period N, and then for each N'th instruction you get info about that particular instruction (in which caches it missed, was a branch, was it mispredicted, ...). Which seems... completely unworkable for rare events? If I want to sample on mispredicted branches, and they account for 1% of all instructions, I'll have to discard 99% of IBS data, and my effective sampling period is 0.01 of nominal?
@amonakov (Only asking because if it's cycle based you'd expect mispredicts and cache misses to get higher weight with that kind of periodic sampling scheme.)
@pervognsen With back-end "Ops" IBS you have a choice of cycle-based or uop-based period, but the front-end "Fetch" IBS only does instruction-based sampling (but that's the side which sees the L1i and iTLB misses).
Fun fact: there are at least five distinct choices for type T (not counting typedefs) such that a C compiler targeting a POSIX system cannot optimize out the call to 'aux' in