Is there a good reason #LLVM targeting #AArch64 doesn't seem to fold shifts into... - LLVM

chandlerc, 4 months ago
Is there a good reason #LLVM targeting #AArch64 doesn't seem to fold shifts into operands when it would require shifting in multiple operands?

I'm seeing lots of:
lsr xN, xN, #7  
and x?, x?, xN  
...  
and x?, x?, xN  
With no other uses of xN.

Is there a reason to prefer this over:
and x?, x?, xN, lsr #7  
...  
and x?, x?, xN, lsr #7  
While "duplicated", it seems like it would save an instruction at least in decode?
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

steve, 4 months ago

@chandlerc @TomF not sure it’s “the” reason, but a lot of arm64 designs will crack those shifted ands into 2 uOps, so you save a uOp by pulling it out.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

steve, 4 months ago

@chandlerc @TomF it’s also pretty common for a design to allow “simple” ALU ops on every pipe and “complex” ops on only a subset, which would also tip the balance (see simple vs complex address generation on some x86 uArches, for example).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

TomF, 4 months ago

@steve @chandlerc The "free" shifter is one of those perfect examples where what was a GREAT idea on one implementation of the arch turns out to be a TERRIBLE idea on later ones.

Other examples are MIPS branch-delay slot, SPARC sliding register window, and every load-link/store-conditional implementation ever.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

chandlerc, 4 months ago

@TomF @steve Totally makes sense.

So from a performance perspective, seems reasonable to think of this purely as a encoding density hack with no real benefit once decoded compared to normal shifts?

(This isn't a case where I have any data or evidence that says anything to the contrary, I was just noticing it in the compiler output and wondered what was up, so appreciate the pointers.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

steve, 4 months ago

@chandlerc @TomF yep, that’s exactly right

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

steve, 4 months ago

@chandlerc @TomF at least some designs special-case left-shift by 1-3, however, since those come up all the time in addressing, and handle them like an unshifted op.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

chandlerc, 4 months ago

@steve @TomF Ooof, that both makes perfect sense, but also makes me really want that aspect of any u-arch to be clearly documented given that the compiler needs a pretty sharply different strategy to generate efficient code there.

Maybe LLVM already has this info? I've not seen surprising stuff in more "normal" addressing instruction sequences so far.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

TomF, 4 months ago

@chandlerc @steve So... while this is a reasonable request from a "control ALL the things" perspective, just be aware that (a) there's a huge variety of uarchs out there (b) they are far more complex internally than you probably think and (c) 99.9999% of the time this will not affect your performance in any measurable way :-)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

chandlerc, 4 months ago

@TomF @steve I mean, I'm somewhat aware of the diversity of uarch's out there.... And I don't really want more knobs in the compiler. I hate them.

But I'm specifically saying that thresholds where encoding A vs. encoding B results in 2 vs. 1 uop seem very important to document and teach compilers about. Not every other difference. =D Nicer to not have them at all, but if they exist, we need to know? And this doesn't seem like a terribly frustrating threshold to model.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

chandlerc, 4 months ago

@TomF @steve I'm much more salty about the uarch thresholds of "only N instructions fitting criteria X within each aligned 32-byte encoded sequence" on Intel CPUs which swing perf by 10% - 20% and are nigh impossible to model even when they are documented...

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

TomF, 4 months ago

@steve @chandlerc Haha - "lea" is such an ugly weird little instruction, but it turns out it's so annoyingly useful it sneaks into every arch :-)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

Add comment