Addressing the adding situation

▲ Addressing the adding situation(xania.org)

128 points by messe 3 hours ago | 33 comments

▲ pansa2 2 hours ago | parent | next [-]

> Using `lea` […] is useful if both of the operands are still needed later on in other calculations (as it leaves them unchanged)

As well as making it possible to preserve the values of both operands, it’s also occasionally useful to use `lea` instead of `add` because it preserves the CPU flags.

▲

andrepd an hour ago | parent [-]

Funny to see a comment on HN raising this exact point, when just ~2 hours ago I was writing inline asm that used `lea` precisely to preserve the carry flag before a jump table! :)

▲

MYEUHD 22 minutes ago | parent [-]

I'm curious, what are you working on that requires writing inline assembly?

	▲	veltas 4 minutes ago \| parent [-]
		I'm not them but whenever I've used it it's been for arch specific features like adding a debug breakpoint, synchronization, using system registers, etc. Never for performance. If I wanted to hand optimise code I'd be more likely to use SIMD intrinsics, play with C until the compiler does the right thing, or write the entire function in a separate asm file for better highlighting and easier handing of state at ABI boundary rather than mid-function like the carry flags mentioned above.

▲ f311a 8 minutes ago | parent | prev | next [-]

What's the current best resources to learn assembly? So that I can understand output of simple functions. I don't want to learn to write it properly, I just want to be able to understand on what's happening.

▲ greatgib an hour ago | parent | prev | next [-]

As a side note of appreciation, I think that we can't do better than what he did for being transparent that LLM was used but still just for the proof-reading.

▲ Thorrez 2 hours ago | parent | prev | next [-]

>However, in this case it doesn’t matter; those top bits5 are discarded when the result is written to the 32-bit eax.

>Those top bits should be zero, as the ABI requires it: the compiler relies on this here. Try editing the example above to pass and return longs to compare.

Sorry, I don't understand. How could the compiler both discard the top bits, and also rely on the top bits being zero? If it's discarding the top bits, it won't matter whether the top bits are zero or not, so it's not relying on that.

▲

201984 an hour ago | parent | next [-]

He's actually wrong on the ABI requiring the top bits to be 0. It only requires that the bottom 32 bits match the parameter, but the top bits of a 32-bit parameter passed in a 64-bit register can be anything (at least on Linux).

You can see that in this godbolt example: https://godbolt.org/z/M1ze74Gh6

The reason the code in his post works is because the upper 32 bits of the parameters going into an addition can't affect the low 32 bits of the result, and he's only storing the low 32 bits.

	▲	fweimer an hour ago \| parent \| next [-]
		The LLVM x86-64 ABI requires the top bits to be zero. GCC treats them as undefined. Until a recent clarification, the x86-64 psABI made the upper bits undefined by omission only, which is why I think most people followed the GCC interpretation. https://github.com/llvm/llvm-project/issues/12579 https://groups.google.com/g/x86-64-abi/c/h7FFh30oS3s/m/Gksan... https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/61
	▲	jfindper 13 minutes ago \| parent \| prev [-]
		There is something fun about using godbolt.org to say that Matt Godbolt is wrong.

▲

Joker_vD 2 hours ago | parent | prev [-]

(Almost) any instruction on x64 that writes to a 32-bit register as destination, writes the lower 32-bits of the value into the lower 32 bits of the full 64-bit register and zeroes out the upper 32 bits of the full register. He touched on it in his previous note "why xor eax, eax".

But the funny thing is, the x64-specific supplement for SysV ABI doesn't actually specify whether the top bits should be zeroes or not (and so, if the compiler could rely on e.g. function returning ints to have upper 32 bits zeroes, or those could be garbage), and historically GCC and Clang diverged in their behaviour.

▲ sethops1 2 hours ago | parent | prev | next [-]

This guy is tricking us into learning assembly! Get 'em!!

▲ miningape 2 hours ago | parent | prev | next [-]

Loving this series! I'm currently implementing a z80 emulator (gameboy) and it's my first real introduction to CISC, and is really pushing my assembly / machine code skills - so having these blog posts coming from the "other direction" are really interesting and give me some good context.

I've implemented toy languages and bytecode compilers/vms before but seeing it from a professional perspective is just fascinating.

That being said it was totally unexpected to find out we can use "addresses" for addition on x86.

▲

Joker_vD 2 hours ago | parent [-]

A seasoned C programmer knows that "&arr[index]" is really just "arr + index" :) So in a sense, the optimizer rewrote "x + y" into "(int)&(((char*)x)[y])", which looks scarier in C, I admit.

▲

crote 2 hours ago | parent [-]

The horrifying side effect of this is that "arr[idx]" is equal to "idx[arr]", so "5[arr]" is just as valid as "arr[5]".

Your colleagues would probably prefer if you forget this.

▲

Joker_vD an hour ago | parent | next [-]

> so "5[arr]" is just as valid as "arr[5]"

This is, I am sure, one of the stupid legacy reasons we still write "lr a0, 4(a1)" instead of more sensible "lr a0, a1[4]". The other one is that FORTRAN used round parentheses for both array access and function calls, so it stuck somehow.

▲

miningape an hour ago | parent | prev | next [-]

Mom, please come pick me up. These kids are scaring me.

▲

rocqua an hour ago | parent | prev [-]

That depends on sizeof(*arr) no?

▲

unwind an hour ago | parent | next [-]

Not in C no, since arithmetic on a pointer is implicitly scaled by the size of the value being pointed at (this statement is kind of breaking the abstraction ... oh well).

▲

messe 19 minutes ago | parent | prev [-]

Nope, a[b] is equivalent to *(a + b) regardless of a and b.

	▲	sureglymop 13 minutes ago \| parent [-]
		Given that, why don't we use just `*(a + b)` everywhere? Wouldn't that be more verbose and less confusing? (genuinely asking)

▲ dist-epoch an hour ago | parent | prev | next [-]

Do we really know that LEA is using the hardware memory address computation units? What if the CPU frontend just redirects it to the standard integer add units/execution ports? What if the hardware memory address units use those too?

It would be weird to have 2 sets of different adders.

▲

adrian_b 39 minutes ago | parent [-]

The modern Intel/AMD CPUs have distinct ALUs (arithmetic-logic units, where additions and other integer operations are done; usually between 4 ALUs and 8 ALUs in recent CPUs) and AGUs (address generation units, where the complex addressing modes used in load/store/LEA are computed; usually 3 to 5 AGUs in recent CPUs).

Modern CPUs can execute up to between 6 and 10 instructions within a clock cycle, and up to between 3 and 5 of those may be load and store instructions.

So they have a set of execution units that allow the concurrent execution of a typical mix of instructions. Because a large fraction of the instructions generate load or store micro-operations, there are dedicated units for address computation, to not interfere with other concurrent operations.

	▲	dist-epoch 32 minutes ago \| parent [-]
		But can the frontend direct these computations based on what's available? If it sees 10 LEA instructions in a row, and it has 5 AGU units, can it dispatch 5 of those LEA instructions to other ALUs? Or is it guaranteed that a LEA instruction will always execute on an AGU, and an ADD instruction always on an ALU?

▲ secondcoming an hour ago | parent | prev | next [-]

The confusing thing about LEA is that the source operands are within a '[]' block which makes it look like a memory access.

I'd love to know why that is.

I think the calculation is also done during instruction decode rather than on the ALU, but I could be wrong about that.

▲ pwg 44 minutes ago | parent | next [-]

It (LEA) does all the work of a memory access (the address computation part) without actually performing the memory access.

Instead of reading from memory at "computed address value" it returns "computed address value" to you to use elsewhere.

The intent was likely to compute the address values for MOVS/MOVSB/MOVSW/MOVSD/MOVSQ when setting up a REP MOVS (or other repeated string operation). But it turned out they were useful for doing three operand adds as well.

▲ trollbridge an hour ago | parent | prev [-]

LEA is the equivalent of & in C. It gives you the address of something.

Fun question: what does the last line of this do?

MOV BP,12 LEA AX,[BP] MOV BX,34 LEA AX,BX

	▲	hota_mazi 9 minutes ago \| parent [-]
		I think OP was just making a comment on the asymmetry of the syntax. Brackets [] are usually used to dereference. Why is this written `lea eax, [rdi + rsi]` instead of just `lea eax, rdi + rsi ?`

▲ Joker_vD 2 hours ago | parent | prev [-]

Honestly, x86 is not nearly as CISC as those go. It just has a somewhat developed addressing modes comparing to the utterly anemic "register plus constant offset" one, and you are allowed to fold some load-arithmetic-store combinations into a single instruction. But that's it, no double- or triple-indexing or anything like what VAXen had.

    BINOP   disp(rd1+rd2 shl #N), rs

        vs.

    SHL     rTMP1, rd2, #N
    ADD     rTMP1, rTMP1, rd1
    LOAD    rTMP2, disp(rTMP1)
    BINOP   rTMP2, rTMP2, rs
    STORE   disp(rTMP1), rTMP2

And all it really takes to support this is just adding a second (smaller) ALU on your chip to do addressing calculations.

▲

rocqua an hour ago | parent | next [-]

There's also a lot of specialized instructions like AES ones.

But the main thing that makes x86 CISC to me is not the actual instruction set, but the byte encoding, and the complexity there.

	▲	201984 an hour ago \| parent \| next [-]
		The classic distinction is that a CISC has data processing instructions with memory operands, and in a RISC they only take register parameters. This gets fuzzy though when you look at AArch64 atomic instructions like ldadd which do read-modify-write all in a single instruction.
	▲	Joker_vD an hour ago \| parent \| prev [-]
		Eh, that's really just a side effect of almost 50 years of constant evolution from a 8-bit microprocessor. Take look at VAX [0], for instance: its instruction encoding is pretty clean yet it's an actual example of a CISC ISA that was impossible to speed up like, literally: DEC engineers tried very hard and concluded that making a truly pipelined & super-scalar implementation was basically impossible; so DEC had to move to Alpha. See [1] for more from John Mashey. Edit: the very, very compressed TL;DR is that if you do only one memory load (or one memory load + store back into this exact location) per instruction, it scales fine. But the moment you start doing chained loads, with pre- and post-increments which are supposed to write back changed values into the memory and be visible, and you have several memory sources, and your memory model is actually "strong consistency", well, you're in a world of pain. [0] https://minnie.tuhs.org/CompArch/Resources/webext3.pdf [1] https://yarchive.net/comp/vax.html

▲

andrepd an hour ago | parent | prev [-]

Would this matter for performance? You already have so many execution units that are actually difficult to keep fully fed even when decoding instructions and data at the speed of cache.