Exploring calling conventions with x86 assembly

hDeraj · on Sept 3, 2016

When calling a simple function like this within a large loop, would it make a noticeable difference in speed to inline the computation vs. having a function call? If so, what's the best practice for inlining a computation like this? I imagine a macro would be the simplest solution but I'm interested to hear any other techniques that are used

ordu · on Sept 4, 2016

> When calling a simple function like this within a large loop, would it make a noticeable difference in speed to inline the computation vs. having a function call?

Maybe. Inlining small functions can reduce cache load, and it means no call/ret instructions and no overhead of argument passing. Moreover inlining allows futher optimizations which can't be done without breaking function boundaries. It may be noticeable. And may be not. Depends on loop.

> what's the best practice for inlining a computation like this?

There are a lot of examples can be seen in linux kernel. Just random example from include/linux/list.h:

  static inline void list_replace(struct list_head *old,
				struct list_head *new)
  {
	new->next = old->next;
	new->next->prev = new;
	new->prev = old->prev;
	new->prev->next = new;
  }

Keyword 'static' allows compiler to make no callable (not inlined) copy of function at all, and also it allows to define such a functions in header files. Compiler can't inline function call if it has no function definition at compile time. Declaration is not enough for inlining. Therefore such a functions likely to be defined in headers, and 'static' becomes necessary. When defined in *.c file 'static' can be omitted, but probably better not to.

With C++ such a functions will be a methods in most cases, and (if so) "static" would be unneeded and wrong.

And you'll need to turn on optimizations when compiling. Compiler is not inlining when not optimizing.

userbinator · on Sept 4, 2016

A call and a return is already two instructions, and this simple function is essentially also 2 (or 1 if you use a conditional move as illustrated). Passing the arguments and making the call alone takes more instructions than the function body itself. It'll be both smaller and faster, so no tradeoffs there. To me, this is clearly in the "yes, definitely inline it" category.

unsignedqword · on Sept 4, 2016

It can sometimes make a difference, but usually the compiler's optimizer does a good job of deciding whether or not a function should be inlined.

If you want you can nudge the compiler in the direction you want via the "inline" keyword, although the compiler won't always take this suggestion to heart. MSVC has "__forceinline" but it too will not always comply.

Before the "inline" keyword, macros were the standard way to do this, IIRC

colejohnson66 · on Sept 4, 2016

There's something funny about a compiler being able to ignore something called "force inline"

gruez · on Sept 4, 2016

Sometimes it's not possible to inline functions, for example recursive calls

userbinator · on Sept 4, 2016

If you have any experience with writing Asm you'll see how rigid calling conventions are almost entirely an artificial construct of HLLs (and possibly an artifact of early compilers), and it's possible to do so much more and so much better --- which is partly the reason why Asm can be so fun. ;-)

When calling functions written in Asm from Asm, you get to decide exactly how to do it: Pass arguments in registers in any order, on the stack, a combination of both, directly following the call instruction itself[1], etc.; the limit is practically your imagination. You can choose the best way to pass arguments for each function instead of being forced into one suboptimal one for every function. Ditto for return values --- you can easily return multiple values, in different registers, and also make use of HLL-inaccessible "registers" like the flags (carry bit in particular is quite useful).

I think the PC BIOS / DOS API is a pretty nice calling convention, clearly designed for and by Asm programmers; all arguments are passed in registers and CF is used to indicate success/error. These compiler-imposed calling conventions like cdecl/stdcall/fastcall are just awfully inefficient in comparison because of how much memory access they require, especially when "fastcall" can only pass two arguments in registers.

Incidentally, these 3 examples are also great at showing how compilers can be so very stupid at code generation. Observe that in all 3 cases, the return value in eax after calling foo is written to memory --- then immediately read from memory again, into the exact same register. This is not something that should ever appear in human-written Asm, and I've actually made use of this fact in marking a course assignment: asked to manually "compile" a short function, some students cheated and used the compiler (with no optimisations, i.e. the defaults), and it was dead easy to recognise.

It's funny to see the entirely-register-based fastcall somehow still managing to generate 5 totally useless memory accesses. If I really wanted to write a fastcall min() function instead of just inlining it as I probably should, it'd be 4 lines:

    mov eax, ecx
    cmp edx, ecx
    cmovl eax, edx
    ret

Likewise, cdecl and stdcall (only differing in one instruction):

    mov eax, [esp+4]
    cmp eax, [esp+8]
    cmovl eax, [esp+12]
    ret    ; ret 8 for stdcall

... and seeing WINAPI in GAS/AT&T syntax just feels very very weird.

[1] Like this:

    call puts
    db "Hello world!", 0
    ; execution continues here

I believe it's not the fastest on modern CPUs, but it does save space and was a very common technique on 8-bit CPUs like 8080/8085/Z80. It's also a good way of confusing automatic disassemblers.

mmozeiko · on Sept 4, 2016

> Incidentally, these 3 examples are also great at showing how compilers can be so very stupid at code generation. Observe that in all 3 cases, the return value in eax after calling foo is written to memory --- then immediately read from memory again, into the exact same register.

That's because code in article is compiled without optimizations. When you enable optimizations the compiler will do the right thing: https://godbolt.org/g/Lc7giO

bjourne · on Sept 4, 2016

You certainly can do it differently but I really doubt you can do it better. :) A common wisdom learned from the "calling convention wars" (there are many more than just cdecl, stdcall and fastcall) were that it just doesn't matter all that much. The same amount of work has to be done and the only thing that changes is if the caller or callee is the one doing it.

For example, if your convention mandates that the callee must preserve RAX-RDX, then it must push/pop those registers if it wants to use them. Which leads to redundant push/pops if the registers aren't in use by the caller. But if it is free to clobber them, then the caller must push/pop them even if the callee doesn't use them, leading to the exact same number of redundant push/pops!

Narishma · on Sept 4, 2016

By "doing it better", I don't believe parent is saying to create another "better" calling convention, but instead to use no convention at all.

userbinator · on Sept 4, 2016

Exactly. What I see from the "calling convention wars" is not that "it just doesn't matter all that much", it's that there is no single optimal convention in all cases. Some functions need to use more registers than others; some arguments are used very early in the function and their values are not needed after that (prefer these in a register), while others may be used later after a bunch of computation that needs many registers (these might be better staying on the stack.) Some instructions like multiply/divide require certain registers (does your function start with a multiply or divide and is one of the arguments the multiplicand/dividend? Use AX, EAX, or EDX:EAX for that one.)

The short examples in the article are illustrative of "used early and not needed afterwards" --- in cdecl/stdcall the caller writes the arguments into memory, only to have the callee immediately read them back again. Ignoring the extra memory accesses, even fastcall isn't optimal in this case --- it uses ECX and EDX when what's really needed is for one of the arguments to be in EAX since it may become the return value. In my "optimised" fastcall above, you can see I had to spend an extra mov instruction just to get the return value in the right place. It would be two instructions (cmp eax, ecx | cmovge eax, ecx) otherwise. All this useless data movement, for what? Just to conform to some arbitrary convention. These may be small things, but they can add up.

GoToRO · on Sept 4, 2016

so why does it allocate extra space? "A value of 16 is subtracted from esp." What's the purpose of that?

jmgao · on Sept 4, 2016

There are some SSE instructions that crash if they're used with arguments that aren't 16 byte aligned. All of the ones I can think of have a version that supports unaligned access at a performance penalty, so it's basically just a choice by the ABI to require the stack to be 16 byte aligned at function call boundaries, so that functions don't have to verify that their stack frame has the proper alignment.

GoToRO · on Sept 4, 2016

apparently "stack alignment" is a thing.

ninjabeans · on Sept 4, 2016

How did he create that diff graphic?

Karliss · on Sept 4, 2016

From the colors it looks like he simply run Meld on text files with assembler.