If you have any experience with writing Asm you'll see how rigid calling conventions are almost entirely an artificial construct of HLLs (and possibly an artifact of early compilers), and it's possible to do so much more and so much better --- which is partly the reason why Asm can be so fun. ;-)
When calling functions written in Asm from Asm, you get to decide exactly how to do it: Pass arguments in registers in any order, on the stack, a combination of both, directly following the call instruction itself[1], etc.; the limit is practically your imagination. You can choose the best way to pass arguments for each function instead of being forced into one suboptimal one for every function. Ditto for return values --- you can easily return multiple values, in different registers, and also make use of HLL-inaccessible "registers" like the flags (carry bit in particular is quite useful).
I think the PC BIOS / DOS API is a pretty nice calling convention, clearly designed for and by Asm programmers; all arguments are passed in registers and CF is used to indicate success/error. These compiler-imposed calling conventions like cdecl/stdcall/fastcall are just awfully inefficient in comparison because of how much memory access they require, especially when "fastcall" can only pass two arguments in registers.
Incidentally, these 3 examples are also great at showing how compilers can be so very stupid at code generation. Observe that in all 3 cases, the return value in eax after calling foo is written to memory --- then immediately read from memory again, into the exact same register. This is not something that should ever appear in human-written Asm, and I've actually made use of this fact in marking a course assignment: asked to manually "compile" a short function, some students cheated and used the compiler (with no optimisations, i.e. the defaults), and it was dead easy to recognise.
It's funny to see the entirely-register-based fastcall somehow still managing to generate 5 totally useless memory accesses. If I really wanted to write a fastcall min() function instead of just inlining it as I probably should, it'd be 4 lines:
mov eax, ecx
cmp edx, ecx
cmovl eax, edx
ret
Likewise, cdecl and stdcall (only differing in one instruction):
mov eax, [esp+4]
cmp eax, [esp+8]
cmovl eax, [esp+12]
ret ; ret 8 for stdcall
... and seeing WINAPI in GAS/AT&T syntax just feels very very weird.
[1] Like this:
call puts
db "Hello world!", 0
; execution continues here
I believe it's not the fastest on modern CPUs, but it does save space and was a very common technique on 8-bit CPUs like 8080/8085/Z80. It's also a good way of confusing automatic disassemblers.
> Incidentally, these 3 examples are also great at showing how compilers can be so very stupid at code generation. Observe that in all 3 cases, the return value in eax after calling foo is written to memory --- then immediately read from memory again, into the exact same register.
That's because code in article is compiled without optimizations. When you enable optimizations the compiler will do the right thing: https://godbolt.org/g/Lc7giO
You certainly can do it differently but I really doubt you can do it better. :) A common wisdom learned from the "calling convention wars" (there are many more than just cdecl, stdcall and fastcall) were that it just doesn't matter all that much. The same amount of work has to be done and the only thing that changes is if the caller or callee is the one doing it.
For example, if your convention mandates that the callee must preserve RAX-RDX, then it must push/pop those registers if it wants to use them. Which leads to redundant push/pops if the registers aren't in use by the caller. But if it is free to clobber them, then the caller must push/pop them even if the callee doesn't use them, leading to the exact same number of redundant push/pops!
Exactly. What I see from the "calling convention wars" is not that "it just doesn't matter all that much", it's that there is no single optimal convention in all cases. Some functions need to use more registers than others; some arguments are used very early in the function and their values are not needed after that (prefer these in a register), while others may be used later after a bunch of computation that needs many registers (these might be better staying on the stack.) Some instructions like multiply/divide require certain registers (does your function start with a multiply or divide and is one of the arguments the multiplicand/dividend? Use AX, EAX, or EDX:EAX for that one.)
The short examples in the article are illustrative of "used early and not needed afterwards" --- in cdecl/stdcall the caller writes the arguments into memory, only to have the callee immediately read them back again. Ignoring the extra memory accesses, even fastcall isn't optimal in this case --- it uses ECX and EDX when what's really needed is for one of the arguments to be in EAX since it may become the return value. In my "optimised" fastcall above, you can see I had to spend an extra mov instruction just to get the return value in the right place. It would be two instructions (cmp eax, ecx | cmovge eax, ecx) otherwise. All this useless data movement, for what? Just to conform to some arbitrary convention. These may be small things, but they can add up.
When calling functions written in Asm from Asm, you get to decide exactly how to do it: Pass arguments in registers in any order, on the stack, a combination of both, directly following the call instruction itself[1], etc.; the limit is practically your imagination. You can choose the best way to pass arguments for each function instead of being forced into one suboptimal one for every function. Ditto for return values --- you can easily return multiple values, in different registers, and also make use of HLL-inaccessible "registers" like the flags (carry bit in particular is quite useful).
I think the PC BIOS / DOS API is a pretty nice calling convention, clearly designed for and by Asm programmers; all arguments are passed in registers and CF is used to indicate success/error. These compiler-imposed calling conventions like cdecl/stdcall/fastcall are just awfully inefficient in comparison because of how much memory access they require, especially when "fastcall" can only pass two arguments in registers.
Incidentally, these 3 examples are also great at showing how compilers can be so very stupid at code generation. Observe that in all 3 cases, the return value in eax after calling foo is written to memory --- then immediately read from memory again, into the exact same register. This is not something that should ever appear in human-written Asm, and I've actually made use of this fact in marking a course assignment: asked to manually "compile" a short function, some students cheated and used the compiler (with no optimisations, i.e. the defaults), and it was dead easy to recognise.
It's funny to see the entirely-register-based fastcall somehow still managing to generate 5 totally useless memory accesses. If I really wanted to write a fastcall min() function instead of just inlining it as I probably should, it'd be 4 lines:
Likewise, cdecl and stdcall (only differing in one instruction): ... and seeing WINAPI in GAS/AT&T syntax just feels very very weird.[1] Like this:
I believe it's not the fastest on modern CPUs, but it does save space and was a very common technique on 8-bit CPUs like 8080/8085/Z80. It's also a good way of confusing automatic disassemblers.