Microbenchmarking return address branch prediction (2018)

rep_lodsb · on June 27, 2024

>Special case: CALL +0 is not a call

There seems to be still a lot of documentation out there that says that you should never pop the return address, and instead call a "proper" function that reads it from the stack before returning normally.

I wonder if now that recent-ish processors treat CALL +0 as a special case, there is instead a performance bug when not popping the return address, with code like this:

    do_something_twice:
        call  do_something_once
        ;fall through to run next piece of code again
    do_something_once:
        ;...
        ret

Both of these uses of CALL would never appear in compiler output, and probably not be common in handwritten assembly either, especially from people who are aware of the "common wisdom" of how to get the instruction pointer. But there must have been at least one high-profile piece of code where this had a real performance impact, or Intel/AMD wouldn't have bothered to optimize this?

im3w1l · on July 3, 2024

Late reply, but I think that people who use a "proper function" don't usually put it immediately after the call - can you really call it a proper function if control flow falls through into it? That's certainly not how functions usually work. So I think people would either put the function somewhere entirely different, or do something like

    do_something_with_rip:
        call  .get_rip
        jmp .after_get_rip
    .get_rip:
        ;...
        ret
    .after_get_rip:
        ;...

Your example is kinda cute, but I'd guess few enough people use that trick for it to be worth special casing or optimizing for.

As for

    call .push_rip
    .push_rip:
    pop rax

I can see that appearing in old hand-written code before knowledge of branch predictor implications become well known? Or perhaps people who know it's suboptimal but figure that it is fast enough (getting the rip is likely to be very cold code).

gpderetta · on June 27, 2024

In principle a CPU might use a meta-predictor to decide between the "normal" indirect predictor and the RAS predictor. It can help programs, like in go, that use stackful coroutines, as the RAS will be off when context switching deep into a call stack.

I wouldn't be surprised if this is already the case for last few generations of CPUs although I haven't tried to benchmark it.

edit: also good article, TIL that CALL +0 is handled specially.

camkego · on June 27, 2024

Too bad the source code from the blog post cannot be accessed. It returns “forbidden permission”