> Eventually we'll do some instruction combining using this information (best pl...

Taniwha · on March 21, 2022

I think that we need lots of trace before we decide which ops make sense to combine

blacklion · on March 21, 2022

As far as I understand, RISC-V proponents want to have "recommended" command sequences for compilers, to avoid situation when different RISC-V CPUs will need different compilations. If different RISC-V implementations have different "fuseable" command sequences, we will be in dreadful situation when you will need exact "-mcpu" for decent performance and binary packages will be very unoptimal.

And such "conventions" are bad idea, like comments in code, IMHO. It can not be checked by tools, etc.

tsmi · on March 21, 2022

> you will need exact "-mcpu" for decent performance

For some definitions of decent, I think that ship has sailed.

https://clang.llvm.org/docs/CrossCompilation.html

-target <triple> The triple has the general format <arch><sub>-<vendor>-<sys>-<abi>, where: arch = x86_64, i386, arm, thumb, mips, etc. sub = for ex. on ARM: v5, v6m, v7a, v7m, etc. vendor = pc, apple, nvidia, ibm, etc. sys = none, linux, win32, darwin, cuda, etc. abi = eabi, gnu, android, macho, elf, etc.

Note, none of those are exhaustive...

blacklion · on March 22, 2022

"arch", "sys" and "eabi" are irrelevant to the core performance. You can not run "arm" on "i386" at all, and "eabi" and "sys" don't affect command scheduling, u-ops fusing and other hardware "magic".

So, only "sub" is somewhat relevant and it is exactly what RISC-V should avoid, IMHO, and it doesn't with its reliance on things like u-op fusion (and not ISA itself) to achieve high-performance.

For example, performance on modern x86_64 doesn't gain a lot if code is compiled for "-march=skylake" instead of "-march=generic" (I remember times, when re-compiling for "i686" instead of "i386" had provided +10% of performance!).

If RISC-V performance is based on u-op fusing (and it is what RISC-V proponents says every time when RISC-V ISA is criticized for performance bottlenecks, like absence of conditional move or integer overflow detection) we will have situation, when "sub" becomes very important again.

It is Ok for embedded use of CPU, as embedded CPU and firmware are tightly-coupled anyway, but it is very bad for generic usage CPU. Which "sub" should be used by Debian build cluster? And why?

Edit: for grammar & typos

ncmncm · on March 21, 2022

It is always frustrating when you have put in the work to optimize code, and turn out to have pessimized it for the next chip over.

The extremum for this is getting a 10x performance boost by using, e.g., POPCNT, and suffering instead a 10-100x pessimization because POPCNT is trapped and emulated.

indolering · on March 22, 2022

Isn't the point of the RISC-V extension mechanism is to eliminate instruction emulation?

ncmncm · on March 22, 2022

I'm not sure "the point" is a well-defined term in this context.

Are you guessing that the extension is optional specifically so that nobody will need to emulate things they can't afford to implement in hardware?

But trapping and emulating is explicitly allowed. Maybe it should be possible to ask at runtime whether an extension is emulated. Maybe it is? But I have not seen any way to tell. I guess a program could run it a thousand times and see how long it takes... It would be a serious nuisance to need to do that for each optimization, and then provide alternate implementations of algorithms that don't depend on the missing features.

This is why leaving popcount out of the core instruction set is such a nuisance. It is cheap in hardware, and very slow to emulate.

On organically-evolved ISAs, there are about N variants that correspond to releases. You can decide what is the oldest variant M you want to support, and use anything that is implemented in targets >=M; and the number of <M machines declines more or less exponentially with time.

With RISC-V, there are instead N=2^V variants, at all times, increasing exponentially with time. You too frequently don't know if your program might need to run on one that lacks feature X. So you (1) arbitrarily fail on an unknown fraction of targets, (2) fail on some and run badly on some others, with instructions you relied on for optimization instead emulated very slowly, (3) run non-optimally on all targets, or (4) have variant versions of (parts of) your program configured to substitute at runtime, for each of K features that might be missing. None of these choices is tenable.

The notion of "profiles" appears meant to reduce the load of this problem, but that makes it even more complicated.