RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so...

RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):

    512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS

Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.

If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.

On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.