I'm less concerned about it being 32-bit and more about them being exclusively s...

ryao · on Feb 10, 2025

The programming model is that all threads in the warp / thread block run the same instruction (barring masking for branch divergence). Having SIMD instructions at the thread level is a rarity given that the way SIMD is implemented is across warps / thread blocks (groups of warps). It does exist, but only within 32-bit words and really only for limited use cases, since the proper way to do SIMD on the GPU is by having all of the threads execute the same instruction:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.

Agentlien · on Feb 10, 2025

I know all of that. I was talking about RDNA2, which is AMD. There, instructions come in two flavours:

1. Scalar - run once per thread group, only acting on shared memory. So these won't be SIMD.

2. Vector - run across all threads, each threads accesses its own copy of the variables. This is what you typically think of GPU instructions doing.

LegionMammal978 · on Feb 10, 2025

That does sound like it would be a pretty big limitation. But there appear to be plenty of vector instructions for 32-bit integers in RDNA2 and RDNA3 [0] [1]. They're named V_*_U32 or V_*_I32 (e.g., V_ADD3_U32), even including things like a widening multiply V_MAD_U64_U32. The only thing missing is integer division, which is apparently emulated using floating-point instructions.

[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 259, Table 83, "VOP3A Opcodes"

[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 160, Table 85, "VOP3 Opcodes"