>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.
> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.
It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.
You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.
And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.
No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.
This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.
Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck
(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)
It's all JIT'd based on the specific driver/GPU, but the intermediate assembly language is sufficient to inspect things like branches and loop unrolling.
Not really. DXIL in particular will still have branches and not care much about unrolling. You need to look at the assembly that is generated. And yes, that depends on the target hardware and compiler/driver.
You will have to check for the different GPUs you are targetting. But GPU vendors don't start from scratch for each hardware generation so you will often see similar results.
I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.
What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.
>since you don't want to use step of pixel-level if statements in your shader code
The observation relates to pixel shaders, and even within that, it relates to values that vary based on pixel-level data. In these cases having if statements without any sort of interpolation introduces aliasing, which tends to look very noticeable.
Now you might be fine with that, or have some way of masking it, so it might be fine in your use case, but most in the most common, naive case the issue does show up.
I don't know how many graphics products you shipped and when, but, say, clamping values at 0, is pretty common even in most basic shaders. It's not magic and won't introduce "aliasing" just for the fact of using it. On the other hand, for example, using negative dot products in you lighting computation will introduce bizarre artifacts. And yes, everyone uses various forms of MSAA for the past 15 years or so even in games. Welcome to the 21st century.
The way you write seems to imply you have professional experience in the matter, which makes it very strange you're not getting what I'm writing about.
Nobody ever talked about clamping - and it's not even relavant to the discussion as it doesn't introduce discontinuity that can cause aliasing.
What I'm referring to is shader aliasing, which MSAA does nothing about - MSAA is for geometry aliasing.
To illustrate what I'm talking about with, an example that draws a red circle on a quad:
The first version has a hard boundary for the circle which has an ugly aliased and pixelated contour, while the latter version smooths it. This example might not be egregious, but this can and does show up in some circumstances.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.