It's very important to control for every variable.
It's not, actually. Say you have 10,000 listeners and you randomly assign each one to 16-bit vs 24-bit listening. You have enough listeners that any differences between the groups are due to chance and will very close to even out. Now, if you find people are unable to distinguish between 16-bit and 24-bit you might want to try the test again with more control over the environment, but if you find a substantial difference in a large blind randomized test that's a real finding.
It's not, actually. Say you have 10,000 listeners and you randomly assign each one to 16-bit vs 24-bit listening. You have enough listeners that any differences between the groups are due to chance and will very close to even out. Now, if you find people are unable to distinguish between 16-bit and 24-bit you might want to try the test again with more control over the environment, but if you find a substantial difference in a large blind randomized test that's a real finding.