What I'm saying is "there are non-deterministic primitives", not "there are no d...

xyzzy_plugh · on Aug 5, 2023

Yes, and `gettimeofday` is a non-deterministic primitive. There is nothing special about GPUs here. If you write tests that fail sometimes because you used non-deterministic primitives like gettimeofday and someone files a bug we don't throw up our hands and say "this is not a bug but due to how CPUs work." We remove the non-deterministic bit.

There's no difference here. This isn't a GPU problem.

cpgxiii · on Aug 5, 2023

Except the issue is inextricably linked to GPUs. All of the work in practical DNNs exists because of the extreme parallel performance available from GPUs, and that performance is only possible with non-deterministic threading. You can't get reasonable training and inference time on existing hardware without it.

d0mine · on Aug 5, 2023

1000 threads can run in parallel. It doesn't prevent us to sum their results deterministically:

    results = ThreadPool(workers=1000).imap_unordered(calc, inputs)
    print(math.fsum(results))

Due to the magic of the fsum alg, the result is deterministic whatever order we get results in. https://docs.python.org/3/library/math.html#math.fsum

cpgxiii · on Aug 5, 2023

That's not the operation being performed on GPUs that is the problem. The issue is that fundamentally GPUs allow for high performance operations using atomics, but this comes at the cost of nondeterministic results. You can get deterministic results but doing so comes with a significant performance costs.

xiphias2 · on Aug 5, 2023

Using atomics is easier than warp operations (using warp shuffle for example), but warp shuffle is quite fast.

I guess if determinism is so important implementations can be changed, it is just maybe not that high priority.

WithinReason · on Aug 5, 2023

That summation is slow and would not be used in practice.

You could use just one thread on your 10000 thread GPU too and it would be deterministic, sure. Completely beside the point.

WanderPanda · on Aug 5, 2023

In my experience cuBLAS is deterministic, since matmul is the most intensive part I don‘t see other reasons for non-determinism other than sloppyness (at least when just a single GPU is involved)

microtonal · on Aug 5, 2023

Yeah. In curated transformers [1] we are seeing completely deterministic output across multiple popular transformer architectures on a single GPU (there can be variance between GPUs due to different kernels). Of course, it completely depends on what ops and implementations you are using. But most transformers do not use ops that are typically non-deterministic to be fast (like scatter-add).

One non-determinism we see with a temperature of 0 is that once you have quantized weights, many predicted pieces will have the same probability, including multiple pieces with the highest probability. And then the sampler (if you are not using a greedy decoder) will sample from those pieces. So, generation is non-deterministic with a temperature of 0.

In other words, a temperature of 0 is a poor man’s greedy decoding. (It is totally possible that OpenAI’s implementation switches to a greedy decoder with a temperature of 0).

[1] https://github.com/explosion/curated-transformers