Hacker Newsnew | past | comments | ask | show | jobs | submit | Tunabrain's commentslogin

Is arXiv a major trade publication?

I've never seen arXiv papers counted towards your publications anywhere that the number of your publications are used as a metric. Is USCIS different?


GPUs are deterministic machines, even for floating point.

The behavior in the linked article has to do with the use of atomic adds to reduce sums in parallel. Floating point addition is not associative, so the order in which addition occurs matters. When using atomic adds this way, you get slightly different results depending on the order in which threads arrive at the atomic add call. It's a simple race condition, although one which is usually deemed acceptable.


I just edited my comment while you were writing your comment to add an explanation. The point here is that some primitives in eg. cudNN are non-deterministic. Whether you classify that as a race condition or not is a different question; but it's intended behaviour.


Right but that's not an inherent GPU determinism issue. It's a software issue.

https://github.com/tensorflow/tensorflow/issues/3103#issueco... is correct that it's not necessary, it's a choice.

Your line of reasoning appears to be "GPUs are inherently non-deterministic don't be quick to judge someone's code" which as far as I can tell is dead wrong.

Admittedly there are some cases and instructions that may result in non-determinism but they are inherently necessary. The author should thinking carefully before introducing non-determinism. There are many scenarios where it is irrelevant, but ultimately the issue we are discussing here isn't the GPU's fault.


What I'm saying is "there are non-deterministic primitives", not "there are no deterministic primitives".


Yes, and `gettimeofday` is a non-deterministic primitive. There is nothing special about GPUs here. If you write tests that fail sometimes because you used non-deterministic primitives like gettimeofday and someone files a bug we don't throw up our hands and say "this is not a bug but due to how CPUs work." We remove the non-deterministic bit.

There's no difference here. This isn't a GPU problem.


Except the issue is inextricably linked to GPUs. All of the work in practical DNNs exists because of the extreme parallel performance available from GPUs, and that performance is only possible with non-deterministic threading. You can't get reasonable training and inference time on existing hardware without it.


1000 threads can run in parallel. It doesn't prevent us to sum their results deterministically:

    results = ThreadPool(workers=1000).imap_unordered(calc, inputs)
    print(math.fsum(results))
Due to the magic of the fsum alg, the result is deterministic whatever order we get results in. https://docs.python.org/3/library/math.html#math.fsum


That's not the operation being performed on GPUs that is the problem. The issue is that fundamentally GPUs allow for high performance operations using atomics, but this comes at the cost of nondeterministic results. You can get deterministic results but doing so comes with a significant performance costs.


Using atomics is easier than warp operations (using warp shuffle for example), but warp shuffle is quite fast.

I guess if determinism is so important implementations can be changed, it is just maybe not that high priority.


That summation is slow and would not be used in practice.

You could use just one thread on your 10000 thread GPU too and it would be deterministic, sure. Completely beside the point.


In my experience cuBLAS is deterministic, since matmul is the most intensive part I don‘t see other reasons for non-determinism other than sloppyness (at least when just a single GPU is involved)


Yeah. In curated transformers [1] we are seeing completely deterministic output across multiple popular transformer architectures on a single GPU (there can be variance between GPUs due to different kernels). Of course, it completely depends on what ops and implementations you are using. But most transformers do not use ops that are typically non-deterministic to be fast (like scatter-add).

One non-determinism we see with a temperature of 0 is that once you have quantized weights, many predicted pieces will have the same probability, including multiple pieces with the highest probability. And then the sampler (if you are not using a greedy decoder) will sample from those pieces. So, generation is non-deterministic with a temperature of 0.

In other words, a temperature of 0 is a poor man’s greedy decoding. (It is totally possible that OpenAI’s implementation switches to a greedy decoder with a temperature of 0).

[1] https://github.com/explosion/curated-transformers


If the hardware is deterministic, so are the results. You can't generate random numbers purely in software with deterministic hardware.


The behaviour of atomic operations is definitely not deterministic. E.g. if you have a lot of atomic adds, every time you run the code you'll get a different result without a random number generator.


I think there may be a misunderstanding here regarding the use case. If the vectors are large and allocated on the heap/on an accelerator, then yes, writing out explicit temporaries may be faster. Of course, this does not preclude operator overloading at all: You could write the same code as auto t = vec1 - vec2; t *= 0.5/0.3; t += vec3; t *= 0.3;

However, if the operands are small (e.g. 2/3/4 element vectors are very common), then "unnecessary copies" or move semantics don't come into play at all. These are value types and the compiler would boil them down to the same assembly as the code you post above. Many modern C++ codebases in scientific computing, rendering, or the game industry make use of vector classes with operator overloading, with no performance drawbacks whatsoever; however, code is much more readable, as it matches actual mathematical notation.


Thank you for putting this so eloquently!

> Many modern C++ codebases in scientific computing, rendering, or the game industry make use of vector classes with operator overloading, with no performance drawbacks whatsoever

I guess these people are all not writing "serious code" :-p


Oh wow, I bump into you everywhere!


Checked for my home country (Switzerland), and it doesn't seem to be correct - travel to Canada requires an eTA, but the site lists Canada as being visa free. Could get you in for an unpleasant surprise when your airline won't let you board (they check for eTA at check-in).


I'd be interested to read that paper, if you're willing to share it.



Any entry level optimization course will make you implement something like that; do a matrix+matrix addition, but traverse it in row-major order in one program, and column-major in the other. If the matrix is big enough, you will easily get a 10x difference. For even larger sizes (once you thrash the TLB), you will get an additional factor.

You might not get up to 60, but this is just a simple matrix addition. Even just matrix multiplication might be enough to get a 60x difference just with optimizing for memory hierarchy. I expect that high-performance memory bound programs are more complex than matrix multiplication, so the parent comment's claim doesn't seem too unlikely to me.


Sorry, I didn't mean to question it whether it is possible. I just think that a factor of 60 sounds tricky to achieve if we are just talking about RAM (no disk involved).

The first time I encountered this myself, was doing texture mapping on the 486 back in the mid 90s. Texels would be laid out by row, and the mapper would draw horizontally. Texture maps that would fit within the L1 cache would draw fine no matter which rotation they were drawn it.

However texture maps that were larger than the L1-cache, would be drawn fine as long as the texture map wasn't significantly rotated. But, if you rotated the polygon by 90 degrees you'd see a significant drop in frames per second, because you'd skip a while row between drawing each texel, which would effectively cause a lot of cache misses. OK, I didn't manage to explain that well, but I hope it is understable.

Obviously, I have encountered this many times since, in many variants (including disk read/write), and I definitely agree that a factor of 10 should be possible.

I guess, I am just wondering if the difference between RAM and L1 has reached a factor 60, and how easy it is to come up with something where the cache misses all the time, or if caches have become smarter.


Googling puts RAM latency avoiding caches on modern processors at 40-60 cycles.


Exactly... Now, that alone could make it tricky to get a factor 60 difference.


Write a simplistic O(N^3) algortihm

When N is 100, all fixed costs in the inner loop are multiplied by 1,000,000. So a nanosecond difference is now a millisecond. It's pretty easy to get worse than that by boneheaded data layout (row-major vs column-major is pretty common)


No...

If you use a "bonehead" data structure each memory reference will force a cache-miss, i.e. taking q CPU cycles instead of 1 CPU cycles.

If the algorithms has a running time of O(N^3), it means it is using some number of instructions <= c * N^3 to complete. Let's for simplicity sake say 1 operation is identical to 1 CPU-cycle.

Now, if each operation takes q CPU cycles instead of 1 cycles due to cache-miss on every operation, the running time of your algorithm is still bounded by q * (c * N^3) operations.

For the example with N=100 -> The fast version takes c * 1,000,000 operations to complete, for some value c. For the slow version, as each operation is now q times slower, it will take q * c * 1,000,000 operations to complete for values q and c.

I.e. it takes q times as long, not q * q * q.

What you were saying is that, if you run a O(N) algorithm on a 100 MHZ CPU, it will take 10 times as long as on a 1000 MHZ CPU, but an O(N^3) algorithm would take 1000 times as long. And that is obviously, not true. Both will take 10 times longer.


No

I'm saying that a poor choice of data structure will have an amplified impact on an O(N^3) algorithm.

Giving it as an example of where you could have the different implementations of the same algorithm have massively different runtimes on the same processor.


If the algorithm has a running time of O(N^3), that implies that it takes O(N^3) < c * N^3 operations to complete. Regardless of datastructures. That's the definition of asymptotic running time.

If we disregard c, and set N = 100, that is 1,000,000 operations.

We have two situations. 1) Either we use a good layout where each operation is fast because each memory access is from cache. 2) Or we use a bad layout where each operation is slow because we have a cache-miss and have to access the RAM. These access times are independent of the specific algorithm, and only depends on CPU architecture.

Let say that the fast operations each take 1 second. And the slow operations each take f seconds.

In the fast case, with good memory layout, the algorithm completes in 1,000,000 seconds. In the slow case, with bad memory layout, the algorithm completes in f * 1,000,000 seconds.

The time of each operation we perform does not depend on the size of N. However, in practice, you'll typically have a lot of cache hits even with a bad memory layout, so the difference may be less than a factor of f for smaller values of N (where you'll have more cache hits in a cubic algorithm). However, it will be upper bounded by f. I.e. there is an upper bound on how much slower you can make an algorithm by changing the memory layout.


Obviously N and f are independent (although they might not be, your N may dictate how often a cache miss occurs).

Right, so now let's assume that our O(N^3) algorithm is simply a triple nested loop. And that we have an expensive operation f, but that through choice of data structures we can choose in which layer that operation occurs. All other costs are constant.

If f is in the outer loop, it happens N times, in the middle loop N^2, in the inner loop N^3.

So our choice of data structure impacts our running times by any factor from linear through cube of N.

I was simply answering the question of is it really possible to have two implementations of the same algorithm have such disparate run times, and it's trivially possible.

Another way of saying it, is that the mistake need not look expensive on cursory inspection, a simple way to have very different runtimes is through the accumulation of a large number of small costs.

Maybe the important point is to use a profiler.

I've recently gained a 60x improvement in real-world run time of what is considered in the literature to be the best known algorithm for a particular problem. The speed up didn't come from improving on the algorithm, it came from improving data locality, removing memory allocations, and general modern code hygiene (to be fair the canonical implementation is ~20 years old)


I like the idea! But is there a reason why it's only available in app form? I was going to try it and see what gifts were available/how much they cost, but I can't do that from the website. I'm not going to install an app on my phone just to see whether it's affordable (and so I'm unlikely to try it at all).

I don't see why this couldn't be done in a Desktop browser (just enter a phone number/email address instead of selecting a contact).


Hey there! CEO of SwiftGift here.

We've gone mobile first because we are all about convenience. From your phone you can tap straight into your contacts and select a recipient with 2 taps. On browser you would need to look up a number (most people don't remember them off by heart) and then manually type it in...

We may well add a website at a future date but for now all our resources are focused on mobile: iOS App (voted by Apple "Best Gift App 2016), as well as iMessage App and Android App :)

On pricing - we are an aggregator and mostly resell items from large retailers like Amazon. Our commission comes from their margin so our pricing is very competitive :)

We do hope you will invest the time to install the free App and take a look around :)


Are all of those ethernet cables? I can't say I know much about wiring, but that seems like a bit overkill.


Looks like they are. Actually a good idea IMO assuming they are just using them as robust 8 bit interconnects and not actually Ethernet.


Considering two of the bits will have large coupling between the two, that sounds like a bad idea...

A 4 bit differential output would be pretty darn good.

Actually, that might be the case: some boards have 4 presumably output ethernet ports.

edit: https://bighexmachine.github.io/hexModuleSpec/ looks like I'm right?


They are, and the interconnects are 4 bit. These were used by first year undergraduates for years before the Big Hex Machine was built, so sturdy Ethernet cables were cheaper and more reliable than many other options.


Yup, we used 4 wires for data and 4 for power - there's a 12V supply which is spread around and downregulated to 5V on each board. We picked ethernet because it's extremely cheap, hard-wearing, widely available and has enough wires to play with.


Really, it beats having to fab your own connectors. Making your own isn't a big deal for most projects, but with that number of boards, this is a smart choice.


But this does use solar panels.

> The device uses solar electricity from a photovoltaic panel to power the chemistry that splits water into oxygen and hydrogen. Microbes within the system then feed on the hydrogen and convert carbon dioxide in the air into alcohol that can be burned as fuel

Only the hydrogen + co2 -> alcohol part uses biological components.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: