Well my comment mentioned "random (but TLB friendly)", which I define as visitin...

Well my comment mentioned "random (but TLB friendly)", which I define as visiting each cache line exactly once, but only with a few (32-64) pages active.

The reason for this is I like to separate out the cache latency to main memory and the TLB related latencies. Certainly there are workloads that are completely random (thus the term cache thrashing), but there's also many workloads that only have a few 10s of pages active. Doubly so under linux when if needed you can switch to HUGE pages if your workload is TLB thrashing.

For a description of the Anandtech graph you posted see: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

So the cache friendly line is the R per RV prange for the 5950X latencies is on the order of 65ns, the similar line for the M1 is dead on 30ns at around 128KB and goes up slightly in the 256-512KB range. Sadly they don't publish the raw numbers and pixel counting on log/log graphs is a pain. However I wrote my own code that produces similar numbers.

My numbers are pretty much a perfect match, if my sliding window is 64 pages (average swap distance = 32 pages) I get around 34ns. If I drop it to 32 pages I get 32ns.

So the M1, assuming a relatively TLB friendly access pattern only keeping 32-64 pages active is about half the latency of the AMD 5950.

So compare the graphs yourself and I can provide more details on my numbers if still curious.