What are the technical reasons why the M1 chip is so fast. Does Apple use any ex...

stefan_ · on Nov 30, 2020

They use this one trick that AMD and others hate: they paid TSMC to use up all of their 5nm capacity.

GloriousKoji · on Nov 30, 2020

I have an anecdote possibly related to this. I work for a semiconductor company and we were in pretty good standing with TSMC. All of a sudden we were informed we should longer contact the usual engineers them and a new team of engineers was assigned to us. It was very clear the new staff was their B-team. Rumor around the office the A-team went to support Apple.

macintux · on Nov 30, 2020

The word I learned in years past because of Apple that applies to it in several ways is "monopsony": the ability to control the market for a product by being the dominant buyer of same.

lotsofpulp · on Dec 1, 2020

That is not what monopsony means. Monopsony means there is only one buyer. Apple is one of many buyers. However, they are able to secure the resources since they can pay the most.

macintux · on Dec 1, 2020

Monopsony means there’s a dominant buyer, not just a single buyer.

Fortune was talking about it almost 10 years ago.

https://macdailynews.com/2011/07/05/how-apple-became-a-monop...

Of course they’ve been accused in the Supreme Court of being a monopsonist with regard to the App Store, but that’s a separate discussion.

(Update: regrettably my economic theory is weak, so while I think it makes sense to apply the term, after more reading I can't argue the case with a reasonable level of conviction.

Certainly there should be some way to describe a buyer who locks up a supply of components to the exclusion of competitors, as Apple has done on multiple occasions dating at least to the original iPod hard drives.)

lotsofpulp · on Dec 1, 2020

https://en.wikipedia.org/wiki/Monopsony

See the box at the top of article showing difference between monopoly, monopsony, oligopoly, oligopsony.

A monopsony would mean the buyer gets to dictate all the terms, and the seller has to sell at very low price. In this case, TSMC has something very valuable, and Apple is paying a considerable amount (presumably, I haven’t seen real numbers) to the seller to make sure no one else gets it.

In fact, TSMC is a monopoly, the only seller of a much sought after product.

Justsignedup · on Nov 30, 2020

Sounds like apple made multiple specialized processors for specific tasks, built a motherboard around that, and cooked all that into the OS. It's a bit of a cheat since they aren't building a generic computer architecture, but a very very specialized one.

This is different with say Windows architecture. In windows vendors work on generic cpus/gpus/motherboards. Some hardware isn't exactly available or uniformally accessible due to drivers and such. This is kind of the huge benefit Apple has by creating the hardware, OS, drivers, languages and libraries.

zepto · on Nov 30, 2020

This is what they do with everything. It is the reason for vertical integration.

Calling it ‘a cheat’ is somewhat weird.

The point is that a general purpose CPU is a bad solution for a personal computer.

The neural engine is about processing the ambiguous world that humans occupy so that the human doesn’t have to translate so much for the computer.

The GPU is about communicating visually with humans.

reallydontask · on Nov 30, 2020

GP said a bit of a cheat not that it was cheating.

Point is that if you try to run a different OS, it will likely run slower, perhaps significantly slower than running macOS

bee_rider · on Nov 30, 2020

Anecdotally, I saw some folks on reddit's linux board that were running Linux in virtual machines on the M1 and getting quite decent results.

zepto · on Nov 30, 2020

Given that the individual cores benchmark at the very high end of any core available today, this is almost certainly not significant.

If another OS doesn’t take advantage of the specialized processors, it will run slower, but only on specialized loads.

eng2prog · on Nov 30, 2020

I see Apple following the path of Nintendo. They don't need to compete on performance, their users aren't buying for performance. Vertical integration puts you at tremendous risk of falling behind.

The future is keeping Apple "fresh", "cool", etc... While keeping product costs low to compensate for a falling market share.

Really says something that Apple is willing to compromise on long term hardware performance.

jitl · on Nov 30, 2020

How are they following the Nintendo path? These machines are clearly faster than the market segment, and apple is the leader in ARM CPU technology at this point, and they’ve been the mobile CPU leader for about 7 years, whereas Nintendo’s latest offering integrated mostly commodity technology from Nvidia Tegra. I don’t see how you can draw this parallel.

zepto · on Nov 30, 2020

Can you explain how they are ‘compromising on long term performance’?

I don’t see any logic that supports this claim.

ascagnel_ · on Nov 30, 2020

Nintendo has, since the Wii, shied away from bleeding-edge or custom components, preferring to use cheaper, proven tech that they can more easily maintain but is still strong enough for their needs. Just look at the Switch -- it uses the Tegra X1 in a device that launched in March of 2017, despite the X2 having been available for more than a year.

Apple, on the other hand, has been running rings around Qualcomm and other ARM vendors for a while now, and sharing tech makes it all the more important that they continue to be competitive.

danbolt · on Dec 1, 2020

I think only really the SNES/N64 years embodied the cutting-edge solutions associated with novel games. Their whole ethos to entertainment (especially the handheld line becoming the core product in a way) still feels embodied in Yokoi's "lateral thinking with seasoned technology" philosophy.

socialdemocrat · on Nov 30, 2020

Yeah and I think it will be quite interesting to see in the years ahead how this kind of highly specialized approach to computing will stack up against the more generic wintel approach, and how customers will respond.

Customers may end up with having the choice between a very high performance system at competitive prices, vs an open architecture which has more flexibility in what you add, but which will ultimately cost more and have worse performance.

You can see this play out in different markets. Just look at Tesla e.g. They are taking the Apple approach to car making with full vertical integration. They are doing everything from making their own alloys to creating their own machine learning hardware.

I think we are at some kind of paradigm shift and the old guard has been caught off guard.

NobodyNada · on Nov 30, 2020

If that were the case, though, we would expect to see performance gains only in areas the M1 is specialized for -- but that's not what we see. The M1's GeekBench scores are impressively high, and general-computing workloads like compiling code are very fast as well, indicating that the M1 is fast in general, not just for specific tasks.

There's been a lot of hype about things like "specialized hardware for NSObjects," but AFAIK that's more of a flawed/outdated design on Intel' side than a specialization on Apple's: the "magic" is really just ARM's weaker & more modern memory consistency model, which makes things like reference counting substantially faster.

sliken · on Nov 30, 2020

Apple did optimize for some common use cases, reference counting (as used by objc++), javascript tweaks, and for emulation for rosetta2. However they also made a great general purpose CPU that will run many standard codes impressively quickly for impressively little power.

eng2prog · on Nov 30, 2020

Today all of this is exciting, but I have a bad feeling this is going to slide Apple towards a low performance future.

They Apple'd themselves and locked themselves in.

In 2 years, secrets aside, Apple would need to beat all 3 big dogs to be the top. How long can they sustain that?

jitl · on Nov 30, 2020

Apple has been beating “the big dogs” in performance for years with this processor and OS architecture in mobile & tablet segments. They’re switching laptop and desktop segments to this architecture because the mobile architecture already overtook the big dogs and has been growing the lead for a couple of generations. The A14 phone chip is already faster than most laptop CPUs, just like the A12 phone CPU was faster than most laptop CPUs when it was released.

mrkstu · on Nov 30, 2020

They've been sustaining it for many years already, that's how they've gotten to the point of being able to passing up the incumbents. They didn't just start on this path this year, its the culmination of a long play strategy.

zepto · on Nov 30, 2020

How do you reach this conclusion? Their individual cores and memory architecture still beat almost everyone else on general purpose loads without any of the specialized engines.

rado · on Nov 30, 2020

Today they overdelivered, why is it then so hard to believe they'll meet the expectations of a scaled up SoC? It's guaranteed.

vb6sp6 · on Nov 30, 2020

Couldn't they could just switch back then?

eng2prog · on Nov 30, 2020

They could at significant expense.

But why would they? The users don't buy for performance.

vb6sp6 · on Nov 30, 2020

Basically every laptop from here on out is going to be benchmarked against these devices.

And battery life seems to be the sleeper performance metric here. Every review is going to be "you could get 6+ more hours if you just bought the macbook"

vmchale · on Nov 30, 2020

IIRC iPhone did well on SMT solver benchmarks as well.

amq · on Nov 30, 2020

I see two main reasons:

- 5nm process (1.8x the density of 7nm [1])

- unified memory directly on the SoC

[1] https://hexus.net/tech/news/industry/130175-tsmc-5nm-will-im...

kllrnohj · on Nov 30, 2020

Unified memory is old, every CPU with an integrated GPU is on unified memory. It's not doing anything at all for M1's general performance.

The main reason for M1's performance is just that Apple managed to get an 8-wide CPU to work & be fed. That's it. That's the entirety of the story of the M1. Apple got 8-wide to work, while Intel & AMD are still 4-wide. ARM's fixed length instructions are helping a ton there, but Apple also put in work to feed it.

All the other shit about specialized units & unified memory & vertical integration is all irrelevant and mostly wrong. The existing Intel Macbook Airs are all unified memory with specialized hardware units for specialized tasks, too (Intel QuickSync is 9 years old now - dedicated silicon for the specialized task of video encoding). Apple did absolutely nothing new or interesting on that front. Other than marketing. They are magic at marketing. And also magic at making a super wide CPU front-end, with really good cache latency numbers.

sliken · on Nov 30, 2020

That's overly simplified. Larger order buffer, larger caches, lower latency caches, more outstanding loads, etc.

They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Maybe the usual intel core i7/i9 or ryzen 5000 look rather quaint with their 60ns or higher memory latencies.

kllrnohj · on Dec 1, 2020

> They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Where are you getting those numbers? From https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

"In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

Which would put M1's DRAM latency at worse than a modern Intel or AMD desktop, which is measuring around 70-80ns: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

sliken · on Dec 1, 2020

I've replicated them myself with my own code, so I'm pretty confident. It doesn't hurt that my numbers match Anandtech's, at least for the range of arrays they use and only using a single thread.

On pretty much any current CPU if you randomly access an array significantly larger than cache (12MB in the M1 case) you end up thrashing the TLB which significantly impacts latency. The number of pages that can be quickly access depends on the number of page in the TLB.

To separate out TLB latency from memory latency I allow controlling size of the sliding window for randomizing the array, so that only a few pages are heavily used at any one time, while visiting each cache line exactly once.

That's exactly what the brown "R per RV prange" does. For more info look at the description at: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

My code builds an array, then does a knuth shuffle, but modified so the maximum shuffle distance is 64 pages, so the average shuffle is 32 pages or so. I get a nice clean line at 34ns. With 2 or 4 threads I get a throughput (not true latency) of a cacheline every 21ns. With 8 threads (using the 4 slow and 4 fast cores) I get a somewhat better cacheline per 12.5ns.

Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

kllrnohj · on Dec 1, 2020

> Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

Eh? That's not how memory latency works. The cheaper consumer chips with "non-spec" RAM and without ECC are regularly better here than the enterprise stuff. This isn't something that scales with price.

sliken · on Dec 1, 2020

Sure, ECC and in particular registered memory does increase the latency a bit. But servers are designed for throughput and have multiple memory channels to better feed the large amount of cores involved, up to 64 cores for the new AMD epyc chips. The amazing thing is that the Apple M1 can fetch random cachelines almost as fast as a current AMD Epyc.

kllrnohj · on Dec 1, 2020

You're confusing throughput & latency here. More channels increases throughput, but doesn't improve latency.

The M1's memory bandwidth is ~68GB/s, which is of course a tiny fraction of AMD Epyc's ~200GB/s per socket.

Epyc's latency isn't even competitive with AMD's own consumer parts, so I'm really not sure why you're surprised that Epyc's latency is also worse than the M1's?

sliken · on Dec 1, 2020

I'm not surprised the latency on the M1 is better than Epyc, but it's near half of any other consumer part, like say the AMD Rzyen 5950x. When accessed in a TLB friendly way (not TLB thrashing) the M1 manages 30ns which is excellent.

Even more impressively is that the random cacheline throughput is also excellent. So if all 8 cores have a cache miss the M1 memory system is very good at keeping multiple pending requests in flight to achieve surprisingly good throughput. Granted this isn't pure latency, so I call it throughput. Getting a random cacheline per 12ns is quite good, especially for a cheap low power system. Normally getting more than 2 memory channels on a desktop requires something exotic like an AMD threadripper.

jayd16 · on Nov 30, 2020

Is this a bot? This is just lifted from the article verbatim.

Do I get my Turing tester badge?

gfody · on Nov 30, 2020

IIRC Apple got some exotic automated optimization tech when when they acquired PA Semi - I think it was mentioned in the Steve Jobs biography. The A chips have always been impressive and the timing lines up.

chess_buster · on Nov 30, 2020

Did you read the article? Does not seem like it.

jonplackett · on Nov 30, 2020

read the article!