Modern x86_64 has supported multiple page sizes for a long time. I'm on commodit...

jeffbee · 2026-03-16T20:01:35 1773691295

Right but on Intel the 1G page size has historically been the odd one. For example Skylake-X has 1536 L2 shared TLB entries for either 4K or 2M pages, but it only has 16 entries that can be used for 1G pages. It wasn't unified until Cascade Lake. But Skylake-like Xeon is still incredibly common in the cloud so it's hard to target the later ones.

Dylan16807 · 2026-03-16T21:03:16 1773694996

So for any process that's using less than 16GB, it's a significant performance boost. And most processes using more RAM, but not splitting accesses across more than 16 zones in rapid succession, will also see a performance boost.

My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)

jeffbee · 2026-03-16T22:41:52 1773700912

That strikes me as a common hugepages win. People never believe you, though, when you say you can make their thing 20% faster for free.

menaerus · 2026-03-17T13:51:07 1773755467

Then it should be pretty easy to display that 20% "faster for free", no? But as always the devil is in the details. I experimented a lot with huge pages, and although in theory you should see the performance boost, the workloads I have been using to test this hypothesis did not end up with anything statistically significant/measurable. So, my conclusion was ... it depends.

Dylan16807 · 2026-03-17T18:18:54 1773771534

Try a big factorio map just as a test case. It's a bit of an outlier on performance, in particular it's very heavy on memory bandwidth.

jeffbee · 2026-03-17T15:53:28 1773762808

Of course, it only helps workloads that exhibit high rates of page table walking per instruction. But those are really common.

menaerus · 2026-03-17T19:04:32 1773774272

Yes, I understand that. It is implied that there's a high TLB miss rate. However, I'm wondering if the penalty which we can quantify as O(4) memory accesses for 4-level page table, which amounts to ~20 cycles if pages are already in L1 cache, or ~60-200 cycles if they are in L2/L3, would be noticeable in workloads which are IO bound. In other words, would such workloads benefit from switching to the huge pages when most of the time CPU anyways sits waiting on the data to arrive from the storage.

jeffbee · 2026-03-17T19:37:37 1773776257

In a multi-tenant environment, yes. The faster they can get off the CPU and yield to some other tenant, the better it is.

tosti · 2026-03-17T10:40:55 1773744055

    > commodity
    > zen 5
    > 128GiB

Are you from the future?

Dylan16807 · 2026-03-18T02:11:55 1773799915

I'm not sure what point you're trying to make.

In the middle of last year, a 9900X was around $350 and 128GB of memory was also around $350. That's very easily "commodity" range.

tosti · 2026-03-18T07:07:50 1773817670

Damn. I feel old and must've missed that boat. Several other boats too, I guess.

Here I was thinking 16GiB is pretty good. I get to compile LibreOffice in an afternoon. QtWebEngine overnight.

Doesn't 128GiB make rowhammer much more feasible? You'd have 32GiB per DIMM.

Oh well

Dylan16807 · 2026-03-18T07:21:16 1773818476

Two 64GiB DIMMs would be the more likely setup. The current CPUs strongly prefer having only one stick of DDR5 per channel.

The effectiveness of rowhammer depends on how well the manufacturer implemented target row refresh. But the internal ECC on DDR5 should help defend against it somewhat.

Personally I've been in the 24-32GiB range since 2013, and that's despite the fact that I'm still on DDR3.