I'm still waiting for a lock-free, zero latency GC that runs on its own core. I ...

jeffbee · on April 25, 2022

Why would running the GC on its own core be a good idea? Don't you want it to happen wherever it would be most efficient, i.e. on the same core where the memory in question was most recently accessed?

caffeine · on April 25, 2022

Can’t speak for OP, but for certain low-latency applications like trading, one wants to isolate a core, pin a thread to it, and have that thread spinning hot on some condition (maybe network card, maybe IPC), for optimal latency in responding to a particular event.

In that scenario, you wouldn’t want the GC to run on that core, for sure.

snarfy · on April 25, 2022

Yep it's about latency. Even 1ms can be unacceptable depending on the situation. We currently don't have an option for that other than going native.

KMag · on April 25, 2022

>> wouldn’t want the GC to run on that core

> We currently don't have an option for that other than going native.

You mean going manual memory management, I believe. Technically, JIT vs. ahead-of-time native vs. interpreted is orthogonal to manual memory management vs. GC. Yes, most JITs also have garbage collectors, but it's not necessarily the case.

But, your point does stand that when you really care about 99.99th percentile latency, you almost certainly need to go with static native compilation and manual memory management. The long-tail on JIT and/or most GCs is just too high.

kaba0 · on April 26, 2022

The OS can cause 1ms pauses easily, so I would be wary whether 1ms is truly unacceptable. To consistently achieve it, not even native may be enough in itself, you have to circumvent the OS kernel as well.

caffeine · on April 27, 2022

You can enforce that the Linux kernel not do this with nohz cmdline options

KMag · on April 25, 2022

> In that scenario, you wouldn’t want the GC to run on that core, for sure.

In that scenario, you wouldn't want the GC to run at all. If it runs on another core but touches the GC header words on objects used by the application code, it's going to end up stealing cache lines from the application core's L1 cache, resulting in pipeline stalls on the application core.

I've talked with folks who just disabled the garbage collector, were careful to keep down the number of objects allocated per trade, and ran their trading engines on boxes with huge amounts of RAM so that they were very unlikely to run out of memory before the close of the market.

Now, I could see some GC-specific modifications to the cache coherency hardware. For one, it would be useful to have message to set or unset a single bit on a cache line owned by another core, and report the previous value of the bit on the cross-core interconnect, without changing the ownership of the cache line or moving the whole line across the bus. You'd probably also want a variant on the MESIF cache coherency protocol where another core can get a "notify" bit set on a cache line, so when a cache line in the F state with its notify bit set gets naturally evicted from the cache, its contents and ownership would be transferred to the requesting core. I imagine you'd have a small programmable push-down automaton similar to the ESP32's ULP core sitting in the L1 cache. Its stack would be that core's share of the grey set, and it would have a small array of addresses it was waiting on to mark in the background as their cache lines became un-contended. It would undoubtedly be complex, but it seems the cleanest way to do garbage collection in the background without stealing cache lines from cores executing application code.

Really, I think you're best off having Erlang/Elixir/BEAM-like concurrency (but using ahead-of-time native compilation) where most objects are kept in per-actor GC arenas, and maybe only have some types capable of being passed between actors, and allocating those types in separate arenas to keep their cache line contention from harming the happy path non-shared objects. (Yes, last I checked, BEAM had moved from per-actor ("process") GC arenas to shared arenas, but that's largely because they had no way of accurately determining which allocations would later be passed across actors.)

Also, ideally, you'd have static inference of region-based garbage collection to reduce the amount of scanning done. Static inference of Rust-like lifetimes is undecidable, so you'd need a conservative inference algorithm that would leave some performance on the table, but you'd still have your garbage collector as a fall back for objects where lifetime inference fails. I guess you'd short-circuit lifetime inference for most dev builds to keep build times sane.

caffeine · on April 27, 2022

> I've talked with folks who just disabled the garbage collector

Yes we did this too for a while. In the end it slowed down dev velocity too much and we could afford to trade off some slight tail latency. Mind you this was back in like 2016, I haven’t used Java in trading for a while..

> it's going to end up stealing cache lines from the application core's L1 cache

This happened but in practice it wasn’t a big deal, it could be pretty effectively mitigated by consistently running the fast-path code to keep the caches hot.

sanxiyn · on April 25, 2022

Because you don't want GC to interfere with cache.

jeffbee · on April 25, 2022

Which cache are we talking about? Are you saying the GC blows out the icache, or ?

KMag · on April 25, 2022

As the GC running on its own core sets mark bits in a mark-sweep or mark-sweep-compact collector, it's going to get exclusive cache line access to the line holding the GC word for every live object. That's going to force more traffic on the cross-core communication lanes and cause cache line invalidation on the cores running your application code. A dedicated GC core is basically a core dedicated to flooding the cross-core interconnect, and invalidating lots of cache lines for your cores running application code.

I think you're much better off coming up with some dedicated instructions and/or features to make read and/or write barriers lighter weight. One possibility would be hardware multi-threading, but where the one thread only issues micro-ops when the primary thread would otherwise stall the pipeline. That way, GC could run in the background using the same L1 cache as the application thread.

jeffbee · on April 26, 2022

> A dedicated GC core is basically a core dedicated to flooding the cross-core interconnect, and invalidating lots of cache lines for your cores running application code.

Yeah, that's what I was trying to chase out of the OP. Running GC on a dedicated core seems like such an obviously bad idea, I'm not sure why anyone would seriously propose it.

KMag · on April 26, 2022

> I'm not sure why anyone would seriously propose it.

Likely, cache line contention just probably never crosses their minds. I don't blame them. I'm largely self-taught. My degree is in mechanical engineering, and I did a bit of dabbling in kernel programming in college, and took a CPU design course, but I had been working in industry for more than 10 years before I had a decent mental model of cache coherency.

There are just so many things to learn. I think few practitioners were exposed to or remember MESI or similar cache coherency protocols unless they've really done a lot of deep optimization in low-level languages or other languages with value types.

Most of your GC languages (particularly those without value types) do so much pointer chasing that cache line contention may often get lost in the noise floor of cache line evictions.

jhgb · on April 27, 2022

> Running GC on a dedicated core seems like such an obviously bad idea, I'm not sure why anyone would seriously propose it.

I was under the impression that C4/Zing was doing exactly that. Did I misunderstand what C4/Zing was doing?

snarfy · on April 29, 2022

This is a great response and yes something I didn't really think about with the original idea.

Considering actual, real world hardware as you point out, it is not a very good idea.

alecco · on April 25, 2022

nick_ · on April 25, 2022

Yes, a GC implemented in hardware would be a huge improvement.