Low-Latency, High-Throughput Garbage Collection [pdf]

infogulch · on April 25, 2022

Based on the abstract it looks like it's a combination of reference counting and tracing GC that brings improvements in both tail latency and throughput compared to G1 and Shenandoah.

> LXR introduces: i) RC re- membered sets for judicious copying of mature objects; ii) a novel low-overhead write barrier that combines coalescing reference counting, concurrent tracing, and remembered set maintenance; iii) object reclamation while performing a con-current trace; iv) lazy processing of decrements; and v) novel survival rate triggers that modulate pause durations.

There was a thread just recently debating the relative merits of RC & GC, and here we see them combined smartly.

KMag · on April 25, 2022

It's been a little while, but as I remember, the Garbage Collection Handbook (an excellent read if you're into this sort of thing) outlines and cites some external work showing tracing garbage collection as a lazily evaluated form of reference counting (using a saturating 1-bit counter) taken to the limit. (Some reference counting systems use a small number of bits for a reference count, and if that reference count hits its max value, then never decrement a counter. Many reference counting systems also incorporate tracing major collections to avoid the cycle problem. If you're going to have a tracing collector, and you assume your high-reference-count objects are likely to be long-lived, then reserving only a small number of bits for the reference count makes sense. Your backup tracing major collector will handle both the cycle problem and the saturating counter problem at no extra cost.)

Eagerly updated single-bit reference counts are useful if you have a lot of code that allocates buffers, passes them to helper functions to get filled (potentially defeating static escape analysis), passes them to consuming functions, and then returns. A single bit to track if a reference to that buffer has ever been stored into another object will allow for early collection in the common case.

On a side note, the Garbage Collection handbook (and many practitioners) consider reference counting to be a form of garbage collection. What some people narrowly define as GC is tracing garbage collection.

munificent · on April 25, 2022

> cites some external work showing tracing garbage collection as a lazily evaluated form of reference counting (using a saturating 1-bit counter) taken to the limit.

Yup. The paper you're thinking of is "A Unified Theory of Garbage Collection":

https://courses.cs.washington.edu/courses/cse590p/05au/p50-b...

brrrrrm · on April 25, 2022

Do you have a link to that thread? I’ve never quite understood the value of GC over RC save for vague claims of developer efficiency.

mananaysiempre · on April 25, 2022

Chris Lattner on garbage collection vs. Automatic Reference Counting (2017), https://news.ycombinator.com/item?id=31139610

See also “A unified theory of garbage collection” by Bacon, Cheng, and Rajan (OOPSLA ’04) for a discussion of how tracing and RC essentially compute the least and greatest fixed points, respectively, of the “roots and referents” function (the difference between the two being the reference cycles), and of course the Garbage Collection Handbook by Jones, Hosking, and Moss (CRC, 2012), which includes both tracing and RC in its definitions of GC and draws comparisons (barrier costs, heap fragmentation, etc.).

(These are both bog-standard references, the thread above has people know much more about this talking about ideas that aren’t 20 years old.)

KMag · on April 25, 2022

Excellent answer. I would just add a brief summary, that naive reference counting does work proportional to the rate at which references are created and destroyed, and naive tracing collectors do work proportional to either the full heap size or the number of live objects on the heap.

Note that as a rule of thumb, tracing garbage collectors need about twice as much memory as live objects, to avoid frequent heap scans. Non-delayed reference counting only keeps around live objects, which can significantly reduce memory usage.

If you have a huge number of live objects that you need to keep around, but infrequently have references to them created (e.g. very large caches), then reference counting is going to out-perform tracing collection, particularly if some of those infrequently used objects might be paged out and you're not using card marking to sometimes avoid needing to trace the paged-out objects. It's convenient that these use cases also tend to work well with reference counting's generally lower peak memory requirements.

On the other hand, if you're rapidly creating and destroying objects, or often updating references, tracing may perform better. Moving collectors (at least using moving or compacting collectors for the young generation) can use bump allocation, resulting in much faster object allocation.

The Garbage Collection Handbook goes into nearly countless variations, hybridizations, and optimizations on these basic collectors.

For over a decade, I helped maintain Goldman's Slang language. Some components had bog standard mark-and-sweep collector, but most of the language uses reference counting. The language uses tight integration with the SecDb globally distributed NoSQL database and pervasively locally caches database objects and memoizes method calls on those objects. I'm sure naive reference counting isn't optimal for the Slang use case, but it's easy to implement and isn't actually that far from optimal. (When I first started at Goldman, some parts of the language used an implementation of Bacon's cycle-detecting concurrent reference counting collector, but there was some rare corner case where someone missed a reference count change on the C++ side, and after a while of trying to find the missing reference count, we just replaced that component with a non-generational mark-and-sweep collector.) (There was a publish-subscribe system that would watch the SecDb transaction logs, so applications could subscribe for cache invalidation notifications. With huge caches, you don't want to use database polling for cache invalidation.)

vnorilo · on April 25, 2022

Simple implementation of RC with multithreading implies a lot of atomic counter updates which clog a multicore cpu fast.

Therefore you apply techniques like Apple's ARC, or deferred-coalesced counter mutation.

Add cycle leak detectors and the RC scheme is looking more and more like a really slow garbage collector, albeit a low latency one.

Note that simple RC is not deterministic or good for real time either because of the possibility of release/destruct cascade. Which looks pretty much like a GC pause (because it is one).

neel_k · on April 25, 2022

The short answer is: reference counting walks the dead part of the heap, and tracing gc walks the live part of the heap.

When a reference count of an object goes to zero, you recursively decrement the reference counts of everything the object points. This leads you to trace the object graph of things whose reference count has gone to zero. You never follow pointers from anything which is live (i.e., has a refcount > 0).

When you are doing the copy phase of a gc, you start with the root set of live objects, and follow pointers from everything that is live. Since anything pointed to by a live object is live, you only follow the pointers of live objects. You never follow pointers from anything which is dead (i.e., garbage).

If object lifetimes are short, most objects will be dead, and so RC will be worse than GC. If object lifetimes are long, most objects will be live, and GC will be worse than RC.

Empirically, the overwhelming majority of objects have a very short lifetime, with only a few objects living a long time. (This is called "the generational hypothesis">) So the optimal memory allocator will GC short-lived objects and RC long-lived objects. Rust/C++ encourages you to do this manually, by stack-allocating things you think will be short-lived, and saving RC for things with an expected long lifetime.

Beyond this, RC has a few really heavy costs.

Reference counting doesn't handle cyclic memory graphs. You need to add tracing to handle those, and if you are going to do tracing anyway, it's tempting to just do tracing really well and skip the refcounts entirely.

This is because the memory overhead of reference counts is high -- empirically, most objects never have more than a single reference to them, and so using a whole word for reference counts is lot of overhead. Moreover, the need to increment/decrement reference counts is really bad for performance: first, mutations are expensive in terms of memory bandwidth (you've got to maintain cache coherence with the other CPUs), and second, in a multicore setting, you have to lock that word to ensure the updates are atomic.

There are tricks to mitigate this (e.g., Rust distinguishes Arc and Rc for objects which can be shared between threads or not), and there are schemes to optimise away RC assignments with static analysis (deferred reference counting), but if you want to do a really good job of reference counting, then you will be implementing a lot of tracing GC machinery.

And vice versa! The algorithm in the link is partly about adding RC to handle old objects (empirically, as part of the generational hypothesis, objects which have lived a long time will live a long time more). In fact, Blackburn and McKinley (two of the three authors of the above paper), pioneered the combination approach with their paper "Ulterior Reference Counting."

KMag · on April 25, 2022

Minor nit: the description above is true for copying GCs, but non-copying mark-sweep collectors generally still touch the header words of both live and dead objects in the heap in order to add dead objects to free lists. Mark-sweep-compact collectors also end up reading all of the mark words in the object headers of both live and dead objects in order to find the holes to be filled by compaction.

It's also not uncommon to have a copying young generation and a mark-sweep-compact tenured generation. That way, you get the advantages of not needing to scan the huge numbers of young dead objects, but the space savings of not needing 2x space for the older generation.

eru · on April 25, 2022

In theory, GC can be faster than RC. (Because tracing GC only pays an overhead proportional to what's still alive after a while, once in time. RC pays an overhead for every operation.)

bob1029 · on April 25, 2022

> A key design insight for LXR is that regular brief STW pauses provide a highly-efficient, low-latency approach to garbage collection.

I have been playing around with low-latency GC strategies in .NET6, and have discovered a similar pattern - If you intentionally run GC as frequently as makes sense (i.e. instead of trying to avoid it or sweep it under some other rug), you can sometimes reach a solution that is far superior to any other.

My example being a toy software rendering engine that is utilized by several concurrent clients. I asked myself "What would happen if I just ran GC right before every single frame"? Turns out, for reasonable scene sizes, the maximum amount of time the GC pause lasted was <5ms, producing a perfect result for my use case. Without intentionally calling GC, I would sometimes see pauses upwards of 3-4 seconds, which clearly has a severe impact on user experience for this type of application.

The way it works in my head - I am happy to pay a small 1~5ms cost for every frame (or batch of frames as the case may be) rather than allowing the .NET runtime to decide when would be most convenient and incur a much longer collection.

"Clean as you go" might be the best way to achieve low latency in any reality. Deferring and hiding is how things build up. Something is gonna have to eventually deal with all that trash unless you are shipping cruise missiles.

zozbot234 · on April 25, 2022

Of course, the limiting case of "clean as you go" is either fully manual memory management, or at least deterministic deallocation (as seen in reference counting systems). With some exceptions for small bursts of especially low-latency, high-throughput work, that can be modeled in a "manual" system by arenas, or higher-scoped references if using reference counting.

eru · on April 25, 2022

Alas, reference counting and C-style manual memory management can still require doing arbitrary amounts of work from time to time.

jhgb · on April 27, 2022

> My example being a toy software rendering engine that is utilized by several concurrent clients. I asked myself "What would happen if I just ran GC right before every single frame"? Turns out, for reasonable scene sizes, the maximum amount of time the GC pause lasted was <5ms, producing a perfect result for my use case. Without intentionally calling GC, I would sometimes see pauses upwards of 3-4 seconds, which clearly has a severe impact on user experience for this type of application.

Oberon, if I'm not mistaken, had stupidly simple GC that simply run between every two user interactions. That way you wouldn't notice pauses simply because your fingers move way slower than bytes in memory.

kaba0 · on April 26, 2022

I don’t believe it is as easy as that. What happens when your program have accumulated plenty of long-living objects? Your case probably works due to having only objects with short lifetimes - but your GC’s pause time will generally grow with the heap size. Not even generational GCs solve that in the general case.

Java’s low-latency GC, ZGC moves almost the whole work to parallel threads by accepting the trade-off of slightly slower barriers. But in reward it can decouple the pause time’s length from the heap size which is the key here.

adgjlsfhk1 · on April 27, 2022

The whole point of generational GC is that that your pause times don't grow with the old objects. I think concurrent sweep makes a lot of sense (it's basically free), but concurrent mark has major performance compromises. It's not just the longer barrier, it also adds an extra indirection to all your memory accesses. Full concurrent may be right for some use cases (where you *need* very low pause times), but it comes at a considerable cost.

kaba0 · on April 27, 2022

I don’t think that generational GC can totally decouple pause times from heap size in the worst case, which is what is important for latency-oriented workloads.

adgjlsfhk1 · on April 27, 2022

For stuff like games, it probably does it well enough. If you drop a frame every minute, that's not a major problem.

gavinray · on April 25, 2022

Chris Seaton pointed out something very interesting in his Reddit comment here:

https://www.reddit.com/r/java/comments/ubgyva/comment/i646o4...

  > "The beauty is it’s not a Java GC, it’s a generic GC they have evaluated in Java."

Important bit of info there.

mdb31 · on April 25, 2022

Most interesting observation here: Short GC pauses do not assure low latency.

Anyway: this paper is mostly about Java, which I rarely use and basically only known from Elasticsearch (where log entries about GC pretty much always seem to indicate 'add more memory'...), but I've never run into any scenarios where .NET CLR GC was a performance issue either (not on the legacy .NET Framework nor in more recent releases, which are a lot better in most performance aspects).

Most GC complaints from the .NET world seem to be from game developers using Unity. Which mostly tells me that there should be a way to have a 'this is the rendering thread, never pause this for GC, unless I really do bad stuff' in Unity...

mangokiwi · on April 26, 2022

Well, the framework they use (mmtk: https://github.com/mmtk/mmtk-core), is language runtime agnostic. They just used Java because Java is one of the most mature and well-research managed languages. If you make a port for C#/.net (assuming all the required features are implemented), you could pretty much use the same GC.

muizelaar · on April 25, 2022

Implementation is here: https://github.com/wenyuzhao/mmtk-core/tree/lxr

mike_hearn · on April 25, 2022

Thanks. Just finished reading the whole paper.

Thoughts in no particular order:

• A very impressive tour de force. To implement a GC in OpenJDK is hard, to come in competitive with G1 out of the gate doubly so. The approach is a very clever hybrid of both classical GC and RC techniques. They aimed to reinvigorate GC research and appear to have achieved this goal.

• Although advertised as low latency their pause times are comparable to G1. SOTA low latency collection is now <1msec and still falling. Their benchmarks don't really show this because they appear to be overdriving both ZGC and Shenandoah, which then totally collapse (e.g. >800msec pauses) due to the lack of generational collection making it hard for them to keep up with high allocation rates. This is an acknowledged limitation which at least the ZGC developers are working on removing, probably Shenandoah devs too, but the authors only mention this in one sentence right at the end.

In fairness they obviously cannot benchmark against unreleased versions of ZGC, but equally, I was expecting/hoping to see benchmarks run at allocation rates that ZGC/Shenandoah are actually rated for today. What they're mostly proving here is that you need generational collection to handle high throughputs on allocation heavy benchmarks like lusearch, rather than anything special about LXR specifically.

• It would have been nice to see a deeper dive of heap overheads and locality. Traditionally the place RC gets used is in consumer electronics / desktop apps, where you need very stable memory locality and predictable heap management overheads to avoid swapping/spikeyness. What they're doing here is sort of half-GC, half-RC and it's not really clear to what extent this would make LXR more appropriate for things like desktop apps. It'd be interesting to see a benchmark of "normal" apps like IntelliJ where the host OS is deliberately pushed into swapping, then seeing if there's a major performance gap. In theory LXR should win because objects don't move around as much but they don't seem to have tested that.

• Relatedly, the question I'm left with is this: OK, you've got roughly similar perf to G1 but with fewer features. That's kind of impressive from a coding perspective but from an end user perspective, where's the competitive advantage? The opening discussion mentions simplicity in the first paragraph, but no LOC comparisons are provided so it's not obvious that LXR is/would be simpler than G1 when feature parity is achieved. It mentions LXR will provide "sufficient responsiveness" without concurrent evacuation, but the whole reason ZGC/Shenandoah were developed was because for some use cases G1 level pause times - even though they're usually quite low - were not considered sufficient.

Overall I'm wondering if the OpenJDK guys will really want to merge this. For the generic good throughput, reasonably-but-not-insanely-low-pauses market they have G1 which is mature and feature complete. For the markets like HFT where they need pauses so low they may well not exist, but aren't spamming the collector, they have ZGC. Soon they'll have generational ZGC which will ramp the throughput capabilities of ZGC significantly and should prevent collapses like those in the paper, whilst still maintaining sub-1msec pause times.

Integrating and maintaining a new GC is a big effort because they touch so many parts of the system. Without some obviously killer advantage, it may require external maintenance. Maybe better ergonomics on desktop systems where much of the app is paged out can provide this?

mangokiwi · on April 26, 2022

> In fairness they obviously cannot benchmark against unreleased versions of ZGC, but equally, I was expecting/hoping to see benchmarks run at allocation rates that ZGC/Shenandoah are actually rated for today. What they're mostly proving here is that you need generational collection to handle high throughputs on allocation heavy benchmarks like lusearch, rather than anything special about LXR specifically.

They evaluate over a diverse range of 17 benchmarks, though only 4 latency benchmarks. Would have been nice to have more latency benchmarks, I agree.

> It mentions LXR will provide "sufficient responsiveness" without concurrent evacuation, but the whole reason ZGC/Shenandoah were developed was because for some use cases G1 level pause times - even though they're usually quite low - were not considered sufficient.

Right. At the cost of a significantly larger heap size though. Memory is cheap nowadays, yes, but large datacenters want to reduce their (electricity/power) running costs, and so if we can reach similar (or _slightly_ worse) latency requirements (and the paper demonstrates that low pause times do not necessarily translate to low latency) while using less memory, I think that in-of-itself is quite appealing. Also do note that, as mentioned in the paper, LXR is has not seen the benefit of years of tuning and tweaks that production collectors have.

I am quite excited by this new GC and hope to see further work done to improve its performance.

doikor · on April 25, 2022

> Although advertised as low latency their pause times are comparable to G1. SOTA low latency collection is now <1msec and still falling. Their benchmarks don't really show this because they appear to be overdriving both ZGC and Shenandoah, which then totally collapse (e.g. >800msec pauses) due to the lack of generational collection making it hard for them to keep up with high allocation rates. This is an acknowledged limitation which at least the ZGC developers are working on removing, probably Shenandoah devs too, but the authors only mention this in one sentence right at the end.

This quite a lot. From my experience with ZGC once you push it too hard to cause it to allocation stall you are in for a bad time. But if you manage to keep the load under that (usually by giving it larger a heap or more gc threads) you should expect pause times under 0.1ms.

Hopefully with the generational ZGC that is in the works this gets better and one could run with smaller heaps/less cpu overhead.

fweimer · on April 25, 2022

I can't speak for the OpenJDK developers, but given that the LXR collector is written in Rust, the current implementation is extremely unlikely to be merged.

mangokiwi · on April 26, 2022

Seems like OpenJDK _may_ be interested [1]?

[1]: https://twitter.com/rkennke/status/1518552419000627202

snarfy · on April 25, 2022

I'm still waiting for a lock-free, zero latency GC that runs on its own core. I believe this might exist for java but not .net.

jeffbee · on April 25, 2022

Why would running the GC on its own core be a good idea? Don't you want it to happen wherever it would be most efficient, i.e. on the same core where the memory in question was most recently accessed?

caffeine · on April 25, 2022

Can’t speak for OP, but for certain low-latency applications like trading, one wants to isolate a core, pin a thread to it, and have that thread spinning hot on some condition (maybe network card, maybe IPC), for optimal latency in responding to a particular event.

In that scenario, you wouldn’t want the GC to run on that core, for sure.

snarfy · on April 25, 2022

Yep it's about latency. Even 1ms can be unacceptable depending on the situation. We currently don't have an option for that other than going native.

KMag · on April 25, 2022

>> wouldn’t want the GC to run on that core

> We currently don't have an option for that other than going native.

You mean going manual memory management, I believe. Technically, JIT vs. ahead-of-time native vs. interpreted is orthogonal to manual memory management vs. GC. Yes, most JITs also have garbage collectors, but it's not necessarily the case.

But, your point does stand that when you really care about 99.99th percentile latency, you almost certainly need to go with static native compilation and manual memory management. The long-tail on JIT and/or most GCs is just too high.

kaba0 · on April 26, 2022

The OS can cause 1ms pauses easily, so I would be wary whether 1ms is truly unacceptable. To consistently achieve it, not even native may be enough in itself, you have to circumvent the OS kernel as well.

caffeine · on April 27, 2022

You can enforce that the Linux kernel not do this with nohz cmdline options

KMag · on April 25, 2022

> In that scenario, you wouldn’t want the GC to run on that core, for sure.

In that scenario, you wouldn't want the GC to run at all. If it runs on another core but touches the GC header words on objects used by the application code, it's going to end up stealing cache lines from the application core's L1 cache, resulting in pipeline stalls on the application core.

I've talked with folks who just disabled the garbage collector, were careful to keep down the number of objects allocated per trade, and ran their trading engines on boxes with huge amounts of RAM so that they were very unlikely to run out of memory before the close of the market.

Now, I could see some GC-specific modifications to the cache coherency hardware. For one, it would be useful to have message to set or unset a single bit on a cache line owned by another core, and report the previous value of the bit on the cross-core interconnect, without changing the ownership of the cache line or moving the whole line across the bus. You'd probably also want a variant on the MESIF cache coherency protocol where another core can get a "notify" bit set on a cache line, so when a cache line in the F state with its notify bit set gets naturally evicted from the cache, its contents and ownership would be transferred to the requesting core. I imagine you'd have a small programmable push-down automaton similar to the ESP32's ULP core sitting in the L1 cache. Its stack would be that core's share of the grey set, and it would have a small array of addresses it was waiting on to mark in the background as their cache lines became un-contended. It would undoubtedly be complex, but it seems the cleanest way to do garbage collection in the background without stealing cache lines from cores executing application code.

Really, I think you're best off having Erlang/Elixir/BEAM-like concurrency (but using ahead-of-time native compilation) where most objects are kept in per-actor GC arenas, and maybe only have some types capable of being passed between actors, and allocating those types in separate arenas to keep their cache line contention from harming the happy path non-shared objects. (Yes, last I checked, BEAM had moved from per-actor ("process") GC arenas to shared arenas, but that's largely because they had no way of accurately determining which allocations would later be passed across actors.)

Also, ideally, you'd have static inference of region-based garbage collection to reduce the amount of scanning done. Static inference of Rust-like lifetimes is undecidable, so you'd need a conservative inference algorithm that would leave some performance on the table, but you'd still have your garbage collector as a fall back for objects where lifetime inference fails. I guess you'd short-circuit lifetime inference for most dev builds to keep build times sane.

caffeine · on April 27, 2022

> I've talked with folks who just disabled the garbage collector

Yes we did this too for a while. In the end it slowed down dev velocity too much and we could afford to trade off some slight tail latency. Mind you this was back in like 2016, I haven’t used Java in trading for a while..

> it's going to end up stealing cache lines from the application core's L1 cache

This happened but in practice it wasn’t a big deal, it could be pretty effectively mitigated by consistently running the fast-path code to keep the caches hot.

sanxiyn · on April 25, 2022

Because you don't want GC to interfere with cache.

jeffbee · on April 25, 2022

Which cache are we talking about? Are you saying the GC blows out the icache, or ?

KMag · on April 25, 2022

As the GC running on its own core sets mark bits in a mark-sweep or mark-sweep-compact collector, it's going to get exclusive cache line access to the line holding the GC word for every live object. That's going to force more traffic on the cross-core communication lanes and cause cache line invalidation on the cores running your application code. A dedicated GC core is basically a core dedicated to flooding the cross-core interconnect, and invalidating lots of cache lines for your cores running application code.

I think you're much better off coming up with some dedicated instructions and/or features to make read and/or write barriers lighter weight. One possibility would be hardware multi-threading, but where the one thread only issues micro-ops when the primary thread would otherwise stall the pipeline. That way, GC could run in the background using the same L1 cache as the application thread.

jeffbee · on April 26, 2022

> A dedicated GC core is basically a core dedicated to flooding the cross-core interconnect, and invalidating lots of cache lines for your cores running application code.

Yeah, that's what I was trying to chase out of the OP. Running GC on a dedicated core seems like such an obviously bad idea, I'm not sure why anyone would seriously propose it.

KMag · on April 26, 2022

> I'm not sure why anyone would seriously propose it.

Likely, cache line contention just probably never crosses their minds. I don't blame them. I'm largely self-taught. My degree is in mechanical engineering, and I did a bit of dabbling in kernel programming in college, and took a CPU design course, but I had been working in industry for more than 10 years before I had a decent mental model of cache coherency.

There are just so many things to learn. I think few practitioners were exposed to or remember MESI or similar cache coherency protocols unless they've really done a lot of deep optimization in low-level languages or other languages with value types.

Most of your GC languages (particularly those without value types) do so much pointer chasing that cache line contention may often get lost in the noise floor of cache line evictions.

jhgb · on April 27, 2022

> Running GC on a dedicated core seems like such an obviously bad idea, I'm not sure why anyone would seriously propose it.

I was under the impression that C4/Zing was doing exactly that. Did I misunderstand what C4/Zing was doing?

snarfy · on April 29, 2022

This is a great response and yes something I didn't really think about with the original idea.

Considering actual, real world hardware as you point out, it is not a very good idea.

alecco · on April 25, 2022

nick_ · on April 25, 2022

Yes, a GC implemented in hardware would be a huge improvement.

jimjimjim · on April 25, 2022

Low-Latency, High-Throughput Garbage Collection... twitter?

alecco · on April 25, 2022

No matter how fast, Garbage Collection will always cause a performance knock-on effect by kicking useful stuff out of cache.

pjmlp · on April 25, 2022

No matter how productive, high level languages will always cause a performance knock-on effect by kicking useful stuff out of instruction cache and internal registers.

Yet they are useful.

titzer · on April 25, 2022

What high-level language feature causes inherently increased register pressure?

pjmlp · on April 25, 2022

Assembly allows to use all of them, and there is no need to reserve any for ABI, other than syscalls or specific opcode requirements.

On high level languages one needs to hope that the register allocator does a good job.

titzer · on April 25, 2022

Ha, well. I just spent a fair bit of time writing an interpreter in assembly and hand-optimized the regalloc to the limit. But I was surprised by my competitors written in C and how well gcc, in particular, could regalloc hairy interpreter loops. Compilers are very good at register allocation at the right granularity; they just generally don't do it across procedures.

zozbot234 · on April 25, 2022

> they just generally don't do it across procedures.

They won't do it for extern (i.e. separately compiled) calls, because the ABI wouldn't really allow for anything of the sort. But calls to static (or inline) procedures that only have visibility within a single source file can occur with arbitrary calling conventions, so the compiler is free to optimize register allocation across procedures.

pjmlp · on April 26, 2022

Sure it is a matter of how good the regalloc algorithm is, and how many registers are there to play with.

My point was more about how we are willing to trade the optimal performance for something that gets us there most of the time, while offering much more productive tooling.

sebastianconcpt · on April 25, 2022

Right, the cost of maintaining systems would be exponentially high for a marginally irrelevant gain on an edge case.

pjmlp · on April 25, 2022

Developer salary per hour * time developing the feature * time spent bug fixing * number of team members + hardware infrastructure cost

moonchild · on April 25, 2022

No matter how fast, malloc will always cause a performance knock-on effect by wrecking locality.

titzer · on April 25, 2022

I don't agree with this. In the limit, an AI (if it is ever as smart as a human[1]) will just introduce malloc/free and stack allocation at the right places.

[1] at memory management. I think that it is only a matter of time until machine learning, region-based memory management, and inference of ownership (which really is region-based memory management, tbh) are combined in the right way to get superhuman results.

moonchild · on April 25, 2022

> introduce malloc/free and stack allocation at the right places

Will it? malloc/free are slow; gc can allocate faster and free faster. And problems are often dynamic and can't be statically determined.

jeffbee · on April 25, 2022

That doesn't really make any sense. The CPU is totally unaware of whatever high-level concept the garbage collector is pursuing. If hot garbage becomes available for reuse, then it's the best outcome.

alecco · on April 25, 2022

When the GC starts traversing the list of objects it fills the cache kicking out the previously cached things. This is a known performance problem in optimization. It's quite common. It also happens when having 2+ threads doing a lot of memory access.

And caches bring a whole cache line. So for every access, no matter how small, the minimum is 64 bytes are kicked out. Also the prefetch heuristics often bring more (e.g. next cache line).

There are cache control instructions but they are rarely used.

[0] Reducing Garbage Collector Cache Misses (2000) (HP) https://www.hpl.hp.com/techreports/2000/HPL-2000-99.pdf

> Cache misses are currently a major factor in the cost of garbage collection, and we expect them to dominate in the future. Traditional garbage collection algorithms exhibit relatively little temporal locality; each live object in the heap is likely to be touched exactly once during each garbage collection. We measure two techniques for dealing with this issue: prefetch-ongrey, and lazy sweeping. The first of these is new in this context. Lazy sweeping has been in common use for a decade. It was introduced as a mechanism for reducing paging and pause times; we argue that it is also crucial for eliminating cache misses during the sweep phase.

titzer · on April 25, 2022

GC currently causes a lot of memory traffic, that is true. So don't touch the whole heap. Wow, that reference is old. It doesn't even use a generational collector. Hans is a bright guy and is up-to-date on current GCs; I kind of doubt that he'd agree with your use of his old paper to characterize today's GCs.

adgjlsfhk1 · on April 25, 2022

The counter to this is that with a moving, generational collector, the GC mainly touches memory in the young generation which then gets moved out of the young generation. This means that GC languages can have better locality since manually managed languages since all your new objects get allocated together better.