Twitter post from Steve Blackburn (one of the authors) announcing the paper and ...

gwbas1c · on April 28, 2022

> Even better would be a way to compare vs methods of "manual" memory management (manual alloc/free, RAII/scope based alloc/free, arena allocation, etc.). Unfortunately, non-GC memory management is very application specific, so I'm not sure if a generalized analysis would be super useful.

I've personally found Rust to be a very difficult language to learn and become productive in.

In contrast, I was able to jump into Swift and become productive very quickly.

Swift isn't GC, but is reference counted. Reference cycles will lead to leaked memory.

So what's interesting is to know GC's overhead relative to what: Reference counted languages like Swift? Or languages like Rust?

More importantly: Is the overhead worth it? Developer time is valuable; thus leading to the critical question: What's cheaper, extra developer time or a more powerful computer?

hedora · on April 28, 2022

GC shifts human work from development time to deployment / ops time.

So the question is whether one-off developer time is cheaper than SRE time for the useful lifetime of the service.

However, it gets worse for GC: depending on the application, developing for a GC can be much more difficult / expensive than manual memory management (or reference counting). The canonical examples are low latency (with SLAs on the order of a disk or network I/O, so 10-100 microseconds) and applications that need to use bounded memory.

On top of that, there are confounding factors that hurt the case for GC. I don't know of an imperarive GCed language with a reasonably expressive type system (all the JVM languages I know of use type erasure).

For experienced developers, rich type systems with actual runtime guarantees are a huge productivity boost.

LAC-Tech · on April 29, 2022

I don't know of an imperarive GCed language with a reasonably expressive type system

Well, we have no idea what you consider to be imperative and what you consider a reasonably expressive type system to be.

titzer · on April 28, 2022

C#? F#? Go (now with generics)? Virgil...

native_samples · on April 29, 2022

GC shifts human work from development time to deployment / ops time.

Not unless you're load testing in production. Source: have been both a developer and an SRE in the past (an actual SRE, at Google, running services that used both manual memory management and GC).

Now, often people do in fact load test in production but this isn't some intrinsic property of GC. Also as an ops guy at Google I had to track down a number of memory management errors/segfaults that only occurred in prod, which wouldn't have occurred in a GCd language. Often double-frees or use-after-frees that only occurred after very specific combinations of RPC responses or timeouts that corrupted the state machine trying to handle them. Not much fun.

Also, you should be aware that "SRE" as a job role is deeply weird and most companies don't do what Google and fast-follower firms do. In most companies, ops people are cheaper than developers. Google was historically pretty unusual in demanding ops people who also happened to have histories hacking UNIX kernels but that was:

a. Not consistent, even in the early days most SRE people had sysadmin backgrounds.

b. To some extent because you would sometimes have to do things like ... debug segfaults in production.

Frankly they struggled to keep devs in SRE. They were only able to do it for a long time because they were making job offers to SWEs conditional on them working in SRE for a few years. Don't want to do ops work? Tough, you don't get to work at Google at all then. The desirability factor of working at Google was higher than the dislike of doing ops, so, people would accept this offer (that's how I ended up there). But obviously this is only a viable strategy in rare cases. Most companies can't do that.

Final point - your ideas about JVM GC are quite out of date. Modern JVMs are much better at auto-tuning. You can still squeeze extra performance out by tweaking obscure knobs and you should definitely tweak the one master knob that tells it whether your job is latency or throughput sensitive (i.e. batch or interactive). But that takes a few seconds and then you're done.

I don't know of an imperarive GCed language with a reasonably expressive type system (all the JVM languages I know of use type erasure).

Type erasure and GC are orthogonal. C# has GC without type erasure.

spullara · on April 28, 2022

You don't consider Scala to have a reasonably expressive type system?

alpaca128 · on April 28, 2022

They probably don't consider Scala an imperative language.

spullara · on April 29, 2022

It is definitely both functional and imperative depending on author :)

kqr · on April 28, 2022

> I don't know of an imperarive GCed language with a reasonably expressive type system

Haskell with all code in the IO monad?

im3w1l · on April 28, 2022

Now I'm mildly curious how viable this would be.

kqr · on April 29, 2022

Extremely, if my last experience trying it (back in 2016 or so) counts.

Like all good jokes, "Haskell is the world's best imperative language" contains a kernel of truth. The standard library contains rather interesting control structures that can exist only because IO actions are lazily evaluated values -- but you can also choose to ignore those for a more Java-like experience.

gwbas1c · on April 28, 2022

> The canonical examples are low latency (with SLAs on the order of a disk or network I/O, so 10-100 microseconds) and applications that need to use bounded memory.

Well, in those cases extra developer time is cheaper than a more powerful computer!

Will you get ROI building a stereotypical web application with manual memory management? Doubtful.

Edit:

> GC shifts human work from development time to deployment / ops time.

I've never had to tune a garbage collector in deployment. I've heard of other people having to do it; but I've never heard of anyone saying, "yes, I'm so happy I built my website with C++."

Sindisil · on April 28, 2022

All the world is not a web application.

Spivak · on April 29, 2022

True but it’s not really about web and more about “problem domains where GC pauses are significant, matter, and you can’t throw more compute and memory at the problem” which is not nothing but on the dart board of all SWE jobs good money is on hitting one where it isn’t even if you’re aiming for one.

gwbas1c · on April 29, 2022

However, web development is by far the most common domain that people develop for; and most web sites do not run at the scale where they would benefit from the kind of performance tuning that involves manual memory management.

(Good luck calling malloc from Javascript, or debugging use-after-free issues when you're iterating quickly.)

chrisseaton · on April 28, 2022

> Swift isn't GC, but is reference counted

GC is often an umbrella term that includes reference counting.

Almost all serious tracing GCs include some aspect of reference counting, and almost all serious reference counting GCs include some aspect of tracing.

dreamcompiler · on April 29, 2022

> Almost all serious tracing GCs include some aspect of reference counting

I don't know about the Java world but in Common Lisp GC there's no reference counting whatsoever.

chrisseaton · on April 29, 2022

Common Lisp is a specification, not an implementation. It doesn't make sense to say that the GC is implemented in any particular way, because that's not part of the specification - it's up to the implementation.

I would bet my house that some implementation of Common Lisp somewhere has a GC that uses card marking, which is a form of reference counting.

dreamcompiler · on April 30, 2022

I should have been more precise: The four CL implementations I'm familiar with (ACL, LW, CCL, SBCL) don't do reference counting and they probably comprise the vast majority of the market.

_19qg · on April 30, 2022

Interlisp used reference counting. Don't know if that is still the case -> interlisp.org . It includes a Common Lisp implementation.

gwbas1c · on April 28, 2022

No, it's not. Garbage collection and reference counting are "automatic memory management."

With reference counting there are no pauses, and memory is reclaimed immediately. It allows for strongly coupling deallocation with resource cleanup. (The consequence is higher overhead maintaining counts, and leaked reference cycles.)

In a garbage collected language you need to explicitly release your resources. (IE, the Dispose pattern that's common in C#)

munificent · on April 28, 2022

These days, "garbage collection" is the umbrella term that includes ref counting and other forms. "Tracing garbage collection" the term for GC that doesn't include ref counting.

> With reference counting there are no pauses

You can still have pauses of arbitrary duration with ref counting. When the last reference to an object that is the sole owner of a large graph of other objects gets decremented, the entire object graph must be immediately deallocated. It's possible for a single dereference to trigger an unbounded amount of deallocation.

> and memory is reclaimed immediately.

Most production-quality ref counting implementations defer tracking references on the stack to minimize ref count churn. That significantly improves performance but it means you can no longer claim that the system always reclaims memory "immediately".

> In a garbage collected language you need to explicitly release your resources. (IE, the Dispose pattern that's common in C#)

Managing non-memory resources is orthogonal to tracing versus ref counting. If you choose to tie a resource's lifetime to a piece of memory's lifetime then, yes, the memory manager's deallocation policy becomes user-visible.

I think you would get a lot out of reading "A Unified Theory of Garbage Collection".

SomeCallMeTim · on April 28, 2022

> When the last reference to an object that is the sole owner of a large graph of other objects gets decremented, the entire object graph must be immediately deallocated. It's possible for a single dereference to trigger an unbounded amount of deallocation.

Two things:

1. Unbounded may be technically true, but it's really not a hard problem to ensure that time-critical runtime deallocations either rarely happen or are simply never huge chains of objects that release each other. Source: Used C++ & boost::shared_ptr/boost::weak_ptr extensively in game development and never had unexpected jank due to deallocations.

2. There's no actual reason that the full deallocation needs to happen immediately in all reference counting implementations. If you have hard time or latency constraints you could queue up the actual dereferencing and ensure that you only dereference a certain number of objects per time period.

Yes, #2 changes some assumptions about when objects will be dereferenced; in C++ one might rely on guarantees that objects are immediately released along with associated resources. But if you go into a project knowing that it needed to support deferred reference counting, manually releasing critical resources is hardly the worst thing you'd need to do.

munificent · on April 28, 2022

> it's really not a hard problem to ensure that time-critical runtime deallocations either rarely happen or are simply never huge chains of objects that release each other.

Sure, but now you're essentially in the same business as someone using a GC finds themselves where they are tuning the GC, using object pools, etc.

> There's no actual reason that the full deallocation needs to happen immediately in all reference counting implementations. If you have hard time or latency constraints you could queue up the actual dereferencing and ensure that you only dereference a certain number of objects per time period.

Yup, that's a common optimization. But now you have given up the property that destruction happens immediately and deterministically.

My overall point is that the benefits of ref counting are not without cost. There are trade-offs and as you make different ones to improve performance or latency, you quickly discover that your ref counting system starts to take on shades of tracing GC with both the benefits and downsides that that entails.

SomeCallMeTim · on April 29, 2022

Object pools are not hard, and I certainly used them a lot in game development. That's just a cost of doing business in low-latency programming.

There's no "tuning of the GC" beyond that, though, unless you count linking in a faster multithreaded malloc library, which is another thing that I've tried.

What's problematic about Java in particular in relation to latency is that Java's libraries can be pointlessly wasteful with regard to allocating garbage. I found that a function that was supposed to call through to a system function that got the system time in milliseconds was somehow allocating an object and discarding it as part of the call. The function returned an integer.

I found this when my Android JNI (C++) app was getting janky because of GC pauses, and I traced the allocation to that function. I wrote a quick JNI function in C++ to call the system and return the integer directly, and the jank went away.

The best GC in the world won't make up for wasteful allocations, and some languages (or their ecosystems?) seem to encourage rampant allocations.

astrange · on April 28, 2022

> Most production-quality ref counting implementations defer tracking references on the stack to minimize ref count churn. That significantly improves performance but it means you can no longer claim that the system always reclaims memory "immediately".

Swift doesn't do this, if I understood what you meant properly. I suppose you might want to defer freeing some things to idle time but it's usually not a problem in my experience. I'm honestly surprised it always comes up.

There's "autorelease pools" in ObjC but they aren't used to move releasing to an idle time, rather it's to handle cases where nobody knows what the precise lifetime of the allocation is. Swift has a better ABI that doesn't have that issue.

munificent · on April 28, 2022

> Swift doesn't do this, if I understood what you meant properly.

I think Swift can take advantage of the fact that many values are purely stack allocated and don't touch the heap or ref counting system at all. It leans pretty heavily on structs and value types.

Other managed languages are more Java-esque in that basically every value is in principle heap allocated. In those, I think it's more valuable to defer ref counting stack references since there are many more of them and they tend to be frequently mutated.

pjmlp · on April 29, 2022

Swift surely does it, I advise reading a bit of what the compiler actually generates.

That is why, as Apple decided to fix of their ARC optmizations, they had a talk at WWDC 2021 about ARC gotchas for code that will behave in a different way due to the new optimizations.

"ARC in Swift: Basics and beyond"

https://developer.apple.com/videos/play/wwdc2021/10216/

astrange · on April 29, 2022

I'm familiar with it, I wouldn't say Swift previously intentionally lengthened lifetimes for the purposes of having fewer refcount operations.

It's more like the old version of the optimizations were just heuristics (so they didn't work that well) and the new versions are formally proven, meaning they're more aggressive and found some memory safety holes in C/ObjC APIs.

gwbas1c · on April 29, 2022

> There's "autorelease pools" in ObjC but they aren't used to move releasing to an idle time, rather it's to handle cases where nobody knows what the precise lifetime of the allocation is. Swift has a better ABI that doesn't have that issue.

Autorelease is a no-op when using automatic reference counting. (But it was a pretty cool concept back then.)

astrange · on April 29, 2022

It isn't. There's an optimization that removes things from the pool sometimes, but it doesn't always work, and it's used to pass back out-params like NSError**.

What you can't do is message autorelease yourself - although of course you can escape that pretty easily.

cesarb · on April 28, 2022

> You can still have pauses of arbitrary duration with ref counting. When the last reference to an object that is the sole owner of a large graph of other objects gets decremented, the entire object graph must be immediately deallocated. It's possible for a single dereference to trigger an unbounded amount of deallocation.

When people say "with reference counting there are no pauses", they are not talking about the bookkeeping cost of releasing the allocations when their last reference is dropped; they are talking about unrelated code (which might for instance be running a tight numerical loop with no allocation or deallocation at all) being arbitrarily stopped.

Also, it's misleading to call the amount of deallocation "unbounded". The amount of deallocation is precisely the size of the graph of objects being released at that point; no more, no less.

munificent · on April 28, 2022

> When people say "with reference counting there are no pauses", they are not talking about the bookkeeping cost of releasing the allocations when their last reference is dropped;

I'm not sure that you can claim that confidently. I have certainly seen many people laboring under the false assumption that ref counting systems always have bounded, small latency when that isn't the case. The latency may be much lower than various other tracing GC alternatives, but I think many people have a simplistic mental model that a ref count decrement only ever leads to a single object being freed.

There is an ah-ha moment when you point out that freeing an object decrements its outgoing references, which can in turn cause a cascade of frees.

> they are talking about unrelated code (which might for instance be running a tight numerical loop with no allocation or deallocation at all) being arbitrarily stopped.

Many GCs will never arbitrarily stop the program in a tight loop that isn't doing allocations, so there's nothing particularly unique about ref counting here.

> Also, it's misleading to call the amount of deallocation "unbounded". The amount of deallocation is precisely the size of the graph of objects being released at that point; no more, no less.

Sure, and a program can build object graphs of arbitrary size. That's what "unbounded" means: there is no pre-defined bound on it.

native_samples · on April 29, 2022

Yes people have incomplete mental models.

Something else people often don't realize is that malloc/free aren't instant. Why is de-referring a large object graph slow, well, mostly because free is slow. Why is free slow, because the internals of a malloc can be rather sophisticated. There is probably locking, there may be internal book-keeping like coalescing free regions inside the heap and so on.

If you look at modern GC benchmark workloads, they're often generating gigabytes of allocations per second. Good luck shoving that much work through a typical malloc! Big companies like Google have had to write their own mallocs for a good reason.

munificent · on April 29, 2022

> Something else people often don't realize is that malloc/free aren't instant.

True, and the costs vary wildly and are deeply tied into the rest of the memory management strategy.

In a copying collector, free() is essentially free: It takes the same amount of time to free 1 object in the semispace as it does 1,000 objects in the same semispace.

astrange · on April 28, 2022

It's not unbounded, but it can be surprisingly high and variable, especially if some of those objects have things like handles to kernel objects and freeing them requires IPC etc.

Still, it usually works out, especially when lowering peak memory increases performance on its own.

wgd · on April 28, 2022

> With reference counting there are no pauses, and memory is reclaimed immediately

You can't guarantee both of these properties at once, in the general case. Consider a very large tree of objects which are referenced through a single root node, and then that root node becomes eligible for reclamation. For a sufficiently large object graph, either reclamation will require a noticeable pause or some portion of the work will have to be deferred until later.

kragen · on April 29, 2022

You may be able to guarantee it in this particular case with region-based allocation (also called arena allocation); discarding a region, or resetting its allocation pointer back to the beginning, is a constant-time operation. If your very large tree is in a single region, and it's the only thing in that region, then you can reclaim the entire thing in constant time as soon as you no longer need it.

Region-based memory management is commonly used, for example, with per-frame heaps in games, or per-request heaps in servers; but in those cases memory is not reclaimed immediately. The MLKit implements Standard ML with regions instead of garbage collection, though recent versions also have GC.

native_samples · on April 29, 2022

That only works if you are able to prove none of your objects do anything in their destructors.

In practice, manually memory managed languages always conflate memory and other resources. A File object is not just a block of memory, it's also a file descriptor, etc. So you have to run the destructors at the 'right' times and can't just toss the memory. In special cases where you control every allocation, sure, but often that isn't practical.

kragen · on April 29, 2022

It's true that this solution doesn't apply when you have finalizers, but I think of that as the unusual case, not the usual one. I don't understand what you mean by "special cases where you control every allocation"; how are things getting allocated in your arena without you controlling them? Who is "you" if not the author of the program, who controls everything in it?

native_samples · on April 30, 2022

Libraries. If your libraries can't do that because you wrote your own malloc to get this arena allocation, it's because your strategy isn't general and will break the moment you want to use libraries that allocate.

kragen · on April 30, 2022

The malloc interface doesn't support finalizers anyway! If a library is using malloc, but needs finalizers, it's going to have to invoke its own finalizers manually, usually just before it manually invokes free().

Also, though, it's very unusual to install an arena allocator as a replacement for malloc and (as a no-op) free — malloc has no way to specify which of possibly several arenas you want to allocate in, arena allocators suck at realloc, and writing an arena allocator that invokes malloc to get its arena is significantly easier than writing one that uses a variety of platform-specific calls depending on compilation options. Instead, you write your own arena allocation functions (h_alloc or apr_palloc or whatever), and then you invoke them explicitly, in your own code. Non-arena-aware library code will just keep invoking malloc as usual.

The case where you might end up with library code invoking your arena allocator is where you supply your allocator as an argument to something — like C++ STL containers or, again, APR pools, for example with apr_pool_create_ex. But that still doesn't mean you have to depend on the allocator to invoke finalizers — after all, the standard library operator new and operator delete don't invoke finalizers, instead relying on other finalizers (destructors) to invoke finalizers and then invoke operator free. That still works — your files still get closed — even if the "operator free" is a nop because you're allocating out of an arena. In that case you don't get the major benefit of arena allocation (fast constant-time reclamation of the whole arena), just the minor ones, no fragmentation and fast constant-time allocation.

A more useful strategy is often to attach your finalizers to the arena instead of allocations within it; in the case of apr_palloc you can do this by creating subpools with apr_pool_create and attaching cleanup functions to them with apr_pool_userdata_set.

Regardless, my claim was more restricted than the one you're arguing against. I claimed that arena allocation can guarantee both that there are no pauses and that memory is reclaimed immediately in this case, where you have an arbirarily large tree of objects that dies at a single point in time, and in many similar cases. This is what wgd was claiming was impossible in their second claim: "For a sufficiently large object graph, either reclamation will require a noticeable pause or some portion of the work will have to be deferred until later." That claim is not true.

As far as I know, however, wgd's first and less general claim is true: there's no way to extend this region-based strategy even to the general case of immediate memory reclamation without losing the constant-time guarantee, much less the even more general case of immediate execution of arbitrary finalizers. (By contrast, reference counting can guarantee immediate memory reclamation and immediate execution of arbitrary finalizers, although at the cost of unbounded pauses, and not in the general case because it can't handle cycles.) Obviously you can't execute an unbounded number of finalizers in constant or even bounded time, or call close() on an unbounded number of file descriptors in constant or even bounded time.

But that doesn't mean you can't reclaim an unbounded number of memory allocations in constant time. You can, it's a useful thing to do, and people do it all the time. If you've fetched an URL from Apache today you've done it yourself, even if you didn't know it.

pkolaczk · on April 28, 2022

There is a difference between pausing a single thread in a known, predictable point and pausing all threads at random point in time. Tracing GCs do the latter. It is way easier to avoid long pauses with recounting than with tracing GC.

Skinney · on April 28, 2022

Java’s ZGC has a constant (O(1)) sub-milliaecond pause-time. GC runs concurrently with the app

pkolaczk · on April 28, 2022

Yeah, according to the paper you pay for it by a very high overhead, up to 2x. So generally, pick your poison ;)

Skinney · on April 29, 2022

Exactly.

But the thing is, I can switch GC implementation according to my needs.

If throughput is my number 1 concern I can use ParalellGC, if not I can use ZGC. Or G1 if I want something in between (not sure if the paper picks up the huge drop in overhead of G1 in Java 18).

pkolaczk · on May 1, 2022

You can switch the GC implementation only globally. In languages like C++ or Rust you can have different parts of program using different memory management strategy, from carefully hand-crafted region allocation through refcounting to even tracing GC (rarely needed though).

ncmncm · on April 28, 2022

In such uses, if it matters you don't rely on RC to reclaim the storage. You allocate from an arena, and drop the whole graph as a unit.

Using pointers to make up a graph is a choice. It is a thing taught in CS classes, so it may feel comfortably familiar; that does not make it good. I never do.

ahtihn · on April 29, 2022

How is this different than working around a GC?

There are lots of things you can do in GCed language to avoid allocations and avoid running the GC when it matters.

ncmncm · on April 29, 2022

Often it is your SRE who has to do that work.

gwbas1c · on April 29, 2022

> Consider a very large tree of objects which are referenced through a single root node

That's predictable and can be planned for: You can hold onto the reference until you get to a point where the pause is acceptable.

You can't do that in a garbage collected language, because you have no control over when the garbage collector will pause. (In theory your application can explicitly trigger a garbage collection, but in practice such behavior is discouraged.)

Sindisil · on April 28, 2022

[citation needed]

Reference counting is widely considered one strategy of garbage collection (https://en.wikipedia.org/wiki/Garbage_collection_(computer_s...). For that matter, some GCs that are primarily tracing also use ref counts for some portion of the heap, as an optimization (just as some primarily refcounting GCs also run a trace to clean up cycles).

Also, reference counting can certainly result in pauses, when large graphs of interconnected objects are reclaimed. On the upside, you know when a pause might happen. You still don't know for certain without explicitly keeping track of what is about to be freed, which would add additional overhead, and indeed may not be practical or even feasible.

spullara · on April 28, 2022

Reference counting absolutely has pauses. Try allocating millions of objects in a block and then exiting that block. Hell, even malloc pauses when fragmentation gets too high.

mirekrusin · on April 28, 2022

I'll show in profiler straight away. With tracing gc it won't.

spullara · on April 28, 2022

What do you think won't show up in a profiler with GC? With the right profiler I can see the entire heap as well as where every object was allocated and all those deallocations are basically free. With a modern collector like ZGC, max pause times are now sub-millisecond for collections. Also easy to measure the total GC cost in the profiler as well. The modern GC story is extremely good. The biggest issue with GC in my opinion is that it uses more memory than you strictly need.

mirekrusin · on April 29, 2022

It won't show what we're talking about in this thread - deallocations of large number of allocations when they loose last root reference at once.

spullara · on April 30, 2022

It absolutely will show them because they cost nothing. 0 will show up in the profiler which is how much deallocations cost with GC.

mirekrusin · on April 30, 2022

Magic.

In rc/arc you can achieve the same by making release a no-op.

kragen · on April 30, 2022

I don't know what you mean by "making release a no-op", but with reference counting it inherently takes O(N) time to deallocate an object containing N references, and O(M) time to deallocate M objects containing no references. With a generational copying collector, it takes O(1) time to deallocate M unreferenced objects each containing N references in the nursery — or rather it takes O(P+Q) time, where P is the number of objects that don't get deallocated and Q is the number of root references that need to be scanned, the ones in the stack plus the cards marked by your write barrier or whatever.

You can't achieve that with reference counting, with malloc/free, or with a non-copying tracing collector. You can achieve it with regions.

On the other hand, you may not need to. Allocating and initializing those objects took O(MN) time, so your program can't get an asymptotic speedup by deallocating them in O(1) time instead of O(MN) time. So it really depends on the constant factors, which nowadays depend strongly on things like cache miss rates and cache line sharing between cores.

mirekrusin · on May 1, 2022

Allocation cost in region is just pointer update, quite cheap. Deallocation is constant. You will get speedup if you can replace large number of allocations with arena. In gc languages you don't have luxury of specifying custom allocators for selected parts of the program. But the main argument was that tracing gc hides all of this from you and you can't profile your program explicitly see which callstacks contribute to large deallocations. Just because gc hides it, defers and spreads in time, doesn't mean it's zero cost – on contrary tracing will have more overhead. In vast majority of programs power consumption overhead is not relevant though, if it's traded for programming ergonomics, it's a win.

chrisseaton · on April 28, 2022

> With reference counting there are no pauses

How are you reclaiming cycles with no pauses?

ncmncm · on April 28, 2022

I have had no occasion in the past three decades to make cycles.

It smacks of poor discipline, or poor design.

chrisseaton · on April 28, 2022

Well good for you I guess - but that’s not a general solution for language implementation as it’s not sound for a language like Python. Which is why in practice most RCs include tracing. And the reverse as well - most tracers include a degree of RC. That’s the pragmatic solution in practice.

kragen · on April 29, 2022

I think tracing GC was added to CPython after about 15 years, so I guess RC was a general enough solution for it for quite a while. Even today CPython refuses to collect cycles with finalizers. Smalltalk also only used RC from 01971 to the early 01980s. I think Perl still only does RC. Swift apparently also only does RC. People programming in those languages just design their programs to not create cycles, or include a weak reference in the cycle somewhere, like for Observer kinds of things.

I'm not a fan of RC, myself. It's space-hungry, it creates unnecessary write traffic, it's unreliable (when you can have cycles), and it's slow. But there are lots of "not sound" and "not general" but "pragmatic" language implementations out there using it, without tracing. In practice, even.

ncmncm · on April 29, 2022

Pervasive RC is the most costly GC. But it is fine where not pervasive, and even some places where it is. RC is about the last thing you would worry about slowing down your Python program.

I RC'd a thing in a program, five years back. The program had a linked list in it. Increments and decrements happened on a scale of minutes. I did not agonize over it.

kragen · on April 29, 2022

I mostly agree, except that I do think reference counting is a significant part of CPython's runtime slowness. Maybe 10%.

kragen · on April 30, 2022

Confirmed, Perl still only does reference counting. No cycle collector.

titzer · on April 29, 2022

In my work, I regularly end up with cyclic data structures, e.g. the SSA IRs in several compilers I have worked on have oodles of cycles. And interpreters run user programs, which create cycles.

I don't think this is a particularly fruitful line of discussion to strongly claim that cycles can just be avoided, or worse, that you're an idiot for having them. It's needlessly inflammatory.

ncmncm · on April 29, 2022

I think your SSA cycles can be trivially reclaimed by arena means, no?

Point is just that people get used to whatever they were directed to do for their CS class assignments, and get to thinking it is natural, and even unavoidable.

That something somebody is used to not thinking about is not actually necessary can be a novel and fruitful idea. The cheapest GC is always no GC at all.

titzer · on April 29, 2022

I've worked on 5 different SSA compilers. C1, C1X, Crankshaft, TurboFan, and Virgil. C1X and Virgil are written in safe, garbage-collected languages, while C1, Crankshaft, and TurboFan are all C++ and use arena allocators. Of these, in my totally subjective opinion, the two written in GC'd languages are a heck of a lot more productive to write in--like 10x as productive. They still run really fast, too. In fact, C1X was faster than C1, and Virgil is in the same ballpark or even better, though they do very significantly different analyses and optimizations in their pipeline. Crankshaft might have the edge on C1X in some respects, but I never did a direct comparison. TurboFan is by far the most incredibly intricate piece of software of my life and I'm convinced could not be written with ownership, acyclicity, or manual memory management without one or two suicides and a bugtail of security vulnerabilities decades long. I would have either walked out or have been forced to resign in disgrace for failing to deliver a workable product by making the call to prioritize memory management efficiency (by presumably using optimal malloc/free everywhere) over literally every other priority when building a piece of software--correctness, maintainability, team retention, extensibility, security, etc. It wouldn't have been the V8 way after all, because it was designed with regions quite extensively because even back then, because most VMs have long realized it is a waste of time to try to do anything more complicated and regions are basically el-cheapo GC for C++.

> The cheapest GC is always no GC at all.

I kind of chuckle here because the Virgil compiler proves you manifestly right and wrong simultaneously; it's a fully GC'd, semi-functional language with a self-hosted compiler written entirely in the language. Incidentally, in a careful style that doesn't allocate a whole lot or create a lot of garbage. It chooses efficient, C-like data structures for IRs and doesn't waste a lot of time on Java-like armored classes because I am sick of wasting my life on that. It bootstraps in less than 200MB of memory, compiling the entire compiler at full optimization in a single whole program compilation. Because I made the dumbest, simplest Cheney moving GC, without even generations, it doesn't even have a write barrier. But because it only allocates 200MB, when bootstrapping it doesn't even GC once. It's basically a giant, single region that is thrown away at process exit, and the mutator doesn't pay the cost of a single write barrier. (And the funny thing is, GC wouldn't even find that much garbage, because only about half of the heap does actually die, primarily because it's mostly IR that is always reachable--it's a whole program compiler, after all). It's basically your mythical perfect GC just because it has enough memory and I am damn well in my rights to spend it as I have.

ncmncm · on April 29, 2022

You seem to be agreeing that not doing even so much as RC, just leaking, is better in this application than any sort of GC. I think that is common, for compilers. In that case, what does your other language do for you that C++ does not?

titzer · on April 29, 2022

It doesn't leak. While I could give you a long rundown of why Virgil is better than C++, in all reality I designed it as a better alternative to Java--statically compiled, polymorphic specialization, delegates and first class functions, tuples, variance on overriding, simpler syntax, properly immutable fields and constants, and algebraic types--its advantages over C++ is that I literally never, ever, have to think about memory management. I don't have to decide regions, reason about errant pointers, mess with custom allocators. I don't have leaks, the object representation is efficient and straightforward, allocation is fast, and if I am a little careful, then I just don't blow a lot of memory on garbage. If I'm not careful then the program just uses more memory and GC's some.

I'm going to time this thread out now because I'd rather continue not thinking about memory management until I cycle back on GC work, and I don't really see any evidence that you're listening. Besides, I've said it better with code and the machine seems to happily hum along to my music.

kragen · on April 30, 2022

I for one have found it very educational; thank you!

ncmncm · on April 29, 2022

This was all enlightening, thank you.

deathanatos · on April 28, 2022

While I also haven't ever had a cycle in my decades as an engineer, it doesn't really answer the parent's question about pauses w.r.t. ref counting. Ref counting, in the general, isn't free.

(Also, in the sense of releasing a cycle, that would be pretty quick, since a typical RC implementation is just going to leak it. Its where a large graph resource graph is successfully ref-counted to 0 that can cause a pause, theoretically, although that's another case I've not seen.)

ncmncm · on April 29, 2022

The usual method is to have strategically placed "weak" references.

deathanatos · on April 29, 2022

… yes (and I know), but that has nothing to do with the point being made here, which is about the compute time required to manage the memory.

anamax · on April 28, 2022

Cycles can happen in lots of ways.

For example, heap allocated closures for functions with non-local references create a cycle when a pointer to the closure is stored in a non-local environment.

ncmncm · on April 29, 2022

Yes, it is and has always been easy to make cycles. My point is that it is almost always about equally easy not to make cycles, if you are paying any attention.

ahtihn · on April 29, 2022

The fact that you have to pay attention makes it significantly harder.

You have to be aware that cycles can cause issues and you have to think about the impact of cycles on whatever you're doing.

ncmncm · on April 29, 2022

So you say. But that does not match my experience. It has always been wholly trivial to know where the "weak references" go.

Ygg2 · on April 29, 2022

Using enough transition proves that difficulty of not making cycle is going to increase very, very fast.

pron · on April 28, 2022

Reference counting and tracing are the two main approaches for garbage collection. See, for example, https://courses.cs.washington.edu/courses/cse590p/05au/p50-b...

While tracing is considered to have better throughput and reference counting to have a lower footprint, as the liked paper shows, they are really on a spectrum of techniques with different tradeoffs.

throwaway894345 · on April 28, 2022

> No, it's not. Garbage collection and reference counting are "automatic memory management.

In strict computer science parlance, refcounting is a kind of GC, but to most people GC refers to "tracing GC" specifically. It's fairly pedantic.

> In a garbage collected language you need to explicitly release your resources.

I assume "resources" here must mean "non-memory resources"? Even still, GC can apply to any kind of resource, it's just most popularly applied to memory (you could have a GC for file handles, for example).

Also, so you know, the user you're replying to knows a thing or two about programming languages.

pjmlp · on April 28, 2022

I advise reading a couple of CS books on the subject.

colejohnson66 · on April 28, 2022

> With reference counting there are no pauses...

Same with GC. "Stop the world" may be a characteristic of a simple GC, but it's entirely possible to have GC that works by "coloring" objects.

> In a garbage collected language you need to explicitly release your resources. (IE, the Dispose pattern that's common in C#)

You only need to implement IDisposable if you are dealing with unmanaged resources (ie. non-GC objects). But it's the same with C/C++: you have to call `free()` or `delete`; The difference is that managed resources are auto-released for you.

zozbot234 · on April 28, 2022

Rust can also use reference counting when necessary. Many Rust novices don't realize this, and run into needless trouble when trying to learn Rust. Rc<RefCell<T>> and Arc<Mutex<T>> are not "unidiomatic" other than in a very weak sense. Sometimes they are even unavoidable.

gwbas1c · on April 29, 2022

They're significantly more time-consuming, for the developer, to work with.

Think of using something like a lambda to start a thread: You have to explicitly clone your Rc<Whatever> into another variable, and then use the second variable in your lambda. The compiler won't implicitly clone your RC<Whatever> for you.

IMO, it's one of the biggest drawbacks in Rust.

throwaway894345 · on April 28, 2022

It's still pretty painful to deal in Rc<...> and friends in Rust. I was expecting it to be a few extra keystrokes around my types, but I still ended up having to deal with the borrow checker when I would put something into an Rc or take it out to pass into some other function. It was nowhere near as easy as a GC or refcounted language.

woodruffw · on April 28, 2022

> I was expecting it to be a few extra keystrokes around my types, but I still ended up having to deal with the borrow checker when I would put something into an Rc or take it out to pass into some other function.

I'd be interested to hear where you experienced difficulty, specifically. `Rc<T>` is `Deref<T>`, so you should never have any problems calling borrowing APIs. Owning APIs require an explicit borrow-and-clone, but that's just two calls (`as_ref().clone()`).

sharpneli · on April 29, 2022

One must also remember to differentiate the deployment targets. Not every piece of software is deployed in your own machine.

Let's take games as an example. You cannot say to users "Just buy more powerful compute/phone". And even if they would you may have literally millions of machines, due to having millions of users. In which case saving 1 dollar worth of machine per user is savings of millions which is worth a lot of developer time. Naturally you as a dev won't really be paying this, so the incentive is not directly there. But the incentive comes from people just not buying a game that runs like crap.

In these cases GC is nice and dandy at start but then you run into the inevitable "Why does my game stutter?", after which object pools are introduced and even then some annoying "leaks" may remain. If the amount of time had been used from the get go to actually think about memory management the whole issue could have been avoided.

native_samples · on April 29, 2022

Yes, but GC and object re-use aren't mutually exclusive.

Sometimes people argue as if they are, which is understandable because the overlap between semi-FP programming styles and GCd languages is fairly strong, and the FP world pushes immutable state as almost a matter of moral principle. If you tried to do functional programming with purely manual memory management you'd see high overheads too simply because there'd be so many small allocations and copies. In practice though, people tend to think harder about in-place mutation when they don't have a GC to clean up after them.

kragen · on April 30, 2022

They aren't mutually exclusive, but if you're using a tracing garbage collector this millennium you're probably using a generational copying collector, and if you're using a generational copying collector you have a write barrier, and the write barrier means that allocating and initializing a new object is less expensive than overwriting all the fields of an existing object, probably by a factor of several. Also, if your reused object is only referenced from the nursery, and it references other objects in the nursery, those objects ought to be collected on a minor collection, but they won't be; instead they'll be promoted, possibly through several generations, until they reach the generation where your reused object is, which will finally allow them to die. Balanced against this you have less frequent minor collections — none at all during times when you're allocating no new objects.

So, with a tracing GC, introducing an object pool will usually make your program slower, not faster. (This is not true with reference counting.)

Sindisil · on April 28, 2022

Yes, both.

Indeed, the fundamental question is whether the costs of automatic memory management (be it tracing GC or refcount GC) are "worth it".

But it's not always as simple as a monetary consideration, and even when it is, the monetary cost often involves way more than just dev time vs cost of computer (e.g., power, cooling, distributed cost over many end users, market applicability because of system requirements).

pkulak · on April 29, 2022

Reference counting is GC. There are no pauses, but incrementing and decrementing a counter on every touch isn’t free either.

ygra · on April 29, 2022

Well, the pauses are somewhat predictable in that they may appear when the count of an object reaches zero. I had a very expensive closing brace in (admittedly not well-written) C++ code once.

pjmlp · on April 28, 2022

Reference counting is a GC algorithm as per any worthwhile CS book on GC algorithms.

xwolfi · on April 28, 2022

Well it depends a bit, in my low latency application where we cant afford GC pauses, we tolerate them from 5am to 5:30 am. We do everything preallocated and if we must tolerate new heap allocation we simply never release (so we have a lot of object pools).

I felt it was hard and exotic at first but now I feel it's quite logical: never free, and have a proper monitoring of your size per unit of information to size the heap right and you're done.

But then it also depends why you dont want GC, us it's because clients cannot tolerate a pause (we had a socket manager creating and releasing garbage that triggered one 50ms GC pause during the last 3 seconds of trading of a particularly crazy trading day and the client threatened to leave), but often they re tolerable if you just stay reasonable with your garbage.

The problem is not the collection in non critical applications, it's the garbage generation getting completely out of hand. In a C# application my team had to micro optimize because of covid-related explosive trading volume, we simply bought gigantic amount of RAM for our 200 users, only allowed GC if CPU was in low usage and run with 80GB memory usage at the end of the day for an order grid that could run with 5GB collected. That s another way: I wish Java 8 had an easier way to simply deactivate GC for us to run with hardware money.

cmrdporcupine · on April 28, 2022

One problem is that GC'd languages also tend to have programming practices that encourage the generation of garbage, while languages with manual stack/heap allocation make you think about it a bit more. The sheer number of String and other lightweight (but existent) instances in a running Java program is astounding, because the language simply doesn't encourage you to be parsimonious with creating them.

For most people's applications this is totally fine. But, yeah, you can end up with a scenario where GC pauses are just totally unacceptable, or where you need absolutely predictable behaviour and then you might resort to radical things like what you're describing here and it starts to feel like you've just chosen the wrong tool for the job.

I say this as a person who wrote an RTB ad server in Java, many moons ago... and never would again.

titzer · on April 28, 2022

> The sheer number of String and other lightweight (but existent) instances in a running Java program is astounding, because the language simply doesn't encourage you to be parsimonious with creating them.

I agree with this. Java promotes a profligate programming style where allocation (and reclamation) are considered free. (They aren't). This is baked into the JDK deeply. E.g. with standard APIs it is not even possible to parse an integer out of a string without allocating memory. Those are language and library design choices that have a huge confounding effect on trying to measure GC.

For example, I wrote the Avrora AVR emulator and sensor network simulator, which is one of the DaCapo benchmarks. It's a CPU emulator for very small devices and uses almost no standard JDK stuff in its core work. As such it allocates very little and has a pretty small heap. It often shows up weird (sometimes obnoxiously small or uninteresting) in papers that study memory management because it really isn't very Java-like in its behavior.

_old_dude_ · on April 28, 2022

> with standard APIs it is not even possible to parse an integer out of a string without allocating memory

Not true anymore :)

https://docs.oracle.com/en/java/javase/17/docs/api/java.base...

titzer · on April 28, 2022

Wonders never cease :)

xwolfi · on April 28, 2022

I d simply take a noop GC in the most recent jvm (I cant yet in my conservative bank), and make people suffer through it in prod until everything is constrained correctly, but you're right, String are banned where I work, and what a pain it has to be to do everything via char buffer but it's taking a dev like maybe 10 minutes per "String" handling instead of 5 second during code writing (to massage the buffer all the way from the input signal to the output destination, be it a log file, a storage structure, a comparator or a map key), so it's a giganormous increase of dev time relatively, but in absolute time it's quite trivial over the last 20 years. Yeah we wasted time per feature on String handling but... like maybe a few hours per feature per dev and we generate billions out of the trading backend so...

Maybe Java had the wrong assumption and really should just have made C-string optional first class citizens or something.

One thing is sure though, our C++ colleagues in the algo team are so slow to deliver, so prone to crazy bugs, so wild and wizardy, so unable to onboard new joiners, they re getting replaced this year by... a brand new Java algo engine... So not sure what to think.

ncmncm · on April 28, 2022

I know exactly what to think: salaries offered for your C++ colleagues are not competitive, so you are collecting competitors' dregs. It can be very hard to come back from that, maybe impossible.

In other shops, the notion of conducting trading in Java, GC'd or no, would get you laughed from the room. But Java coders certainly are cheaper.

xwolfi · on April 28, 2022

Well we do Java for the slow path and fpga for the fast path. C++ for the algo decisions, but even when we do java we jni everything that would be slower than C++. Nobody's laughing, everyone is just profiling and fitting constraints, so far Java worked, we ll switch if we must...

Problem with FPGA is that it s pointless to try to correct client mistakes or validate exchange regulation with it, so it's only for upstreams who have the whole chain clean, it's a slow transition.

cmrdporcupine · on April 28, 2022

See now this all sounds very interesting and exciting. I love both FPGA and C++.

Too bad it's finance :-)

ncmncm · on April 28, 2022

I had a colleague who called it "automated stealing".

I think there is a time threshold above which it defensibly is not, and another below which you would have to agree, but I am not sure of the values.

xwolfi · on April 29, 2022

There's no stealing if all parties consent. If you don't consent, convince others and vote better. Why blame finance when house representatives dont even understand they shouldnt insider trade, something even I, a "stealing automation engineer" would never even think of ?

ncmncm · on April 29, 2022

For the record, I don't think of you as a stealing automation engineer. You perform a valued service for an appreciative employer. (My employer, meantime, provides valued services for just such employers, maybe even yours.)

But your employers take home very large annual bonuses while contributing what, exactly, to society? If they and their sub-ms competitors did not do it, would anyone miss it? Suppose, e.g., trades did not register until 1000ms after entry?

Rhetorical questions.

pjmlp · on April 28, 2022

The City in London is full of people with humour and deep pockets.

xwolfi · on April 28, 2022

I m in Asia, we need the ability to validate 13 different exchange regulations and Java helps more than C++ as far as I can observe in team work, modularity, forcing some sort of okay common practice etc. The guys in the US have it easy with regard to logic so can throughput fast enough in Java to only switch fast clients to FPGA without the need for C++ in between.

The guys in London have nowhere near the volumes of APAC or the US. They can trade with excel if they want.

pjmlp · on April 28, 2022

Might be, doesn't change the fact that plenty of business in finance have moved into GC based languages, reducing their use of C++ into niche use cases.

Just like using C or C++ hasn't taken away that some niche use cases still require coding in Assembly.

It isn't only Java, C#, F#, Haskell and OCaml are also tooling they reach for.

FpUser · on April 28, 2022

Initially Java has been created with totally different application scope in mind comparatively to what it is now. I know it may look like a mess but maybe they should've added new optimized string type in addition to already available.

As for C++ - I do just that. All my servers are C++ and while I will not claim "bug free" I do not really remember when was the last time I've had memory related bug (alloc / boundary / use after free / etc). Modern C++ along with sanitizers helps greatly in this department. All fixed bugs were logical.

I did my share of Java development big while ago and got sick of trying to make it performant enough. In order for me to achieve the goal I've spent way more time caring about memory / resource related issues in Java than I would do in "dangerous languages. Granted it was high performance servers. Doing different type of task in Java could've cost way less aggravation.

initplus · on April 29, 2022

I'm curious, is there a reason why GC'd languages don't use some form of escape analysis so that only allocations that can possibly live beyond the current scope need to be heap allocated?

kragen · on April 29, 2022

Lots of GCed languages do use escape analysis. That's what escape analysis was invented for!

chihuahua · on April 28, 2022

C# has the option of background/concurrent GC. Did that not work for your application?

xwolfi · on April 28, 2022

It was less cpu intensive to buy RAM, so we did that first. We need all 20 cores for all the other stuff the desktop user uses sadly, and we can only give them 3 computers with 20 cores each.

I tried lobbying for threadrippers but... not easy to onboard a new CPU in a giga corporation :(