Hacker Newsnew | past | comments | ask | show | jobs | submit | gavinhoward's commentslogin

As someone who has thought a lot about VCS design [1] [2], the chunking approach is the wrong one and will still waste space.

[1]: https://gavinhoward.com/uploads/designs/yore.md

[2]: My WIP VCS has been named Yore for at least two years; I did not copy Lore's name.


This is a very long document that says nothing about chunking at first skim. If chunking is actually wrong, then just explain why, here. Wasting space is not actually a problem if it’s optimized for other purposes instead.

When it comes to large assets, wasting large chunks of space is a problem. If your chunks are 64 kib average (from the Lore document), but changes only average 1 kib (which could be a high estimate), then you will still run out of space 64 times faster and need to read 64 times more data off of the disk for certain operations.

It also makes diffing hard, as well as diff viewing.


Seems like if Lore wants to reduce space usage, they could apply something like Git's delta compression (as used in packfiles) to the chunks.

Suppose you make a 1 kB change in a 50 MB file. That causes a 64 kB chunk to be created and stored. Disk space is wasted.

But since the 50 MB file was already stored as a sequence of 64 kB chunks, there is an existing 64 kB chunk that is very similar to your new 64 kB chunk. You can store your new chunk as a delta to that, so only ~1 kB of disk space is used.

Admittedly, it's complicated and inelegant. But it allows both deduplication between files (one of the reasons Lore chose chunks, apparently) and efficient space usage for small changes.


What do you do instead of chunking your snapshots? Storing diffs is usually the other approach.

I tried to give that section of the doc a fair read.

Looks like operational transforms to me.

The doc claims it's the first with this technique. A 30 second search reminded me of Darcs, and taught me about Pijul, and Weave. And yes, Google Docs storage works the same way - there are probably papers documenting how efficient Google Docs storage is, but it's not wrapped up in a full VCS that folks can use.

The example in the doc uses text, and unfortunately I think it's for a reason. I think with large, binary game assets, the most common operation is going to be strings of "replace A with B", and depending on your chunk size relative to the distribution of changes you make on your assets, I see it as pretty close to a wash, for efficiency. Especially considering that content-addressable blocks also solves de-duplication, which for a multi-game studio is probably going to be significant. Especially if they're managing multiple releases, patches, development branches, etc.


> Looks like operational transforms to me.

Sort of. I add provenance, which helps properly identify collisions, and require a well-defined order by stacking [1] changelists.

> The doc claims it's the first with this technique.

More like the first with the particular angle on the technique. I specifically mention patch theory as another side of the same coin.

> A 30 second search reminded me of Darcs, and taught me about Pijul, and Weave.

Darcs is Pijul's ancestor, and I mentioned Pijul. I also mentioned the weave and how reference sets scale better.

> The example in the doc uses text, and unfortunately I think it's for a reason.

Readability. Nothing more. The real stuff will be a compact binary format.

> I think with large, binary game assets, the most common operation is going to be strings of "replace A with B", and depending on your chunk size relative to the distribution of changes you make on your assets, I see it as pretty close to a wash, for efficiency.

Yore will dedup change data instead because as the Lore document itself identifies, dedupping content is hard using chunks; you either get dedupping or canonical addresses. Change data doesn't have one canonical address; the address is in the commit data instead.

Dedupping changes has another benefit. If most instances are "replace A with B," and A replaces B in multiple places, Yore will be able to store just one instance of A, no matter its size. This matters because the larger the chunk, the less likely it will match any other chunk.

> Especially considering that content-addressable blocks also solves de-duplication, which for a multi-game studio is probably going to be significant. Especially if they're managing multiple releases, patches, development branches, etc.

True, but that should be table stakes. The fact that Git does not is a poor reflection on Git, not an innovation in Lore.

[1]: https://www.stacking.dev/


I'm not saying this goes over my head, but respectfully it goes over how much time I'm willing to spend understanding it... From what I can tell, it's a type of diffing approach. You're storing the diffs but tying them to hashes of the original data too.

Your argument was essentially that you kick chunking's ass.

I look forward to the benchmarks, but am highly skeptical.


The best answer I have is for you to read the "History Model" section of that design doc through the "Implementing CRDTs" subsection.

I wrote down some thoughts about a "next generation VCS" in 2019: https://beza1e1.tuxen.de/monorepo_vcs.html

Back then I concluded that it will probably never be built because OSS projects don't need it. Maybe this is changing now as AI allows for larger OSS projects.


Because Git was faster.

This mattered because speed is the killer feature [1], and speed is often seen by users as a proxy for reliability [2].

[1]: https://bdickason.com/posts/speed-is-the-killer-feature/

[2]: https://craigmod.com/essays/fast_software/


> At this point if your VCS isn't a layer above git plumbing, nobody gonna waste time using it.

Probably true, but it's a shame because there are better ways of storing and processing the data, ways that natively handle binary files, semantics, and large files without falling over.


Okay, but if you combine the curried and tuple styles, and add a dash of runtime function pointers, you can solve the expression problem. [1]

[1]: https://gavinhoward.com/2025/04/how-i-solved-the-expression-...


Bram Cohen is awesome, but this feels a little bare. I've put much more thought into version control ([1]), including the use of CRDTs (search for "# History Model" and read through the "Implementing CRDTs" section).

[1]: https://gavinhoward.com/uploads/designs/yore.md


That's worth making a separate post! (and I recommend rendering it to HTML)

But "bare" is part of the value of Cohen's post, I think. When you want to publicize a paradigm shift, it helps to make it in small, digestible chunks.


Is this the Bram Cohen who made bittorrent? There is surprisingly little information on this page.


Yes


Yes, just look at his Github page.


> For instance, you might think that big tech engineers are being deliberately demoralized as part of an anti-labor strategy to prevent them from unionizing, which is nuts. Tech companies are simply not set up to engage in these kind of conspiracies.

https://en.wikipedia.org/wiki/High-Tech_Employee_Antitrust_L...


I have had to put programming aside in 2025, probably for the rest of my life, so 2026 will be the year I reskill and reinvent myself.

But most importantly, I want to finally become as kind, patient, and charitable as I have always wanted to be.


May I ask why you have had to put programming aside?


Hey, where can I apply to a job like yours? I may not be smart enough, but I may be. And I am very interested in formal verification.


The title of the blog post downplays the absolute masterclass that this post is. It should be called "A Tale of Four Fuzzers: Best Practices for Advanced Fuzzing."

And if you don't have time, just go to the bullet point list at the end; that's all of the best practices, and they are fantastic.



just a comment on this article, that may be unrelated to the point you want to make: gavin makes a fatal mistake in interpreting RMS intent. he claims that he only wanted control over his hardware. that is not true. he also wanted the right to share his code with others. the person who had the code for his printer was not allowed to share that code. RMS wanted to ensure that the person who has the code is also allowed to share it. source available does not do that.


SAMS does do that. Read my article carefully; it also requires modification and distribution rights for users. See principles 0 and 1.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: