This is a very long document that says nothing about chunking at first skim. If chunking is actually wrong, then just explain why, here. Wasting space is not actually a problem if it’s optimized for other purposes instead.
When it comes to large assets, wasting large chunks of space is a problem. If your chunks are 64 kib average (from the Lore document), but changes only average 1 kib (which could be a high estimate), then you will still run out of space 64 times faster and need to read 64 times more data off of the disk for certain operations.
It also makes diffing hard, as well as diff viewing.
Seems like if Lore wants to reduce space usage, they could apply something like Git's delta compression (as used in packfiles) to the chunks.
Suppose you make a 1 kB change in a 50 MB file. That causes a 64 kB chunk to be created and stored. Disk space is wasted.
But since the 50 MB file was already stored as a sequence of 64 kB chunks, there is an existing 64 kB chunk that is very similar to your new 64 kB chunk. You can store your new chunk as a delta to that, so only ~1 kB of disk space is used.
Admittedly, it's complicated and inelegant. But it allows both deduplication between files (one of the reasons Lore chose chunks, apparently) and efficient space usage for small changes.
I tried to give that section of the doc a fair read.
Looks like operational transforms to me.
The doc claims it's the first with this technique. A 30 second search reminded me of Darcs, and taught me about Pijul, and Weave. And yes, Google Docs storage works the same way - there are probably papers documenting how efficient Google Docs storage is, but it's not wrapped up in a full VCS that folks can use.
The example in the doc uses text, and unfortunately I think it's for a reason. I think with large, binary game assets, the most common operation is going to be strings of "replace A with B", and depending on your chunk size relative to the distribution of changes you make on your assets, I see it as pretty close to a wash, for efficiency. Especially considering that content-addressable blocks also solves de-duplication, which for a multi-game studio is probably going to be significant. Especially if they're managing multiple releases, patches, development branches, etc.
Sort of. I add provenance, which helps properly identify collisions, and require a well-defined order by stacking [1] changelists.
> The doc claims it's the first with this technique.
More like the first with the particular angle on the technique. I specifically mention patch theory as another side of the same coin.
> A 30 second search reminded me of Darcs, and taught me about Pijul, and Weave.
Darcs is Pijul's ancestor, and I mentioned Pijul. I also mentioned the weave and how reference sets scale better.
> The example in the doc uses text, and unfortunately I think it's for a reason.
Readability. Nothing more. The real stuff will be a compact binary format.
> I think with large, binary game assets, the most common operation is going to be strings of "replace A with B", and depending on your chunk size relative to the distribution of changes you make on your assets, I see it as pretty close to a wash, for efficiency.
Yore will dedup change data instead because as the Lore document itself identifies, dedupping content is hard using chunks; you either get dedupping or canonical addresses. Change data doesn't have one canonical address; the address is in the commit data instead.
Dedupping changes has another benefit. If most instances are "replace A with B," and A replaces B in multiple places, Yore will be able to store just one instance of A, no matter its size. This matters because the larger the chunk, the less likely it will match any other chunk.
> Especially considering that content-addressable blocks also solves de-duplication, which for a multi-game studio is probably going to be significant. Especially if they're managing multiple releases, patches, development branches, etc.
True, but that should be table stakes. The fact that Git does not is a poor reflection on Git, not an innovation in Lore.
I'm not saying this goes over my head, but respectfully it goes over how much time I'm willing to spend understanding it... From what I can tell, it's a type of diffing approach. You're storing the diffs but tying them to hashes of the original data too.
Back then I concluded that it will probably never be built because OSS projects don't need it. Maybe this is changing now as AI allows for larger OSS projects.
> At this point if your VCS isn't a layer above git plumbing, nobody gonna waste time using it.
Probably true, but it's a shame because there are better ways of storing and processing the data, ways that natively handle binary files, semantics, and large files without falling over.
Bram Cohen is awesome, but this feels a little bare. I've put much more thought into version control ([1]), including the use of CRDTs (search for "# History Model" and read through the "Implementing CRDTs" section).
That's worth making a separate post! (and I recommend rendering it to HTML)
But "bare" is part of the value of Cohen's post, I think. When you want to publicize a paradigm shift, it helps to make it in small, digestible chunks.
> For instance, you might think that big tech engineers are being deliberately demoralized as part of an anti-labor strategy to prevent them from unionizing, which is nuts. Tech companies are simply not set up to engage in these kind of conspiracies.
The title of the blog post downplays the absolute masterclass that this post is. It should be called "A Tale of Four Fuzzers: Best Practices for Advanced Fuzzing."
And if you don't have time, just go to the bullet point list at the end; that's all of the best practices, and they are fantastic.
just a comment on this article, that may be unrelated to the point you want to make: gavin makes a fatal mistake in interpreting RMS intent. he claims that he only wanted control over his hardware. that is not true. he also wanted the right to share his code with others. the person who had the code for his printer was not allowed to share that code. RMS wanted to ensure that the person who has the code is also allowed to share it. source available does not do that.
[1]: https://gavinhoward.com/uploads/designs/yore.md
[2]: My WIP VCS has been named Yore for at least two years; I did not copy Lore's name.
reply