More

matja · 2026-06-03T16:58:45 1780505925

One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.

lambda · 2026-06-03T19:26:27 1780514787

It's not? There's an mmproj in the GGUFs released by ggml-org: https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...

From the visual guide, there's still the 35M parameter embedder, then the linear projector, for vision, and the linear projector for audio, so it does have some parameters used for the multimodal input to project it into the LLM latent space: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

And the Unsloth quants, which are missing this, don't support multimodal input. (edit: actually, I may have just needed to update my llama.cpp, will check with an updated llama.cpp soon)

I'm downloading the ggml-org GGUFs now, I tried Unsloth but got some weird problems, double checking with the bf16 model to see if the issue was just the quant.

lambda · 2026-06-04T01:07:15 1780535235

Ah, Unsloth has uploaded mmproj now as well.

pferdone · 2026-06-03T17:29:48 1780507788

But do I have the option to run it 'text only'?

matja · 2026-05-31T15:08:58 1780240138

+1 this. Example: Using Mistral TTS voice cloning appears to be not possible via the "providers" pass-through object in the OpenRouter API because some parameters are always forwarded which conflict with the provider's parameters.

matja · 2026-05-31T14:38:27 1780238307

The latest Raspberry Pi 5 has one 32-bit channel (2x 16-bit subchannels) of LPDDR4X-4267 SDRAM giving 17.1GB/s of bandwidth, 52x less than this GPU. Never mind lacking the CUDA and Tensor cores, so the FP16 performance is 102x less (307 GFLOPS vs 31.4 TFLOPS). So for £200, there's absolutely no comparison for this specific use-case.

knollimar · 2026-05-31T15:58:53 1780243133

Yeah thats what I'm saying. How is it so cheap????

feisuzhu · 2026-05-31T16:15:25 1780244125

V100 GPUs are e-waste.

matja · 2026-05-31T14:26:53 1780237613

The AMD MI250X GPUs are also interesting - 128GB of HBM2E at 3TB/s, sometimes you see them second-hand for under $1k, the catch obviously is that it needs an OAM socket. Never seen an easy way to hook them up to a regular mainboard.

Gracana · 2026-05-31T15:15:03 1780240503

An additional complication is that MI250Xes are two GPUs in one package, so you need to connect the first and last x16 SERDES groups to the host, otherwise you'll only see one GPU (or it won't work at all, idk).

Also, the cheap HPE pulls on eBay need some proprietary HPE magic to work, and I have yet to see anyone figure that out.

sonzohan · 2026-05-31T19:29:50 1780255790

This person has built a converter for the OAM socket, but it is only confirmed working with NVIDIA cards at the moment (https://www.reddit.com/r/NVIDIA_SXM2PCIE/comments/1d076cn/oa...)

It fits an MI250X, and the system sees it, but the drivers don't work. They tested an HPE MI250X. There's a rumor on the thread that there are two kinds of MI250X: Ones from HPEs and everyone else's. The HPEs require a special firmware, the normal ones do not. However, the majority of the MI250Xs on the secondhand market are HPE so caveat emptor.

plagiarist · 2026-05-31T15:23:32 1780241012

Ahh luckily this OAM socket will prevent me from spending money.

Teknomadix · 2026-05-31T14:42:22 1780238542

These are interesting, and offer beefy through put. No point in adapting to a PCI lane thought, stuck behind the slot-bus bottleneck.

matja · 2026-05-29T17:18:28 1780075108

> canonicality matters — for signatures, content-addressing, or any kind of “two implementations must agree on the bytes” property

If you don't do this properly, you end up with things like: - SAML XSW attack due to XML signature wrapping - ASN.1 BER/DER signature forgery - Bitcoin transaction malleability attacks

matja · 2026-05-24T12:21:22 1779625282

The problem is AVX-512 was disabled in later Intel Alder Lake CPUs, and later generation Intel desktop CPUs, so very few Intel desktop CPUs have AVX-512 now. Ironic that AMD has better support/performance for an ISA extension that Intel invented.

matja · 2026-05-24T12:07:54 1779624474

I loved hearing this comment in my mind :)

matja · 2026-05-19T11:38:22 1779190702

I was also thinking along the same lines. Interesting, but I'm not sure in which aspect it is an achievement, considering the loop isn't a regex.

Meanwhile, 1K ZX Chess takes fewer bytes of memory than the first four paragraphs from the post.

matja · 2026-05-09T15:15:04 1778339704

Even with O_DIRECT and aligned blocks, I still don't understand how the storage engine can return a "successful commit" to the client without a sync at some point, because a sync (IIRC) is the only way to guarantee an ATA/NVMe FUA command is sent, and the device write cache/buffer is committed.

klodolph · 2026-05-09T15:53:28 1778342008

:-/ it’s a statistical guarantee in the first place. A successful commit in a durable storage engine just needs to achieve some finite level of durability, like “10^-7 probability of loss per year”. The durability is a property of the whole system, and it is possible to achieve durability without fsync, you just may have a hard time explaining what the durability is, how you calculated it, and what the evidence or justifications are for the numbers you give.

Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.

asdfasgasdgasdg · 2026-05-09T16:18:25 1778343505

10^-7 (loss/record) * 10^8 (record/year) yields 10 data losses per year. If you're even a medium sized business you need a much better than 10^-7 probability of losses.

Dylan16807 · 2026-05-09T17:14:51 1778346891

That's only true if your typical loss event loses one record. If you have a one in a million chance of an array failure taking out 10% of your production database, and otherwise have zero possibility of data loss, you also get 10^-7 losses per record.

And I wouldn't assume they meant that number to be per record in the first place.

asdfasgasdgasdg · 2026-05-09T17:18:38 1778347118

I don't think anyone in history has ever achieved a true 10^-7 annual probability of any data loss incident. So they must have been making some kind of per record or per operation claim.

klodolph · 2026-05-10T04:19:26 1778386766

I like to think that the true AFR for data is bounded by something like 10^-3, because maybe that’s close to the rate at which civilizations collapse. You have to use a kind of subtle definition to support 10^-7 or 10^-9 or 10^-11. Or maybe instead of “subtle definition”, you can call it a “whimsical, imaginary definition”. Depends on how cynical you are.

The way I would go is by saying that you multiply the number of objects by AFR, and that’s close to the actual losses on most years. You can then exclude WW3 and the late holocene extinction event from your consideration. Or simple bankruptcy, for that matter. If your employer is gone, you don’t care about its data any more.

klodolph · 2026-05-10T04:07:36 1778386056

The half-remembered storage system I pulled those numbers from had records ~100G in size, so a 10^-7 loss is 1 loss event per year, per exabyte of data. A loss event is just “at least one bit in the record cannot be read within a certain deadline”.

Durability is a knob. If you have enough data, or turn the knob too far in the direction of durability, you will simply bankrupt yourself or maybe drown your service in latency. It makes sense that you would have storage services that provide different levels of durability.

jakewins · 2026-05-09T16:23:01 1778343781

I used to say this as well but like.. industry has, for a long time now equated “durable” with “stored on disk”. Any DBA will assume that’s what it means, and use that fact when they work out the replication they need either in clustering or in raid.

If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.

zbentley · 2026-05-10T01:37:34 1778377054

I forget the product, but more than a decade ago I remember someone broke out their durability into a table with columns for all the settings their data store offered between “ram on one node” and “fsync confirmed on a quorum of nodes’ disks” and rows for example failure cases ranging from “unexpected reboot of one machine” to “catastrophic loss of quorum-1 machines”. Cells were data loss risks from “prevented” to “possible” to “likely”.

That was very helpful when choosing durability levels.

klodolph · 2026-05-10T04:00:57 1778385657

I don’t have any respect for the viewpoint that “durable” is equatable with “stored on disk”, and I don’t want to spend time accommodating that viewpoint. It is just an oversimplification in a very bad way.

AFRs and discussions about different failure scenarios are the bare minimum. The bare minimum for scenarios is disk loss, total machine loss, and data center loss. This is just my take on things. I don’t care if something is on disk or not. I do care what happens when a sector on disk goes bad, when a faulty power supply destroys all the disks in a machine, or when a data center floods.

That forces you to think about things like whether you want to turn on synchronous replication.

jakewins · 2026-05-10T17:19:54 1778433594

The point of “durable” implying stored to durable media is precisely that it allows the operator of the system to make that kind of calculation. They know the disks they picked and the replication chosen, and as long as the database calls fsync, their calculations will work.

My beef is with database systems that use the argument you made further up thread to skip fsync to juice their performance numbers. Data is not “durable” if turning off the machines storing it means it’s lost, that’s a category difference, not a pure probability difference as you are claiming.

It is of course totally fine to not store data to durable media and say the risk of devops doing a coordinated reboot is as low as the risk of raid disk data loss, but then don’t use the word “durable”.

klodolph · 2026-05-12T16:21:32 1778602892

That definition of durable doesn’t seem useful to me, sorry. I want the failure rates and scenarios.

thomas_fa · 2026-05-09T16:26:49 1778344009

Yes, as we mentioned in the post, it is targeted for the virtualized NVME disk and we don't have control for actually issing FUA command. We are also changing to open data files with O_DATA_SYNC to make them work with normal on-prem deployment environments.

nh2 · 2026-05-09T18:00:45 1778349645

Even then, I also share the confusion of the poster you're replying to.

I don't see how a virtualised NVMe disk is different from a physical one.

Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.

> O_DATA_SYNC

You mean `O_DSYNC`?

Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?

Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?

My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.

However, I suspect that that whole consideration is pointless:

The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.

On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.

    [1] https://news.ycombinator.com/item?id=46532675
    [2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/

So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.

Let me know if I got anything wrong.

The only remaining question is: Why do you then see any difference in your benchmark?

    Configuration            Throughput (obj/s)
    -------------------------------------------
    ext4 + O_DIRECT + fsync             116,041
    Our engine                          190,985

That is what I'd find very valuable to investigate.

The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?

So I'd be interested in:

    ext4 + O_DIRECT + fdatasync
    ext4 + O_DIRECT + O_DSYNC
    Our engine + O_DSYNC (which you're suggesting above)

Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.

thomas_fa · 2026-05-09T18:17:57 1778350677

Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.

For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...

jmalicki · 2026-05-10T11:05:52 1778411152

Fsync on PLP drives isn't strictly a NOP - you still take a latency hit from the round trip of the command to the NVMe device, where it is implemented as a NOP.

binaryturtle · 2026-05-09T16:00:36 1778342436

To truly guarantee things you probably also would need an uncached read afterwards (to verify the data comes back properly from the device). Now that would kill any sort of performance, of course.

asdfasgasdgasdg · 2026-05-09T16:21:33 1778343693

There is no such thing as a guarantee in life, there are only probabilities. The goal is to make it sufficiently unlikely that data is lost, and to balance that against the cost of doing so.

That is where the disparity lies here. Reading back the data after the device reports that it has been written offers little in the way of additional assurances that it's successfully written. But if you report successful writes without syncing, there is a near certainty that you'll lose data on every power loss.

matja · 2026-04-24T15:36:16 1777044976

The eigenvalue distribution looks somewhat similar to Benford's Law - isn't that expected for a human-curated corpus?

btilly · 2026-04-24T16:52:15 1777049535

I would expect that for any sampling of data that has a roughly similar distribution over many scales.

Which will be true of many human curated corpuses. But it will also be similar to, for natural data as well. Such as the lengths of random rivers, or the brightness of random stars.

The law was first discovered because logarithm books tended to wear out at the front first. That turned out to because most numbers had a small leading digit, and therefore the pages at the front were being looked up more often.

IshKebab · 2026-04-25T16:43:09 1777135389

Benford's law has nothing to do with humans or decimal digits. It's just a statement that data often follows an exponential distribution.