Current approaches of multi-modal models work on embeddings and tokenizations of images, which is the fundamental problem: you are feeding blurry, non-precise data into the model. Yes, they are blind because of exactly this.
An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.
At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.
> An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.
An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).
When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.
Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.
On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.
An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.
At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.