Current approaches of multi-modal models work on embeddings and tokenizations of...

visarga · on July 10, 2024

> An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).

When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.

dheera · on July 10, 2024

> needs less words

Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.

On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.

visarga · on July 12, 2024

I remembered about a paper that sheds light on this issue. An embedding can store/recover exactly a short sentence:

> a multi step method that iteratively corrects and re embeds text is able to recover 92% of 32-token text inputs exactly

https://arxiv.org/abs/2310.06816

So it's probably 1 sentence == 1 embedding

jayd16 · on July 10, 2024

Doesn't Gemini have a 2 million token limit for exactly this?

diwank · on July 10, 2024

The number of tokens per image are actually fairly small, ranging from 85 to ~500.