I think his point is that LLMs are pre-trained transformers. And pre-trained tra...

I think his point is that LLMs are pre-trained transformers. And pre-trained transformers are general sequence predictors. Those sequences started out as text or language only but by no means is the architecture constrained to text or language alone. You can train a transformer that embeds and predicts sound and images as well as text.