> Auto-AI translation youtube uses is, bluntly, horrid. Any jokes, even obvious ones, are still fumbled frequently.
Youtube auto-translations are horrible indeed, and I say that as someone that has to live with the fact that Youtube decides to badly translate titles from a language I understad to Spanish because bilingual people don't exist I suppose. But that is because they use some dumb cheap model to make the translations; probably not even a Gemini-based model.
This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).
You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111
No. Chain of thought it just the model generating a single answer for longer inside <think></think> tags which are not shown in the final response. The strategy of generating different answers in parallel is something different (which can be used in conjunction with chain of thought) and is the thing used by models like Gemini 3 Deep Think and GPT-5.2 Pro.
I'm thinking now that as models get better and better at generating SVGs, there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs in realtime (like flash games).
You’re not going to believe me when I tell you this, but generating a webpage with HTML is far simpler than generating arbitrary graphics (that look good) with SVGs.
Thats one dimension before another long term milestone: Realtime generation of 3D mesh content during gameplay.
Which is the "left brain" approach vs the "right brain" approach of coming at dynamic videogames from the diffusion model direction which the Gemini Genie thing seems to be about.
Unless the LLM is a base model or just a finetuned base model, it definitely doesn't predict words just based on how likely they are in similar sentences it was trained on. Reinforcement learning is a thing and all models nowadays are extensively trained with it.
If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.
> If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.
So... "finding the most likely next word based on what they've seen on the internet"?
Reinforcement learning is not done with random data found on the internet; it's done with curated high-quality labeled datasets. Although there have been approaches that try to apply reinforcement learning to pre-training[1] (to learn in an unsupervised way a predict-the-next-sentence objective), as far as I know it doesn't scale.
You know that when A. Karpathy released NanoLLM (or however it was called), he said it was mainly coded by hand as the LLMs were not helpful because "the training dataset was way off". So yeah, your argumentation actually "reinforces" my point.
No, your opinion is wrong because the reason some models don't seem to have some "strong opinion" on anything is not related to predicting words based on how similar they are to other sentences in the training data. It's most likely related to how the model was trained with reinforcement learning, and most specifically, to recent efforts by OpenAI to reduce hallucination rates by penalizing guessing under uncertainty[1].
Well, you do understand the "penalising" or as the ML scientific community likes to call it - "adjusting the weights downwards" - is part of setting up the evaluation functions, for gasp - calculating the next most likely tokens, or to be more precise, tokens with the highest possible probability? You are effectively proving my point, perhaps in a bit hand-wavy fashion, that nevertheless still can be translated into the technical language.
You do understand that the mechanism through which an auto-regressive transformer works (predicting one token at a time) is completely unrelated to how a model with that architecture behaves or how it's trained, right? You can have both:
- An LLM that works through completely different mechanisms, like predicting masked words, predicting the previous word, or predicting several words at a time.
- A normal traditional program, like a calculator, encoded as an autoregressive transformer that calculates its output one word at a time (compiled neural networks) [1][2]
So saying "it predicts the next word" is a nothing-burger. That a program calculates its output one token at a time tells you nothing about its behavior.
> So saying "it predicts the next word" is a nothing-burger. That a program calculates its output one token at a time tells you nothing about its behavior.
Well it does - it tells me it is utterly un-reliable, because it does not understand anything. It just merely goes on, shitting out a nice pile of tokens that placed one after another kind of look like coherent sentences but make no sense, like "you should absolutely go on foot to the car wash". A completely logical culmination of Bill Gates' idiotic "Content is King" proclamation of 20 years ago.
No, you can't know that the output of a program is unreliable just from the fact that it outputs one words at a time. I already told you that you can perfectly compile a normal program, like a calculator, into the weights of an autoregressive transformer (this comes from works like RASP, ALTA, tracr, etc). And with this I don't mean it in the sense of "approximating the output of a calculator with 99.999% accuracy", I mean it in the sense of "it deterministically gives exactly the same output as a calculator 100% of the time for all possible inputs".
If it answers this out-of-distribution question correctly -- which the other major models do -- what else should we conclude, other than that a meaningful form of "understanding" is being exhibited?
Do we need a new dictionary word that acts as a synonym for "understanding" specifically for non-human actors? I don't see why, personally, but I guess a case could be made.
You may be tempted to conclude that. Then you find something else to ask that leads to an answer obviously nonsensical to a human being, or it hallucinates something, and you realise that, in fact, that's not the case.
IMHO 'understanding' in the usual human sense requires thinking and however good and fast improving LLMs are I don't think anyone would suggest that any of them has become sentient yet. They can infer things based on their training data set better and better but do not 'understand' anmything.
This is a deep and complex topic, and has been for decades.
LLMs can roleplay taking personal offense, can act and respond accordingly, and that's all that matters. Not every discussion about LLMs capabilities must go down the "they are not sentient" rabbit hole.
The difference between thinking and no-thinking models can be a little blurry. For example, when doing coding tasks Anthropic models with no-thinking mode tend to use a lot of comments to act as a scratchpad. In contrast, models in thinking mode don't do this because they don't need to.
Ultimately, the only real difference between no-thinking and thinking models is the amount of tokens used to reach the final answer. Whether those extra scratchpad tokens are between <think></think> tags or not doesn't really matter.
There is also the slight problem that apparently Opus 4.6 verbalized its awareness of being in some sort of simulation in some evaluations[1], so we can't be quite sure whether Opus is actually misaligned or just good at playing along.
> On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.
I feel like a lot of evaluations are pretty clearly evaluations. Not sure how to add the messiness and grit that a real benchmark could have.
That said, apparently Gemini's internal thought process reveals that it thinks loads of things were simulations when they aren't; it's 99% sure news stories about Trump from Dec 2025 are a detailed simulation:
> I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part:
>> It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for flow, clarity, and internal consistency.
Youtube auto-translations are horrible indeed, and I say that as someone that has to live with the fact that Youtube decides to badly translate titles from a language I understad to Spanish because bilingual people don't exist I suppose. But that is because they use some dumb cheap model to make the translations; probably not even a Gemini-based model.
reply