>With only instructional materials (a 500-page reference grammar, a dictionary, and ≈400 extra parallel sentences) all provided in context, Gemini 1.5 Pro and Gemini 1.5 Flash are capable of learning to translate from English to Kalamang— a Papuan language with fewer than 200 speakers and therefore almost no online presence—with quality similar to a person who learned from the same materials
I'm not entirely sure, that I totally convinced, but yeah, it is better than me. I mean, I could do the same, but it would take me ages to go through 500 pages and to use them for the actual translation.
I'm not sure, because Gemini knows a lot of languages. The third language is easier to learn than the second one, I suppose 100th language is even easier? But still Gemini do better, than I believed.
>Re 2: There's something tremendous in the fact, staring us right in the face, that LLMs are unable to meaningfully contribute to academic/medical research. I'm not saying that they need to perform on the level of a one-in-a-million Maxwell, DaVinci, or whatever. But as Dwarkesh asked one year ago: "What do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?"
This isn't really true so what ? If you really cared and were actually paying attention, you'd see that frontier LLMs have begun to contribute to academic research. There are other impressive results for math as well.
>Turing's imitation game is about making it difficult for a human to tell whether they are communicating with a computer or not. If a computer can trick the human, then... what? The computer is "thinking" ?
If you read his paper, Turing was trying to make a specific point. The Turing test itself is just one example of how that broader point might manifest.
If a thinking machine can not be distinguished from a thinking human then it is thinking. That was his idea. In broader terms, any material distinction should be testable. If it is not, then it does not exist. What do you call 'fake gold' that looks, smells etc and reacts as 'real gold' in every testable way ? That's right - Real gold. And if you claimed otherwise, you would just look like a mad man, but swap gold for thinking, intelligence etc and it seems a lot of mad men start to appear.
You don't need to 'prove' anything, and it's not important or relevant that anyone try to do so. You can't prove to me that you think, so why on earth should the machine do so ? And why would you think it matters ? Does the fact you can't prove to me that you think change the fact that it would be wise to model you as someone that does ?
>Secondly, it has been debunked for almost half a century at this point by Searle’s Chinese room thought experiment.
Searles thought experiment is stupid and debunked nothing. What neuron, cell, atom of your brain understands English ? That's right. You can't answer that anymore than you can answer the subject of Searles proposition, ergo the brain is a Chinese room. If you conclude that you understand English, then the Chinese room understands Chinese.
> Searle’s response to the Systems Reply is simple: in principle, he could internalize the entire system, memorizing all the instructions and the database, and doing all the calculations in his head. He could then leave the room and wander outdoors, perhaps even conversing in Chinese. But he still would have no way to attach “any meaning to the formal symbols”. The man would now be the entire system, yet he still would not understand Chinese. For example, he would not know the meaning of the Chinese word for hamburger. He still cannot get semantics from syntax.
> The man would now be the entire system, yet he still would not understand Chinese.
Really, here the only issue is Searle's inability to grasp the concept that the process is what does the understanding, not the person (or machine, or neurons) that performs it.
People just overstate their understanding and knowledge, the usual human stuff. The same user has a comment in this thread that contains:
'If you actually know what models are doing under the hood to product output that...'
Any one that tells you they know 'what models are dong under the hood' simply has no idea what they're talking about, and it's amazing how common this is.
Fair, I should define what I mean by under the hood. By “under the hood” I mean that models are still just being fed a stream of text (or other tokens in the case of video and audio models), being asked to predict the next token, and then doing that again. There is no technique that anyone has discovered that is different than that, at least not that is in production. If you think there is, and people are just keeping it secret, well, you clearly don’t know how these places work. The elaborations that make this more interesting than the original GPT/Attention stuff is 1) there is more than one model in the mix now, even though you may only be told you’re interacting with “GPT 5.4”, 2) there’s a significant amount of fine tuning with RLHF in specific domains that each lab feels is important to be good at because of benchmarks, strategy, or just conviction (DeepMind, we see you). There’s also a lot work being put into speeding up inference, as well as making it cheaper to operate. I probably shouldn’t forget tool use for that matter, since that’s the only reason they can count the r’s in strawberry these days.
None of that changes the concept that a model is just fundamentally very good at predicting what the next element in the stream should be, modulo injected randomness in the form of a temperature. Why does that actually end up looking like intelligence? Well, because we see the model’s ability to be plausibly correct over a wide range of topics and we get excited.
Btw, don’t take this reductionist approach as being synonymous with thinking these models aren’t incredibly useful and transformative for multiple industries. They’re a very big deal. But OpenAI shouldn’t give up because Opus 4.whatever is doing better on a bunch of benchmarks that are either saturated or in the training data, or have been RLHF’d to hell and back. This is not AGI.
Everybody says "but they just predict tokens" as if that's not just "I hope you won't think too much about this" sleight of hand.
Why does predicting the next token mean that they aren't AGI? Please clarify the exact logical steps there, because I make a similar argument that human brains are merely electrical signals propagating, and not real intelligence, but I never really seem to convince people.
More take an episode like Loops from Radiolab where a person’s memory resets back to a specific set of inputs/state and pretty responds the same way over and over again - very much like predicting the next token. Almost all human interaction is reflexive not thoughtful. Even now as you read this and process it, there’s not a lot of thought - but a whole lot of prediction and pattern matching going on.
Because there are some really fundamental things they cannot do with next token prediction. For instance, their memory is akin to someone who reads the phone book and memorizes the entire thing, but can't tell you what a phone number is for. Moreover, they can mimic semantic knowledge, because they have been trained on that knowledge, but take them out of their training distribution and they get into a "creative story-telling" mode very quickly. They can quote me all the rules of chess, but when it comes to actually making a chess move they break those rules with abandon simply because they didn't actually understand the rules. Chess is instructive in another way, too, in that you can get them to play a pretty solid opening game, maybe 10, 15 moves in, but then they start forgetting pieces, creating board positions that are impossible to reach, etc. They have memorized the forms of a board, know the names of the pieces, but they have no true understanding of what a chess game is. Coding is similar, they're fine when you give them Python or Bash shell scripts to write, they've been heavily trained on those, but ask them to deal with a system that has a non-standard stack and they will go haywire if you let their context get even medium sized. Something else they lack is any kind of learning efficiency as you or I would understand the concept. By this I mean the entire Internet is not sufficient to train today's models, the labs have to synthesize new data for models to train on to get sufficient coverage of a given area they want the model to be knowledgeable about. Continuous learning is a well-known issue as well, they simply don't do it. The labs have created memory, which is just more context engineering, but it's not the same as updating as you interact with them. I could go on.
At the end of the day next token prediction is a sleight of hand. It produces amazingly powerful affects, I agree. You can turn this one magic trick into the illusion of reasoning, but what it's doing is more of a "one thing after another" style story-telling that is fine for a lot of things, but doesn't get to the heart of what intelligence means. If you want to call them intelligent because they can do this stuff, fine, but it's an alien kind of intelligence that is incredibly limited. A dog or a cat actually demonstrate more ability to learn, to contextualize, and to make meaning.
You didn't actually give an example of what the issue with next token prediction is. You just mentioned current constraints (ie generalization and learning are difficult, needs mountains of data to train, can't play chess very well) that are not fundamental problems. You can trivially train a transformer to play chess above the level any human can play at, and they would still be doing "next token prediction". I wouldn't be surprised if every single thing you list as a challenge is solved in a few years, either through improvement at a basic level (ie better architectures) or harnessing.
We don't know how human brains produce intelligence. At a fundamental level, they might also be doing next token prediction or something similarly "dumb". Just because we know the basic mechanism of how LLMs work doesn't mean we can explain how they work and what they do, in a similar way that we might know everything we need to know about neurons and we still cannot fully grasp sentience.
I use the chess example because it’s especially instructive. It would NOT be trivial to train an LLM to play chess, next token prediction breaks down when you have so many positions to remember and you can’t adequately assign value to intermediate positions. Chess bots work by being trained on how to assign value to a position, something fundamentally different than what an LLM is doing.
A simpler example — without tool use, the standard BPE tokenization method made it impossible for state of the art LLMs to tell you how many ‘r’s are in strawberry. This is because they are thinking in tokens, not letters and not words. Can you think of anything in our intelligence where the way we encode experience makes it impossible for us to reason about it? The closest thing I can come to is how some cultures/languages have different ways of describing color and as a result cannot distinguish between colors that we think are quite distinct. And yet I can explain that, think about it, etc. We can reason abstractly and we don’t have to resort to a literal deus ex machina to do so.
Not being able to explain our brain to you doesn’t mean I can’t notice things that LLMs can’t do, and that we can, and draw some conclusions.
There are chess engines based on transformers, even DeepMind released one [1]. It achieved ~2900 Elo. It does have peculiarities for example in the endgame that are likely derived from its architecture, though I think it definitely qualifies as an example of the fact that simply because something is a next token predictor doesn't mean it cannot perform tasks that require intelligence and planning.
The r in strawberry is more of a fundamental limitation of our tokenization procedures, not the transformer architecture. We could easily train a LLM with byte-size tokens that would nail those problems. It can also be easily fixed with harnessing (ie for this class of problems, write a script rather than solve it yourself). I mean, we do this all the time ourselves, even mathematicians and physicists will run to a calculator for all kinds of problems they could in principle solve in their heads.
But chess models aren't trained the same way LLMs are trained. If I am not mistaken, they are trained directly from chess moves using pure reinforcement learning, and it's definitely not trivial as for instance AlphaZero took 64 TPUs to train.
Modern LLMs often start at "imitation learning" pre-training on web-scale data and continue with RLVR for specific verifiable tasks like coding. You can pre-train a chess engine transformer on human or engine chess parties, "imitation learning" mode, and then add RL against other engines or as self-play - to anneal the deficiencies and improve performance.
This was used for a few different game engines in practice. Probably not worth it for chess unless you explicitly want humanlike moves, but games with wider state and things like incomplete information benefit from the early "imitation learning" regime getting them into the envelope fast.
I meant trivial in the sense it's a solved problem, I'm sure it still costs a non-negligible amount of money to train it. See for example the chess transformer built by DeepMind a couple of years ago which I referred to in a sibling comment [1].
I admit, my knowledge of reinforcement learning is a bit outdated so it seemed to me that it was unattainable for a non-specialized model to train efficiently on something like chess, which has a huge state space.
None of this is a logical certainty of "X, therefore Y", it's just opinions. You can trivially add memory to a model by continuing to train it, we just don't do it because it's expensive, not because it can't be done.
Also, the phone book example is off the mark, because if I take a human who's never seen a phone and ask them to memorise the phone book, they would (or not), while not knowing what a phone number was for. Did you expect that a human would just come up on knowledge about phones entirely on their own, from nothing?
Next token prediction is about predicting the future by minimizing the number of bits required to encode the past. It is fundamentally causal and has a discrete time domain. You can't predict token N+2 without having first predicted token N+1. The human brain has the same operational principles.
Next-token prediction is just the training objective. I could describe your reply to me as “next-word prediction” too, since the words necessarily come out one after another. But that framing is trivial. It tells you what the system is being optimized to do, not how it actually does it.
Model training can be summed up as 'This what you have to do (objective), figure it out. Well here's a little skeleton that might help you out (architecture)'.
We spend millions of dollars and months training these frontier models precisely because the training process figures out numerous things we don't know or understand. Every day, Large Language Models, in service of their reply, in service of 'predicting the next token', perform sophisticated internal procedures far more complex than anything any human has come up with or possesses knowledge of. So for someone to say that they 'know how the models work under the hood', well it's all very silly.
> Btw, don’t take this reductionist approach as being synonymous with thinking these models aren’t incredibly useful and transformative for multiple industries. They’re a very big deal. But OpenAI shouldn’t give up because Opus 4.whatever is doing better on a bunch of benchmarks that are either saturated or in the training data, or have been RLHF’d to hell and back. This is not AGI.
It's sad that you have to add this postscript lest you be accused of being ignorant or anti-AI because you acknowledge that LLMs are not AGI.
If you typed your comment by reading all the others' in the chain, then you responded by typing your response in one go, then you 'just' did next-token prediction based on textual input.
I would still argue that does not prevent you from having intelligence, so that's why this argument is silly.
I'm not sure the assumption is that he's coming across HN for the first time rather than making an alt/stop lurking to post this. Or even that someone in tech their entire life must have already had a HN account before today. HN is big, but it's not so big that statement is even remotely reasonable.
You are arguing a point no-one is making. LLMs are not random sentence generators. Its probability distributions are anything but random. You could make an actual random sentence generator, but no-one would argue about its intelligence.
>Right. But HN, among other platforms, is full of users who will confidently run their mouths about something they don't fully understand while believing they do.
This is honestly funny and kind of ironic.
If this:
'The "reasoning" is two matrix transformations based on how often words appear next to each other.'
is what byang364 has to say, then he's part of the people you mention.
https://arxiv.org/abs/2403.05530
reply