Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]
It’s often very difficult (intractable) to come up with a probability distribution of an estimator, even when the probability distribution of the data is known.
Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.
In microgpt, there's no alignment. It's all pretraining (learning to predict the next token). But for production systems, models go through post-training, often with some sort of reinforcement learning which modifies the model so that it produces a different probability distribution over output tokens.
But the model "shape" and computation graph itself doesn't change as a result of post-training. All that changes is the weights in the matrices.
That's 4–6 months in the 18 months the trials lasted for, i.e. about a 30% slowdown of progression. The open-label extensions suggest this relative slowdown seems to continue at least to the 4-year mark (at which point it would have bought you over a year of time): https://www.alzforum.org/news/conference-coverage/signs-last...
Time will tell if the 30% slowdown continues beyond four years, and/or if earlier treatment with more effective amyloid clearance from newer drugs has greater effects. The science suggests it should.
A mistake in this critique is it assumes an exponential: a constant proportional rate of growth. It is true that, in some sense, an exponential always seems to be accelerating while infinity always remains equally far away.
But this is a bit of a straw man. Mathematical models of the technological singularity [1], along with the history of human economic growth [2], are super-exponential: the rate of growth is itself increasing over time, or at least has taken multiple discrete leaps [3] at the transitions to agriculture and industry, respectively. A true singularity/infinity can of course never be achieved for physical reasons (limited stuff within the cubically-expanding lightcone, plus inherent limits to technology itself), but the growth curve can look hyperbolic and traverse many orders of magnitude before those physical limits are encountered.
It can’t be infinitely fast, but after the point where we all collectively cease to be able to comprehend the rate of change, it’s effectively a discontinuity from our point of view.
In some redacted documents, there is even an alphabetical word index at the end with a list of pages on which the words appear.
The redacted words are also redacted in the word index, but the alphabetically preceding and succeeding words are visible, as is the number of index lines taken up by the redacted word's entry, which correlates with the number of appearances of that word.
This seems like rather useful information to constrain a search by such a tool.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]
reply