I'm not sure why you think hallucinations can't be "fixed". If we define hallucinations as falsehoods introduced between the training data and LLM output, then it seems obvious that the hallucination rate could at least be reduced significantly. Are you defining hallucinations as falsehood introduced at any point in the process?
Alternatively, are you saying that they can never be entirely fixed because LLMs are an approximate method? I'm in agreement here, but I don't think the researchers are claiming that they solved hallucinations completely.
Do you think LLMs don't have an internal model of the world? Many people seem to think that, but it is possible to find an internal model of the world in small LLMs trained on specific tasks (See [0] for a nice write-up of someone doing that with an LLM trained on Othello moves). Presumably larger general LLMs have various models inside of them too, but those would be more difficult to locate. That being said, I haven't been keeping up with the literature on LLM interpretation, so someone might have managed it by now.
> If we define hallucinations as falsehoods introduced between the training data and LLM output,
Yes, if.
Or we could realize that the LLMs output is a random draw from a distribution learned from the training data, i.e. ALL of its outputs are a hallucination. It has no concept of truth or falsehoods.
I think what you are saying here is that because it has no "concept" (I'll assume that means internal model) of truth, then there is no possible way of improving the truthiness of an LLMs outputs.
However, we do know that LLMs posses viable internal models, as I linked to in the post you are responding to. The OP paper notes that the probes it uses find the strongest signal of truth, where truth is defined by whatever the correct answer on each benchmark is, on the middle layers of the model during the activation of these "exact answer" tokens. That is, we have something which statistically correlates with whether the LLM's output matches "benchmark truth" inside the LLM. Assuming that you are willing to grant that "concept" and "internal model" are pretty much the same, this sure sounds like a concept of "benchmark truth" at work. If you aren't willing to grant that, I have no idea of what you mean by concept.
If you mean to say that humans have some model of Objective Truth which is inherently superior, I'd argue that isn't really the case. Human philosophers have been arguing for centuries over how to define truth, and don't seem to have come to any conclusion on the matter. In practice, people have wildly diverging definitions of truth, which depend on things like how religious or skeptical they are, what the standards for truth are in their culture, and various specific quirks from their own personality and life experience.
This paper only measured "benchmark truth" because that is easy to measure, but it seems reasonable to assume that other models of truth exist within them. Given that LLMs are supposed to replicate the words that humans wrote, I suspect that their internal models of truth work out to be some agglomeration (plus some noise) of what various humans think of as truth.
If that were the case, you couldn't give it a statement and ask whether that statement is true or not, and get back a response that is correct more often than not.
If language communicates thoughts, thoughts have a relationship with reality, and that relationship might be true or false or something else.
Then what thought is LLM language communicating, to what reality does it bear a relationship, and what is the truth or falseness of that language?
To me, LLM generated sentences have no truth or false value, they are strings, literally, not thoughts.
Take the simple "user:how much is two plus two? assistant: two plus two is four". It may seem trivial, but how do ascertain that that statement maps to 2+2=4? Do you make a leap of faith or argue that the word plus maps to the adding function? What about is, does it map to equality? Even if they are the same tokens as water is wet (where wet is not water?). Or are we arguing that the truthfulness lies on the embedding interpretation? Where now tokens and strings merely communicate the multidim embedding space, which could be said to be a thought, now we are mapping some of the vectors in that space as true, and some as false?
Lets assume LLMs don't "think". We feed an LLM an input and get back an output string. It is then possible to interpret that string as having meaning in the same way we interpret human writing as having meaning, even though we may choose not to. At that point, we have created a thought in our heads which could be true or false.
Now lets talk about calculators. We can think of calculators as similar to LLMs, but speaking a more restricted language and giving significantly more reliable results. The calculator takes a thought converted to a string as input from the user, and outputs a string, which the user then converts to a thought. The user values that string creating a thought which has a higher truthiness. People don't like buggy calculators.
I'd say one can view an LLM in exactly the same way, just that they can take a much richer language of thoughts, but output significantly buggier results.
You might not be able to sell someone a library that fixes all bugs, but you can sell (or give away) software systems that reduce the number of bugs. Doing that is pretty useful.
Examples include linters, fuzzers, testing frameworks, and memory safe programming languages (as in Rust, but also as in any language with a GC). All these things reduce the number of bugs in the final product by giving you a way to detect them. (except for memory safe languages, which just eliminate a class of bugs) The paper is advertising a method to detect whether a given output is likely to be affected by a "bug", and a taxonomy of the symptoms of such bugs. The paper doesn't provide a way to fix those, and hallucinations don't necessarily have a single cause. Some hallucinations might be fixed by contextual calibration [0], others might be fixed by adding more training data similar to the wrong example.
In any case, you need to find the bad outputs before you can perform any fixes. Because LLMs tend to be used to produce "fuzzy" outputs with no single right answer, traditional testing frameworks and the like aren't always applicable.
Yeah for sure, but the claim in the article is something like "we found the line in compiler code that causes bugs" or "we found the bytes in the compiled object that causes bugs"
To me the claims in the article read something like "we have found a way to identify execution paths in some common compiler architecture (which are the transformer architecture in the case of LLMs) which are often but not always associated with buggy code". This seems like a reasonable claim to make.
Additionally, I think you may or may not be suspecting research malpractice. Obviously I don't have insider knowledge, but I would note that the idea of training probes in the middle layer of the model wasn't their idea. This paper cites other papers that already did exactly that. The contribution of this paper is simply that focusing on the middle layers for certain "critical tokens" gives a better signal than just checking the middle layers on every token.
It's of course possible that this paper in particular is fraudulent, but note that there is a field of research making the same basic claim as this paper, so this isn't some one off thing. A reasonable amount of people from different institutions would need to be in on it for the entire field to be fraudulent.
Alternatively, I think you may be objecting to the use of the word "truthfulness" in the abstract of the paper, because you seem to think that only human thoughts can possibly have a true or false value. I'm not actually going to object to the idea that only human thoughts can be true or false, but like the response I wrote to your koan comment, the user can interpret the LLMs output, which gives the user's thought a true or false value.
In this case, philosophically, you can think of this paper as trying to find cases where the LLM outputs strings that the user interprets as false. I think the authors of the paper are probably thinking about true or false more as a property of sentences, and thus a thing mere strings can possess regardless of how they are created. This is also a philosophically valid way to look at it, but differs from your view in a way that possibly made you think their claims absurd.
Alternatively, are you saying that they can never be entirely fixed because LLMs are an approximate method? I'm in agreement here, but I don't think the researchers are claiming that they solved hallucinations completely.
Do you think LLMs don't have an internal model of the world? Many people seem to think that, but it is possible to find an internal model of the world in small LLMs trained on specific tasks (See [0] for a nice write-up of someone doing that with an LLM trained on Othello moves). Presumably larger general LLMs have various models inside of them too, but those would be more difficult to locate. That being said, I haven't been keeping up with the literature on LLM interpretation, so someone might have managed it by now.
[0] https://thegradient.pub/othello