Getting tired of seeing this concept of practically guaranteed hallucinations from any LLM used in production. I've used LLMs for various tasks, and if you tune your system correctly, it can be very reliable. It's just not always plug-and-play reliability. You need to set up your fine-tuning and prompts and then test well for consistent results.
> You need to set up your fine-tuning and prompts and then test well for consistent results.
Tell that to Google...
Seriously, it is well established that these systems hallucinate. Trying to say otherwise shows you are trying to push something that just is not true.
They can be right, yes. But when they are wrong they can be catastrophically wrong. You could be wasting time looking into the wrong problem with something like this.
It is a serious problem when these tools are being pushed as trustworthy when they are anything but.
On an almost daily occurrence I deal with some sort of hallucination in code, in summarizing something, we see it constantly on social media when people try to use Google's AI summary as a source of truth.
Let's not try to lie to push an agenda about what the capabilities of what these models can do. They are very powerful, but they make mistakes. There is zero question about that, and quite often.
The problem isn't that they hallucinate, the problem is that we have comments like yours trying to downplay it. Then we have people that, it is right just enough times that they start trusting it without double checking.
That is the problem, it is right enough times that you just start accepting the answers. That leads to, making scripts that grab data and put it into a database without checking. It's fine if it is not business critical data, but it's not really fine when we are talking about health care data or.. oh idk, police records like a recent post was talking about.
If you are going to use it for your silly little project, or you're going to bring down your own companies infrastructure go for it. But let's not pretend the problem doesn't exist and shove this technology into far more sensitive areas.
I think you're exaggerating. You're imagining the worst but your argument basically boils down to not trusting that people can handle it, and calling me a liar. Good one.
Yeah, because Google's LLMs have an completely open question/answer space.
For e.g. a Kubernetes AI, you can nowadays just feed in the whole Kubernetes docs + a few reference Helm charts, tell it to stick close to the material, and you'll have next to no hallucinations. Same thing for simple data extraction tasks, where in the past you couldn't use LLMs because they would just hallucinate data into the output that wasn't there in the input (e.g. completely mangling an ID), which nowadays is essentially a non-issue.
As soon as you have a restrictable space in which the LLM acts, you have a lot of options to tune them that hallucinations are not a major issue nowadays.
I am getting very tired of seeing everyone pushing LLMs and being disingenuous about exactly how often it is getting things wrong. And what the impact of that is. There is a reason that AI is not taking off in the enterprise and that is because people who take their job seriously are getting tired too.
I understand your sentiment, but I also don't think it's fair to say people are being disingenuous. I don't work for an AI company, I just use it with decent results.
My last project needed a way to tag certain types of business activity indicated in 17,000 user reviews. I wrote a prompt with 5 different tags along with descriptions for each business activity, took a few-shot approach by defining 8 different examples and how I would tag them, and then ran the tagging prompt on batches of indexed reviews, giving it 100 reviews per batch. I did a random sampling of about 200 items, and the success rate was roughly 89%. I could have improved it by doing more iterations, and possibly fine-tuning if I felt that it was worth it.
In every run, it generated matching results for the input JSON in a corresponding output JSON with no errors in any run.
That's the only example I have numbers on off the top of my head.
(Reader beware: basically everything the LLM emits in here is wrong.)
I was attempting to find formats that used a particular quoting style — SQL style double-the-character-to-espace — and bizarrely JSON came up.
Gemini:
> JSON: In JSON, double quotes are used to delimit strings. If a double quote needs to be included within the string itself, it's escaped by doubling it (e.g., "This is a string with a "quote" in it.").
That is,
"This is a string with a "quote" in it."
Me: Does your JSON example parse as valid JSON?
Gemini:
> Yes, the JSON example I provided ("This is a string with a "quote" in it.") is actually valid JSON.
> The reason it's valid is because the doubled quote within the string is interpreted as an escaped character, representing a single literal quote within the string itself. The JSON parser recognizes the double quotes at the beginning and end as delimiters for the entire string, not the escaped quote within.
It also called out,
> XML: Use a backslash (\) to escape characters, including quotes.
In a different context, and more towards "working with k8s" that the OP's AI is focused on,
Me: Regex to extract the replica number from a statefulset pod's name?
Gemini:
> Use kubectl: The kubectl get pods command with output formatting can directly show replica numbers. Here's an example:
kubectl get pods -o jsonpath='{.items[*].metadata.name}{" - replica: "}{.metadata.labels.replicaset-序号}'
The Chinese here, as best as I can tell, reads "serial number" … which sort of contextually makes sense in the sense of an LLM, I guess. The command is utter nonsense, of course.
My absolute favorite is when the model will tell you why something should be one way, but in the same message confidently tell you it is the other way.
I had a situation where I was playing around with having it right a quick method for me. I gave it the specs it needed to work with and the other methods it will call.
Basically, it wrote calling one of the methods with the wrong (missing) arguments. When I asked it to fix that call, it first said its current call was right but then when it went to explain it (in the same message), it pointed out that a certain missing argument was required to be passed in.
I have seen similar situations with these models looking at data time and time again. It theoretically sees it, but once that wrong data (the first failed interpretation) is part of its context it's basically all over.