Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We are working on an AI product in a highly regulated industry (Investing). Recently we have been experimenting using the GPT3 API as a Junior Equity Analyst. On an eyeball check the results of the technology are impressive.

The problem is that there is no way to validate the feedback on scale. I.e., we can't receive statistics about the feedback from the API.

In contrast, for our own Entity Recognition models we can (and do) calculate probabilities that explain why a certain entity is shown.

Hence, I think for API users of GPT3, OpenAI should return additional statistics why a certain result is returned the way it is to make it really useful and more importantly compliant.



GPT is a voluble, entertaining bullshitter that is occasionally correct. It is not reliable, and for now, it cannot be made reliable.

It is a very impressive toy, but still just a toy for now.


It can be made more reliable.

Here’s how:

https://github.com/williamcotton/empirical-philosophy/blob/m...


For now GPT is creating the filler content that moves the Web in 2023. But, given the results I have seen from our PoC's, it can do more and will do more in the future.


> that is occasionally correct

It's almost always correct, otherwise it would be worthless.

Yes, it's possible to construct questions that lead to nonsensical answers, and sometimes nonsensical answers are given even to sensible questions, but saying that ChatGPT's answers are "occasionally" correct is weapons grade BS. ChatGPT is a hair's breadth from being an ultimate answer machine, and is far more likely to be correct on almost any question than the average human.

The FUD that is currently being manufactured around language models is insane. I guess we should all stop using search engines, since those are even less reliable.


It's almost nearly always wrong in my experience. Like if I ask it a basic technical question, it gets stuff mostly correct but if you look too much at the details, then errors start popping up. If you ask it something relatively complicated, it's nearly always wrong.

Like as an easy example of the latter, I asked chatGPT "What is the S4 model in NLP?" and it gave back

> The S4 model (Simple Synchronous Syntactic-Semantic Sentence Simplification Model) is a model in natural language processing (NLP) that aims to simplify complex sentences while preserving their original meaning. The model was introduced in the paper "A Simple Synchronous Syntactic-Semantic Sentence Simplification Model" by Siddharthan et al. (2014).

> The S4 model consists of two main components: a synchronous grammar and a semantic model. The synchronous grammar is used to identify the syntactic structure of the input sentence and generate a simplified version of it. The semantic model is then used to ensure that the simplified sentence retains the original meaning.

> The S4 model is designed to address the problem of text simplification, which is the process of transforming complex text into simpler text that is easier to read and understand. Text simplification is important for making information more accessible to people with cognitive disabilities or limited literacy skills, as well as for improving readability for non-native speakers of a language.

> Overall, the S4 model represents an important contribution to the field of NLP, as it provides a simple yet effective approach to sentence simplification that can be used in a wide range of applications.

However, all of the above statements are like complete nonsense. The citation doesn't exist and the S4 model that the thing is saying exists doesn't.

I have found search engines to be significantly more reliable (with the ability to actually check sources) than chatGPT. Usually, when I get decent answers from chatGPT, I would also get the same from Wikipedia which is usually a top result on most search engines. Do you have an example of an entire class of technical queries where chatGPT might perform better?


> "What is the S4 model in NLP?"

99.99% of people would be unable to answer that question (without looking it up, I mean). Such hyper-specific queries for highly technical information from niche fields say very little about the model's overall performance at natural language tasks.

If you ask things like "Which of these animals doesn't live in Africa?" or "What is the most reactive chemical element?", ChatGPT's answers are almost always correct. And they are far more likely to be correct than the average (unaided) human's.


We already had Watson for Jeopardy-style general knowledge quiz questions a decade ago. It didn't revolutionize anything.


Update. This morning I asked ChatGPT what day today was. It answered correctly. I then asked how it could know that given that its training data ends in September 2021. It said it was based on the number of days since its training data ended. I pointed out it still had no way of knowing that number of days if it had no knowledge past September 2021. It kept apologizing and repeating the same story over and over.


ChatGPT is almost always bullshitting if you ask it to create a complete list of something with more than 10 entries or so.


I'm not sure exactly what the ask here is.

>In contrast, for our own Entity Recognition models we can (and do) calculate probabilities that explain why a certain entity is shown.

>Hence, I think for API users of GPT3, OpenAI should return additional statistics why a certain result is returned the way it is to make it really useful and more importantly compliant.

For LLMs, you can get the same thing: the distribution of probabilities for the next token, for each token. But right now we cannot say why the probabilities are the way they are, same goes for your image recognition models.


The problem in a nutshell, and the one the FTC had pointed out, is Model explainability. I was working in the past of an AI for automated lending decisions. We were asked to be able to explain every single decision the engine took.

If now a news article reaches our AI engine, it will tag, categorize, classify, and rank this news article. All based on models that are explainable.

LLMs, at least how I personally implemented them in the past, create a huge black box that is largely non-explainable.


A blogger (i.e. experienced ML professional in fintech) published an excellent write up of AI in fintech in December - his basic take was that there's a lot of room for the tech to grow before it becomes truly ubiquitous, because answers in finance must be correct 100% of the time. Worth a read!

https://chaosengineering.substack.com/p/artificial-intellige...


You can return log probs per token generated. This can be used to asses the confidence the model has in handling tasks which involve nominal data.

If that’s not helpful, were you getting at having the model return some rich data about the attention weights that went into generating some token?


For most of our models we return more information. Especially if you look at it from a vendor/customer perspective I believe this to be quite important.


And what happens when you run out of analysts with 3 years of experience in <checks notes> three years?


[flagged]


It is a regulated industry. Compliance exists for a reason. As we have seen during the crypto fallout... ^-^




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: