I wonder how OpenAI are going to avoid the problem after the web is littered wit...

valine · on April 24, 2023

Presumably the ChatGPT content that makes it onto the web is at the very least curated by humans, making that text on average slightly higher quality than the raw output of ChatGPT. If that's the case than you would expect model performance to continue to improve even if the dataset is polluted.

senttoschool · on April 24, 2023

That's a bold assumption. I can imagine a world where 99.999% of the web will be filled with non-human curated AI generated text.

The rate at which AI can generate text will be so much greater than what humans can generate.

pr337h4m · on April 24, 2023

Doesn't matter. We want high-quality text - it's not necessary for it to be human-written. Social signals like upvotes or PageRank will still remain useful even if most text is AI generated.

goatlover · on April 24, 2023

I certainly don't want most of discussion forums to be generated by bots. I'd rather there was none of it. High-quality generated text is good for fiction and summaries, but not when you want to hear what actual humans have to say.

senttoschool · on April 24, 2023

The point is that AIs will run out of human-generated text or that it won't be able to distinguish from AI or human generated text to train on.

You're already assuming pagerank and upvote systems won't break down in the future.

checkyoursudo · on April 24, 2023

You just gotta get the AIs to do the upvoting, then cut the humans out of the loop all together and only have AIs read the AI generated text, and then everything will be fine. Just an endless death spiral of ai gen, ai filtering, and ai consumption, forever and ever.

Presumably at some point computers will become (already are for all I know?) the largest consumers of content on the internet as well as its producers.

imtringued · on April 24, 2023

"bold assumption" says the guy who assumes $2 worth of energy spent on AI generated text for every single written word by humans.

Now go ahead and spend $50 dollars on AI generated text nobody is ever going to read, just like almost nobody is going to read this comment.

senttoschool · on April 24, 2023

Bold assumption that AI generated text won't get cheaper exponentially. It already costs less than human generated text of the same quality by magnitudes.

AstralStorm · on April 24, 2023

Costs a lot more than free text written by thinking humans.

pixl97 · on April 24, 2023

I think you're very confused about the costs required in operating a human... Or are you assuming because the human was going to be doing it anyway the cost is free?

creatonez · on April 24, 2023

I don't think this problem matters as much as people say it does, except maybe from a research perspective. The chatbot has essentially become part of human culture, it speaks human languages and could actually subtly influence the way human language works. It may develop its own idioms and communication style, and humans may adopt some of this. So yes: now that LLMs are released, everything is polluted in some way, similar to radioactive isotopes. But language is descriptive, not prescriptive: it always works as long as there is shared understanding. People will cherry pick the ChatGPT answers they were able to understand when publishing to the internet, and ignore/ridicule the output that didn't make sense to them.

Note that GPT-3.5 and above are already intentionally polluted with their own output by the RLHF process.

alex_sf · on April 24, 2023

My apologies, but as a human language model, it is unlikely that ChatGPT would have much impact on human culture.

geraldhh · on April 24, 2023

why not?

i'd say llm's represent a institutionalized reinforcement of bias (much like journalism) combined with some in-human autonomy.

moelf · on April 24, 2023

what do we say to people who has the argument of "but the web is already littered with spam blog and SEO stuff"

rcme · on April 24, 2023

They probably fingerprint their generated content.

creatonez · on April 24, 2023

This has been researched, but no such thing has been implemented by OpenAI or Bard.

rcme · on April 24, 2023

I think my comment was misunderstood. I didn’t mean the output text would contain some identifying information. Rather, OpenAI could generate a fingerprint from the text, similar to Apple’s neural has for images, and store that so they can filter out generated text later.

p-e-w · on April 24, 2023

How could that possibly work?

stewartmcgown · on April 24, 2023

Well, they have all of the outputs of ChatGPT stored on their own servers. I suppose it wouldn't be out of the question to filter any future datasets they scrape against the outputs they have.

bckr · on April 24, 2023

Keep track of all embeddings ever emitted. While scraping, check all data against those embeddings.

So, not like a watermark, which would be impossible.

gojomo · on April 24, 2023

A watermark is absolutely possible - see for example some of the work Scott Aaronson has mentioned doing for OpenAI.

But: very fragile, especially if people are specifically trying to hide their GPT use, or have access to the watermarking algorithm or online oracle.

And: other methods – like remembering all output ever, or fuzzy summary representations of all output ever – seem to me similarly fragile, & introduce other problems & impracticalities.

A guess: OpenAI internally initially shared the common concern that "consuming its own junk outputs" could be a problem. But their own experiments so far, private & public, may have convinced them it's not as much of a problem in practice as it seems in theory. The model outputs have a mix of good and bad text – just like the pre-LLM internet. And, the same filterings/weightings that have worked on pre-LLM content keep working. And, counter to some early intuitions, often one LLM's quality output is in fact very-useful input for other later LLMs.

pprotas · on April 24, 2023

Computerphile has a video that explains it very well: https://youtu.be/XZJc1p6RE78

(You can skip to the section “Verifying“)