They probably fingerprint their generated content.

creatonez · on April 24, 2023

This has been researched, but no such thing has been implemented by OpenAI or Bard.

rcme · on April 24, 2023

I think my comment was misunderstood. I didn’t mean the output text would contain some identifying information. Rather, OpenAI could generate a fingerprint from the text, similar to Apple’s neural has for images, and store that so they can filter out generated text later.

p-e-w · on April 24, 2023

How could that possibly work?

stewartmcgown · on April 24, 2023

Well, they have all of the outputs of ChatGPT stored on their own servers. I suppose it wouldn't be out of the question to filter any future datasets they scrape against the outputs they have.

bckr · on April 24, 2023

Keep track of all embeddings ever emitted. While scraping, check all data against those embeddings.

So, not like a watermark, which would be impossible.

gojomo · on April 24, 2023

A watermark is absolutely possible - see for example some of the work Scott Aaronson has mentioned doing for OpenAI.

But: very fragile, especially if people are specifically trying to hide their GPT use, or have access to the watermarking algorithm or online oracle.

And: other methods – like remembering all output ever, or fuzzy summary representations of all output ever – seem to me similarly fragile, & introduce other problems & impracticalities.

A guess: OpenAI internally initially shared the common concern that "consuming its own junk outputs" could be a problem. But their own experiments so far, private & public, may have convinced them it's not as much of a problem in practice as it seems in theory. The model outputs have a mix of good and bad text – just like the pre-LLM internet. And, the same filterings/weightings that have worked on pre-LLM content keep working. And, counter to some early intuitions, often one LLM's quality output is in fact very-useful input for other later LLMs.

pprotas · on April 24, 2023

Computerphile has a video that explains it very well: https://youtu.be/XZJc1p6RE78

(You can skip to the section “Verifying“)