I think my comment was misunderstood. I didn’t mean the output text would contain some identifying information. Rather, OpenAI could generate a fingerprint from the text, similar to Apple’s neural has for images, and store that so they can filter out generated text later.
Well, they have all of the outputs of ChatGPT stored on their own servers. I suppose it wouldn't be out of the question to filter any future datasets they scrape against the outputs they have.
A watermark is absolutely possible - see for example some of the work Scott Aaronson has mentioned doing for OpenAI.
But: very fragile, especially if people are specifically trying to hide their GPT use, or have access to the watermarking algorithm or online oracle.
And: other methods – like remembering all output ever, or fuzzy summary representations of all output ever – seem to me similarly fragile, & introduce other problems & impracticalities.
A guess: OpenAI internally initially shared the common concern that "consuming its own junk outputs" could be a problem. But their own experiments so far, private & public, may have convinced them it's not as much of a problem in practice as it seems in theory. The model outputs have a mix of good and bad text – just like the pre-LLM internet. And, the same filterings/weightings that have worked on pre-LLM content keep working. And, counter to some early intuitions, often one LLM's quality output is in fact very-useful input for other later LLMs.