It is kind of unfortunate how people don't actually read the paper but only run ...

It is kind of unfortunate how people don't actually read the paper but only run with the conclusions, speculating whether this would or would not work.

Here's the paper in question:

https://arxiv.org/abs/2310.13828

My two cents is that in its current implementation the compromised images can be easily detected, and possibly even 'de-poisoned'.

The attack works by targeting an underrepresented concept (let's say 1% of images contain dogs, so 'dog' is a good concept to attack).

They poison the concept of 'dog' with the concept of 'cat' by blending (in latent space) an archetypical image of 'cat' (always the same) to every image containing a 'dog'.

This works during training, since every poisoned image of dog contains the same blended in image of a cat, so this false signal eventually builds up in the model, even if the poisoned sample count is low.

But note: this exploits the lack of data in a domain - this would not prevent the model from generating anime waifus or porn, because the training set of those is huge.

But how to detect poisoned images?

1. You take a non-poisoned labeler (these exist, because clean pre-SD datasets, and pre-poison diffusion models exist)

2. You ask your new model and the non-poisoned labeler to check your images. You find that the concept of 'dog' has been poisoned

3. You convert all your 'dog' images to latent space and take the average. Most likely all the non-poison details will average out, while the poison will accumulate.

4. You now have a 'signature' of the poison. You check each of your images in latent space against the correllation with the poison. If the correllation is high, the image is poisoned.

The poison is easily detectible for the same reason it works - it embeds a very strong signal that gets repeated across the training set.