This doesn't seem all that impressive when you compare it to earlier work like '...

This doesn't seem all that impressive when you compare it to earlier work like 'g.pt' https://arxiv.org/abs/2209.12892 Peebles et al 2022. They cite it in passing, but do no comparison or discussion, and to my eyes, g.pt is a lot more interesting (for example, you can prompt it for a variety of network properties like low vs high score, whereas this just generates unconditionally) and more thoroughly evaluated. The autoencoder here doesn't seem like it adds much.