Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Shengbang Tong, Boyang Zheng, Ziteng Wang +7 more
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decode...