Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models
Under ReviewOur paper demonstrates that text-to-image generative models often fail at accurately composing attributes and relationships due to sub-optimal text conditioning by the CLIP text-encoder, and we show that significant compositional improvements can be achieved by fine-tuning a simple linear projection on CLIP's representation space.