AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Abstract

Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality—accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality—a common drawback in prior approaches—and even generalizes to other capabilities not explicitly trained for, such as text rendering.

Method Overview

We propose an agentic orchestration framework in which specialized agents (with appropriate tools, such as image editing modules and VQA models for interacting with generated images and analyzing their compositional properties) collaborate to generate a positive image, synthesize contrastive prompts, produce corresponding negative images, and rank them according to compositional distance. These groups of near-identical yet compositionally contrasted samples are then used to explicitly train the model via our Agentic Preference Optimization method, enabling it to distinguish between denoising trajectories of visually similar but compositionally distinct samples. This targeted supervision substantially improves compositional reasoning and understanding in text-to-image models.

Results

Example from the dataset generated by the agentic orchestra. The dataset includes high-quality samples, with reference image that accurately capture compositional details in the given prompt, along with negative samples created by subtly altering those details in the reference text–image pair.

Example scenario of the Image Generation Agent. The agent employs iterative reasoning and tool calls to produce a compositionally accurate image that aligns with the given prompt.

Image Editing Agent Example Scenario. Given a source image, its prompt, and a target prompt, the image-editing agent leverages editing tool and VQA to produce a correct contrastive sample. Although the editing tools may introduce unintended modifications (Steps 1 and 2), the agent detects these errors through reasoning and VQA feedback, adjusts its intermediate prompts, and ultimately generates the intended result.

Quantitative comparison of AgentComp against other baselines on T2I-CompBench. AgentComp achieves state-of-the-art performance, demonstrating substantial improvements in compositional reasoning and understanding for T2I models.

Qualitative comparison across compositional categories. AgentComp produces more compositionally accurate images than the base model across various categories.

General quality and text rendering comparison. AgentComp preserves and even improves image generation quality while also significantly enhancing text rendering accuracy.

@article{zarei2025agentcomp, title={AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models}, author={Zarei, Arman and Pan, Jiacheng and Gwilliam, Matthew and Feizi, Soheil and Yang, Zhenheng}, journal={arXiv preprint arXiv:2512.09081}, year={2025} }

AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Abstract

Method Overview

Results

BibTeX