Localizing Knowledge in Diffusion Transformers

Localization across various DiT models and knowledge categories. For each model, heatmaps indicate the frequency of each block being selected as a dominant carrier of different target knowledge. Green-bordered images are standard generations, while red-bordered images result from withholding knowledge-specific information in the localized blocks. Our method successfully localizes diverse knowledge types, with variation in localization patterns across models.

Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.

Given a target knowledge $\kappa$, we first construct a set of prompts $\{p_1^\kappa, p_2^\kappa, \dots, p_N^\kappa\}$ that contain the knowledge, either manually or using an LLM. Using the DiT model, we generate images and compute the attention contribution of the tokens $\{\mathbf{x}_{j_1}, \mathbf{x}_{j_2}, \dots, \mathbf{x}_{j_\tau}\}$ corresponding to $\kappa$ in each prompt $p_i^\kappa$ at each layer (step 1 in the figure). These values are averaged across seeds and prompts to obtain a per-layer score indicating how much each block contributes to injecting the knowledge into the image (step 2 in the figure). We then select the top-$K$ most dominant blocks ($\mathcal{B}_K^\kappa$) as the most informative. To verify the role of the localized blocks $\mathcal{B}_K^\kappa$, we generate images using the original prompts $\{p_1^\kappa, p_2^\kappa, \dots, p_N^\kappa\}$, but replace the inputs to the $\mathcal{B}_K^\kappa$ with knowledge-agnostic prompts $\{p_1^{\kappa\text{-neutral}}, p_2^{\kappa\text{-neutral}}, \dots, p_N^{\kappa\text{-neutral}}\}$, which omit the knowledge (step 3 in the figure). In models like PixArt-$\alpha$, this is done by swapping the cross-attention input, and for MMDiT-based models like FLUX, which use a separate prompt branch, we perform two passes, one with $\{p_i^\kappa\}$ and one with $\{p_i^{\kappa\text{-neutral}}\}$, and overwrite the text branch input in the $\mathcal{B}_K^\kappa$ of the first pass with those from the second.

Differences in how knowledge is localized across categories and models. LLaVA-based evaluations and generation samples as the number of intervened blocks $K$ increases, where $K$ denotes the top-$K$ most informative blocks identified by our localization method. Some knowledge types (e.g., copyright) are highly concentrated in a few blocks, while others (e.g., animals) are more distributed across the model. Examples include outputs from the base models and their intervened counterparts.

Variation in how artistic styles are localized within the model. We report CSD scores for various artists in the PixArt-$\alpha$ model as the number of intervened blocks $K$ increases. The numbers indicate how many artist styles remain identifiable at each $K$. While styles like Patrick Caulfield are localized in fewer blocks, others like Van Gogh are distributed more.

Relationship between artistic style complexity and the number of blocks required for localization. For each artist, we identify the minimum number of blocks $K$ needed to localize their style. Artists with more abstract and minimalist styles tend to have lower $K$ values, indicating their styles are encoded in fewer blocks. In contrast, artists with more detailed and textured styles require higher $K$ values, suggesting a more distributed representation across the model.

Targeted fine-tuning via knowledge localization. Given a concept to personalize or remove, our method first identifies the most relevant blocks via knowledge localization and restricts fine-tuning to those blocks. This enables efficient adaptation (top) and targeted suppression (bottom) with minimal impact on surrounding content, while better preserving the model’s prior performance.

Improved prompt alignment and surrounding identity preservation via localized DreamBooth. Left: Localized fine-tuning better adheres to prompt specifications. Right: Surrounding class-level identities are better preserved, demonstrating reduced interference with other concepts.

Quantitative results showing comparable performance with localized unlearning while being more efficient. Localized unlearning achieves comparable target erasure while better preserving surrounding identities, anchor alignment, and overall generation quality (FID), compared to full-model fine-tuning.

BibTeX

@article{zarei2025localizing,
  title={Localizing Knowledge in Diffusion Transformers},
  author={Zarei, Arman and Basu, Samyadeep and Rezaei, Keivan and Lin, Zihao and Nag, Sayan and Feizi, Soheil},
  journal={arXiv preprint arXiv:2505.18832},
  year={2025}
}

Localizing Knowledge in Diffusion Transformers

Abstract

Method

Results

Applications

Results

BibTeX