SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Abstract

Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

Method

Given an input image $X_\text{orig}$ and a prompt $\mathcal{P} = \{ \mathcal{P}_1, ..., \mathcal{P}_K \}$ containing $K$ edit instructions, a base image editing model produces an edited output $X_\text{editted}^{\mathcal{P}_1, ..., \mathcal{P}_K}$ where all edits are applied simultaneously. Our objective is to learn a flexible adapter $M_\theta(\mathcal{P}_i)$ capable of suppressing or modulating a specific instruction $\mathcal{P}_i$ within $\mathcal{P}$. When this adapter is activated, the model should generate $X_\text{editted}^{\mathcal{P}_1, ..., \mathcal{P}_{i-1}, \mathcal{P}_{i+1} ,...,\mathcal{P}_K}$, effectively removing the influence of $\mathcal{P}_i$ while keeping other edits intact.

More details

Partial Prompt Suppression Loss. To train $M_\theta$, we propose the Partial Prompt Suppression (PPS) objective. Using the frozen base model $\epsilon(Z, X, P)$, where $Z$ denotes the noisy latents, $X$ the original image latents, and $P$ the text prompt, we first perform a forward pass with the prompt excluding the $i$-th instruction $\mathcal{P}_i$. We then require that the adapted model $\epsilon_{M_\theta(\mathcal{P}_i)}$, when given the full prompt, produces an equivalent denoising direction:

\[ \mathcal{L}_\texttt{PPS} = \|\epsilon_{M_\theta(\mathcal{P}_i)}(Z, X_\text{orig}, \mathcal{P}) -\epsilon(Z, X_\text{orig}, \mathcal{P}-\{\mathcal{P}_i\}) \| \]

Intuitively, this objective teaches the adapter to neutralize the representation of the tokens corresponding to $\mathcal{P}_i$ throughout the model so that their visual effect disappears.

STLoRA and GSTLoRA We instantiate $M_\theta$ as a Selective Token LoRA (STLoRA)—a lightweight, token-aware adapter. STLoRA learns low-rank updates for selected linear projections in the model but applies them only to the embeddings of target tokens corresponding to the suppressed instruction $\mathcal{P}_i$. While STLoRA effectively handles both single- and multi-instruction prompts by selectively modulating tokens corresponding to each instruction $\mathcal{P}_i$, we introduce Globally Selective Token LoRA (GSTLoRA) for the single-instruction setting. In this variant, all token embeddings (both text and image) are included in the adaptation, allowing LoRA updates to be applied globally across the representation space. This design provides stronger control and often yields higher-fidelity edits when manipulating a single instruction, as the update can leverage global context rather than being limited to a subset of intermediate text token embeddings.

Continuous Control via Scaling STLoRA Once trained, the LoRA adapter naturally supports continuous control through its scaling parameter. We denote $M_\theta^\alpha$ as the adapter with scaled updates $\alpha \Delta W_\ell$ for each layer. By varying $\alpha$ within a predefined range $[\alpha_\text{min}, \alpha_\text{max}]$, we obtain a smooth continuum of effects—from complete suppression ($\alpha = 1$) to full application ($\alpha = 0$), and even exaggerated edits for $\alpha < 0$.

Results

Qualitative Samples of GSTLoRA. Demonstrates smooth, continuous control over the strength of both local and global edits.

Interactive 2D Control for Multi-Instruction Edits

Qualitative Samples of STLoRA on Multi-Instruction Edits. SliderEdit produces a disentangled and smooth interpolation space in multi-instruction editing scenarios, offering fine control over individual instruction directions.

Controllable zero-shot multi-subject personalization with STLoRA. STLoRA enables smooth adjustment of each instruction’s strength to generate coherent, evolving image sequences, supporting story-like visual editing. (Best viewed from top-left to top-right, then bottom-right to bottom-left)

@article{zarei2025slideredit, title={SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control}, author={Zarei, Arman and Basu, Samyadeep and Pournemat, Mobina and Nag, Sayan and Rossi, Ryan and Feizi, Soheil}, journal={arXiv preprint arXiv:2511.09715}, year={2025} }

SliderEdit: Continuous Image Editing withFine-Grained Instruction Control

Abstract

Method

Results

Interactive 2D Control for Multi-Instruction Edits

BibTeX

SliderEdit: Continuous Image Editing with
Fine-Grained Instruction Control