Diffusion models have shown exceptional performance in generative tasks, making them suitable for image editing. Recent research highlights their ability to apply edits based on textual instructions, but two key challenges persist. First, these models struggle to perform multiple edits simultaneously, causing computational inefficiencies due to sequential processing. Second, using textual prompts to specify editing regions can result in unintended changes to the image.
We introduce FunEditor, an efficient diffusion model designed to learn basic editing functions and combine them to perform complex edits. This method allows tasks like object movement by merging multiple functions and applying them simultaneously to designated areas. Our experiments show that FunEditor significantly outperforms recent inference-time optimization techniques and fine-tuned models, both quantitatively across various metrics and through visual comparisons, on complex tasks like object movement and object pasting. Moreover, with only four inference steps, FunEditor achieves a 5-24x speedup compared to existing popular methods. Codes and datasets are available online.
Object Movement: Hover to Move
Object Resizing: Hover to Enlarge, Click to Shrink
Atomic Task Learning: FunEditor learns individual editing functions like object removal, edge enhancement, and harmonization by representing tasks with trainable tokens embedded in the text tokenizer, leveraging simple task-specific datasets to minimize data collection costs.
Training Procedure: The training process involves a source image, target image, task tokens, and binary masks. Task tokens are activated randomly, and their positions are shuffled to enhance model robustness. Localized cross-attention masking ensures task-specific edits are confined to designated regions.
Task Token Aggregation: Multiple task tokens are combined during inference for complex edits. For instance, combining object removal and edge enhancement achieves object movement.
Simultaneous Multi-Function Execution: Functions are executed simultaneously in a single inference pass, enabling efficient, high-quality edits while minimizing processing steps for faster results.
Few-Step Inference & Localization Control: FunEditor applies edits in just four steps, delivering up to 24× speed improvements over conventional methods. Masks ensure precise, localized edits without unintended changes.
@article{samadi2024achieving,
title={Achieving Complex Image Edits via Function Aggregation with Diffusion Models},
author={Samadi, Mohammadreza and Han, Fred X and Salameh, Mohammad and Wu, Hao and Sun, Fengyu and Zhou, Chunhua and Niu, Di},
journal={arXiv preprint arXiv:2408.08495},
year={2024}
}