FunEditor@AAAI2025

Overview Image — Our method efficiently performs complex image editing tasks such as object movement, resizing, and pasting in just **4 steps** by composing and applying multiple functions simultaneously. Key functions include object removal (\(f_{OR}\)), edge enhancement (\(f_{EE}\)), and harmonization (\(f_{HR}\)), each applied exclusively to specified mask regions for precise edits. The source image \(I\) is omitted from function arguments for simplicity in the demo visualization.

Abstract

Diffusion models have shown exceptional performance in generative tasks, making them suitable for image editing. Recent research highlights their ability to apply edits based on textual instructions, but two key challenges persist. First, these models struggle to perform multiple edits simultaneously, causing computational inefficiencies due to sequential processing. Second, using textual prompts to specify editing regions can result in unintended changes to the image.

We introduce FunEditor, an efficient diffusion model designed to learn basic editing functions and combine them to perform complex edits. This method allows tasks like object movement by merging multiple functions and applying them simultaneously to designated areas. Our experiments show that FunEditor significantly outperforms recent inference-time optimization techniques and fine-tuned models, both quantitatively across various metrics and through visual comparisons, on complex tasks like object movement and object pasting. Moreover, with only four inference steps, FunEditor achieves a 5-24x speedup compared to existing popular methods. Codes and datasets are available online.

Object Movement: Hover to Move

Object Resizing: Hover to Enlarge, Click to Shrink

Method Overview

(a) Basic Task Training

Atomic Task Learning: FunEditor learns individual editing functions like object removal, edge enhancement, and harmonization by representing tasks with trainable tokens embedded in the text tokenizer, leveraging simple task-specific datasets to minimize data collection costs.

Training Procedure: The training process involves a source image, target image, task tokens, and binary masks. Task tokens are activated randomly, and their positions are shuffled to enhance model robustness. Localized cross-attention masking ensures task-specific edits are confined to designated regions.

(b) Complex Task Inference

Task Token Aggregation: Multiple task tokens are combined during inference for complex edits. For instance, combining object removal and edge enhancement achieves object movement.

Simultaneous Multi-Function Execution: Functions are executed simultaneously in a single inference pass, enabling efficient, high-quality edits while minimizing processing steps for faster results.

Few-Step Inference & Localization Control: FunEditor applies edits in just four steps, delivering up to 24× speed improvements over conventional methods. Masks ensure precise, localized edits without unintended changes.

Comparison To Current Methods

Quantitative Results

COCOE Dataset

ReS Dataset

Quantitative evaluation of our approach compared to the baselines on the object movement task using the COCOEE and ReS datasets.

Efficiency Comparison

BibTeX

@article{samadi2024achieving,
      title={Achieving Complex Image Edits via Function Aggregation with Diffusion Models},
      author={Samadi, Mohammadreza and Han, Fred X and Salameh, Mohammad and Wu, Hao and Sun, Fengyu and Zhou, Chunhua and Niu, Di},
      journal={arXiv preprint arXiv:2408.08495},
      year={2024}
    }

🎨 FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

AAAI Conference on Artificial Intelligence (AAAI-25)