🎨 FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

1Huawei Technologies Canada, 2University of Alberta, 3Huawei Kirin Solution

AAAI Conference on Artificial Intelligence (AAAI-25)

Overview Image
Our method efficiently performs complex image editing tasks such as object movement, resizing, and pasting in just 4 steps by composing and applying multiple functions simultaneously. Key functions include object removal (\(f_{OR}\)), edge enhancement (\(f_{EE}\)), and harmonization (\(f_{HR}\)), each applied exclusively to specified mask regions for precise edits. The source image \(I\) is omitted from function arguments for simplicity in the demo visualization.

Abstract

Diffusion models have shown exceptional performance in generative tasks, making them suitable for image editing. Recent research highlights their ability to apply edits based on textual instructions, but two key challenges persist. First, these models struggle to perform multiple edits simultaneously, causing computational inefficiencies due to sequential processing. Second, using textual prompts to specify editing regions can result in unintended changes to the image.

We introduce FunEditor, an efficient diffusion model designed to learn basic editing functions and combine them to perform complex edits. This method allows tasks like object movement by merging multiple functions and applying them simultaneously to designated areas. Our experiments show that FunEditor significantly outperforms recent inference-time optimization techniques and fine-tuned models, both quantitatively across various metrics and through visual comparisons, on complex tasks like object movement and object pasting. Moreover, with only four inference steps, FunEditor achieves a 5-24x speedup compared to existing popular methods. Codes and datasets are available online.

Object Movement: Hover to Move

Object Resizing: Hover to Enlarge, Click to Shrink

Method Overview

Overview Image

(a) Basic Task Training

Atomic Task Learning: FunEditor learns individual editing functions like object removal, edge enhancement, and harmonization by representing tasks with trainable tokens embedded in the text tokenizer, leveraging simple task-specific datasets to minimize data collection costs.

Training Procedure: The training process involves a source image, target image, task tokens, and binary masks. Task tokens are activated randomly, and their positions are shuffled to enhance model robustness. Localized cross-attention masking ensures task-specific edits are confined to designated regions.

(b) Complex Task Inference

Task Token Aggregation: Multiple task tokens are combined during inference for complex edits. For instance, combining object removal and edge enhancement achieves object movement.

Simultaneous Multi-Function Execution: Functions are executed simultaneously in a single inference pass, enabling efficient, high-quality edits while minimizing processing steps for faster results.

Few-Step Inference & Localization Control: FunEditor applies edits in just four steps, delivering up to 24× speed improvements over conventional methods. Masks ensure precise, localized edits without unintended changes.

Comparison To Current Methods

Complex Task 1: Object Movement

Overview Image

Complex Task 2: Object Placement

Overview Image
Qualitative and visual comparison between our approach and baseline methods for object placement, repositioning, and pasting tasks. Our method demonstrates superior performance by composing functions such as object removal, harmonization, and edge enhancement. FunEditor seamlessly pastes objects from a reference image into a target image while ensuring proper integration and realistic appearance.

Quantitative Results

COCOE Dataset

Results Table 1

ReS Dataset

Results Table 2
Quantitative evaluation of our approach compared to the baselines on the object movement task using the COCOEE and ReS datasets.

Efficiency Comparison

Overview Image
Efficiency is compared in terms of the number of function evaluations (NFEs) and latency (seconds). Latency is measured as the average wall-clock time for editing one image over 10 runs on a single Nvidia V100 (32GB) GPU.

BibTeX

@article{samadi2024achieving,
      title={Achieving Complex Image Edits via Function Aggregation with Diffusion Models},
      author={Samadi, Mohammadreza and Han, Fred X and Salameh, Mohammad and Wu, Hao and Sun, Fengyu and Zhou, Chunhua and Niu, Di},
      journal={arXiv preprint arXiv:2408.08495},
      year={2024}
    }