Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

1Xiamen University 2OpenGVLab, Shanghai AI Laboratory
3The University of Hong Kong
†Equal contribution ‡Project lead *Corresponding author

Video

Abstract

This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

Overview

Diffree is trained to predict masks and images containing the new object given the original image and object text description. Thanks to the extensive coverage of objects in natural scenes in OABench, Diffree can add various objects to the same image while matching the visual context well. Moreover, Diffree can iteratively insert objects into a single image while preserving the background consistency using the generated mask.

OABench

Towards high-quality text-guided object addition, we curate a synthetic dataset named Object Addition Benchmark (OABench) which consists of 74K real-world tuples including an original image, an inpainted image, a mask image of the object, and an object description. The data curation process is illustrated in the figure below. Note that object addition can be deemed as the inverse process of object removal. We build OABench by removing objects in the image using advanced image inpainting algorithms. In this way, we can obtain an original image containing the object, an inpainted image with the object removed, the object mask, and the object descriptions.

Visualization

Application

Applications combined with Diffree. (a): combined with anydoor to add a specific object. (b): using GPT4V to plan what should be added. (c): using KLING video model to make the edited image move.

Cite Us

@article{zhao2024diffree,
      title={Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model},
      author={Zhao, Lirui and Yang, Tianshuo and Shao, Wenqi and Zhang, Yuxin and Qiao, Yu and Luo, Ping and Zhang, Kaipeng and Ji, Rongrong},
      journal={arXiv preprint arXiv:2407.16982},
      year={2024}
    }