Official Implementations "Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models" (ICLR'24)
Abstract: The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two approaches, which we refer to as soft-weighted regularization and inference-time text embedding optimization. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).
Existing text-to-image models can encounter challenges in effectively suppressing the generation of the negative target. For example, when requesting an image using the prompt "a face without glasses", the diffusion models (i.e., SD) synthesize the subject without "glasses", as shown in Figure (the first column). However, when using the prompt "a man without glasses", both SD and DeepFloyd-IF models still generate the subject with "glasses" (It also happens in both Ideogram and Mijourney models, see Appendix F), as shown in Figure (the second and fifth columns). Figure (the last column) quantitatively show that SD has 0.819 DetScore for "glasses" using 1000 randomly generated images, indicating a very common failure cases in diffusion models. Also, when giving the prompt "a man", often the glasses are included, see Figure (the third and sixth columns).
Overview of the proposed method. (a) We devise a negative target embedding matrix
The codebase is tested on
- Python 3.8
- PyTorch 1.12.1
- Quadro RTX 3090 GPUs (24 GB VRAM) with CUDA version 11.7
environment or python libraries:
pip install -r requirements.txt
python suppress_eot_w_nulltext.py --type Real-Image \
--prompt "A man with a beard wearing glasses and a hat in blue shirt" \
--image_path "./example_images/A man with a beard wearing glasses and a hat in blue shirt.jpg" \
--token_indices "[[4,5],[7],[9,10],]" \
--alpha "[1.,]" --cross_retain_steps "[.2,]"
You can use NPI (Negative-prompt Inversion) for faster inversion of the real image, but it may lead to a certain degree of degradation in editing quality:
python suppress_eot_w_nulltext.py --type Real-Image --inversion NPI\
--prompt "A man with a beard wearing glasses and a hat in blue shirt" \
--image_path "./example_images/A man with a beard wearing glasses and a hat in blue shirt.jpg" \
--token_indices "[[4,5],[7],[9,10],]" \
--alpha "[1.,]" --cross_retain_steps "[.2,]"
python suppress_eot_w_nulltext.py --type Generated-Image \
--prompt "A dog in Van Gogh Style" --seed 2023 \
--token_indices "[[4,5,6],]" \
--alpha "[1.,]" --cross_retain_steps "[.2,]" --iter_each_step 0
Our method can suppress copyrighted materials and memorized images from pretrained text-to-image diffusion models, such as ESD, Concept-ablation and Forget-Me-Not, and it is training-free.
1️⃣ Generating subjects for generated image (Attend-and-Excite similar results)
python suppress_eot_w_nulltext.py --type Generated-Image \
--prompt "A painting of an elephant with glasses" --seed 16 \
--token_indices "[[7],]" \
--alpha "[-0.001,]" --cross_retain_steps "[.2,]" --iter_each_step 0
2️⃣ Adding subjects for real image (GLIGEN similar results)
python suppress_eot_w_nulltext.py --type Real-Image \
--prompt "A car near trees with raining" \
--image_path "./example_images/A car near trees.jpg" \
--token_indices "[[6],]" \
--alpha "[-0.001,]" --cross_retain_steps "[.2,]" --iter_each_step 0
python suppress_eot_w_nulltext.py --type Real-Image \
--prompt "a robot is jumping out of a jeep" \
--image_path "./example_images/a man is jumping out of a jeep.jpg" \
--token_indices "[[2],]" \
--alpha "[-0.001,]" --cross_retain_steps "[.5,]" --iter_each_step 0
4️⃣ (Top) Cracks removal results. (Middle) Rain removal for synthetic rainy image. (Bottom) Rain removal for real-world rainy image
python suppress_eot_w_nulltext.py --type Real-Image \
--prompt "A photo with cracks" \
--image_path "./example_images/A photo with cracks.jpg" \
--token_indices "[[4],]" \
--alpha "[1,]" --cross_retain_steps "[.5,]"
Following Inst-Inpaint, we use FID and CLIP Accuracy to evaluate the accuracy of the removal operation on the GQA-Inpaint dataset. We achieve superior suppression results and higher CLIP Accuracy scores on the GQA-Inpaint dataset.
There are 18883 pairs of test data in the GQA-Inpaint dataset, including source image, target image, and prompt. Inst-Inpaint attempts to remove objects from the source image based on the provided prompt as an instruction (e.g., "Remove the airplane at the center"). We suppress the noun immediately following "remove" in the instruction (e.g., "airplane") and use the remaining part, deleting the word "remove" at the beginning of the instruction to form our input prompt (e.g., "The airplane at the center").
Outputs for GQA-Inpaint in Inst-Inpaint and Ours.
@inproceedings{li2023get,
title={Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models},
author={Li, Senmao and van de Weijer, Joost and Khan, Fahad and Hou, Qibin and Wang, Yaxing and others},
booktitle={The Twelfth International Conference on Learning Representations},
year={2023}
}
Should you have any questions, please contact [email protected]
The code in this repository is licensed under the MIT License for academic and other non-commercial uses.
Acknowledgment: This code is based on the P2P, Null-text repository.