Abstract
Previous methods for inserting new subjects (e.g., face identities and cats) into pre-trained Text-to-Image diffusion models for personalized generation have two problems: (1) Attention Overfit : As shown in the activation maps of Textual Inversion and Prospect, their ‘‘V’’ attention nearly takes over the whole images, which means the learned embedding try to encode both the target subject and subject-unrelated information in the target images, such as the subject region layout and background. This problem extremely limits their interaction with other existing concepts such as ‘‘cup’’, which results in the failure of generating the image content aligned with the given prompt. (2) Limited Semantic-Fidelity: Despite they alleviate overfit by introducing subject prior such as face recognition models, the ‘‘cup’’ attention of Celeb Basis still affects the ‘‘V’’ face region and this limitation hinders the control of subject attributes such as ‘‘eyes closed’’, while IP-Adapter learns mismatched subject embedding (i.e., its attention of ‘‘V*’’ is inconsistent with the generated face). These flaws result in the limited semantic-fidelity of text-to-image generation. Therefore, we propose Subject-Wise Attention Loss and Semantic-Fidelity Token Optimization to address problem (1) and (2) respectively.
Framework
The overview of our framework. We first propose a novel Subject-Wise Attention Loss to alleviate the attention overfit problem and make the subject embedding focus on the subject region to improve subject accuracy and interactive generative ability. Then, we optimize the target subject embedding as five per-stage tokens pairs with disentangled features to expend textural conditioning space with Semantic-Fidelity control ability.
Motivation
We note that T2I models have learned a robust general concept prior for various subcategories, e.g., different human identities falling under the broad concept of ‘‘person’’, which could act as an anchor for regularizing subject embedding. Furthermore, when ‘‘Obama’’ is replaced with ‘‘Hillary’’ in prompts, the attention maps for each token remain similar. Thus, we propose the subject-wise attention loss, which encourages the effect of each token to align with those from a reference prompt for dual optimization of controllability and subject similarity.
When it comes to the generation of images with controllable attributes, previous methods fail to generate examples like ‘‘an old person’’. (1)Attention Mismatch: The ${K}$ feature cooperates with the image feature ${Q}$ to decide how to sample from the feature ${V}$. However, the attention map is unmatched for prompts like ‘‘beard’’ or ‘‘closed’’. (2)Insufficient ${V}$ Feature: Even though the attention is correct, the ${V}$ feature from prompt ‘‘an old Emma Watson/Rich Sommer’’, as a detailed texture feature provider, can not reflect the ‘‘old’’ on ‘‘Emma Watson/Rich Sommer’’. We address this challenge by disentangling the ${K}$ and ${V}$ features, which is helpful in achieving fine-grained subject attribute control.
Results
Single Person's Generation
Single Person’s Comparisons
Various scenes, actions, and facial attributes manipulation
Various face photo generation or ours and comparison methods
More Single Person’s Generation Results
Multiple Persons' Generation
Multiple Persons’ Comparisons
More Multiple Persons’ Gneration Results
More Evaluation
Embedding Other Objects
Various subject generation or ours and comparison methods
Our method does not introduce face prior from other models, we adopt animals (Bear, Cat, and Dog) and general objects (Car, Chair, and Plushie) for experiments, which show the generalization ability of our method.
Various scenes, actions, and attributes manipulation
Using Stable Diffusion XL
We select SDXL model stable-diffusion-xl-base-1.0 as the target model and the newly released methods using it for comparison.