ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

Soon Yau Cheong
University of Surrey
s.cheong@surrey.ac.uk Armin Mustafa
University of Surrey
armin.mustafa@surrey.ac.uk Andrew Gilbert
University of Surrey
a.gilbert@surrey.ac.uk

Abstract

This paper introduces ViscoNet, a novel method that enhances text-to-image human generation models with visual prompting. Unlike existing methods that rely on lengthy text descriptions to control the image structure, ViscoNet allows users to specify the visual appearance of the target object with a reference image. ViscoNet disentangles the object’s appearance from the image background and injects it into a pre-trained latent diffusion model (LDM) model via a ControlNet branch. This way, ViscoNet mitigates the style mode collapse problem and enables precise and flexible visual control. We demonstrate the effectiveness of ViscoNet on human image generation, where it can manipulate visual attributes and artistic styles with text and image prompts. We also show that ViscoNet can learn visual conditioning from small and specific object domains while preserving the generative power of the LDM backbone.

Harmonizing text and visual prompts, our method can generate a variety of images from the reference image (bottom left in (a)), and maintain pose and appearance throughout. All images in this paper are in full resolution; zoom in for the best viewing experience.

1 Introduction

Diffusion models [16, 33, 39, 29] are powerful tools for generating realistic and diverse images and videos from various inputs. Among them, latent diffusion models (LDM), more notably Stable Diffusion (SD) [36], have shown impressive results in text-to-image (T2I) synthesis, thanks to their high quality and open-source availability. Controlling the image structure, including object positions and human pose, is inherently challenging. In this work, we focus on controllable high-quality stylized human generation. To enable this, previous works [20, 10, 51] have trained training large LDMs from scratch, however, this is both resource-intensive and time-consuming. Consequently, recent approaches, such as T2I-Adapter[28] and ControlNet[52], have added separate branches to extract additional conditioning information. These branches are then applied to a frozen backbone LDM for finetuning. However, these models are often trained on small data, creating domain gaps between the LDMs. As observed by [20], the conflict arising from the domain gap can result in LDMs being unable to generate the correct pose and image styles. They improve human image generation by assigning higher loss weight to the person’s body keypoints. Nevertheless, there is still a lack of fine-grained visual conditioning, and the overall visual appearance of individuals and images continues to be primarily controlled through text.

Text prompting in image generation is challenging due to the ambiguity in natural language. Extensive effort is spent crafting lengthy and detailed text prompts to achieve the desired person’s appearance, often leaving little room to describe the image background. However, given the text’s inherent ambiguity and the stochastic nature of diffusion models, unintended variations in people’s appearance can emerge. This instability can pose a significant challenge for specific applications that demand faithful reconstruction or precise control over people’s appearance. However, due to the inherent difficulty of transformer[42, 32] (used to encode text for LDMs) in relating property to object, it often generates clothing with incorrect colors, such as assigning the color of the top clothing to the supposed color of the pants; or apply the clothing stripe pattern over to the background. Furthermore, it is difficult to specify the fine texture details, such as the width of the clothing stripe pattern. We propose to solve this issue by adding visual conditioning to ControlNet without changing the underlying architecture, instead using an image encoder to capture fine-grained visual attributes.

In contrast, the text prompts used in machine learning literature lack detailed descriptions to control individuals’ appearance effectively. Our research reveals that mode collapse can occur when a detailed description is introduced into the text prompt. For example, “a woman wearing ripped jeans, in Ukiyoe style” could trigger mode collapse, trapping the model at extreme modes. As modern clothing was not present in ancient Japan, the model may produce a photorealistic image controlled by ControlNet training on real images or a Ukiyoe style without the ripped jeans from the backbone LDMs. Usually, the highly optimized ControlNet will prevail, and there is currently no effective mechanism to control and manage this conflict. Only when one of the conflicting texts, i.e., ripped jeans or Ukiyoe style, is removed, will it escape the stuck mode. Still, the model cannot generate desired images that fulfill all the conditions. In ControlNet architecture, the entanglement with backbone LDMs is attributed to the sharing of text conditioning. Our method severs this tie by replacing the text conditioning in ControlNet with image conditioning. This establishes a clear disentanglement, with ControlNet overseeing the human (structure and texture) and the remainder controlled by backbone LDM through text prompts.

In general, the solution of other works is to train a more extensive dataset to cover the gap. HumanSD[20] attempts to bridge the domain gap by compiling a 1M image dataset, integrating data from diverse domains, including an art dataset for its artistic image styles. However, our research found that this is not sufficient in preventing mode collapse. HyperHuman [25] curated an extensive dataset comprising 340M images. Instead of pursuing an endlessly expansive model to encompass every conceivable image composition and style, a more pragmatic and environmentally friendly approach is disentangling distinct objects (e.g., humans, clothing, watches, shoes) and finetuning them in specific domains with smaller data. Combining these individually optimized components allows for more targeted and efficient model development. Hence, we propose a method that applies masking in training to capture their unique visual properties while retaining the power of backbone LDMs in generating background. Subsequently, these individually optimized objects can be combined harmoniously to create a comprehensive and coherent representation of the larger scene.

In this work, we focus on controllable high-quality stylized human generation; this has a different set of challenges to the more widely studied object editing and compositing [41, 47, 8, 26]. Reconstructing and editing human images from a reference image and a target pose remains largely unexplored, with most existing works focusing on specific sub-problems such as pose transfer or face swapping and text-to-image generation. DeepFashion [57] is a popular dataset in this field. The dataset contains 52k images, primarily in a consistent style of fashion models posing against plain colored studio backgrounds. It contains fine-grained clothing segmentation, which helps learn specific fashion domain knowledge. However, training on this small dataset could lead ControlNet to ‘forget’ its capability to generate rich image backgrounds. Our proposed method applies masking in model training to enable the learning of foreground objects from a small dataset while preserving the rich generative capacity of LDM. Our method does not overfit to fashion model faces and plain backgrounds. Due to the use of proposed Control Feature Masks and Control Srrength Scaling, our approach successfully retains the generative capability, including rich backgrounds and well-known people, movie characters, and artistic styles not present in the dataset.

Our proposed method, which we call ViscoNet (Visual ControlNet), addresses some limitations of the ControlNet model. The summary of our method contributions:

•

Adding visual conditioning capability to ControlNet for precise and consistent visual control.
•

Applying human mask in model training that is effective in learning from small datasets without overfitting and hence losing backbone LDMs generative capability.
•

A novel method to regulate and harmonize the textual and visual conditioning to control image style and avoid mode collapse.

2 Related Works

Text to Image Generation. The landscape of image generation has transformed significantly with the introduction of diffusion models [16], which iteratively apply transformations to images, simulating the gradual diffusion of information and creating intricate, diverse, and realistic visual content. Novel approaches in personalized image generation, such as DreamBooth [38], explore finetuning vocabularies to define specific identities. [6, 13] follow the same idea, while [40, 18, 7] leverage large-scale upstream training to eliminate the need for test-time finetuning.

Visual Conditioning. Although images are used as conditioning to ControlNet [52] and T2I-Adapter[28], they are usually formatted for spatial conditioning, such as using skeleton images, sketch images, and segmentation maps. They are not effective for precise visual conditioning. In contrast, our method allows visual conditioning on top of existing structure conditioning provided by the aforementioned methods. UPGPT [10] is among the first literature that uses visual conditioning in diffusion models. They segment the body and clothing parts, encoding them independently into CLIP [31] image embedding and concatenating alongside text embedding for text and visual control. However, using global CLIP embedding results in losing spatial information critical for learning fine texture details. DisCo [44] substitutes SD with SDIV [30] in ControlNet for human pose transfer. By extracting the last hidden layer embedding before the pooling layer in the CLIP image transformer, also known as local CLIP embedding, DisCo captures finer texture image details. Nevertheless, the clothing color between the top and bottom body parts can be mixed up when the target human size and location significantly deviate from the reference image, owing to the spatial information encoded for the entire image.

Human Image Generation. Text-to-image models have existed since the early days of GANs, with examples such as [49, 46, 55], transformer [42]-based DALL-E [32], and diffusion models like Glide [29], DALL-E 2 [33], LDM [36] and Imagen [39]. However, these models lack precise control over human pose and fine-grained appearances. KPE [9] developed the first text-and-pose-guided image generative model by encoding body key points into transformer tokens as conditions. While it generates accurate poses, it cannot provide fine-grained appearance control and consistency for pose transfer due to weak text conditions. Text2Human [19] uses a parsing map and hierarchical autoencoder to encode different body regions, offering more fine-grained appearance control. However, it cannot specify unlabelled person and clothing attributes like color. Later approaches [50, 54, 53, 34, 9, 4] exclusively trained on small datasets, resulting in overfitting and an inability to generalize to generate realistic images in diverse, real world scenarios.

Image Editing. Within the broader image editing, diffusion models are transforming image editing by enabling the creation of complex, diverse, and general images. When combined with text prompts, these models allow for various inputs such as strokes or image patches for local editing, as seen in methods like SDEdit [27]. Other techniques like InpaintAnything [48] or BlendedDiffusion [2] use text descriptions to insert objects into images. By semantically disentangling GAN latent spaces [35, 3, 1] and utilizing semantic masks [3, 14, 23, 43] for localized edits, controllable editing has become possible. However, diffusion models have extended these capabilities to more complex and diverse tasks [15, 21, 24].

Refer to caption — Figure 1: Architectural diagram showing our contribution concerning backbone LDM and ControlNet layers. We omit time embedding, zero convolution, and number of blocks from the ControlNet diagram [52] for simplicity.

3 Method

To mitigate the observed entanglement caused by sharing text embedding between an LDM and ControlNet, we propose to remove text embedding from ControlNet and replace it with the image embedding of the objects targeted for fine-tuning. Then, a Control Strength Scale connecting the LDM and ControNet will be introduced to regulate the control signal strength for harmonious interaction with LDM. While people are the specific objects in our experiments, our method can be extended to other object domains.

3.1 Preliminaries

Stable Diffusion (SD), a backbone LDM [36], and a ControlNet model [52] are shown in the left and right block in Figure 1. LDM uses UNet[37] as a denoising network and progressively refines the input noise into latent variables that can be reconstructed into realistic synthetic images, relying on understanding intricate image distributions. The words within a text prompt are decomposed into smaller subword units and tokenized and encoded with a CLIP [31] text transformer [42]. The text embedding is injected into the cross-attention layers in UNet, serving as the sole conditioning in image generation. The same text embedding is also used as conditioning in ControlNet using the same method as LDM.

The loss function of LDM is:

\mathcal{L_{MSE}}:=\mathbb{E}_{z,p_{R},c,t,\epsilon n\sim\mathcal{N}(0,1)}% \left[\|\epsilon-\epsilon_{\theta}(z_{t},t,c)\|^{2}_{2}\right]

θしーた end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]

(1)

$c$ is text conditioning, $t$ is diffusion time step, and $z$ is the latent variable (denoted as input in Figure 1).

Training them from scratch is often impractical due to the substantial volume of data and computational demands, often leading to the need for fine-tuning pre-trained models. Therefore, ControlNet [52] adds a learnable branch parallel with a now frozen pre-trained LDM, as shown on the block on the right in Figure 1. ControlNet generates spatial features and adds them to the LDM as a control signal across multiple spatial resolutions.

3.2 Model Architecture

As shown in Figure 1, our approach sits between LDM and ControlNet. We replace the text embedding with image embedding of person images in ControlNet, allowing ControlNet to control both the pose and person appearance while leaving everything else in the image to the LDM; this includes the image background, other objects, and artistic styles. We apply Control Features Masking to limit image background flow into LDM. Our visual extraction method makes the Control Strength Scale effective in balancing the influence with LDMs to achieve the desired image output and remove mode collapse.

3.2.1 Visual Conditioning

To effectively control the generation of the image, we adopt the approach of UPGPT [10] to segment person images into 8 categories - hair, face, top, bottom, outerwear, headwear, shoes, and accessories. These segmented images, referred to as (fashion) style images, are normalized to $224\times 224$ , aligning with the standard input size for the CLIP model. Given that people in images are captured from various camera views, ranging from the entire body to the occluded half body, the normalization process preserves texture and spatial information. The utilization of style images decouples visual attributes, such as fashion gourmet, from human pose. This allows us to generate human images in different poses while wearing the same outfit. We use a frozen pretrained CLIP vision transformer [12] as an image encoder. UPGPT replaces the CLIP image encoder with the CLIP text encoder during inference, enabling control over fashion color using text and the main text prompt that governs fashion structure. In contrast to their approach, we don’t require a separate text prompt exclusively for fashion styles. Instead, we can influence the fashion styles directly from the text prompt by adjusting the control signal strength (Section 3.2.3).

We propose to encode the image into localized CLIP embeddings. Unlike DisCo [44], we use multiple style images, eight instead of just one, to enable finer-grained control of the result. However, this results in a significantly larger embedding dimension. To address this, we employ a linear layer to perform weighted averaging, reducing the embedding size from 257 to 8 tokens for each of the multiple-style images.

3.2.2 Control Features Masking

To focus the model on learning the people rather than the plain background, we multiply a binary human silhouette mask to the SD loss function (Equation 1) at output in Figure 1. The training loss backpropagates via the frozen LDM to train the ControlNet. This approach is akin to [10, 20], although they use it to assign weight loss to different body segmentation parts rather than masking a region entirely. We add masking to LDM loss function 1:

\mathcal{L_{MSE}}:=\mathbb{E}_{z,p_{R},c,v,t,\epsilon\sim\mathcal{N}(0,1)}% \left[\|\mathcal{M}\odot(\epsilon-\epsilon_{\beta}(z_{t},t,c,v)\|^{2}_{2})\right]

βべーた end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c , italic_v ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ]

(2)

where $\epsilon_{\beta}$ βべーた end_POSTSUBSCRIPT is ControlNet, $v$ is image embedding, $\odot$ is element-wise multiplication, and $\mathcal{M}\in\mathbb{R}^{H,W}$ is the binary mask resized to resolution (H, W) of the LDM output. Although text condition is not used in ControlNet, it is used by LDM in training and, thus, included in the equation.

Unlike [10, 20], the masking is also applied to multi-resolution spatial features on all paths originating from ControlNet, as illustrated in Figure 1, aiming to eliminate any unwanted background information leakage. Ablation study on the effective of using mask is included in Section 5.

3.2.3 Control Strength Scale

Nevertheless, the indiscriminate use of masking across all spatial resolutions to generate pixels exclusively from the ControlNet signal for the person can lead to undesirable copy-and-paste visual results, causing the person to appear incoherent with the image background. Therefore, we introduce Control Strength Scale; here, scalar values adjust the strength of the control signals at different resolutions during the image generation process. ControlNet has a similar method in controlling the signal strength but conditioning on text; the control features lack good spatial properties for the scaling to give effective results. In contrast, our visual conditioning method extracts is effective in extracting hierarchical image features where lower resolution has a more considerable influence on the image structure. In comparison, higher resolution impacts the texture more. Leveraging these properties, we devise scale combinations to manage the effect from ControlNet.

For the sake of discussion, the 13 individual control strength scales (c0-c12) can be roughly grouped into three blocks - low (LB), mid (MB), and high (HB). The control signals directed to LB exert the strongest influence on structure and texture, resulting in faithful representations learned from the dataset but potentially limiting artistic creativity. Consequently, these signals (c0-c3) are set to 0 by default during image sampling. Conversely, the MB governs the object’s structure, such as the person’s pose. We found that setting c4 or c5 to 0.5 can disentangle the structure information control features containing both design and visual conditions. We then start tuning the scale by increasing the control signal from c5 onwards. HB regulates fine texture details, which are only increased from 0 if we wanted faithful person reconstruction. Varying between [0.0, 2.0], these scale parameters are tune-able to create distinct artistic effects and alterations to the person’s appearance, including pose and facial features, as demonstrated in Section 4.1.

4 Experiments

We employ the DeepFashion In-shop Clothes Retrieval dataset [57], comprising 52K images, and adopt the train-test split proposed by [56] for the pose transfer task, padding the images to the size of $512\times 512$ . Pose information is extracted using OpenPose [5] to create body-and-hand skeleton images, and we use pre-segmented style images from [10]. We explored various options for text labels, including using the manually annotated text in the DeepFashion Multimodal subset, written in detailed fashion technical language. However, this is unrealistic for general use. We also explored using the image-to-text model [22] to generate text labels but found that they can be inaccurate. Ultimately, we chose a simple text prompt in the format of “[object], [image style]” (e.g., “a person, realistic”). This serves two purposes: first, the neutral description avoids potential conflicts with the LDM, and secondly, it acts as an unconditional text embedding, enabling users to amplify the desired visual effect using positive prompts, negative prompts, and guidance scales[11].

We initialize ControlNet by copying weights from the LDM. However, since the cross-attention input has shifted from global CLIP text embedding to local CLIP image embedding, we re-initialize the weights in the cross-attention layer at the start of training. All weights in the LDM, CLIP text and image encoders are frozen. We trained the model on a single GPU, specifically a GTX 3090, for 2 epochs, using a batch size of 4 with 4 gradient accumulations per batch, resulting in an effective batch size of 16. We retained the remaining configurations and hyperparameters from [52].

4.1 Visual Control

Figure 2 illustrates a step-by-step process of independently controlling the image foreground and background using our method. The control scales are set to 1.0 initially. We take Figure 1(a) from the unseen test data set as a visual reference to transfer the pose denoted at the top left of Figure 1(b) using the default text prompt “a person, realistic”. This showcases the capability of our method to learn about the person and reconstruct them in a different pose, emphasizing its ability to generalize rather than memorize specific instances. As part of our design, the visual conditioning controls only the foreground object.

Consequently, the LDM generates a plain color background with random colors. In Figure 1(c) and 1(d), we perform a virtual clothing try-on by replacing the style image of the top clothing with the one depicted in the top right corner, sourced from outside the dataset. Lastly, we add the words “in jungle” into the text prompt to replace the background with a realistic scene. This demonstrates the versatility and effectiveness of our method in controlling both the foreground object and background, leveraging visual and text prompts, respectively.

Continuing from Figure 1(e), our attempt to change the image style to “ukiyo-e style” in the text prompt reveals a noticeable copy-and-paste visual artifact in Figure 2(a)(zoom in to watch). In Figure 2(b), we mitigate this artifact by setting the Control Strength Scale LB to 0, thereby reducing the conditioning strength from ControlNet. Subsequently, we decreased the Control MB from 1.0 to around 0.5, enhancing the influence of the text prompt on the image style. As illustrated in Figure 2(c), the clothing styles diverge from the visual conditioning, aligning more with the clothing from the ancient Japanese era. Continuing the adjustment, we further reduce the MB values to a point where the control strength is weak enough, allowing us to potentially use the text prompt to alter the person’s pose. Figure 2(d) displays the effect by adding ”raise her hand to hold a flag” to the text prompt. While existing pose methods, including ControlNet, effectively exert pose control, they can be overly rigid for creative purposes. To the best of our knowledge, our method is the first image generation method demonstrating the ability to modify pose with text while mainly maintaining the overall visual appearance.

Nonetheless, achieving this level of control remains challenging and necessitates careful tweaking. Finally, Figure 2(e) shows the effect of blocking off the high spatial resolution signal going through HB altogether, leaving essentially the structural information flowing through MB for pose conditioning. Figure 4 shows our method does not override the person’s information from LDM. Instead, we create harmonized conditioning and leverage LDM knowledge to alter the person’s appearance, including faces. The interaction between the person and their surroundings, e.g., holding the flag in Figure 2(d) and the rocket pack occluding Angelina Jolie in Figure 3(c) also successfully demonstrate that our method mitigates the copy-and-paste effect from naive masking.

4.2 Foreground / Background Disentanglement

We found that trying to control both the pose and clothing color with existing methods is difficult. The clothing color in the text prompt often spreads over to the image background. Using pretrained HumandSD [17], ControlNet [52] and T2I-Adapter [28], we generated images across various image styles using text prompt “a woman wearing a tank top in blue and purple pattern, and white short pants, standing in a jungle”. Figure 5 compares with some of the best examples produced using their methods. The discrepancy between the specified green jungle background and the observed purple hue from clothing color is consistently present across various image styles, significantly impacting ControlNet and T2I methods. In contrast, our approach, which avoids explicitly specifying clothing color in the text prompt, successfully generates vibrant backgrounds and faithfully reproduces clothing appearances. Notably, white pants, a challenging detail for other methods, are accurately represented in our samples.

We conducted a qualitative assessment separately for the foreground and background. Utilizing CLIP similarity [31] to compare the CLIP image embedding to CLIP text embedding of [“green forest”,“purple forest”], we classified whether the background exhibited a purple smear. We measure the consistency of the person’s appearance by cropping out the person and calculating Multiscale Structural Similarity (MS-SSIM)[45] with the reference image. The results in Table 1 indicate that our method significantly outperformed reference methods in preventing clothing color leakage into the background. Additionally, our approach achieved superior results in maintaining the person’s appearance across various image categories (detailed in Table 2).

Method	Background $\uparrow$	Person $\uparrow$
	CLIPSIM	MSSIM
ControlNet[52]	0.172	0.342
T2I-Adapter[28]	0.237	0.348
HumanSD[20]	0.348	0.336
ViscoNet(ours)	0.975	0.408

Table 1: Comparing foreground background disentanglement and person appearance similarity.

4.3 Mode Collapse

Human Evaluation $\uparrow$ CLIP similarity classification accuracy $\uparrow$ Image Styles HumanSD ControlNet T2I-Adapter ViscoNet (Ours) HumanSD ControlNet T2I-Adapter ViscoNet (Ours) Ukiyoe ✓ ✓ ✓ ✓ 1.00 0.68 0.53 1.00 Cyberpunk anime ✓ ✓ ✓ ✓ 0.98 0.79 0.85 0.88 Stained glass ✗ ✓ ✓ ✓ 0.23 0.96 1.00 1.00 Van Gogh ✗ ✗ ✓ ✓ 0.03 0.55 0.73 0.64 Picasso ✗ ✗ ✓ ✓ 0.00 0.17 0.66 1.00 Oil Painting ✗ ✗ ✗ ✓ 0.39 0.20 0.88 0.84 Disney ✗ ✗ ✗ ✗ 0.14 0.33 0.40 0.58 Average 0.286 0.429 0.714 0.857 0.396 0.526 0.722 0.849

Table 2: Faithfulness of generated images to the prescribed image style categories.

To study mode collapse, we repeat the experiments in Section 4.2 but introduce the conflicting word “ripped jeans” into the text prompt. Figure 6 shows mode collapse when the person reverts into the image domain of the controlling network.

Before beginning the experiments, we use the neutral text prompt “a woman, [image style]” to generate images to find out the common image styles that all the models could generate, and they are listed in Table 2. We performed quantitative and qualitative analysis. We first perform a qualitative assessment; we use humans to asses the generated samples to decide if most of the samples within the categories are faithful to the image styles in the baseline. Our method demonstrated superior performance, experiencing mode collapse in only 1 out of 7 categories in the Disney cartoon style. While our method could generate a Disney cartoon style (Figure 4(f)), the success rate of the batch-generated samples was deemed relatively low. Apart from Disney cartoons, computer graphics is another category where all models fail, but it is excluded from comparison as HumanSD could not generate them in the baseline. For the same reason, color sketch is excluded due to HumanSD and ControlNet, which both ours and T2I-Adapter successfully maintain the image styles.

We confirm our human evaluation result with quantitative results. Like Section 4.2, we employ CLIP similarity to categorize the image style. We use the image styles as the texts option for CLIP text embedding, and we add “photo” to the comparison as it is among the most common mode in mode collapse. To ensure the effectiveness of CLIP in detecting image styles, we run CLIP similarity classification onto the baseline samples generated by each model. They all achieved accuracy close to 100%, apart from Disney for HumanSD, which is only 40%. The quantitative result in Table 2 aligns with the qualitative assessment. Oil painting for T2I-Adapter is the only exception in which humans deemed they were closer to watercolor, which is not in the test list. Overall, the results show that our method has greatly mitigated mode collapse in ControlNet, which our architecture is based on.

5 Ablation and Limitations

Initially, we did not apply Control Features Masking in model training; this caused the ControlNet layers to overpower the backbone LDM, overfitting and resulting in plain background images. We attempted to remove the influence of ControlNet on the background generation by applying the masking post-training. However, the generative power of the LDM has been massively reduced to only generating a few simple scenes, as shown in Figure 7. Despite masking the influence, we can observe the background features leaking into the image with the jungle such as the flat horizon originating from fashion studio flooring. This is further evident in Figure 7d where image padding errors from in the DeepFashion dataset influence the repeating leaf pixels on the left. This underscores the importance of our method, Spatial Feature Masking.

After extensive experimentation, we identified control scale combinations that can be used as a starting point for tuning, and they perform well across various image styles, particularly in the context of painting styles. However, achieving an optimal balance between visual prompts and challenging image styles, like Disney cartoons, remains problematic. The scales may require recalibration after each training cycle to maintain effectiveness.

6 Conclusions

In conclusion, our work introduced ViscoNet, a novel approach addressing challenges in image generation, particularly mitigating mode collapse and enhancing visual conditioning. Our method’s key contributions were

•

The addition of visual conditioning that allows for precise visual control
•

The application of masking to reduce overfitting when small datasets are used
•

Able to avoid mode collapse through disentangling joint textual prompt and scaling control signal strengths

By disentangling the visual appearance of objects from Stable Diffusion and relocating them to ControlNet, ViscoNet effectively harmonizes and balances the influence of text prompts and visual conditioning, showcasing its efficacy in generating diverse and precise visual attributes. While previous research has explored cascading multiple ControlNets for multiple controls, we envision the possibility of cascading multiple ViscoNets to achieve even more nuanced visual control over objects in images.

References

Alaluf et al. [2022] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pages 18511–18521, 2022.
Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
Bau et al. [2020] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
Bhunia et al. [2023] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah, and Fahad Shahbaz Khan. Person image synthesis via denoising diffusion model. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2023.
Cao et al. [2019] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
Chen et al. [2023a] Hong Chen, Yipeng Zhang, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
Chen et al. [2023b] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023b.
Chen et al. [2023c] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
Cheong et al. [2022] Soon Yau Cheong, Armin Mustafa, and Andrew Gilbert. Kpe: Keypoint pose encoding for transformer-based image generation. British Machine Vision Conference (BMVC), 2022.
Cheong et al. [2023] Soon Yau Cheong, Armin Mustafa, and Andrew Gilbert. Upgpt: Universal diffusion model for person image generation, editing and pose transfer. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. Conference on Neural Information Processing Systems (NeurIPS), 2021.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference for Learning Representations (ICLR), 2020.
Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
Gu et al. [2019] Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. Mask-guided portrait editing with conditional gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3436–3445, 2019.
Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Conference on Neural Information Processing Systems (NeurIPS), 2020.
[17] HuggingFace. openai/clip-vit-large-patch14.
Jia et al. [2023] Xuhui Jia, Yang Zhao, Kelvin CK Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, and Yu-Chuan Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
Jiang et al. [2022] Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. Text2human: Text-driven controllable human image generation. SIGGRAPH, 2022.
Ju et al. [2023] Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. Humansd: A native skeleton-guided diffusion model for human image generation. International Conference on Computer Vision (ICCV), 2023.
Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning (ICML), 2023.
Ling et al. [2021] Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34:16331–16345, 2021.
Liu et al. [2023a] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023a.
Liu et al. [2023b] Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, and Sergey Tulyakov. Hyperhuman: Hyper-realistic human generation with latent structural diffusion. Arxiv preprint: 2310.08579, 2023b.
Lu et al. [2023] Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294–2305, 2023.
Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
Mou et al. [2023] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. Arxiv preprint 2302.08453, 2023.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. Proceedings of Machine Learning Research, 2021.
Pinkney [2022] Justin Pinkney. Stable diffusion image variations. https://github.com/justinpinkney/stable-diffusion, 2022.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML), 2021.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. Arxiv Preprint: 2204.06125, 2022.
Ren et al. [2022] Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H. Li. Neural texture extraction and distribution for controllable person image synthesis. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2022.
Richardson et al. [2021] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer Assisted Interventions (MICCAI), 2015.
Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. Arxiv preprint: 2205.11487, 2022.
Shi et al. [2023] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
Song et al. [2022] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. Objectstitch: Generative object compositing. arXiv preprint arXiv:2212.00932, 2022.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS), 2017.
Wang et al. [2022] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
Wang et al. [2023] Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for referring human dance generation in real world. Arxiv Preprint 2307.00040, 2023.
Wang et al. [2003] Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, pages 1398–1402 Vol.2, 2003.
Xu et al. [2017] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Yang et al. [2023] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
Yu et al. [2023] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. International Conference on Computer Vision (ICCV), 2017.
Zhang et al. [2021] Jinsong Zhang, Kun Li, Yu-Kun Lai, and Jingyu Yang. Pise: Person image synthesis and editing with decoupled gan. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2021.
Zhang et al. [2022a] Kaiduo Zhang, Muyi Sun, Jianxin Sun, Binghao Zhao, Kunbo Zhang, Zhenan Sun, and Tieniu Tan. Humandiffusion: a coarse-to-fine alignment diffusion framework for controllable text-driven person image generation. Arxiv Preprint 2211.06235, 2022a.
Zhang and Agrawala [2023] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. International Computer Vision Conference (ICCV), 2023.
Zhang et al. [2022b] Pengze Zhang, Lingxiao Yang, Jianhuang Lai, and Xiaohua Xie. Exploring dual-task correlation for pose guided person image generation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
Zhou et al. [2022] Xinyue Zhou, Mingyu Yin, Xinyuan Chen, Li Sun, Changxin Gao, and Qingli Li. Cross attention based style distribution for controllable person image synthesis. European Conference on Computer Vision (ECCV) IEEE Conference of Computer Vision and Pattern Rec, 2022.
Zhu et al. [2019a] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
Zhu et al. [2019b] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. Progressive pose attention transfer for person image generation. IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2019b.
Ziwei et al. [2016] Ziwei, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Liu Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

\thetitle

Supplementary Material
This appendix is divided into two sections. Section A includes additional comparison with other methods and analysis of our method. We then showcase more image results in Section B, demonstrating versatility and quality of our method in performing DeepFakes, virtual try-on, pose transfer and stylization.

Appendix A Further Analysis

A.1 User Study

In addition to Table 2 in main paper which we asked humans to evaluate the image styles within each method. We further conducted a larger scale user study on Amazon Mechanical Turk (AMT) to measure the real-life preferences between our model and the baseline approaches. We perform a 4-way comparison, asking workers to select their best preference from randomly shuffled samples, as shown in Figure 8.

We generated 50 images for each image style for each method for the following text prompt: “A woman, wearing a tank top in purple and blue pattern, white short ripped jeans, in the jungle, [style]”, where the [style] is replaced with one of 7 stylization terms Disney cartoon, Japanese anime cyberpunk, oil painting, Picasso, stained glass, Ukiyoe, van Gogh . We ask 5 different workers to pick their preference for the four images, and we use consensus to determine the overall decision for each stylized image. In total, 221 different workers contributed to the user study, resulting 100 user responses per image styles or total 700 responses for the entire study.

Style ControlNet T2I-Adapter HumanS ViscoNet [52] [28] [20] (ours) Van Gogh 13 9 2 76 Oil painting 11 7 9 73 Disney cartoon 23 5 5 67 Stained glass 32 23 0 45 Picasso 13 42 0 45 Cyberpunk anime 15 21 23 41 Ukiyo-e 32 4 27 37 Total 139 111 77 384 % 19.9% 15.9% 9.43% 54.9%

Table 3: User study comparing faithfulness of image styles. Higher preference indicates lower mode collapse.

As shown in Table 3, users preferred our model for all styles, with overall baselines in about 55% of cases. The highest was for the Van Gogh, oil painting and Disney cartoon, with a 76%, 67% and 73% preference rate, respectively.

A.2 Good Scaling and Interpolation Properties

Our method has a good scaling property to balance between the controlling network and the backbone LDM. In Figure 8(a), with 100% control strength, the human is completely in controlling network’s domain while the image background is in artistic style depicted by text prompt conditioned on LDM. With our method, reducing the signal strength leads to a gradual reduction in controlling network’s influence and transition smoothly to the LDM’s domain. In contrast, this has little effect in ControlNet and often resulting in only off-the-cliff changes between the two modes. Our control signal strength method exploits this unique property to avoid mode collapse and also to create various visual effects as shown in Section B.

This property is further demonstrated in Figure 10 showing effect of latent space interpolation between visual prompt (Figure 9(a) and text prompt “a woman, wearing a spacesuit, standing inside a space shuttle.”. As we reduce the signal strength, we can observe the clothing texture, color and shape gradually transition to a spacesuit. Human face and pose play important roles in human image generation and our method is able to preserve both well despite weakened control signal strength. We can leverage this to perform visual interpolation while maintaining the person’s identity. Figure 9(i) to 9(l) shows another example and the background is largely consistent with fixed random seed.

A.3 CLIP Image Embedding

We experimented with two image embedding methods for visual conditioning - global and local CLIP image embedding (Section 2 Related Works - Visual Conditioning). Figure 11 shows that local CLIP embedding used in our method is better at capturing fine texture details.

Appendix B Qualitative Results

B.1 DeepFakes

We can perform DeepFakes using two methods - visual prompt or text prompt. Figure 13 shows by conditioning on face and hair images, our method generate realistic people with the correct skin tones and body shapes matching the faces. Although the DeepFashion dataset consists of more than 90% of female images, predominately fair-skinned women, our approach shows excellent generalization capability in generating people with a broad spectrum of skin tones, genders, faces, hair and body shapes. We can also leverage knowledge of backbone LDM to generate famous people. To do this, we replace the face and hair images with blank images to unset the face and hair condition while keeping clothing visual conditioning. Then we replace the neutral text “a person“ with the famous people’s name. The results are shown in Figure 14. Despite using the same human mask, by tweaking the control signal strength, our method can generate people of various sizes and also hair length. On the other hand, we found our conditioning method can create more realistic and better looking celebrity images than the standalone backbone LDM (Stable Diffusion) with latter often plagued with distorted faces and extra limbs.

B.2 Pose Transfer

We trained our model on data from Deepfashion pose transfer task but we did not perform quantitative evaluation with other pose transfer methods. This is predominantly because our visual conditioning ignore the image background by design. The random color background can result in deceiving poor similarity score when compared to the ground truth images. This randomness is also affecting the clothing color hence unsuitable to mass produce images required for pose transfer test evaluation. Nevertheless, with careful random seed selection and control strength tuning, we can achieve excellent pose transfer outcomes as shown in Figure 15.

B.3 Virtual Try-on

Figure 16 demonstrates how we perform fashion virtual try-on by using visual prompt and text prompt. Figure 17 illustrates the culmination of our methods, showcasing the seamless integration of DeepFakes, virtual try-on, and pose transfer.

B.4 Stylization

Figure 18 and Figure 19 shows that our visual conditioning is effective across many image domains in creating desired person’s appearance, including various painting styles and also 3D objects such as statures, sculptures, toys and 3D graphics. Some image domains have distinctive characteristics with large divergence from real photo such as cartoons that have disproportionate bigger head, this can lead to higher mode collapse rate. We circumvent this by removing face visual condition, like we used for DeepFakes, to create results such in Figure 17(l) and Figure 18(l).

B.5 Video Demo

We also included a video demo.mp4 (Figure 12) to demonstrate our methods in creating images in this project.