(Translated by https://www.hiragana.jp/)
AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

Earl Ranario, Lars Lundqvist, Heesup Yun, Brian N. Bailey, J. Mason Earles
University of California, Davis
{ewranario, llund, hspyun, bnbailey, jmearles}@ucdavis.edu
Abstract

Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification)111https://github.com/plant-ai-biophysics-lab/AGILE, a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.

1 Introduction

Refer to caption
Figure 1: We use labels from synthetic data to gain object domain knowledge represented by text labels by optimizing for semantic correspondences between the source and target domains.
Refer to caption
Figure 2: AGILE uses pretrained diffusion models to find semantic correspondences between unpaired source and target images. We optimize text embeddings through query attention maps generated from labeled source images, guiding the model to focus on desired regions in the target domain. Attention guidance is applied during the denoising process to enhance control over semantic alignment, achieving improved consistency in translation between source and target domains.

1.1 Computer Vision in Agriculture

Computer vision is used in a wide range of agricultural tasks such as plant phenotyping, disease detection, and yield estimation. Such tasks involve the identification of plant-specific traits or characteristics that reflect the condition of the plant. This allows farmers or plant breeders to identify key phenotypic traits in plants that improve decision-making. For example, Palacios et al. [20] implemented a segmentation model to detect visible berries and canopy features to predict the yields of different varieties of grapes. Palacios further states that their results would improve should a higher number of diverse data points be used to build the models. Chen et al. [4] detected rice plant diseases based on deep transfer learning, utilizing pre-trained models which was trained on a large amount of images and fine-tuned on a smaller, domain specific dataset.

Although machine learning tools have allowed for high-throughput identification of plant traits, the performance of these tools is limited by the availability of labeled data and resource constraints [12]. There is an emphasis on expanding public image datasets for agricultural tasks, but this alone may not adequately address the complexities of multi-domain scenarios [15]. For instance, a model trained on one domain may not generalize well to another domain due to differences in lighting, camera angle, or plant species. The general approach is to label new data tailored to specific domains, but this process is costly and time-consuming [14]. However, recent advances in artificial intelligence (AI) provide new capabilities in improving the efficiency of labeling.

1.2 Domain Translation

Some of the many applications of generative AI to improve labeling efficiency include image-to-image translation, text-to-image generation, and style transfer. These methods have been applied to a variety of applications, including medical imaging, autonomous driving, and video games. For the case of image-to-image translation, by using existing labeled public datasets, it is possible to translate the images to another domain while maintaining the semantics of the labeled object, where semantics refers to the structural features of an object, such as its shape, color, and positional relationships within the image.

Generative Adversarial Networks (GANs) [7, 32] is an early example of image-to-image translation for unpaired images. More recently, Stable Diffusion models [27, 23] have enabled high-quality image generation through various conditioning mechanisms. Although diffusion models such as DALL-E [24, 25, 2] and Stable Diffusion are capable of generating complex scenes, using them for semantically constrained image-to-image translation comes with challenges. First, for most diffusion-based models, training a model that can translate an image from one domain to another requires paired images, which is difficult to obtain. Second, images collected in the field do not come with text descriptions. For instance, the prompt “grapes in a vineyard” could output different types of grapes in various vineyard settings. Third, the generated images may not be semantically accurate or not contain the desired object. For the case of image-to-image translation, the user may want to keep an object in a specific location based on the semantics of the input image.

Synthetic GrapeBorden DayBorden NightSynthetic FlowerReal FlowerRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption

Figure 3: The dataset is pulled from AgML, a machine learning library for agricultural datasets. The original synthetic images were generated from Helios, a 3D Plant and Environment Biophysical Modeling Framework [1, 13]. Synthetic images is treated as the source domain for its capability to generate an infinite amount of labeled images. We train and evaluate our method on object detection tasks and constrain the translation within the same plant.

Therefore, this paper studies how to efficiently utilize existing labeled images to improve semantic accuracy in image-to-image translation tasks, specifically for plant trait identification. We propose a diffusion-based method Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification (AGILE), which utilizes existing pretrained diffusion models and public agricultural datasets to generate labeled images for specific domains. The idea is to use labels generated for supervised tasks to gain object domain knowledge represented by text labels. Through this, we should be able to find semantic correspondences between the source and target domain and guide it to desired regions. Our contributions include:

  • Images collected in real-world settings often lack accompanying text descriptions and semantic alignment with text inputs. By optimizing prompt embeddings, we leverage existing labeled images to emphasize regions of interest, improving semantic knowledge. This allows us to have text-image correspondence with few labeled images.

  • With semantically-aware, optimized prompt embeddings, we can control object semantics in the target domain through attention guidance during the de-noising process of a diffusion-based model. This allows labels to be transferrable from the source to target domains.

  • We compare the performance between the source, target and generated data for object detection tasks.

2 Related Work

2.1 Image-to-Image Translation

Image-to-image translation is the task of translating one possible representation of a scene into another, which was explored early with GANs [11]. Within the GAN framework, a generator creates an output image that aims to resemble a target image, while the discriminator evaluates how real or fake the output looks compared to the actual data. Fei et al. enlisted 3D crop models and GANs to semantically constrain fruit position and geometry [6]. CropGAN performs effectively when the source and target domains are similar, but its performance deteriorates significantly when applied to vastly different domains.

For the case of classifier-free diffusion-based models, conditional inputs can be included during the generation process, allowing for more creativity or control. However, diffusion models can dramatically change the content of the desired image and introduce unexpected changes in regions of interest. Parmar et al. developed a zero-shot image-to-image translation using the Stable Diffusion framework [22]. Their proposed method focuses on changing desired regions while keeping unrelated regions consistent by editing the text embedding space and using cross-attention guidance. While their approach effectively maintains the structure of the desired region, it struggles with handling complex features. Moreover, their method is limited to translating only a single object per image. Parmar et al. additionally addressed the slow processing speed and the reliance of paired data for model fine-tuning with CycleGan-Turbo and pix2pix-Turbo by adapting a single-step diffusion model to new domains through adversarial learning objectives [21]. Xu et al. introduced CycleNet, which incorporates cycle consistency into diffusion models to regularize image to image manipulation, without the need for paired data [30].

Zhang et al. developed a method, called ControlNet, to add conditional control for text-to-image diffusion models [31]. ControlNet locks the production ready large diffusion models and reuses their pretrained encoding layers to learn a diverse set of conditional controls. It also contains “zero convolutions” (zero-initialized convolutional layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect finetuning. Their method allows for various control methods to be used and is very fast to converge during training. However, as is the case for most diffusion-based methods, paired data needs to be accessible to streamline the training process.

2.2 Semantic Correspondence

Semantic correspondence is a core problem in computer vision tasks that relates to finding corresponding locations in images that are of the same semantic [16, 5, 8]. Hedin et al. leverages semantic knowledge within diffusion models to find locations in multiple images that have the same semantic meaning [9]. Their key insight is that since recent diffusion models can generate photo-realistic images from text prompts only, there must be knowledge about semantic correspondences built-in within them. Therefore, they deduced that one may not need any ground-truth semantic correspondences between image pairs to find semantic correspondences. By exploiting the attention maps of latent diffusion models, one can identify the prompt corresponding to a particular image location. Given arbitrary input images, these attention maps should respond to the semantics of the prompt. They proposed a method, inspired by recent prompt-to-prompt text-based image editing [18], to first optimize a randomly initialized text embedding to maximize the cross-attention score at a query location. Then, they find the semantically corresponding location in another image by using the pixel attaining the maximum attention map score within the target image. Attend and Excite [3] is an example of a similar approach. We can build upon this method by not relying on paired images by using the model’s underlying knowledge of the object based on the given prompt.

2.3 Cross-Attention Guidance

Content preservation through cross-attention guidance involves maintaining the semantics of an image before and after diffusion translation by ensuring that the text-image cross-attention map remains consistent. Ma et al. demonstrated cross-attention guidance in their method, Directed Diffusion, which improves upon diffusion models by providing direct control over object placement within generated images [17]. Their approach uses text-image cross-attention maps to guide the positioning of labeled or specified objects while maintaining contextual coherence. However, the constraint is that text prompts provided must have text-image correspondence, which are not readily available in the field. Therefore, prior to attention guidance, we must first optimize text embeddings to find semantic correspondences between the text prompt and target images.

3 Experimental Setup

Refer to caption
Figure 4: The displayed timesteps indicate when attention guidance is halted. The optimal stopping range is between t=5𝑡5t=5italic_t = 5 to t=15𝑡15t=15italic_t = 15, as this preserves object structure and color effectively.
TargetSourceAGILECropGANCycleGAN-turboRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionBorden DayRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionBorden NightRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionReal Flower
Figure 5: Generation results across translation tasks for our method (AGILE), CropGAN, and CycleGAN-turbo. The Target column represents the desired output domain for each translation task. Top Row: Synthetic Grape to Borden Day. Middle Row: Synthetic Grape to Borden Night. Bottom Row: Synthetic Flower to Real Flower.

3.1 Datasets

Each set of images is derived from AgML222https://github.com/Project-AgML/AgML, a machine learning library for agricultural datasets. AgML provides labeled datasets of different plants in various domains. For example, we use a synthetically generated grape and flower dataset generated from Helios [1], a 3D Plant and Environment Biophysical Modeling Framework, as our source domain. Typically, synthetic images are treated as a source domain based on its capability to generate an infinite amount of labeled images. Additionally, plant modeling parameters can be tweaked to close the domain gap, in exchange for resources and compute time, to allow for improved generative results. As seen in Figure 3, we train and evaluate our method on object detection tasks and constrain our translation within the same plant.

3.2 Method

We build the proposed method on top of ControlNet [31], which enables conditional control of pretrained diffusion models by incorporating external control signals during the generation process. Our method first includes finding semantic correspondences between source and target images using pretrained diffusion models. In order to do that, we first optimize text embeddings using a query attention map generated from the labels of the source images. This allows us to have text-image correspondence without the need for paired images. Then, we use the optimized text embeddings to highlight regions of interest in the target domain. We further guide the attention maps to highlight the desired regions using the same set of query attention maps. This allows us to control the semantics of the target image through attention guidance during the denoising process of a diffusion-based model.

In diffusion models [19], during the forward process, an image I𝐼Iitalic_I is encoded into its latent representation, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Noise is gradually added to x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for T𝑇Titalic_T timesteps, making the data increasingly noisy until it becomes pure noise at t=T𝑡𝑇t=Titalic_t = italic_T. The training objective is to minimize the discrepancy between the model’s predicted noise ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and the true noise ϵitalic-ϵ\epsilonitalic_ϵ. The model can be further conditioned on a text prompt y𝑦yitalic_y by providing an embedding e=τθ(y)𝑒subscript𝜏𝜃𝑦e=\tau_{\theta}(y)italic_e = italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) using a text encoder τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:

L=Ex0,ϵ,t[ϵϵθ(xt,t,e)22].𝐿subscript𝐸subscript𝑥0italic-ϵ𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑒22L=E_{x_{0},\epsilon,t}\big{[}\|\epsilon-\epsilon_{\theta}(x_{t},t,e)\|_{2}^{2}% \big{]}.italic_L = italic_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_e ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1)

In the reverse process, using Denoising Diffusion Implicit Models (DDIM) [28], a series of denoising steps, parameterized by θ𝜃\thetaitalic_θ, are applied to predict xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT given by:

xt1=α¯t1fθ(xt,t)+1α¯t1ϵθ(xt,t),subscript𝑥𝑡1subscript¯𝛼𝑡1subscript𝑓𝜃subscript𝑥𝑡𝑡1subscript¯𝛼𝑡1subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡x_{t-1}=\sqrt{\bar{\alpha}_{t-1}}f_{\theta}(x_{t},t)+\sqrt{1-\bar{\alpha}_{t-1% }}\epsilon_{\theta}(x_{t},t),italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (2)

where α¯t1subscript¯𝛼𝑡1\bar{\alpha}_{t-1}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is the noise scaling factor, fθ(xt,t)subscript𝑓𝜃subscript𝑥𝑡𝑡f_{\theta}(x_{t},t)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the model’s predicted denoised version of x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the model’s prediction of the noise at step t𝑡titalic_t. The denoiser for Stable Diffusion based models is a transformer architecture [29] utilizing a series of self-attention and cross-attention layers. For latent xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the t𝑡titalic_t-th timestep, we can compute the attention maps by computing the query Qlsubscript𝑄𝑙Q_{l}italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and key Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT attention map for each l𝑙litalic_l-th layer of the decoder. The cross-attention for a given sample is defined as:

Ml(xtt,e,Ii)=Attn(Ql,Kl)=softmax(QlKldk),subscript𝑀𝑙conditionalsubscript𝑥𝑡𝑡𝑒subscript𝐼𝑖Attnsubscript𝑄𝑙subscript𝐾𝑙softmaxsubscript𝑄𝑙superscriptsubscript𝐾𝑙topsubscript𝑑𝑘M_{l}(x_{t}\mid t,e,I_{i})=\text{Attn}(Q_{l},K_{l})=\text{softmax}\left(\frac{% Q_{l}K_{l}^{\top}}{\sqrt{d_{k}}}\right),italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_t , italic_e , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = Attn ( italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) , (3)

where dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the dimension of the key vectors. We first define our source and target model by fine-tuning the Stable Diffusion [23, 27] model, SD1.5, until the input matches the output image, as seen in Section A and C from Figure 2. During this fine-tuning process, we included geometric and color augmentations to the images concatenated during the denoising process, which was not originally implemented in ControlNet. Doing this prevents the model from overfitting with the training set.

3.2.1 Text Optimization

Table 1: Comparison of Fréchet Inception Distance (FID) and Average Precision (AP) metrics for various image translation methods across three domain translation tasks: Synthetic Grape to Borden Day, Synthetic Grape to Borden Night, and Synthetic Flower to Real Flower. Our method consistently outperforms existing approaches in both metrics. All methods were fine-tuned for each task. CropGAN was provided few-shot examples of labeled target images.
Method Syn. Grape to Borden Day Syn. Grape to Borden Night Syn. Flower to Real Flower
FID\downarrow AP\uparrow FID\downarrow AP\uparrow FID\downarrow AP\uparrow
Synthetic Only 265.96 0.12 363.87 0.00 175.94 0.11
Real Only 92.89 0.51 129.27 0.69 81.93 0.65
\hdashlineCropGAN[6] 131.18 0.28 192.12 0.17 202.96 0.24
CycleGAN-turbo[21] 146.64 0.09 276.94 0.00 154.21 0.33
ControlNet[31] 140.60 0.09 198.90 0.00 183.53 0.26
\hdashlineOurs 123.08 0.33 187.18 0.32 156.44 0.35

Images collected in the field do not come with a paired text description. Therefore, we cannot collect semantic correspondences between images using paired data. To address this, we propose to optimize provided text embeddings using existing labeled images for improved semantic knowledge, as summarized in Part B of Figure 2. The labeled images, the source domain, must contain the object of interest that makes semantic correspondence between the source and target domain possible. A query attention map, Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, is created for each labeled image containing multiple Gaussian markers centered within the labeled bounding box. Each Gaussian marker, G(x,y)𝐺𝑥𝑦G(x,y)italic_G ( italic_x , italic_y ), is defined as:

G(x,y)=exp((xxc)22σx2(yyc)22σy2),𝐺𝑥𝑦superscript𝑥subscript𝑥𝑐22superscriptsubscript𝜎𝑥2superscript𝑦subscript𝑦𝑐22superscriptsubscript𝜎𝑦2G(x,y)=\exp\left(-\frac{(x-x_{c})^{2}}{2\sigma_{x}^{2}}-\frac{(y-y_{c})^{2}}{2% \sigma_{y}^{2}}\right),italic_G ( italic_x , italic_y ) = roman_exp ( - divide start_ARG ( italic_x - italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ( italic_y - italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (4)

where (xc,yc)subscript𝑥𝑐subscript𝑦𝑐(x_{c},y_{c})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is the center of the Gaussian marker and σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σy2superscriptsubscript𝜎𝑦2\sigma_{y}^{2}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the variances. The variance can be scaled by the dimensions of the bounding boxes. After creating the query attention map for each labeled image, we condition the stable diffusion model with the text embedding e𝑒eitalic_e and optimize the text embedding at a specific timestep. As the Gaussian regions represent the desired region of focus, we can optimize the embedding e𝑒eitalic_e that reproduces the desired attention map. We extracted the cross-attention map from timestep 30 and optimize the first token only using MSE loss:

e=argmine1Ni=1NMl(xtt,e,Ii)Ms(Ii)22.e^{*}=\arg\min_{e}\frac{1}{N}\sum_{i=1}^{N}\|M_{l}(x_{t}\mid t,e,I_{i})-M_{s}(% I_{i})\|_{2}^{2}.italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_t , italic_e , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

The attention response from each layer in the decoding stage exhibits different levels of semantic knowledge [18]. Therefore, to summarize these responses, we average the attention maps for token 1 across all the decoding layers and attention heads. At the end of each optimization step, we obtain an optimized text embedding esuperscript𝑒e^{*}italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that highlights the desired regions of interest in the source domain. Once ee*italic_e ∗ is obtained from (5), it is used as input for subsequent optimization steps using Ml(xtt,e,Ii)M_{l}(x_{t}\mid t,e*,I_{i})italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_t , italic_e ∗ , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We have found that at least five images are needed to get the desired optimized embedding.

3.2.2 Attention Guidance

The following steps are applied after defining the target model as shown in Part C in Figure 2. Our approach builds on ControlNet by concatenating the source and synthetic images during the denoising process to preserve semantic consistency, while skip connections help maintain fine details [21]. In contrast to text optimization, we target the first three cross-attention layers of the decoding block, which are closest to the high-level latent feature space. Within each of these layers, we aggregate the attention maps by averaging over all attention heads and spatial dimensions to obtain a single representative map per token, Ml,tokensubscript𝑀𝑙𝑡𝑜𝑘𝑒𝑛M_{l,token}italic_M start_POSTSUBSCRIPT italic_l , italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT. Since we employ single-word text embeddings, we designate the first token as the “object” token and treat the remaining tokens as “background” tokens. This allows us to apply different scaling weights to object and background tokens. Prior to scaling, we compute the mean and standard deviation of each map, Ml,tokensubscript𝑀𝑙𝑡𝑜𝑘𝑒𝑛M_{l,token}italic_M start_POSTSUBSCRIPT italic_l , italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT, for the object and background token separately, and normalize the query map based on these statistics. As a result, we get normalized representative maps per token for each selected decoding layer, M~l,tokensubscript~𝑀𝑙𝑡𝑜𝑘𝑒𝑛\tilde{M}_{l,token}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_l , italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT. Since each layer produces attention maps at different resolutions, the representative maps are interpolated to match the dimensions of the query map, ensuring consistency during the attention guidance process.

We add the final query map M~s(Ii)subscript~𝑀𝑠subscript𝐼𝑖\tilde{M}_{s}(I_{i})over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with the final attention map M~l,tokensubscript~𝑀𝑙𝑡𝑜𝑘𝑒𝑛\tilde{M}_{l,token}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_l , italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT for each token index as shown:

Ml,tokenedit=βM~l,token(xt|t,e,Ii)+M~s(Ii),subscriptsuperscript𝑀𝑒𝑑𝑖𝑡𝑙𝑡𝑜𝑘𝑒𝑛𝛽subscript~𝑀𝑙𝑡𝑜𝑘𝑒𝑛conditionalsubscript𝑥𝑡𝑡superscript𝑒subscript𝐼𝑖subscript~𝑀𝑠subscript𝐼𝑖M^{edit}_{l,token}=\beta*\tilde{M}_{l,token}(x_{t}|t,e^{*},I_{i})+\tilde{M}_{s% }(I_{i}),italic_M start_POSTSUPERSCRIPT italic_e italic_d italic_i italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT = italic_β ∗ over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_l , italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t , italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (6)

where β𝛽\betaitalic_β is the scaling factor that can be different for object or background tokens. This editing approach can be stopped early to prevent harsh contrast differences with the target domain, as seen in Figure 4, though this can change depending on the dataset used.

4 Experiments and Ablation Study

Our method leverages synthetic images and their corresponding labels to enforce semantic constraints during image-to-image translation. In our framework, the real images are treated as the target domain, but their labels are withheld during training to simulate an unpaired setting. To evaluate our approach, we compute the Fréchet Inception Distance (FID) [10] between the generated images and the target domain. For comparison, we also calculate the FID for the Synthetic Only and Real Only baselines by measuring the distance between their respective training sets and the target test set.

To demonstrate the practical utility of our generated images, we use them as training data for a Faster R-CNN [26] object detection model. Specifically, we train the detection model on the synthetic, real and generated images annotated with bounding boxes corresponding to grape clusters or flowers. After training, we test the model on real images from the target domain, comparing the predicted bounding boxes to the ground truth bounding boxes from the real dataset. We quantify detection performance using Average Precision (AP). Finally, we compare our results against recent unpaired image-to-image translation methods to highlight the effectiveness of our approach.

Table 2: Ablation study results summarizing the importance of text optimization prior to attention guidance.
Method Syn. Grape to Borden Day Syn. Grape to Borden Night Syn. Flower to Real Flower
AP\uparrow AP\uparrow AP\uparrow
No Text Optim. 0.10 0.01 0.13
No Guidance 0.14 0.00 0.25
\hdashlineOurs 0.33 0.32 0.35

4.1 Quantitative Results

We evaluate our approach against CropGAN [6], CycleGAN-turbo [22], and ControlNet [31]. Table 1 presents a quantitative comparison based on FID between the generated images and the target domain, as well as the AP when performing object detection on the target domain.

Our results demonstrate that AGILE outperforms the baseline models across all datasets in terms of AP. For example, when translating Synthetic Grape to Borden Day, AGILE achieves an AP of 0.33, which is a significant improvement compared to CropGAN (0.28) and CycleGAN-turbo (0.09). Similarly, for the Synthetic Grape to Borden Night task, AGILE achieves an AP of 0.32, surpassing both CropGAN (0.17) and CycleGAN-turbo (0.00). Additionally, when translating Synthetic Flower to Real Flower, AGILE achieves an AP of 0.35, outperforming CropGAN (0.24) and slightly surpassing CycleGAN-turbo (0.33).

While AGILE excels in AP performance across all tasks, CycleGAN-turbo achieves a marginally better FID score for the Synthetic Flower to Real Flower translation task. Despite this, AGILE demonstrates superior performance in maintaining object semantics and producing high-quality images across the other domains. Furthermore, our approach surpasses CropGAN, a few-shot learning method, across all datasets, despite CropGAN being provided with some of the labeled target images.

We also conducted an ablation study to assess the impact of text optimization on attention guidance. As shown in Table 2, when text optimization is omitted, attention guidance adversely affects the generation process compared to both the base ControlNet method and the approach using only text optimization.

Refer to caption
Figure 6: Without attention guidance, the generated image fails to translate the labeled objects. The background and object color of the generated Borden Day image is inconsistent with the target domain.

4.2 Qualitative Comparisons

Figure 5 presents a visual comparison of the images generated by the different approaches. Our method preserves realism while ensuring semantic consistency, whereas CropGAN and CycleGAN-turbo fall short. For example, CropGAN struggles to distinguish between the sky and grape regions in the Synthetic Grape to Borden Night translation, and CycleGAN-turbo fails to maintain proper color consistency with the target domain. Overall, our approach effectively preserves the color, shape, and texture of both object and background compared to the other methods. For instance, a grape in the synthetic image could look very different in the target image. But our approach aims to maintain the shape despite differences between domains. However, it does encounter challenges when transferring very small objects that provide minimal signal, as seen in the flower translation task.

In our ablation study, we attempted attention guidance using a text embedding that is not optimized to have object knowledge, as seen in Figure 7. As a result, the method without text optimization failed to preserve the object’s shape and color in the target domain. Additionally, we observed an inconsistency in the background color between the full proposed method and the variant without text optimization, though this discrepancy may be unrelated. This proves that text optimization is a necessary preliminary step prior to attention guidance. Figure 6 shows that without attention guidance, the generated image fails to translate the labeled objects and struggles to maintain object color consistency with the target domain.

5 Discussion

Refer to caption
Figure 7: Without text optimization, attention guidance fails to accurately generate the correct shape and color of the flower in the real image. Despite differences between the synthetic and real flowers, our method successfully preserves the flower’s shape within the target domain, demonstrating improved semantic consistency.

The timing and location for optimizing text embeddings and guiding attention maps depend on the specific timestep chosen for applying edits or optimizations during the denoising process, as well as the decoding layers targeted for modification. We optimize at t=30𝑡30t=30italic_t = 30 while noise is still present which ensures that the optimized text embedding maintains robust against variability in images. Furthermore, we edit only the first three layers to avoid object distortion or inconsistent colors in the final image. Since attention guidance involves manipulation the attention scores, excessive adjustments could lead to unwanted artifacts or background inconsistency.

Additionally, we found that different layers in the decoding block attend to different areas of the image. For example, early layers focus on interpreting the foreground or objects defined by the text embedding, while later layers contribute to background refinement and overall image coherence. Overall, there are two key axes to consider for successful text optimization and attention guidance: the denoising timestep and the decoding layers.

The scaling of the attention maps using β𝛽\betaitalic_β from Equation 6 enhances signals for particularly small objects, ensuring they receive adequate focus. Multiplying this scalar on background tokens improves semantic consistency, even though these tokens are not explicitly optimized to contain the specified object knowledge. ControlNet provides a mechanism to preserve the semantics of the source image by scaling the concatenated feature map with a factor known as the control scale during the decoding process. However, we observed that increasing this value beyond one degrades image quality. Conversely, decreasing this value when there are large domain gaps—such as between synthetic grape images and Borden night scenes—yields better results by preventing the output from overly adhering to the source domain.

6 Limitations

Our method faces challenges when translating small objects, as object guidance via attention scores is constrained by the resolution of the attention map. If an object is too small, the corresponding signal may be too weak to be effectively captured, leading to reduced visibility in the final output. Additionally, our method also struggles with unseen examples. For instance, if an object is found in the synthetic image but is not found in the real image, the model will try to fill in the gaps with plausible details based on its learned prior, ensuring coherence with the training distribution. This means that the model synthesizes missing features in a way that aligns with its learned representations, potentially leading to hallucinations if the object is significantly different from those seen during training.

A potential solution for improving small object translation and overall image quality is enabling guidance for multiple objects, which requires incorporating multi-object knowledge within the text embedding. Our proposed method currently relies on labels for a single class, but extending the approach to optimize for multiple classes could make multi-object guidance feasible.

Despite applying these image translation methods, domain gaps still exist between the generated target images and the ground truth target images. Our visual analysis revealed that the most influential factors contributing to this gap are perspective, brightness, and plant type. These parameters are not explicitly accounted for during the diffusion process, as they are not defined on a per-image basis.

7 Conclusion and Future Work

In this paper, we proposed AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework aimed at improving semantic consistency for cross-domain image translation tasks relevant to plant trait identification. By leveraging pretrained diffusion models, optimized text embeddings, and attention guidance during the denoising process, AGILE effectively maintains object structure and semantics even when there are significant domain gaps.

Our experimental results demonstrate that AGILE consistently outperforms existing image translation methods across multiple datasets. The quantitative results highlight improvements in object detection performance, while the qualitative comparisons show enhanced realism and consistency in generated images. Furthermore, ablation studies validate the importance of text optimization and attention guidance, emphasizing their complementary roles in preserving semantic alignment between source and target domains.

However, our approach still faces challenges when translating small objects or generalizing to unseen examples. Additionally, existing domain gaps such as perspective, brightness, and plant type are not fully addressed during the diffusion process. While the domain-translated images generated by AGILE may not completely surpass the performance of using real, labeled target images, they can still provide valuable information for pretraining a backbone model. This pretraining can enhance the model’s ability to extract relevant features, which can then be further fine-tuned using a limited set of labeled images from the target domain.

Future work will focus on extending our method to incorporate multi-object guidance through enhanced text embeddings and improving robustness to various domain gaps. Furthermore, we will explore more efficient optimization techniques to enhance performance and generalization across diverse agricultural datasets.

References

  • [1] Brian N. Bailey. Helios: A scalable 3d plant and environmental biophysical modeling framework. 10. Publisher: Frontiers.
  • [2] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving Image Generation with Better Captions.
  • [3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.
  • Chen et al. [2020] Junde Chen, Defu Zhang, Yaser A Nanehkaran, and Dele Li. Detection of rice plant diseases based on deep transfer learning. Journal of the Science of Food and Agriculture, 100(7):3246–3256, 2020. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/jsfa.10365.
  • Cho et al. [2023] Seokju Cho, Sunghwan Hong, and Seungryong Kim. CATs++: Boosting Cost Aggregation With Convolutions and Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2023. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Fei et al. [2021] Zhenghao Fei, Alex Olenskyj, Brian N. Bailey, and Mason Earles. Enlisting 3D Crop Models and GANs for More Data Efficient and Generalizable Fruit Detection. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1269–1277, Montreal, BC, Canada, 2021. IEEE.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
  • Gupta et al. [2023] Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, and Abhishek Kar. ASIC: Aligning Sparse in-the-wild Image Collections, 2023. arXiv:2303.16201.
  • Hedlin et al. [2023] Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised Semantic Correspondence Using Stable Diffusion, 2023. arXiv:2305.15581 [cs].
  • [10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.
  • Isola et al. [2018] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018.
  • Jiang and Li [2020] Yu Jiang and Changying Li. Convolutional neural networks for image-based high-throughput plant phenotyping: A review. Plant Phenomics, 2020:4152816, 2020.
  • [13] Tong Lei, Jan Graefe, Ismael K. Mayanja, Mason Earles, and Brian N. Bailey. Simulation of automatically annotated visible and multi-/hyperspectral images using the helios 3d plant and radiative transfer modeling framework. 6:0189.
  • Li et al. [2023] Jiajia Li, Dong Chen, Xinda Qi, Zhaojian Li, Yanbo Huang, Daniel Morris, and Xiaobo Tan. Label-efficient learning in agriculture: A comprehensive review. Computers and Electronics in Agriculture, 215:108412, 2023.
  • [15] Yuzhen Lu and Sierra Young. A survey of public datasets for computer vision tasks in precision agriculture. 178:105760.
  • Ma et al. [2021] Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. Image Matching from Handcrafted to Deep Features: A Survey. International Journal of Computer Vision, 129(1):23–79, 2021.
  • Ma et al. [2023] Wan-Duo Kurt Ma, J. P. Lewis, Avisek Lahiri, Thomas Leung, and W. Bastiaan Kleijn. Directed Diffusion: Direct Control of Object Placement through Attention Guidance, 2023. arXiv:2302.13153 [cs].
  • Mokady et al. [2022] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text Inversion for Editing Real Images using Guided Diffusion Models, 2022. arXiv:2211.09794.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • Palacios et al. [2023] Fernando Palacios, Maria P. Diago, Pedro Melo-Pinto, and Javier Tardaguila. Early yield prediction in different grapevine varieties using computer vision and machine learning. Precision Agriculture, 24(2):407–435, 2023.
  • [21] Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.
  • Parmar et al. [2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation, 2023.
  • Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
  • [26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-CNN: Towards real-time object detection with region proposal networks.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.
  • Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • [30] Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, and Joyce Chai. CycleNet: Rethinking cycle consistency in text-guided diffusion for image manipulation.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
  • Zhu et al. [2020] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks, 2020.