AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

Earl Ranario, Lars Lundqvist, Heesup Yun, Brian N. Bailey, J. Mason Earles
University of California, Davis
{ewranario, llund, hspyun, bnbailey, jmearles}@ucdavis.edu

Abstract

Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification)¹¹1https://github.com/plant-ai-biophysics-lab/AGILE, a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.

1 Introduction

Refer to caption — Figure 1: We use labels from synthetic data to gain object domain knowledge represented by text labels by optimizing for semantic correspondences between the source and target domains.

1.1 Computer Vision in Agriculture

Computer vision is used in a wide range of agricultural tasks such as plant phenotyping, disease detection, and yield estimation. Such tasks involve the identification of plant-specific traits or characteristics that reflect the condition of the plant. This allows farmers or plant breeders to identify key phenotypic traits in plants that improve decision-making. For example, Palacios et al. [20] implemented a segmentation model to detect visible berries and canopy features to predict the yields of different varieties of grapes. Palacios further states that their results would improve should a higher number of diverse data points be used to build the models. Chen et al. [4] detected rice plant diseases based on deep transfer learning, utilizing pre-trained models which was trained on a large amount of images and fine-tuned on a smaller, domain specific dataset.

Although machine learning tools have allowed for high-throughput identification of plant traits, the performance of these tools is limited by the availability of labeled data and resource constraints [12]. There is an emphasis on expanding public image datasets for agricultural tasks, but this alone may not adequately address the complexities of multi-domain scenarios [15]. For instance, a model trained on one domain may not generalize well to another domain due to differences in lighting, camera angle, or plant species. The general approach is to label new data tailored to specific domains, but this process is costly and time-consuming [14]. However, recent advances in artificial intelligence (AI) provide new capabilities in improving the efficiency of labeling.

1.2 Domain Translation

Some of the many applications of generative AI to improve labeling efficiency include image-to-image translation, text-to-image generation, and style transfer. These methods have been applied to a variety of applications, including medical imaging, autonomous driving, and video games. For the case of image-to-image translation, by using existing labeled public datasets, it is possible to translate the images to another domain while maintaining the semantics of the labeled object, where semantics refers to the structural features of an object, such as its shape, color, and positional relationships within the image.

Generative Adversarial Networks (GANs) [7, 32] is an early example of image-to-image translation for unpaired images. More recently, Stable Diffusion models [27, 23] have enabled high-quality image generation through various conditioning mechanisms. Although diffusion models such as DALL-E [24, 25, 2] and Stable Diffusion are capable of generating complex scenes, using them for semantically constrained image-to-image translation comes with challenges. First, for most diffusion-based models, training a model that can translate an image from one domain to another requires paired images, which is difficult to obtain. Second, images collected in the field do not come with text descriptions. For instance, the prompt “grapes in a vineyard” could output different types of grapes in various vineyard settings. Third, the generated images may not be semantically accurate or not contain the desired object. For the case of image-to-image translation, the user may want to keep an object in a specific location based on the semantics of the input image.

Therefore, this paper studies how to efficiently utilize existing labeled images to improve semantic accuracy in image-to-image translation tasks, specifically for plant trait identification. We propose a diffusion-based method Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification (AGILE), which utilizes existing pretrained diffusion models and public agricultural datasets to generate labeled images for specific domains. The idea is to use labels generated for supervised tasks to gain object domain knowledge represented by text labels. Through this, we should be able to find semantic correspondences between the source and target domain and guide it to desired regions. Our contributions include:

•

Images collected in real-world settings often lack accompanying text descriptions and semantic alignment with text inputs. By optimizing prompt embeddings, we leverage existing labeled images to emphasize regions of interest, improving semantic knowledge. This allows us to have text-image correspondence with few labeled images.
•

With semantically-aware, optimized prompt embeddings, we can control object semantics in the target domain through attention guidance during the de-noising process of a diffusion-based model. This allows labels to be transferrable from the source to target domains.
•

We compare the performance between the source, target and generated data for object detection tasks.

2 Related Work

2.1 Image-to-Image Translation

Image-to-image translation is the task of translating one possible representation of a scene into another, which was explored early with GANs [11]. Within the GAN framework, a generator creates an output image that aims to resemble a target image, while the discriminator evaluates how real or fake the output looks compared to the actual data. Fei et al. enlisted 3D crop models and GANs to semantically constrain fruit position and geometry [6]. CropGAN performs effectively when the source and target domains are similar, but its performance deteriorates significantly when applied to vastly different domains.

For the case of classifier-free diffusion-based models, conditional inputs can be included during the generation process, allowing for more creativity or control. However, diffusion models can dramatically change the content of the desired image and introduce unexpected changes in regions of interest. Parmar et al. developed a zero-shot image-to-image translation using the Stable Diffusion framework [22]. Their proposed method focuses on changing desired regions while keeping unrelated regions consistent by editing the text embedding space and using cross-attention guidance. While their approach effectively maintains the structure of the desired region, it struggles with handling complex features. Moreover, their method is limited to translating only a single object per image. Parmar et al. additionally addressed the slow processing speed and the reliance of paired data for model fine-tuning with CycleGan-Turbo and pix2pix-Turbo by adapting a single-step diffusion model to new domains through adversarial learning objectives [21]. Xu et al. introduced CycleNet, which incorporates cycle consistency into diffusion models to regularize image to image manipulation, without the need for paired data [30].

Zhang et al. developed a method, called ControlNet, to add conditional control for text-to-image diffusion models [31]. ControlNet locks the production ready large diffusion models and reuses their pretrained encoding layers to learn a diverse set of conditional controls. It also contains “zero convolutions” (zero-initialized convolutional layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect finetuning. Their method allows for various control methods to be used and is very fast to converge during training. However, as is the case for most diffusion-based methods, paired data needs to be accessible to streamline the training process.

2.2 Semantic Correspondence

Semantic correspondence is a core problem in computer vision tasks that relates to finding corresponding locations in images that are of the same semantic [16, 5, 8]. Hedin et al. leverages semantic knowledge within diffusion models to find locations in multiple images that have the same semantic meaning [9]. Their key insight is that since recent diffusion models can generate photo-realistic images from text prompts only, there must be knowledge about semantic correspondences built-in within them. Therefore, they deduced that one may not need any ground-truth semantic correspondences between image pairs to find semantic correspondences. By exploiting the attention maps of latent diffusion models, one can identify the prompt corresponding to a particular image location. Given arbitrary input images, these attention maps should respond to the semantics of the prompt. They proposed a method, inspired by recent prompt-to-prompt text-based image editing [18], to first optimize a randomly initialized text embedding to maximize the cross-attention score at a query location. Then, they find the semantically corresponding location in another image by using the pixel attaining the maximum attention map score within the target image. Attend and Excite [3] is an example of a similar approach. We can build upon this method by not relying on paired images by using the model’s underlying knowledge of the object based on the given prompt.

2.3 Cross-Attention Guidance

Content preservation through cross-attention guidance involves maintaining the semantics of an image before and after diffusion translation by ensuring that the text-image cross-attention map remains consistent. Ma et al. demonstrated cross-attention guidance in their method, Directed Diffusion, which improves upon diffusion models by providing direct control over object placement within generated images [17]. Their approach uses text-image cross-attention maps to guide the positioning of labeled or specified objects while maintaining contextual coherence. However, the constraint is that text prompts provided must have text-image correspondence, which are not readily available in the field. Therefore, prior to attention guidance, we must first optimize text embeddings to find semantic correspondences between the text prompt and target images.

3 Experimental Setup

3.1 Datasets

Each set of images is derived from AgML²²2https://github.com/Project-AgML/AgML, a machine learning library for agricultural datasets. AgML provides labeled datasets of different plants in various domains. For example, we use a synthetically generated grape and flower dataset generated from Helios [1], a 3D Plant and Environment Biophysical Modeling Framework, as our source domain. Typically, synthetic images are treated as a source domain based on its capability to generate an infinite amount of labeled images. Additionally, plant modeling parameters can be tweaked to close the domain gap, in exchange for resources and compute time, to allow for improved generative results. As seen in Figure 3, we train and evaluate our method on object detection tasks and constrain our translation within the same plant.

3.2 Method

We build the proposed method on top of ControlNet [31], which enables conditional control of pretrained diffusion models by incorporating external control signals during the generation process. Our method first includes finding semantic correspondences between source and target images using pretrained diffusion models. In order to do that, we first optimize text embeddings using a query attention map generated from the labels of the source images. This allows us to have text-image correspondence without the need for paired images. Then, we use the optimized text embeddings to highlight regions of interest in the target domain. We further guide the attention maps to highlight the desired regions using the same set of query attention maps. This allows us to control the semantics of the target image through attention guidance during the denoising process of a diffusion-based model.

In diffusion models [19], during the forward process, an image $I$ is encoded into its latent representation, $x_{0}$ . Noise is gradually added to $x_{0}$ for $T$ timesteps, making the data increasingly noisy until it becomes pure noise at $t=T$ . The training objective is to minimize the discrepancy between the model’s predicted noise $\epsilon_{\theta}(x_{t},t)$ and the true noise $\epsilon$ . The model can be further conditioned on a text prompt $y$ by providing an embedding $e=\tau_{\theta}(y)$ using a text encoder $\tau_{\theta}$ :

L=E_{x_{0},\epsilon,t}\big{[}\|\epsilon-\epsilon_{\theta}(x_{t},t,e)\|_{2}^{2}% \big{]}.

(1)

In the reverse process, using Denoising Diffusion Implicit Models (DDIM) [28], a series of denoising steps, parameterized by $\theta$ , are applied to predict $x_{t-1}$ given by:

x_{t-1}=\sqrt{\bar{\alpha}_{t-1}}f_{\theta}(x_{t},t)+\sqrt{1-\bar{\alpha}_{t-1% }}\epsilon_{\theta}(x_{t},t),

(2)

where $\bar{\alpha}_{t-1}$ is the noise scaling factor, $f_{\theta}(x_{t},t)$ is the model’s predicted denoised version of $x_{0}$ , and $\epsilon_{\theta}(x_{t},t)$ is the model’s prediction of the noise at step $t$ . The denoiser for Stable Diffusion based models is a transformer architecture [29] utilizing a series of self-attention and cross-attention layers. For latent $x_{t}$ at the $t$ -th timestep, we can compute the attention maps by computing the query $Q_{l}$ and key $K_{l}$ attention map for each $l$ -th layer of the decoder. The cross-attention for a given sample is defined as:

M_{l}(x_{t}\mid t,e,I_{i})=\text{Attn}(Q_{l},K_{l})=\text{softmax}\left(\frac{% Q_{l}K_{l}^{\top}}{\sqrt{d_{k}}}\right),

(3)

where $d_{k}$ is the dimension of the key vectors. We first define our source and target model by fine-tuning the Stable Diffusion [23, 27] model, SD1.5, until the input matches the output image, as seen in Section A and C from Figure 2. During this fine-tuning process, we included geometric and color augmentations to the images concatenated during the denoising process, which was not originally implemented in ControlNet. Doing this prevents the model from overfitting with the training set.

3.2.1 Text Optimization

Table 1: Comparison of Fréchet Inception Distance (FID) and Average Precision (AP) metrics for various image translation methods across three domain translation tasks: Synthetic Grape to Borden Day, Synthetic Grape to Borden Night, and Synthetic Flower to Real Flower. Our method consistently outperforms existing approaches in both metrics. All methods were fine-tuned for each task. CropGAN was provided few-shot examples of labeled target images.

Method	Syn. Grape to Borden Day		Syn. Grape to Borden Night		Syn. Flower to Real Flower
Method	FID $\downarrow$	AP $\uparrow$	FID $\downarrow$	AP $\uparrow$	FID $\downarrow$	AP $\uparrow$
Synthetic Only	265.96	0.12	363.87	0.00	175.94	0.11
Real Only	92.89	0.51	129.27	0.69	81.93	0.65
\hdashlineCropGAN[6]	131.18	0.28	192.12	0.17	202.96	0.24
CycleGAN-turbo[21]	146.64	0.09	276.94	0.00	154.21	0.33
ControlNet[31]	140.60	0.09	198.90	0.00	183.53	0.26
\hdashlineOurs	123.08	0.33	187.18	0.32	156.44	0.35

Images collected in the field do not come with a paired text description. Therefore, we cannot collect semantic correspondences between images using paired data. To address this, we propose to optimize provided text embeddings using existing labeled images for improved semantic knowledge, as summarized in Part B of Figure 2. The labeled images, the source domain, must contain the object of interest that makes semantic correspondence between the source and target domain possible. A query attention map, $M_{s}$ , is created for each labeled image containing multiple Gaussian markers centered within the labeled bounding box. Each Gaussian marker, $G(x,y)$ , is defined as:

G(x,y)=\exp\left(-\frac{(x-x_{c})^{2}}{2\sigma_{x}^{2}}-\frac{(y-y_{c})^{2}}{2% \sigma_{y}^{2}}\right),

(4)

where $(x_{c},y_{c})$ is the center of the Gaussian marker and $\sigma_{x}^{2}$ and $\sigma_{y}^{2}$ are the variances. The variance can be scaled by the dimensions of the bounding boxes. After creating the query attention map for each labeled image, we condition the stable diffusion model with the text embedding $e$ and optimize the text embedding at a specific timestep. As the Gaussian regions represent the desired region of focus, we can optimize the embedding $e$ that reproduces the desired attention map. We extracted the cross-attention map from timestep 30 and optimize the first token only using MSE loss:

e^{*}=\arg\min_{e}\frac{1}{N}\sum_{i=1}^{N}\|M_{l}(x_{t}\mid t,e,I_{i})-M_{s}(% I_{i})\|_{2}^{2}.

(5)

The attention response from each layer in the decoding stage exhibits different levels of semantic knowledge [18]. Therefore, to summarize these responses, we average the attention maps for token 1 across all the decoding layers and attention heads. At the end of each optimization step, we obtain an optimized text embedding $e^{*}$ that highlights the desired regions of interest in the source domain. Once $e*$ is obtained from (5), it is used as input for subsequent optimization steps using $M_{l}(x_{t}\mid t,e*,I_{i})$ . We have found that at least five images are needed to get the desired optimized embedding.

3.2.2 Attention Guidance

The following steps are applied after defining the target model as shown in Part C in Figure 2. Our approach builds on ControlNet by concatenating the source and synthetic images during the denoising process to preserve semantic consistency, while skip connections help maintain fine details [21]. In contrast to text optimization, we target the first three cross-attention layers of the decoding block, which are closest to the high-level latent feature space. Within each of these layers, we aggregate the attention maps by averaging over all attention heads and spatial dimensions to obtain a single representative map per token, $M_{l,token}$ . Since we employ single-word text embeddings, we designate the first token as the “object” token and treat the remaining tokens as “background” tokens. This allows us to apply different scaling weights to object and background tokens. Prior to scaling, we compute the mean and standard deviation of each map, $M_{l,token}$ , for the object and background token separately, and normalize the query map based on these statistics. As a result, we get normalized representative maps per token for each selected decoding layer, $\tilde{M}_{l,token}$ . Since each layer produces attention maps at different resolutions, the representative maps are interpolated to match the dimensions of the query map, ensuring consistency during the attention guidance process.

We add the final query map $\tilde{M}_{s}(I_{i})$ with the final attention map $\tilde{M}_{l,token}$ for each token index as shown:

M^{edit}_{l,token}=\beta*\tilde{M}_{l,token}(x_{t}|t,e^{*},I_{i})+\tilde{M}_{s% }(I_{i}),

(6)

where $\beta$ is the scaling factor that can be different for object or background tokens. This editing approach can be stopped early to prevent harsh contrast differences with the target domain, as seen in Figure 4, though this can change depending on the dataset used.

4 Experiments and Ablation Study

Our method leverages synthetic images and their corresponding labels to enforce semantic constraints during image-to-image translation. In our framework, the real images are treated as the target domain, but their labels are withheld during training to simulate an unpaired setting. To evaluate our approach, we compute the Fréchet Inception Distance (FID) [10] between the generated images and the target domain. For comparison, we also calculate the FID for the Synthetic Only and Real Only baselines by measuring the distance between their respective training sets and the target test set.

To demonstrate the practical utility of our generated images, we use them as training data for a Faster R-CNN [26] object detection model. Specifically, we train the detection model on the synthetic, real and generated images annotated with bounding boxes corresponding to grape clusters or flowers. After training, we test the model on real images from the target domain, comparing the predicted bounding boxes to the ground truth bounding boxes from the real dataset. We quantify detection performance using Average Precision (AP). Finally, we compare our results against recent unpaired image-to-image translation methods to highlight the effectiveness of our approach.

Table 2: Ablation study results summarizing the importance of text optimization prior to attention guidance.

Method	Syn. Grape to Borden Day	Syn. Grape to Borden Night	Syn. Flower to Real Flower
Method	AP $\uparrow$	AP $\uparrow$	AP $\uparrow$
No Text Optim.	0.10	0.01	0.13
No Guidance	0.14	0.00	0.25
\hdashlineOurs	0.33	0.32	0.35

4.1 Quantitative Results

We evaluate our approach against CropGAN [6], CycleGAN-turbo [22], and ControlNet [31]. Table 1 presents a quantitative comparison based on FID between the generated images and the target domain, as well as the AP when performing object detection on the target domain.

Our results demonstrate that AGILE outperforms the baseline models across all datasets in terms of AP. For example, when translating Synthetic Grape to Borden Day, AGILE achieves an AP of 0.33, which is a significant improvement compared to CropGAN (0.28) and CycleGAN-turbo (0.09). Similarly, for the Synthetic Grape to Borden Night task, AGILE achieves an AP of 0.32, surpassing both CropGAN (0.17) and CycleGAN-turbo (0.00). Additionally, when translating Synthetic Flower to Real Flower, AGILE achieves an AP of 0.35, outperforming CropGAN (0.24) and slightly surpassing CycleGAN-turbo (0.33).

While AGILE excels in AP performance across all tasks, CycleGAN-turbo achieves a marginally better FID score for the Synthetic Flower to Real Flower translation task. Despite this, AGILE demonstrates superior performance in maintaining object semantics and producing high-quality images across the other domains. Furthermore, our approach surpasses CropGAN, a few-shot learning method, across all datasets, despite CropGAN being provided with some of the labeled target images.

We also conducted an ablation study to assess the impact of text optimization on attention guidance. As shown in Table 2, when text optimization is omitted, attention guidance adversely affects the generation process compared to both the base ControlNet method and the approach using only text optimization.

4.2 Qualitative Comparisons

Figure 5 presents a visual comparison of the images generated by the different approaches. Our method preserves realism while ensuring semantic consistency, whereas CropGAN and CycleGAN-turbo fall short. For example, CropGAN struggles to distinguish between the sky and grape regions in the Synthetic Grape to Borden Night translation, and CycleGAN-turbo fails to maintain proper color consistency with the target domain. Overall, our approach effectively preserves the color, shape, and texture of both object and background compared to the other methods. For instance, a grape in the synthetic image could look very different in the target image. But our approach aims to maintain the shape despite differences between domains. However, it does encounter challenges when transferring very small objects that provide minimal signal, as seen in the flower translation task.

In our ablation study, we attempted attention guidance using a text embedding that is not optimized to have object knowledge, as seen in Figure 7. As a result, the method without text optimization failed to preserve the object’s shape and color in the target domain. Additionally, we observed an inconsistency in the background color between the full proposed method and the variant without text optimization, though this discrepancy may be unrelated. This proves that text optimization is a necessary preliminary step prior to attention guidance. Figure 6 shows that without attention guidance, the generated image fails to translate the labeled objects and struggles to maintain object color consistency with the target domain.

5 Discussion

The timing and location for optimizing text embeddings and guiding attention maps depend on the specific timestep chosen for applying edits or optimizations during the denoising process, as well as the decoding layers targeted for modification. We optimize at $t=30$ while noise is still present which ensures that the optimized text embedding maintains robust against variability in images. Furthermore, we edit only the first three layers to avoid object distortion or inconsistent colors in the final image. Since attention guidance involves manipulation the attention scores, excessive adjustments could lead to unwanted artifacts or background inconsistency.

Additionally, we found that different layers in the decoding block attend to different areas of the image. For example, early layers focus on interpreting the foreground or objects defined by the text embedding, while later layers contribute to background refinement and overall image coherence. Overall, there are two key axes to consider for successful text optimization and attention guidance: the denoising timestep and the decoding layers.

The scaling of the attention maps using $\beta$ from Equation 6 enhances signals for particularly small objects, ensuring they receive adequate focus. Multiplying this scalar on background tokens improves semantic consistency, even though these tokens are not explicitly optimized to contain the specified object knowledge. ControlNet provides a mechanism to preserve the semantics of the source image by scaling the concatenated feature map with a factor known as the control scale during the decoding process. However, we observed that increasing this value beyond one degrades image quality. Conversely, decreasing this value when there are large domain gaps—such as between synthetic grape images and Borden night scenes—yields better results by preventing the output from overly adhering to the source domain.

6 Limitations

Our method faces challenges when translating small objects, as object guidance via attention scores is constrained by the resolution of the attention map. If an object is too small, the corresponding signal may be too weak to be effectively captured, leading to reduced visibility in the final output. Additionally, our method also struggles with unseen examples. For instance, if an object is found in the synthetic image but is not found in the real image, the model will try to fill in the gaps with plausible details based on its learned prior, ensuring coherence with the training distribution. This means that the model synthesizes missing features in a way that aligns with its learned representations, potentially leading to hallucinations if the object is significantly different from those seen during training.

A potential solution for improving small object translation and overall image quality is enabling guidance for multiple objects, which requires incorporating multi-object knowledge within the text embedding. Our proposed method currently relies on labels for a single class, but extending the approach to optimize for multiple classes could make multi-object guidance feasible.

Despite applying these image translation methods, domain gaps still exist between the generated target images and the ground truth target images. Our visual analysis revealed that the most influential factors contributing to this gap are perspective, brightness, and plant type. These parameters are not explicitly accounted for during the diffusion process, as they are not defined on a per-image basis.

7 Conclusion and Future Work

In this paper, we proposed AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework aimed at improving semantic consistency for cross-domain image translation tasks relevant to plant trait identification. By leveraging pretrained diffusion models, optimized text embeddings, and attention guidance during the denoising process, AGILE effectively maintains object structure and semantics even when there are significant domain gaps.

Our experimental results demonstrate that AGILE consistently outperforms existing image translation methods across multiple datasets. The quantitative results highlight improvements in object detection performance, while the qualitative comparisons show enhanced realism and consistency in generated images. Furthermore, ablation studies validate the importance of text optimization and attention guidance, emphasizing their complementary roles in preserving semantic alignment between source and target domains.

However, our approach still faces challenges when translating small objects or generalizing to unseen examples. Additionally, existing domain gaps such as perspective, brightness, and plant type are not fully addressed during the diffusion process. While the domain-translated images generated by AGILE may not completely surpass the performance of using real, labeled target images, they can still provide valuable information for pretraining a backbone model. This pretraining can enhance the model’s ability to extract relevant features, which can then be further fine-tuned using a limited set of labeled images from the target domain.

Future work will focus on extending our method to incorporate multi-object guidance through enhanced text embeddings and improving robustness to various domain gaps. Furthermore, we will explore more efficient optimization techniques to enhance performance and generalization across diverse agricultural datasets.

References

[1] Brian N. Bailey. Helios: A scalable 3d plant and environmental biophysical modeling framework. 10. Publisher: Frontiers.
[2] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving Image Generation with Better Captions.
[3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.
Chen et al. [2020] Junde Chen, Defu Zhang, Yaser A Nanehkaran, and Dele Li. Detection of rice plant diseases based on deep transfer learning. Journal of the Science of Food and Agriculture, 100(7):3246–3256, 2020. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/jsfa.10365.
Cho et al. [2023] Seokju Cho, Sunghwan Hong, and Seungryong Kim. CATs++: Boosting Cost Aggregation With Convolutions and Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2023. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
Fei et al. [2021] Zhenghao Fei, Alex Olenskyj, Brian N. Bailey, and Mason Earles. Enlisting 3D Crop Models and GANs for More Data Efficient and Generalizable Fruit Detection. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1269–1277, Montreal, BC, Canada, 2021. IEEE.
Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
Gupta et al. [2023] Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, and Abhishek Kar. ASIC: Aligning Sparse in-the-wild Image Collections, 2023. arXiv:2303.16201.
Hedlin et al. [2023] Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised Semantic Correspondence Using Stable Diffusion, 2023. arXiv:2305.15581 [cs].
[10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.
Isola et al. [2018] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018.
Jiang and Li [2020] Yu Jiang and Changying Li. Convolutional neural networks for image-based high-throughput plant phenotyping: A review. Plant Phenomics, 2020:4152816, 2020.
[13] Tong Lei, Jan Graefe, Ismael K. Mayanja, Mason Earles, and Brian N. Bailey. Simulation of automatically annotated visible and multi-/hyperspectral images using the helios 3d plant and radiative transfer modeling framework. 6:0189.
Li et al. [2023] Jiajia Li, Dong Chen, Xinda Qi, Zhaojian Li, Yanbo Huang, Daniel Morris, and Xiaobo Tan. Label-efficient learning in agriculture: A comprehensive review. Computers and Electronics in Agriculture, 215:108412, 2023.
[15] Yuzhen Lu and Sierra Young. A survey of public datasets for computer vision tasks in precision agriculture. 178:105760.
Ma et al. [2021] Jiayi Ma, Xingyu Jiang, Aoxiang Fan, Junjun Jiang, and Junchi Yan. Image Matching from Handcrafted to Deep Features: A Survey. International Journal of Computer Vision, 129(1):23–79, 2021.
Ma et al. [2023] Wan-Duo Kurt Ma, J. P. Lewis, Avisek Lahiri, Thomas Leung, and W. Bastiaan Kleijn. Directed Diffusion: Direct Control of Object Placement through Attention Guidance, 2023. arXiv:2302.13153 [cs].
Mokady et al. [2022] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text Inversion for Editing Real Images using Guided Diffusion Models, 2022. arXiv:2211.09794.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
Palacios et al. [2023] Fernando Palacios, Maria P. Diago, Pedro Melo-Pinto, and Javier Tardaguila. Early yield prediction in different grapevine varieties using computer vision and machine learning. Precision Agriculture, 24(2):407–435, 2023.
[21] Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.
Parmar et al. [2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation, 2023.
Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
[26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-CNN: Towards real-time object detection with region proposal networks.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.
Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
[30] Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, and Joyce Chai. CycleNet: Rethinking cycle consistency in text-guided diffusion for image manipulation.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
Zhu et al. [2020] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks, 2020.