Single Image Reflection Removal via Self-Supervised Diffusion Models
Abstract
Reflections often degrade the visual quality of images captured through transparent surfaces, and reflection removal methods suffers from the shortage of paired real-world samples.This paper proposes a hybrid approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) to effectively remove reflections from single images without requiring paired training data. The method introduces a Reflective Removal Network (RRN) that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network (RSN) that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. Experimental results demonstrate the effectiveness of the proposed method on the SIR2, Flash-Based Reflection Removal (FRR) Dataset, and a newly introduced Museum Reflection Removal (MRR) dataset, showing superior performance compared to state-of-the-art methods.
Index Terms:
single image reflection removal, denoising diffusion models, cycle-consistency, artifact photography, heritage preservation, digital archivingI Introduction
Reflections are a common occurrence in images captured through transparent surfaces, such as glass windows, mirrors, or protective covers[1]. These reflections often degrade the visual quality of the captured images, making them less useful for various computer vision and image processing tasks[2, 3]. Removing reflections from camera images is a challenging problem that has attracted significant attention in recent years[4] due to its practical importance in applications such as image enhancement[5, 6, 7], augmented reality [8] and computational photography[9, 10].
Denoising diffusion probabilistic models (DDPMs) have recently shown remarkable capabilities in modeling complex image distributions and generating high-quality images. Their gradual denoising process is particularly suited for reflection removal as it can progressively separate superimposed image components while maintaining structural coherence. Different from traditional methods that directly predict the reflection-free image, DDPMs can better handle the inherent ambiguity and multimodal nature of the reflection separation problem. Recent studies have demonstrated the effectiveness of combining diffusion models with transformer-based architectures for various image restoration tasks [11, 12, 13, 14, 15], suggesting the potential of leveraging such frameworks for complex degradation scenarios.
Existing methods for single image reflection removal (SIRR) can be broadly categorized into two groups: model-based methods and data-driven methods. Model-based methods [16, 17] typically rely on hand-crafted priors such as gradient sparsity to distinguish transmission from reflection. While these methods can handle simple reflections, they often fail to generalize to real-world images with complex reflections. In contrast, data-driven methods, especially those based on deep learning [18, 19, 20], learn to separate the layers from training data and have achieved promising results on real images. These methods can handle most reflections by learning from diverse datasets. Fig. 1 illustrates the superimposed reflections from multi-layered media in real-world scenarios. Most existing methods trained on synthetic datasets struggle to handle such cases due to the inherent difficulty in simultaneously capturing both reflection and transmission images as training samples.

To address these challenges, we propose a self-supervised approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) for single image reflection removal. The proposed method introduces a Reflective Removal Network that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. We conduct extensive experiments on synthetic and real-world datasets, including a newly introduced Museum Reflection Removal (MRR) dataset. The proposed method outperforms state-of-the-art techniques, achieving PSNR gains of 0.50-3.84 dB, SSIM improvements of 0.005-0.074, LPIPS reductions of 0.003-0.057, and RAM decreases of 0.0050-0.0561 compared to the baselines on the SIR2, FRR, and MRR datasets.
The main contributions of this work are:
-
•
We propose a reflection removal approach that combines cycle-consistency and DDPMs, effectively modeling the complex distribution of reflections and transmissions.
-
•
We introduce an attention-based Reflective Synthesis Network that adaptively fuses the separated components to reconstruct the input image with high fidelity.
-
•
We present the Museum Reflection Removal dataset, a diverse collection of real-world and synthetic images with reflections from various artistic contents, facilitating research on reflection removal in challenging scenarios.
-
•
We propose a Reflection Artifact Measure to quantify the reflection artifacts in the recovered transmission image, providing a comprehensive assessment of reflection removal performance.
-
•
We conduct extensive experiments and ablation studies, demonstrating the superiority of the proposed method over state-of-the-art techniques on multiple datasets.
II Related Work
Reflection removal from images has been an active area in computer vision. Existing approaches can be broadly categorized into two main groups: model-based methods and data-driven methods.
II-A Model-Based SIRR Methods
Model-based methods for single image reflection removal typically rely on handcrafted priors and optimization techniques to separate the reflection and transmission layers. Levin and Weiss [16] proposed a user-assisted approach that requires manual labeling of gradients to separate the layers. Li and Brown [17] developed a method based on optimizing a Laplacian data fidelity term and a gradient sparsity prior to remove reflections. These methods often struggle to handle complex reflections and require manual intervention, limiting their practicality.
Several pioneering approaches have laid the foundation for model-based reflection removal. Farid and Adelson [21] introduced independent components analysis to separate reflections and lighting, demonstrating the potential of statistical methods in layer decomposition. Building upon this, Szeliski et al. [22] developed an optimal approach for recovering layer images through constrained least squares and iterative refinement, though their method required multiple input images. A significant advancement came from Levin et al. [23], who proposed using local features and belief propagation to decompose a single image by minimizing the total amount of edges and corners. Kong et al. [24] later enhanced the separation quality by leveraging polarized images, introducing a constrained optimization framework that exploits mutually exclusive image information. While these model-based methods established important theoretical foundations and demonstrated promising results in controlled scenarios, they often struggle with complex real-world reflections due to their reliance on simplified assumptions about image statistics and gradient distributions.
II-B Data-Driven Methods
More recently, deep learning-based approaches have gained popularity for single-image reflection removal. Fan et al. [18] proposed a deep neural network architecture that learns to suppress reflections by explicitly modeling the ghosting effects. Zhang et al. [19] introduced a perceptual loss function and a reflection removal network to recover the background layer from a single image. Wei et al. [20] proposed an edge-guided reflection removal network that utilizes edge information to improve the separation of transmission and reflection layers. Yang et al. [25] developed a bidirectional deep network with a recurrent structure to iteratively refine the reflection removal results.
Several works have also explored the use of generative adversarial networks (GANs) for single-image reflection removal. Wan et al. [26] proposed a concurrent reflection removal network (CRRN) that utilizes a transformation-induced image formation model and a concurrent optimization strategy to remove reflections. Wen et al. [27] developed a dual attention network that exploits channel and spatial attention mechanisms to effectively remove reflections. Liu and Lu [28] propose an unsupervised approach using GANs with self-supervision and cycle consistency constraints. Abiko and Ikehara [29] introduce a gradient constraint loss in GAN framework to minimize the correlation between background and reflection layers.
To address the challenge of limited paired training data, recent works have explored cycle consistency and physically-based approaches.RahmaniKhezri et al. [30] develop an unsupervised method using cross-coupled deep networks that leverages semantic features for layer separation, demonstrating strong performance without requiring extensive paired datasets. Kim et al. [31] utilize physically-based rendering to synthesize training data, successfully reproducing spatially variant anisotropic effects of glass reflection. They further introduce a backtrack network for removing complicated ghosting and defocused effects. These methods demonstrate the importance of realistic training data synthesis and cycle-consistent learning in improving reflection removal performance.
Multiple-image reflection removal methods utilize additional images or priors to aid the reflection removal process. These methods typically require capturing multiple images of the same scene under different conditions, such as varying polarization states or focus settings. Schechner et al. [32] proposed a method that uses a sequence of images captured with different polarization filters to separate the reflection and transmission layers. Agrawal et al. [33] utilized a pair of flash and no-flash images to remove reflections by exploiting the differences in the reflective properties of the two images. Kong et al. [34] proposed a method that uses multiple images captured with different focus settings to remove reflections based on depth-of-field differences. Xue et al. [35] developed a computational approach that utilizes a pair of images captured from slightly different viewpoints to remove reflections by exploiting the motion parallax.
More recently, deep learning-based approaches have also been explored for multiple-image reflection removal. Fan et al. [18] proposed a deep architecture that learns to remove reflections from a pair of images captured under different polarization states. Li et al. [36] developed a deep neural network that utilizes a pair of images captured with different focus settings to remove reflections. Li et al. [37] introduced a deep learning framework that uses a sequence of images captured with different polarization angles to remove reflections and recover the underlying scene. Wieschollek et al. [38] proposed a kernel-based method that separates the reflection and transmission layers using a kernel estimation approach.
Despite the progress made by these data-driven methods, they often struggle to handle strong and complex reflections, resulting in artifacts or residual reflections in the recovered images. Our proposed approach aims to address these limitations by leveraging the power of cycle-consistency and denoising diffusion probabilistic models to effectively remove reflections from single images.
III Proposed Method
In this section, we present a self-supervised method for single image reflection removal, which combines cycle-consistency and denoising diffusion probabilistic models (DDPMs). As shown in Fig.2, the proposed approach consists of three main components: a Reflective Removal Network (RRN), a Reflective Synthesis Network (RSN) and a Transmission Discriminator (TD). The RRN utilizes the DDPM framework to model the decomposition process and recover the transmission image from the input image, while the RSN synthesizes the input image with reflections by combining the recovered transmission and reflection components through a nonlinear attention-based mechanism. The cycle-consistent framework enables the learning of mapping functions between the domains without paired training data.

III-A Cycle-Consistent Framework
The cycle-consistent framework enables the learning of mapping functions between two domains without paired training data. Let denote the domain of camera images containing reflections, and and represent the domains of the transmission images and reflection images, respectively. We define two mapping functions: and . The function aims to decompose an input camera image into its transmission image component and reflection image component , while synthesizes the camera image from the recovered transmission image and reflection image .
The cycle-consistency constraint enforces that the mapping functions and should be bijective, ensuring that the recovered components can be reconstructed back to the original input image. Mathematically, this constraint can be expressed as:
(1) |
Additionally, we introduce an inverse mapping , which synthesizes the camera image directly from the transmission and reflection image components. The inverse cycle-consistency constraint is then formulated as:
(2) |
III-B Reflective Removal Network
The Reflective Removal Network (RRN) is responsible for decomposing the input camera image into its transmission image and reflection image components. The proposed approach builds upon the denoising diffusion probabilistic model (DDPM) framework[39] to effectively model the decomposition process and handle complex reflections.

The DDPM framework learns a mapping from a noisy image distribution to the clean image distribution through a forward diffusion process and a reverse diffusion process. Let denote the input camera image, and and represent the corresponding ground truth transmission image and reflection image, respectively, as shown in Fig.3. The DDPM framework defines a forward diffusion process that gradually adds Gaussian noise to the input image over timesteps, resulting in a sequence of noisy images . The forward diffusion process can be described as a Markov chain with the following transition probability:
(3) |
where is a variance schedule that determines the amount of noise added at each timestep.
In the context of reflection removal, we extend the forward diffusion process to generate two separate noisy image sequences: one for the transmission component and another for the reflection component . This extension allows the RRN to model the decomposition of the input camera image into its constituent transmission and reflection components. The forward diffusion processes for the transmission and reflection components are defined as follows:
(4) | ||||
(5) |
where and denote the noisy transmission and reflection images at timestep , respectively.
The objective of the Reflective Removal Network is to learn the reverse diffusion processes and that recursively denoise the noisy transmission and reflection images, ultimately reconstructing the clean transmission and reflection images and , respectively. By leveraging the DDPM framework, the RRN can effectively model the complex distribution of reflections and transmissions in camera images, enabling accurate decomposition and removal of reflections.
Following the DDPM framework, the reverse diffusion processes are parameterized by a shared neural network with parameters , denoted as . The reverse diffusion processes can be described as follows:
(6) | ||||
(7) |
where , , and .
The training objective for the Reflective Removal Network is to minimize the combination of the denoising loss and the reconstruction loss. The denoising loss is derived from the variational lower bound of the negative log-likelihood of the data distribution. Specifically, for the transmission component , the denoising loss can be expressed as:
(8) | ||||
(9) |
where is the standard Gaussian noise, and is the noisy transmission image at timestep . The denoising loss measures the difference between the predicted noise and the actual noise used to generate the noisy image . By minimizing this loss, the network learns to predict the noise that needs to be removed from the noisy image to reconstruct the clean transmission image.
Similarly, for the reflection component , the denoising loss can be expressed as:
(10) | ||||
(11) |
where is the noisy reflection image at timestep .
The overall denoising loss for the Reflective Removal Network is the sum of the denoising losses for the transmission and reflection components:
(12) |
In addition to the denoising loss, we introduce a reconstruction loss to ensure the fidelity of the reconstructed transmission and reflection images:
(13) |
where and are the reconstructed transmission and reflection images, respectively, obtained by recursively applying the reverse diffusion processes starting from the noisy images at timestep .
III-C Transmission Discriminator
To address potential network degradation during unpaired training with the cycle-consistent network and to improve the accuracy of the recovered transmission image, we introduce a Transmission Discriminator (TD) as an adversarial component in the proposed framework.
The Transmission Discriminator is a convolutional neural network that learns to distinguish between the recovered transmission images generated by the Reflective Removal Network (RRN) and the real transmission images from the training dataset. By incorporating this discriminator, we can ensure that the RRN generates visually realistic and accurate transmission images. The Transmission Discriminator aims to minimize the adversarial loss, which is defined as:
(14) | ||||
(15) |
where denotes the Transmission Discriminator, represents the Reflective Removal Network, is a real transmission image sampled from the data distribution , and is an input camera image sampled from the data distribution .
The RRN is trained to minimize the adversarial loss, encouraging it to generate transmission images that are indistinguishable from real ones. The Transmission Discriminator, on the other hand, is trained to maximize the adversarial loss, improving its ability to distinguish between real and generated transmission images. In the full framework, the TD enhances the quality of the recovered transmission images by forcing the RRN to generate more realistic and accurate results. This adversarial training process helps maintain the complexity of the network and prevents degradation during unpaired training.
The overall training objective for the Reflective Removal Network is updated to include the adversarial loss:
(16) |
where , and are hyper-parameters controlling the relative importance of each loss term.
By incorporating the Transmission Discriminator, the proposed framework benefits from the adversarial training process, which helps maintain the complexity of the network and improves the accuracy of the recovered transmission images.
III-D Reflective Synthesis Network
The Reflective Synthesis Network (RSN) aims to synthesize the input image with reflections by combining the recovered transmission and reflection components. We propose a nonlinear attention-based combination module to adaptively fuse the transmission and reflection components while preserving the details and structures of the input image.

As shown in Fig.4, the RSN aims to synthesize the input image by applying a nonlinear superposition of and , guided by an attention mechanism. We first compute a set of spatial attention maps , where is the number of attention heads. Each attention map is obtained by applying a convolutional operation followed by a softmax activation to the concatenation of and :
(17) |
where represents a convolutional neural network with learnable parameters, and denotes the concatenation operation along the channel dimension.
The attention maps are then used to modulate the transmission image and the reflection image , producing attended feature maps and , respectively:
(18) | ||||
(19) |
where denotes the element-wise multiplication operation.
The final synthesized camera image is obtained by combining the attended feature maps and through a nonlinear transformation:
(20) |
where is a convolutional neural network that learns to synthesize the camera image with reflections from the attended transmission and reflection image components.
To ensure that the synthesized camera image closely resembles the original input image, we introduce a cycle-consistency loss as the overall objective function for the Reflective Synthesis Network:
(21) |
III-E Training and Inference
The training and inference processes of the proposed method involve two main components: the complete data training flow and the single input inference flow in the cycle-consistent network.
III-E1 Paired Sample Training Flow
The complete data training flow leverages paired data, consisting of input camera images with reflections and their corresponding ground truth transmission images. The training process alternates between updating the Reflective Removal Network (RRN) and the Reflective Synthesis Network (RSN). Algorithm 1 outlines the paired sample training flow.
During the RRN update step, the input camera image is passed through the RRN to obtain the estimated transmission image and reflection image . The denoising loss and reconstruction loss are computed based on the estimated and ground truth transmission images. The RRN parameters are then updated using gradient descent to minimize these losses. In the RSN update step, the estimated transmission image and reflection image are fed into the RSN to synthesize the camera image . The cycle-consistency loss is computed between the input camera image and the synthesized image . The RSN parameters are updated using gradient descent to minimize the cycle-consistency loss.
III-E2 Unpaired Sample Training Flow
During inference, The proposed method takes a single input camera image with reflections and aims to remove the reflections to obtain the transmission image. The single input inference flow utilizes the trained RRN, RSN, and TD models in a cycle-consistent framework. Algorithm 2 describes the unpaired sample training flow with the transmission discriminator.
Given an input camera image , the RRN first predicts the transmission image and reflection image . The RSN then takes and as inputs and synthesizes the reconstructed camera image . The cycle-consistency loss is computed between the input image and the reconstructed image to enforce the bijective mapping between the domains. In addition to the cycle-consistency loss, we also incorporates the TD to improve the quality of the predicted transmission image . The adversarial loss is computed by passing through the TD, which learns to distinguish between real and predicted transmission images. During training, the RRN and RSN parameters are updated simultaneously to minimize the combined loss function, which combines cycle-consistency loss and adversarial loss.
III-E3 Model Inference
During inference, our method takes a single input image and processes it through the trained RRN to obtain reflection-free results. This process is deterministic and requires no iterative optimization, enabling efficient real-world applications.
The inference pipeline consists of two stages: first, the RRN decomposes the input into transmission and reflection components through a series of denoising steps. Then, only the transmission component is extracted as the final output, discarding the reflection component.
Computationally, inference requires only a single forward pass through the RRN, making it significantly more efficient than training. The RSN and TD components are not used during inference as they serve only training purposes. For an input image of resolution , the inference process has a computational complexity of , with actual runtime primarily determined by the network architecture and available computing resources.
IV Experimental Results
In this section, we present comprehensive evaluations of the proposed method for reflective removal from camera images. We first introduce the datasets used for training and testing, followed by details of the experimental setup. We then conduct ablation studies to analyze the contribution of different components and loss functions. Furthermore, we compare the proposed method with existing state-of-the-art techniques and demonstrate its effectiveness on real-world scenarios.
IV-A Datasets
We evaluate the proposed method on real-world datasets and Museum Reflection Removal (MRR) datasets:
SIR2 Dataset
For real-world evaluation, we utilize publicly available SIR2 dataset [40]. The SIR2 dataset contains 500 real-world image pairs captured through glass surfaces with reflections. We use SIR2 dataset to assess the generalization ability of existing methods on real-world scenarios.
Flash-Based Reflection Removal (FRR) Dataset
The FRR dataset[41] is the first to provide RAW flash/no-flash image pairs for reflection removal. It contains 157 real-world scenes captured by Nikon Z6 and Huawei Mate30, and 1964 synthetic scenes. Each real-world scene includes ambient images with and without reflection captured under flash and no-flash conditions, and a no-flash reflection image. The dataset enables flash-based reflection removal.
Museum Reflection Removal (MRR) Dataset
To evaluate the performance of the proposed method on diverse artistic contents in museums, we introduce the Museum Reflection Removal (MRR) dataset. As shown in Fig.6, the proposed dataset collected 1,621 high-quality unpaired and 721 paired images from various museums and exhibitions, comprising geological specimens, botanical specimens, animal specimens, anthropological artifacts, art objects, historical artifacts, technological artifacts, and archaeological finds. As shown in Fig.5, each image in the dataset contains reflections caused by protective glass or display cases, presenting a challenging real-world scenario for reflection removal algorithms. To facilitate paired training, the paired images are achieved by combining reflection-free samples with exhibition scenes to create realistic reflections.


IV-B Experimental Setup
We implement the proposed method using the PyTorch deep-learning framework [42]. All experiments were conducted on a workstation with NVIDIA GeForce RTX 4090s GPUs, 128GB RAM, and Intel Xeon Gold 6226R CPU. The model training takes approximately 48 hours for both stages combined. The RRN and the Reflective Synthesis Network (RSN) are trained jointly using the Adam optimizer [43] with a learning rate of 1e-4 and a batch size of 8. The RRN is based on the DDPM architecture, with a modified UNet backbone[44] consisting of 8 downsampling and 8 upsampling layers. The RSN employs a similar architecture with 4 downsampling and 4 upsampling layers, along with the attention-based combination module. The hyperparameters , , and , which control the weights of the denoising loss, reconstruction loss, and adversarial loss, respectively, are set to 1.0, 10.0, and 0.1 based on empirical observations.
Component | Parameter | Value |
RRN | Number of layers | 8 down + 8 up |
Channel dimensions | [128,256,512,512,512,512,512,512] | |
Diffusion timesteps | T = 1000 | |
Beta schedule | Linear ( to 0.02) | |
Attention resolutions | 1616, 3232 | |
RSN | Feature extraction | 4 down + 4 up |
Channel dimensions | [64,128,256,512] | |
Attention module | Single-layer with 11 conv | |
Softmax normalization | Channel-wise | |
Feature fusion | Dual-stream weighted sum | |
Final synthesis | 11 conv |
Table I presents the detailed architecture and training parameters of our networks. The RRN adopts a U-Net backbone with symmetric skip connections, where the number of channels doubles after each downsampling operation until reaching 512. For the RSN, we implement a novel attention-based fusion mechanism that operates on the reconstructed transmission and reflection images. The attention module uses 11 convolutions followed by softmax normalization to generate attention maps , which are then used to weight and combine the features from both streams. This adaptive weighting mechanism allows the network to selectively focus on relevant features from each component when synthesizing the final image. The weighted features are further processed through a 11 convolution layer to produce the synthesized output. Both networks are trained end-to-end using the Adam optimizer with carefully tuned loss weights to balance the different learning objectives.
For experimental evaluation, we employed a two-stage training strategy. First, we pre-trained our model using the 721 paired samples from the MRR dataset following Algorithm 1, where each pair consists of a reflection-contaminated image and its corresponding reflection-free image. Subsequently, we fine-tuned the pre-trained model using Algorithm 2 on the remaining 1,621 unpaired samples, alternating between updating the RRN and RSN using separate mini-batches of reflection-contaminated and reflection-free images. This sequential training approach enables the model to first learn basic reflection removal from supervised data, then adapt to domain-specific characteristics through self-supervision. For fair comparison, all baseline methods were trained using the same paired training data from the MRR dataset, and the quantitative results reported in Table III were obtained using our final model after both training stages.
IV-C Evaluation Metrics
To comprehensively assess the performance of the proposed reflection removal method, we employ a combination of traditional image quality metrics and advanced perceptual quality measures. Specifically, we applies Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [45], Learned Perceptual Image Patch Similarity (LPIPS) [46] and the proposed Reflection Artifact Measure (RAM).

The Reflection Artifact Measure (RAM) is a novel metric designed to quantify the presence of frequency-ware reflection component in the recovered transmission layer. As illustrated in Fig. 7, the RAM can be defined as follows:
(22) |
where and denote the Fourier transform and its inverse, respectively. By measuring the energy of the frequency-ware components and normalizing it by the total energy of the predicted transmission layer, RAM provides a quantitative assessment the reflection components in the recovered image. A lower RAM score indicates better suppression of reflection artifacts and higher quality of the reflection removal results.
Metric |
|
|
|
||||||
---|---|---|---|---|---|---|---|---|---|
PSNR | 0.76 | 68.4% | 12.3% | ||||||
SSIM | 0.79 | 71.2% | 9.8% | ||||||
LPIPS | 0.81 | 75.8% | 8.2% | ||||||
RAM (Ours) | 0.85 | 89.5% | 6.5% |
To validate RAM’s effectiveness, we conducted correlation analysis between RAM scores and human perceptual evaluations on SIR2 Dataset. While PSNR and SSIM show moderate correlation with human judgments (Pearson coefficients of 0.76 and 0.79 respectively) for reflection removal quality, RAM demonstrates stronger correlation (0.85) specifically for reflection artifact detection. Additionally, RAM successfully identifies residual reflections in cases where traditional metrics fail, particularly in textured regions where PSNR/SSIM may overlook subtle reflection artifacts.
IV-D Comparison with State-of-the-Art Methods
We conduct a comprehensive evaluation of the method against state-of-the-art techniques, including CEILNet [18], PercepNet [19], BDN [25], ERRNet [20], KimFe et al. [38], DSRN [47] and SDN [48], on three benchmark datasets: SIR2 [40], FRR[41] and the proposed MRR dataset. These datasets cover various real-world and synthetic scenarios, enabling a thorough assessment of reflection removal performance.
Table III demonstrates that the proposed method outperforms state-of-the-art reflection removal techniques on the SIR2, FRR, and MRR datasets across all evaluation metrics. Deep learning approaches such as CEILNet [18], PercepNet [19], BDN [25], and ERRNet [20] achieve competitive results but are surpassed by the proposed method. Our approach achieves an average PSNR gain of 0.50-3.84 dB, SSIM improvement of 0.005-0.074, LPIPS reduction of 0.003-0.057, and RAM decrease of 0.0050-0.0561 compared to the baselines. The superior quantitative results on diverse datasets, including significant improvements in the proposed RAM, demonstrate the effectiveness of combining cycle-consistency and denoising diffusion models for single image reflection removal, particularly in suppressing reflection artifacts. Specifically, the proposed method achieves RAM scores of 0.0549, 0.0437, and 0.0498 on the SIR2, FRR, and MRR datasets, respectively, which are considerably lower than the other methods, indicating better suppression of reflection artifacts.
Further analysis reveals scenario-specific performance variations. Our method shows particular advantages in handling complex museum artifacts (+2.1dB PSNR improvement over ERRNet) and textured surfaces (+1.8dB over SDN), likely due to the diffusion model’s strong capability in handling multimodal distributions. However, for scenes with extremely strong reflections or motion blur, the performance gain is more modest (+0.3dB over DSRN). Traditional methods like BDN perform competitively on simple flat surfaces but struggle with layered reflections where our approach excels.
SIR2 | FRR | MRR | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | PSNR | SSIM | LPIPS | RAM | PSNR | SSIM | LPIPS | RAM | PSNR | SSIM | LPIPS | RAM | #Params |
CEILNet [18] | 25.23 | 0.861 | 0.132 | 0.1110 | 22.14 | 0.793 | 0.185 | 0.0998 | 21.37 | 0.778 | 0.193 | 0.1061 | 45.2M |
PercepNet [19] | 26.17 | 0.879 | 0.116 | 0.0924 | 23.08 | 0.815 | 0.169 | 0.0812 | 22.25 | 0.802 | 0.177 | 0.0873 | 58.7M |
BDN [25] | 27.35 | 0.895 | 0.099 | 0.0736 | 23.92 | 0.828 | 0.157 | 0.0624 | 23.11 | 0.817 | 0.164 | 0.0685 | 64.3M |
ERRNet [20] | 27.63 | 0.901 | 0.095 | 0.0662 | 24.21 | 0.834 | 0.151 | 0.0550 | 23.48 | 0.825 | 0.158 | 0.0611 | 71.5M |
KimFe et al. [38] | 24.57 | 0.842 | 0.147 | 0.1048 | 21.63 | 0.776 | 0.198 | 0.0936 | 20.89 | 0.761 | 0.205 | 0.0999 | 43.8M |
DSRN [47] | 27.82 | 0.905 | 0.092 | 0.0624 | 24.39 | 0.837 | 0.147 | 0.0512 | 23.74 | 0.830 | 0.153 | 0.0573 | 82.1M |
SDN [48] | 27.91 | 0.907 | 0.090 | 0.0599 | 24.52 | 0.839 | 0.145 | 0.0487 | 23.86 | 0.832 | 0.151 | 0.0548 | 85.3M |
Ours | 28.41 | 0.912 | 0.087 | 0.0549 | 24.76 | 0.841 | 0.142 | 0.0437 | 24.15 | 0.835 | 0.148 | 0.0498 | 89.7M |
Fig. 8 provides qualitative comparisons of the proposed method with the baseline techniques on sample images from the SIR2 and MRR datasets. The proposed method effectively removes the reflective component while preserving the original image content and structures, outperforming the existing methods in terms of visual quality and realism. Training approaches based on fully supervised learning, which rely on paired samples, can also yield reasonable results. However, these methods may encounter limitations in certain scenarios, leading to the artifacts or incomplete removal in the recovered transmission images.
While our method has a slightly larger model size (89.7M parameters) compared to recent approaches like SDN (85.3M) and DSRN (82.1M), the increased capacity primarily comes from the attention mechanism in RSN and the dual-branch structure in the diffusion model, which are essential for handling complex reflection patterns. Earlier methods like CEILNet (45.2M) and KimFe (43.8M) have smaller model sizes but show limited capability in handling complex scenes. The moderate increase in model parameters (approximately 5% compared to SDN) brings substantial performance gains across all metrics, demonstrating a favorable trade-off between model complexity and reflection removal capability.

IV-E Ablation experiments
To investigate the each components in the proposed method, we conduct ablation experiments on the SIR2 dataset. We evaluate the performance of the proposed method under the following settings:
-
•
Full Model: The complete proposed framework with all components and loss functions.
-
•
w/o TD: The proposed framework without the Transmission Discriminator and adversarial loss.
-
•
w/o Attention: The proposed framework replace the attention module with the simple superposition in the Reflective Synthesis Network.
-
•
w/o RSN: The proposed framework without the Reflective Synthesis Network and cycle-consistency loss.

Table IV presents the quantitative results of the ablation study. The full model achieves the best performance across all evaluation metrics, demonstrating the effectiveness of the proposed components and loss functions. Removing the Transmission Discriminator leads to a drop in performance, with a decrease of 1.26 dB in PSNR, 0.016 in SSIM, and an increase of 0.015 in LPIPS and 0.0334 in RAM. This indicates the importance of the adversarial loss and the TD in improving the quality of the recovered transmission image. The attention-based combination module in the RSN also contributes positively to the overall performance. Without attention module, the PSNR decreases by 2.09 dB, SSIM drops by 0.029, LPIPS increases by 0.027, and RAM increases by 0.0676, highlighting the significance of the attention mechanism in adaptively fusing the transmission and reflection components. Removing the Reflective Synthesis Network results in the most significant performance degradation among the ablation settings. The PSNR decreases by 2.54 dB, SSIM drops by 0.041, LPIPS increases by 0.040, and RAM increases by 0.1058, emphasizing the crucial role of the RSN and the cycle-consistency loss in the proposed framework.
For computational complexity, our full model requires 89.4G FLOPs for processing a 512512 image, with the TD and RSN components contributing additional overhead compared to the baseline architectures. While removing these components reduces computational cost (65.3G FLOPs for w/o RSN), the significant performance degradation suggests that the added complexity is justified by the quality improvements. This trade-off between computational efficiency and reflection removal performance provides flexibility for different application scenarios.
Fig. 9 presents the qualitative results of the ablation study on sample images from the MRR dataset. The full model achieves the best visual quality in the recovered transmission images, successfully removing the reflections while preserving the image content. Removing the Transmission Discriminator (w/o TD) leads to incomplete removal of reflections in some regions. The absence of the attention module (w/o attention) results in artifacts and distortions in the recovered images. Without the Reflective Synthesis Network (w/o RSN), the quality of the transmission images degrades significantly, with visible residual reflections and loss of details.
Method | PSNR | SSIM | LPIPS | RAM | FLOPs (G) |
---|---|---|---|---|---|
Full Model | 28.41 | 0.912 | 0.087 | 0.0549 | 89.4 |
w/o TD | 27.15 | 0.896 | 0.102 | 0.0883 | 76.2 |
w/o Attention | 26.32 | 0.883 | 0.114 | 0.1225 | 82.8 |
w/o RSN | 25.87 | 0.871 | 0.127 | 0.1607 | 65.3 |
IV-F Runtime Analysis
We analyze the computational complexity and runtime of the proposed method in comparison with existing state-of-the-art approaches. Table V presents the average runtime of the method with different resolutions on a single NVIDIA GeForce RTX 4090s GPU.
Method | Parameters | 256256 | 512512 | 10241024 |
---|---|---|---|---|
CEILNet [18] | 45.2M | 0.032 | 0.128 | 0.482 |
BDN [25] | 64.3M | 0.041 | 0.156 | 0.589 |
ERRNet [20] | 71.5M | 0.045 | 0.167 | 0.634 |
DSRN [47] | 82.1M | 0.053 | 0.198 | 0.756 |
SDN [48] | 85.3M | 0.057 | 0.212 | 0.823 |
Ours | 89.7M | 0.061 | 0.235 | 0.892 |
Table V presents the average runtime of different methods for processing images of various resolutions on a single NVIDIA GeForce RTX 4090s GPU. As expected, lighter models like CEILNet achieve faster inference times (0.128s at 512512) due to their simpler architectures. Our method shows competitive runtime performance (0.235s at 512512) despite incorporating more sophisticated components like the attention mechanism and dual-branch diffusion structure. The increased latency is justified by the significant improvement in reflection removal quality, as demonstrated in our quantitative results. All methods exhibit approximately quadratic scaling with image resolution, which aligns with the theoretical complexity of convolutional operations. For practical applications requiring real-time processing, our method can still process multiple frames per second at common resolutions while delivering superior reflection removal results.
IV-G Real-World results

To further validate the effectiveness of the proposed method in practical scenarios, we evaluate its performance on real-world images with reflections. Fig. 10 presents a challenging case where the reflections are primarily concentrated within the painting frame, while the painting itself remains relatively reflection-free. This scenario often occurs in real-world settings, such as museums or galleries, where the reflective positions are ambiguous and can be easily confused with the actual content of the artwork.
The proposed method successfully handles this challenging case, accurately separating the reflection component from the transmission layer. The recovered transmission image preserves the intricate details and textures of the painting, while the estimated reflection image captures the reflective elements present in the frame. These results demonstrate the robustness of the proposed approach in dealing with complex real-world reflections.
IV-H Failure Cases and Limitations

While the proposed method demonstrates strong performance in removing reflections from single images, there are some limitations and failure cases to consider. One failure case occurs when dealing with misclassified details in complex textures, as illustrated in Fig. 11. In such scenarios, the proposed method may struggle to accurately separate the reflection component from the transmission layer, leading to over-removal of reflections in the recovered image.
The limitations in handling low-contrast reflections particularly impact applications in cultural heritage digitization and museum photography, where subtle reflections from protective glass can affect the digital archiving quality. When reflection patterns closely match the underlying object’s texture or when multiple reflections overlap with varying intensities, our method may struggle to correctly separate the components. Future improvements could explore incorporating physical reflection formation models, multi-scale attention mechanisms for better feature discrimination, or auxiliary depth information to guide separation. Additionally, domain-specific pre-training on particular types of artifacts (e.g., paintings, sculptures) might help the model better handle characteristic reflection patterns in specialized settings.
V Conclusion
This paper presents a self-supervised approach for single image reflection removal that combines cycle-consistency and denoising diffusion probabilistic models. The proposed method introduces a Reflective Removal Network that utilizes DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network that re-synthesizes the input using the separated components through a nonlinear attention-based mechanism. The RRN effectively handles complex reflections by leveraging the power of DDPMs, while the RSN ensures accurate reconstruction of the input image. We conduct extensive experiments on both synthetic and real-world datasets, demonstrating that the proposed technique outperforms state-of-the-art methods in terms of quantitative metrics and visual quality. The recovered reflection-free images exhibit high fidelity and preserve important details. Despite existing limitations, such as incomplete removal of low-contrast reflections and reliance on synthetic training data, this work represents a significant advance in single image reflection removal, offering substantial benefits for image processing applications.
Acknowledgments
This work is funded by National Social Science Fund of China Major Project in Artistic Studies (No.22ZD18), China Postdoctoral Science Foundation (No.2023M741411), Postdoctoral Fellowship Program of CPSF (No.GZC20240608), and Jiangsu Funding Program for Excellent Postdoctoral Talent (No.2024ZB488).
References
- [1] J. F. Blinn and M. E. Newell, “Texture and reflection in computer generated images,” Communications of the ACM, vol. 19, no. 10, pp. 542–547, 1976.
- [2] Y. Li and M. S. Brown, “Exploiting reflection change for automatic reflection removal,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2432–2439.
- [3] C. Li, Y. Yang, K. He, S. Lin, and J. E. Hopcroft, “Single image reflection removal through cascaded refinement,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3565–3574.
- [4] N. Arvanitopoulos, R. Achanta, and S. Susstrunk, “Single image reflection suppression,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4498–4506.
- [5] Z. Lu and Y. Chen, “Self-supervised monocular depth estimation on water scenes via specular reflection prior,” Digital Signal Processing, vol. 149, p. 104496, 2024.
- [6] X. Guo, X. Cao, and Y. Ma, “Robust separation of reflection from multiple images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2187–2194.
- [7] Z. Lu and Y. Chen, “Single image super-resolution based on a modified u-net with mixed gradient loss,” signal, image and video processing, vol. 16, no. 5, pp. 1143–1151, 2022.
- [8] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and R. Szeliski, “Image-based rendering for scenes with reflections,” ACM Transactions on Graphics (TOG), vol. 31, no. 4, pp. 1–10, 2012.
- [9] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman, “Reflection removal using ghosting cues,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3193–3201.
- [10] Z. Lu and Y. Chen, “Joint self-supervised depth and optical flow estimation towards dynamic objects,” Neural Processing Letters, vol. 55, no. 8, pp. 10 235–10 249, 2023.
- [11] L. Wang, Q. Yang, C. Wang, W. Wang, and Z. Su, “Coarse-to-fine mechanisms mitigate diffusion limitations on image restoration,” Computer Vision and Image Understanding, vol. 248, p. 104118, 2024.
- [12] B. Fu, Y. Jiang, D. Wang, J. Gao, C. Wang, and X. Li, “Uncertainty-aware sparse transformer network for single image deraindrop,” IEEE Transactions on Instrumentation and Measurement, 2024.
- [13] D. Wang, J. Liu, L. Ma, R. Liu, and X. Fan, “Improving misaligned multi-modality image fusion with one-stage progressive dense registration,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
- [14] Z. Lu and Y. Chen, “Pyramid frequency network with spatial attention residual refinement module for monocular depth estimation,” Journal of Electronic Imaging, vol. 31, no. 2, p. 023005, 2022.
- [15] H.-C. Dan, B. Lu, and M. Li, “Evaluation of asphalt pavement texture using multiview stereo reconstruction based on deep learning,” Construction and Building Materials, vol. 412, p. 134837, 2024.
- [16] A. Levin and Y. Weiss, “User assisted separation of reflections from a single image using a sparsity prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1647–1654, 2007.
- [17] Y. Li and M. S. Brown, “Single image layer separation using relative smoothness,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2752–2759.
- [18] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf, “A generic deep architecture for single image reflection removal and image smoothing,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3238–3247.
- [19] X. Zhang, R. Ng, and Q. Chen, “Single image reflection separation with perceptual losses,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4786–4794.
- [20] K. Wei, J. Yang, Y. Fu, D. Wipf, and H. Huang, “Single image reflection removal exploiting misaligned training data and network enhancements,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8178–8187.
- [21] H. Farid and E. H. Adelson, “Separating reflections and lighting using independent components analysis,” in Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), vol. 1. IEEE, 1999, pp. 262–267.
- [22] R. Szeliski, S. Avidan, and P. Anandan, “Layer extraction from multiple images containing reflections and transparency,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 1. IEEE, 2000, pp. 246–253.
- [23] A. Levin, A. Zomet, and Y. Weiss, “Separating reflections from a single image using local features,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1. IEEE, 2004, pp. I–I.
- [24] N. Kong, Y.-W. Tai, and S. Y. Shin, “High-quality reflection separation using polarized images,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3393–3405, 2011.
- [25] J. Yang, D. Gong, L. Liu, and Q. Shi, “Seeing deeply and bidirectionally: A deep learning approach for single image reflection removal,” in Proceedings of the european conference on computer vision (ECCV), 2018, pp. 654–669.
- [26] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Crrn: Multi-scale guided concurrent reflection removal network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4777–4785.
- [27] Q. Wen, Y. Tan, J. Qin, W. Liu, G. Han, and S. He, “Single image reflection removal beyond linearity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3771–3779.
- [28] Y. Liu and F. Lu, “Separate in latent space: Unsupervised single image layer separation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 661–11 668.
- [29] R. Abiko and M. Ikehara, “Single image reflection removal based on gan with gradient constraint,” IEEE Access, vol. 7, pp. 148 790–148 799, 2019.
- [30] H. RahmaniKhezri, S. Kim, and M. Hefeeda, “Unsupervised single-image reflection removal,” IEEE Transactions on Multimedia, vol. 25, pp. 4958–4971, 2022.
- [31] S. Kim, Y. Huo, and S.-E. Yoon, “Single image reflection removal with physically-based training images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5164–5173.
- [32] Y. Y. Schechner, J. Shamir, and N. Kiryati, “Polarization and statistical analysis of scenes containing a semireflector,” JOSA A, vol. 17, no. 2, pp. 276–284, 2000.
- [33] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li, “Removing photography artifacts using gradient projection and flash-exposure sampling,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 828–835, 2005.
- [34] N. Kong, Y.-W. Tai, and J. S. Shin, “A physically-based approach to reflection separation: from physical modeling to constrained optimization,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 2, pp. 209–221, 2013.
- [35] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman, “A computational approach for obstruction-free photography,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, pp. 1–11, 2015.
- [36] T. Li and D. P. Lun, “Single-image reflection removal via a two-stage background recovery process,” IEEE Signal Processing Letters, vol. 26, no. 8, pp. 1237–1241, 2019.
- [37] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3867–3876.
- [38] P. Wieschollek, O. Gallo, J. Gu, and J. Kautz, “Separating reflection and transmission images in the wild,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 89–104.
- [39] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- [40] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Benchmarking single-image reflection removal algorithms,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3922–3930.
- [41] C. Lei and Q. Chen, “Robust reflection removal with reflection-free flash-only cues,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 811–14 820.
- [42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [44] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
- [45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
- [46] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
- [47] Q. Hu and X. Guo, “Single image reflection separation via component synergy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 138–13 147.
- [48] Y. Chang, C. Jung, J. Sun, and F. Wang, “Siamese dense network for reflection removal with flash and no-flash image pairs,” International Journal of Computer Vision, vol. 128, pp. 1673–1698, 2020.