(Translated by https://www.hiragana.jp/)
Single Image Reflection Removal via Self-Supervised Diffusion Models

Single Image Reflection Removal via Self-Supervised Diffusion Models

Zhengyang Lu, Weifan Wang, Tianhao Guo, Feng Wang

School of Design, Jiangnan University, China
Abstract

Reflections often degrade the visual quality of images captured through transparent surfaces, and reflection removal methods suffers from the shortage of paired real-world samples.This paper proposes a hybrid approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) to effectively remove reflections from single images without requiring paired training data. The method introduces a Reflective Removal Network (RRN) that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network (RSN) that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. Experimental results demonstrate the effectiveness of the proposed method on the SIR2, Flash-Based Reflection Removal (FRR) Dataset, and a newly introduced Museum Reflection Removal (MRR) dataset, showing superior performance compared to state-of-the-art methods.

Index Terms:
single image reflection removal, denoising diffusion models, cycle-consistency, artifact photography, heritage preservation, digital archiving

I Introduction

Reflections are a common occurrence in images captured through transparent surfaces, such as glass windows, mirrors, or protective covers[1]. These reflections often degrade the visual quality of the captured images, making them less useful for various computer vision and image processing tasks[2, 3]. Removing reflections from camera images is a challenging problem that has attracted significant attention in recent years[4] due to its practical importance in applications such as image enhancement[5, 6, 7], augmented reality [8] and computational photography[9, 10].

Denoising diffusion probabilistic models (DDPMs) have recently shown remarkable capabilities in modeling complex image distributions and generating high-quality images. Their gradual denoising process is particularly suited for reflection removal as it can progressively separate superimposed image components while maintaining structural coherence. Different from traditional methods that directly predict the reflection-free image, DDPMs can better handle the inherent ambiguity and multimodal nature of the reflection separation problem. Recent studies have demonstrated the effectiveness of combining diffusion models with transformer-based architectures for various image restoration tasks [11, 12, 13, 14, 15], suggesting the potential of leveraging such frameworks for complex degradation scenarios.

Existing methods for single image reflection removal (SIRR) can be broadly categorized into two groups: model-based methods and data-driven methods. Model-based methods [16, 17] typically rely on hand-crafted priors such as gradient sparsity to distinguish transmission from reflection. While these methods can handle simple reflections, they often fail to generalize to real-world images with complex reflections. In contrast, data-driven methods, especially those based on deep learning [18, 19, 20], learn to separate the layers from training data and have achieved promising results on real images. These methods can handle most reflections by learning from diverse datasets. Fig. 1 illustrates the superimposed reflections from multi-layered media in real-world scenarios. Most existing methods trained on synthetic datasets struggle to handle such cases due to the inherent difficulty in simultaneously capturing both reflection and transmission images as training samples.

Refer to caption
Figure 1: Real-world reflections can be highly complex, often involving multiple superimposed reflections and blurring phenomena. Most methods trained on simple synthetic dataset are struggle to reconstruct transmission images in the wild.

To address these challenges, we propose a self-supervised approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) for single image reflection removal. The proposed method introduces a Reflective Removal Network that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. We conduct extensive experiments on synthetic and real-world datasets, including a newly introduced Museum Reflection Removal (MRR) dataset. The proposed method outperforms state-of-the-art techniques, achieving PSNR gains of 0.50-3.84 dB, SSIM improvements of 0.005-0.074, LPIPS reductions of 0.003-0.057, and RAM decreases of 0.0050-0.0561 compared to the baselines on the SIR2, FRR, and MRR datasets.

The main contributions of this work are:

  • We propose a reflection removal approach that combines cycle-consistency and DDPMs, effectively modeling the complex distribution of reflections and transmissions.

  • We introduce an attention-based Reflective Synthesis Network that adaptively fuses the separated components to reconstruct the input image with high fidelity.

  • We present the Museum Reflection Removal dataset, a diverse collection of real-world and synthetic images with reflections from various artistic contents, facilitating research on reflection removal in challenging scenarios.

  • We propose a Reflection Artifact Measure to quantify the reflection artifacts in the recovered transmission image, providing a comprehensive assessment of reflection removal performance.

  • We conduct extensive experiments and ablation studies, demonstrating the superiority of the proposed method over state-of-the-art techniques on multiple datasets.

II Related Work

Reflection removal from images has been an active area in computer vision. Existing approaches can be broadly categorized into two main groups: model-based methods and data-driven methods.

II-A Model-Based SIRR Methods

Model-based methods for single image reflection removal typically rely on handcrafted priors and optimization techniques to separate the reflection and transmission layers. Levin and Weiss [16] proposed a user-assisted approach that requires manual labeling of gradients to separate the layers. Li and Brown [17] developed a method based on optimizing a Laplacian data fidelity term and a gradient sparsity prior to remove reflections. These methods often struggle to handle complex reflections and require manual intervention, limiting their practicality.

Several pioneering approaches have laid the foundation for model-based reflection removal. Farid and Adelson [21] introduced independent components analysis to separate reflections and lighting, demonstrating the potential of statistical methods in layer decomposition. Building upon this, Szeliski et al. [22] developed an optimal approach for recovering layer images through constrained least squares and iterative refinement, though their method required multiple input images. A significant advancement came from Levin et al. [23], who proposed using local features and belief propagation to decompose a single image by minimizing the total amount of edges and corners. Kong et al. [24] later enhanced the separation quality by leveraging polarized images, introducing a constrained optimization framework that exploits mutually exclusive image information. While these model-based methods established important theoretical foundations and demonstrated promising results in controlled scenarios, they often struggle with complex real-world reflections due to their reliance on simplified assumptions about image statistics and gradient distributions.

II-B Data-Driven Methods

More recently, deep learning-based approaches have gained popularity for single-image reflection removal. Fan et al. [18] proposed a deep neural network architecture that learns to suppress reflections by explicitly modeling the ghosting effects. Zhang et al. [19] introduced a perceptual loss function and a reflection removal network to recover the background layer from a single image. Wei et al. [20] proposed an edge-guided reflection removal network that utilizes edge information to improve the separation of transmission and reflection layers. Yang et al. [25] developed a bidirectional deep network with a recurrent structure to iteratively refine the reflection removal results.

Several works have also explored the use of generative adversarial networks (GANs) for single-image reflection removal. Wan et al. [26] proposed a concurrent reflection removal network (CRRN) that utilizes a transformation-induced image formation model and a concurrent optimization strategy to remove reflections. Wen et al. [27] developed a dual attention network that exploits channel and spatial attention mechanisms to effectively remove reflections. Liu and Lu [28] propose an unsupervised approach using GANs with self-supervision and cycle consistency constraints. Abiko and Ikehara [29] introduce a gradient constraint loss in GAN framework to minimize the correlation between background and reflection layers.

To address the challenge of limited paired training data, recent works have explored cycle consistency and physically-based approaches.RahmaniKhezri et al. [30] develop an unsupervised method using cross-coupled deep networks that leverages semantic features for layer separation, demonstrating strong performance without requiring extensive paired datasets. Kim et al. [31] utilize physically-based rendering to synthesize training data, successfully reproducing spatially variant anisotropic effects of glass reflection. They further introduce a backtrack network for removing complicated ghosting and defocused effects. These methods demonstrate the importance of realistic training data synthesis and cycle-consistent learning in improving reflection removal performance.

Multiple-image reflection removal methods utilize additional images or priors to aid the reflection removal process. These methods typically require capturing multiple images of the same scene under different conditions, such as varying polarization states or focus settings. Schechner et al. [32] proposed a method that uses a sequence of images captured with different polarization filters to separate the reflection and transmission layers. Agrawal et al. [33] utilized a pair of flash and no-flash images to remove reflections by exploiting the differences in the reflective properties of the two images. Kong et al. [34] proposed a method that uses multiple images captured with different focus settings to remove reflections based on depth-of-field differences. Xue et al. [35] developed a computational approach that utilizes a pair of images captured from slightly different viewpoints to remove reflections by exploiting the motion parallax.

More recently, deep learning-based approaches have also been explored for multiple-image reflection removal. Fan et al. [18] proposed a deep architecture that learns to remove reflections from a pair of images captured under different polarization states. Li et al. [36] developed a deep neural network that utilizes a pair of images captured with different focus settings to remove reflections. Li et al. [37] introduced a deep learning framework that uses a sequence of images captured with different polarization angles to remove reflections and recover the underlying scene. Wieschollek et al. [38] proposed a kernel-based method that separates the reflection and transmission layers using a kernel estimation approach.

Despite the progress made by these data-driven methods, they often struggle to handle strong and complex reflections, resulting in artifacts or residual reflections in the recovered images. Our proposed approach aims to address these limitations by leveraging the power of cycle-consistency and denoising diffusion probabilistic models to effectively remove reflections from single images.

III Proposed Method

In this section, we present a self-supervised method for single image reflection removal, which combines cycle-consistency and denoising diffusion probabilistic models (DDPMs). As shown in Fig.2, the proposed approach consists of three main components: a Reflective Removal Network (RRN), a Reflective Synthesis Network (RSN) and a Transmission Discriminator (TD). The RRN utilizes the DDPM framework to model the decomposition process and recover the transmission image from the input image, while the RSN synthesizes the input image with reflections by combining the recovered transmission and reflection components through a nonlinear attention-based mechanism. The cycle-consistent framework enables the learning of mapping functions between the domains without paired training data.

Refer to caption
Figure 2: The self-supervised diffusion model has three main components: Reflective Removal Network, Reflective Synthesis Network, and Transmission Discriminator.

III-A Cycle-Consistent Framework

The cycle-consistent framework enables the learning of mapping functions between two domains without paired training data. Let 𝒳𝒳\mathcal{X}caligraphic_X denote the domain of camera images containing reflections, and 𝒮𝒮\mathcal{S}caligraphic_S and \mathcal{R}caligraphic_R represent the domains of the transmission images and reflection images, respectively. We define two mapping functions: G:𝒳𝒮×:𝐺𝒳𝒮G:\mathcal{X}\rightarrow\mathcal{S}\times\mathcal{R}italic_G : caligraphic_X → caligraphic_S × caligraphic_R and F:𝒮×𝒳:𝐹𝒮𝒳F:\mathcal{S}\times\mathcal{R}\rightarrow\mathcal{X}italic_F : caligraphic_S × caligraphic_R → caligraphic_X. The function G𝐺Gitalic_G aims to decompose an input camera image x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X into its transmission image component s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and reflection image component r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R, while F𝐹Fitalic_F synthesizes the camera image x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X from the recovered transmission image s𝑠sitalic_s and reflection image r𝑟ritalic_r.

The cycle-consistency constraint enforces that the mapping functions G𝐺Gitalic_G and F𝐹Fitalic_F should be bijective, ensuring that the recovered components can be reconstructed back to the original input image. Mathematically, this constraint can be expressed as:

xF(G(x))x𝒳formulae-sequence𝑥𝐹𝐺𝑥for-all𝑥𝒳x\approx F(G(x))\quad\forall x\in\mathcal{X}italic_x ≈ italic_F ( italic_G ( italic_x ) ) ∀ italic_x ∈ caligraphic_X (1)

Additionally, we introduce an inverse mapping G1:𝒮×𝒳:superscript𝐺1𝒮𝒳G^{-1}:\mathcal{S}\times\mathcal{R}\rightarrow\mathcal{X}italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_R → caligraphic_X, which synthesizes the camera image directly from the transmission and reflection image components. The inverse cycle-consistency constraint is then formulated as:

s,rG(G1(s,r))s𝒮,rformulae-sequence𝑠𝑟𝐺superscript𝐺1𝑠𝑟formulae-sequencefor-all𝑠𝒮𝑟s,r\approx G(G^{-1}(s,r))\quad\forall s\in\mathcal{S},r\in\mathcal{R}italic_s , italic_r ≈ italic_G ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_s , italic_r ) ) ∀ italic_s ∈ caligraphic_S , italic_r ∈ caligraphic_R (2)

III-B Reflective Removal Network

The Reflective Removal Network (RRN) is responsible for decomposing the input camera image into its transmission image and reflection image components. The proposed approach builds upon the denoising diffusion probabilistic model (DDPM) framework[39] to effectively model the decomposition process and handle complex reflections.

Refer to caption
Figure 3: Dual DDPM ensures accurate transmission and reflection image predictions.

The DDPM framework learns a mapping from a noisy image distribution to the clean image distribution through a forward diffusion process and a reverse diffusion process. Let x0H×W×3subscript𝑥0superscript𝐻𝑊3x_{0}\in\mathbb{R}^{H\times W\times 3}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT denote the input camera image, and s0H×W×3subscript𝑠0superscript𝐻𝑊3s_{0}\in\mathbb{R}^{H\times W\times 3}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT and r0H×W×3subscript𝑟0superscript𝐻𝑊3r_{0}\in\mathbb{R}^{H\times W\times 3}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT represent the corresponding ground truth transmission image and reflection image, respectively, as shown in Fig.3. The DDPM framework defines a forward diffusion process q(xt|xt1)𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1q(x_{t}|x_{t-1})italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) that gradually adds Gaussian noise to the input image over T𝑇Titalic_T timesteps, resulting in a sequence of noisy images xt=1Tsuperscriptsubscript𝑥𝑡1𝑇x_{t=1}^{T}italic_x start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The forward diffusion process can be described as a Markov chain with the following transition probability:

q(xt|xt1)=𝒩(xt;1βtxt1,βtI),𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐼q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) , (3)

where βt(0,1)subscript𝛽𝑡01\beta_{t}\in(0,1)italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is a variance schedule that determines the amount of noise added at each timestep.

In the context of reflection removal, we extend the forward diffusion process to generate two separate noisy image sequences: one for the transmission component st=1Tsuperscriptsubscript𝑠𝑡1𝑇{s}_{t=1}^{T}italic_s start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and another for the reflection component rt=1Tsuperscriptsubscript𝑟𝑡1𝑇{r}_{t=1}^{T}italic_r start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. This extension allows the RRN to model the decomposition of the input camera image into its constituent transmission and reflection components. The forward diffusion processes for the transmission and reflection components are defined as follows:

q(st|st1)𝑞conditionalsubscript𝑠𝑡subscript𝑠𝑡1\displaystyle q(s_{t}|s_{t-1})italic_q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) =𝒩(st;1βtst1,βtI),absent𝒩subscript𝑠𝑡1subscript𝛽𝑡subscript𝑠𝑡1subscript𝛽𝑡𝐼\displaystyle=\mathcal{N}(s_{t};\sqrt{1-\beta_{t}}s_{t-1},\beta_{t}I),= caligraphic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) , (4)
q(rt|rt1)𝑞conditionalsubscript𝑟𝑡subscript𝑟𝑡1\displaystyle q(r_{t}|r_{t-1})italic_q ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) =𝒩(rt;1βtrt1,βtI).absent𝒩subscript𝑟𝑡1subscript𝛽𝑡subscript𝑟𝑡1subscript𝛽𝑡𝐼\displaystyle=\mathcal{N}(r_{t};\sqrt{1-\beta_{t}}r_{t-1},\beta_{t}I).= caligraphic_N ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) . (5)

where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the noisy transmission and reflection images at timestep t𝑡titalic_t, respectively.

The objective of the Reflective Removal Network is to learn the reverse diffusion processes pθ(st1|st)subscript𝑝𝜃conditionalsubscript𝑠𝑡1subscript𝑠𝑡p_{\theta}(s_{t-1}|s_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and pθ(rt1|rt)subscript𝑝𝜃conditionalsubscript𝑟𝑡1subscript𝑟𝑡p_{\theta}(r_{t-1}|r_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that recursively denoise the noisy transmission and reflection images, ultimately reconstructing the clean transmission and reflection images s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, respectively. By leveraging the DDPM framework, the RRN can effectively model the complex distribution of reflections and transmissions in camera images, enabling accurate decomposition and removal of reflections.

Following the DDPM framework, the reverse diffusion processes are parameterized by a shared neural network with parameters θ𝜃\thetaitalic_θ, denoted as ϵθ(,t)subscriptitalic-ϵ𝜃𝑡\epsilon_{\theta}(\cdot,t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , italic_t ). The reverse diffusion processes can be described as follows:

pθ(st1|st)subscript𝑝𝜃conditionalsubscript𝑠𝑡1subscript𝑠𝑡\displaystyle p_{\theta}(s_{t-1}|s_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒩(st1;μθ(st,t),σt2I),absent𝒩subscript𝑠𝑡1subscript𝜇𝜃subscript𝑠𝑡𝑡superscriptsubscript𝜎𝑡2𝐼\displaystyle=\mathcal{N}(s_{t-1};\mu_{\theta}(s_{t},t),\sigma_{t}^{2}I),= caligraphic_N ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) , (6)
pθ(rt1|rt)subscript𝑝𝜃conditionalsubscript𝑟𝑡1subscript𝑟𝑡\displaystyle p_{\theta}(r_{t-1}|r_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒩(rt1;μθ(rt,t),σt2I),absent𝒩subscript𝑟𝑡1subscript𝜇𝜃subscript𝑟𝑡𝑡superscriptsubscript𝜎𝑡2𝐼\displaystyle=\mathcal{N}(r_{t-1};\mu_{\theta}(r_{t},t),\sigma_{t}^{2}I),= caligraphic_N ( italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) , (7)

where μθ(,t)=11βt(ztβt1α¯tϵθ(zt,t))subscript𝜇𝜃𝑡11subscript𝛽𝑡subscript𝑧𝑡subscript𝛽𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡\mu_{\theta}(\cdot,t)=\frac{1}{\sqrt{1-\beta_{t}}}\left(z_{t}-\frac{\beta_{t}}% {\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(z_{t},t)\right)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ), σt2=βtsuperscriptsubscript𝜎𝑡2subscript𝛽𝑡\sigma_{t}^{2}=\beta_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and α¯t=i=1t(1βi)subscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}(1-\beta_{i})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

The training objective for the Reflective Removal Network is to minimize the combination of the denoising loss and the reconstruction loss. The denoising loss desubscriptde\mathcal{L}_{\text{de}}caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT is derived from the variational lower bound of the negative log-likelihood of the data distribution. Specifically, for the transmission component s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the denoising loss desubscriptde\mathcal{L}_{\text{de}}caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT can be expressed as:

de(s0)subscriptdesubscript𝑠0\displaystyle\mathcal{L}_{\text{de}}(s_{0})caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =𝔼s0,ϵ[1Tt=1T|ϵϵθ(α¯ts0+1α¯tϵ,t)|22]absent𝔼subscript𝑠0italic-ϵdelimited-[]1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptitalic-ϵsubscriptitalic-ϵ𝜃subscript¯𝛼𝑡subscript𝑠01subscript¯𝛼𝑡italic-ϵ𝑡22\displaystyle=\mathbb{E}{s_{0},\epsilon}\left[\frac{1}{T}\sum_{t=1}^{T}\left|% \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}s_{0}+\sqrt{1-\bar{\alpha}_{% t}}\epsilon,t)\right|_{2}^{2}\right]= blackboard_E italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_t ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (8)
=𝔼s0,ϵ[1Tt=1T|ϵϵθ(st,t)|22],absent𝔼subscript𝑠0italic-ϵdelimited-[]1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑠𝑡𝑡22\displaystyle=\mathbb{E}{s_{0},\epsilon}\left[\frac{1}{T}\sum_{t=1}^{T}\left|% \epsilon-\epsilon_{\theta}(s_{t},t)\right|_{2}^{2}\right],= blackboard_E italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (9)

where ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) is the standard Gaussian noise, and st=α¯ts0+1α¯tϵsubscript𝑠𝑡subscript¯𝛼𝑡subscript𝑠01subscript¯𝛼𝑡italic-ϵs_{t}=\sqrt{\bar{\alpha}_{t}}s_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilonitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ is the noisy transmission image at timestep t𝑡titalic_t. The denoising loss measures the difference between the predicted noise ϵθ(st,t)subscriptitalic-ϵ𝜃subscript𝑠𝑡𝑡\epsilon_{\theta}(s_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and the actual noise ϵitalic-ϵ\epsilonitalic_ϵ used to generate the noisy image stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By minimizing this loss, the network learns to predict the noise that needs to be removed from the noisy image to reconstruct the clean transmission image.

Similarly, for the reflection component r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the denoising loss can be expressed as:

de(r0)subscriptdesubscript𝑟0\displaystyle\mathcal{L}_{\text{de}}(r_{0})caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =𝔼r0,ϵ[1Tt=1T|ϵϵθ(α¯tr0+1α¯tϵ,t)|22]absent𝔼subscript𝑟0italic-ϵdelimited-[]1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptitalic-ϵsubscriptitalic-ϵ𝜃subscript¯𝛼𝑡subscript𝑟01subscript¯𝛼𝑡italic-ϵ𝑡22\displaystyle=\mathbb{E}{r_{0},\epsilon}\left[\frac{1}{T}\sum_{t=1}^{T}\left|% \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}r_{0}+\sqrt{1-\bar{\alpha}_{% t}}\epsilon,t)\right|_{2}^{2}\right]= blackboard_E italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_t ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (10)
=𝔼r0,ϵ[1Tt=1T|ϵϵθ(rt,t)|22],absent𝔼subscript𝑟0italic-ϵdelimited-[]1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑟𝑡𝑡22\displaystyle=\mathbb{E}{r_{0},\epsilon}\left[\frac{1}{T}\sum_{t=1}^{T}\left|% \epsilon-\epsilon_{\theta}(r_{t},t)\right|_{2}^{2}\right],= blackboard_E italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (11)

where rt=α¯tr0+1α¯tϵsubscript𝑟𝑡subscript¯𝛼𝑡subscript𝑟01subscript¯𝛼𝑡italic-ϵr_{t}=\sqrt{\bar{\alpha}_{t}}r_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilonitalic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ is the noisy reflection image at timestep t𝑡titalic_t.

The overall denoising loss for the Reflective Removal Network is the sum of the denoising losses for the transmission and reflection components:

de=de(s0)+de(r0).subscriptdesubscriptdesubscript𝑠0subscriptdesubscript𝑟0\mathcal{L}_{\text{de}}=\mathcal{L}_{\text{de}}(s_{0})+\mathcal{L}_{\text{de}}% (r_{0}).caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (12)

In addition to the denoising loss, we introduce a reconstruction loss to ensure the fidelity of the reconstructed transmission and reflection images:

rec=𝔼s0,r0[|s0s^0|1+|r0r^0|1],subscriptrecsubscript𝔼subscript𝑠0subscript𝑟0delimited-[]subscriptsubscript𝑠0subscript^𝑠01subscriptsubscript𝑟0subscript^𝑟01\mathcal{L}_{\text{rec}}=\mathbb{E}_{s_{0},r_{0}}\left[\left|s_{0}-\hat{s}_{0}% \right|_{1}+\left|r_{0}-\hat{r}_{0}\right|_{1}\right],caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + | italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , (13)

where s^0subscript^𝑠0\hat{s}_{0}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and r^0subscript^𝑟0\hat{r}_{0}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the reconstructed transmission and reflection images, respectively, obtained by recursively applying the reverse diffusion processes starting from the noisy images at timestep T𝑇Titalic_T.

III-C Transmission Discriminator

To address potential network degradation during unpaired training with the cycle-consistent network and to improve the accuracy of the recovered transmission image, we introduce a Transmission Discriminator (TD) as an adversarial component in the proposed framework.

The Transmission Discriminator is a convolutional neural network that learns to distinguish between the recovered transmission images generated by the Reflective Removal Network (RRN) and the real transmission images from the training dataset. By incorporating this discriminator, we can ensure that the RRN generates visually realistic and accurate transmission images. The Transmission Discriminator aims to minimize the adversarial loss, which is defined as:

adv=𝔼subscriptadv𝔼\displaystyle\mathcal{L}_{\text{adv}}=\mathbb{E}caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT = blackboard_E spdata(s)[logD(s)]+similar-to𝑠limit-fromsubscript𝑝data𝑠delimited-[]𝐷𝑠\displaystyle{s\sim p_{\text{data}}(s)}[\log D(s)]+italic_s ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_s ) [ roman_log italic_D ( italic_s ) ] + (14)
𝔼xpdata(x)[log(1D(G(x)))],similar-to𝔼𝑥subscript𝑝data𝑥delimited-[]1𝐷𝐺𝑥\displaystyle\mathbb{E}{x\sim p_{\text{data}}(x)}[\log(1-D(G(x)))],blackboard_E italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) [ roman_log ( 1 - italic_D ( italic_G ( italic_x ) ) ) ] , (15)

where D𝐷Ditalic_D denotes the Transmission Discriminator, G𝐺Gitalic_G represents the Reflective Removal Network, s𝑠sitalic_s is a real transmission image sampled from the data distribution pdata(s)subscript𝑝data𝑠p_{\text{data}}(s)italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_s ), and x𝑥xitalic_x is an input camera image sampled from the data distribution pdata(x)subscript𝑝data𝑥p_{\text{data}}(x)italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ).

The RRN is trained to minimize the adversarial loss, encouraging it to generate transmission images that are indistinguishable from real ones. The Transmission Discriminator, on the other hand, is trained to maximize the adversarial loss, improving its ability to distinguish between real and generated transmission images. In the full framework, the TD enhances the quality of the recovered transmission images by forcing the RRN to generate more realistic and accurate results. This adversarial training process helps maintain the complexity of the network and prevents degradation during unpaired training.

The overall training objective for the Reflective Removal Network is updated to include the adversarial loss:

RRN=λ1de+λ2rec+λ3adv,subscriptRRNsubscript𝜆1subscriptdesubscript𝜆2subscriptrecsubscript𝜆3subscriptadv\mathcal{L}_{\text{RRN}}=\lambda_{1}\mathcal{L}_{\text{de}}+\lambda_{2}% \mathcal{L}_{\text{rec}}+\lambda_{3}\mathcal{L}_{\text{adv}},caligraphic_L start_POSTSUBSCRIPT RRN end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT , (16)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are hyper-parameters controlling the relative importance of each loss term.

By incorporating the Transmission Discriminator, the proposed framework benefits from the adversarial training process, which helps maintain the complexity of the network and improves the accuracy of the recovered transmission images.

III-D Reflective Synthesis Network

The Reflective Synthesis Network (RSN) aims to synthesize the input image with reflections by combining the recovered transmission and reflection components. We propose a nonlinear attention-based combination module to adaptively fuse the transmission and reflection components while preserving the details and structures of the input image.

Refer to caption
Figure 4: Reflective synthesis network restores the original image from transmission and reflection image with attention mechanism.

As shown in Fig.4, the RSN aims to synthesize the input image x^𝒳^𝑥𝒳\hat{x}\in\mathcal{X}over^ start_ARG italic_x end_ARG ∈ caligraphic_X by applying a nonlinear superposition of s𝑠sitalic_s and r𝑟ritalic_r, guided by an attention mechanism. We first compute a set of spatial attention maps A=a1,a2,,aK𝐴subscript𝑎1subscript𝑎2subscript𝑎𝐾A={a_{1},a_{2},\ldots,a_{K}}italic_A = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, where K𝐾Kitalic_K is the number of attention heads. Each attention map akH×Wsubscript𝑎𝑘superscript𝐻𝑊a_{k}\in\mathbb{R}^{H\times W}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT is obtained by applying a convolutional operation followed by a softmax activation to the concatenation of s𝑠sitalic_s and r𝑟ritalic_r:

ak=softmax(fc([s,r]))subscript𝑎𝑘softmaxsubscript𝑓𝑐𝑠𝑟a_{k}=\text{softmax}(f_{c}([s,r]))italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = softmax ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( [ italic_s , italic_r ] ) ) (17)

where fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represents a convolutional neural network with learnable parameters, and [,][\cdot,\cdot][ ⋅ , ⋅ ] denotes the concatenation operation along the channel dimension.

The attention maps A𝐴Aitalic_A are then used to modulate the transmission image t𝑡titalic_t and the reflection image s𝑠sitalic_s, producing attended feature maps s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG, respectively:

s^^𝑠\displaystyle\hat{s}over^ start_ARG italic_s end_ARG =k=1Kaksabsentsuperscriptsubscript𝑘1𝐾direct-productsubscript𝑎𝑘𝑠\displaystyle=\sum_{k=1}^{K}a_{k}\odot s= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊙ italic_s (18)
r^^𝑟\displaystyle\hat{r}over^ start_ARG italic_r end_ARG =k=1K(1ak)rabsentsuperscriptsubscript𝑘1𝐾direct-product1subscript𝑎𝑘𝑟\displaystyle=\sum_{k=1}^{K}(1-a_{k})\odot r= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( 1 - italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⊙ italic_r (19)

where direct-product\odot denotes the element-wise multiplication operation.

The final synthesized camera image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG is obtained by combining the attended feature maps t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG and r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG through a nonlinear transformation:

x^=g([s^,r^])^𝑥𝑔^𝑠^𝑟\hat{x}=g([\hat{s},\hat{r}])over^ start_ARG italic_x end_ARG = italic_g ( [ over^ start_ARG italic_s end_ARG , over^ start_ARG italic_r end_ARG ] ) (20)

where g𝑔gitalic_g is a convolutional neural network that learns to synthesize the camera image with reflections from the attended transmission and reflection image components.

To ensure that the synthesized camera image closely resembles the original input image, we introduce a cycle-consistency loss as the overall objective function for the Reflective Synthesis Network:

RSN=cyc=xx^1=xF(G(x))1subscriptRSNsubscriptcycsubscriptdelimited-∥∥𝑥^𝑥1subscriptdelimited-∥∥𝑥𝐹𝐺𝑥1\displaystyle\mathcal{L}_{\text{RSN}}=\mathcal{L}_{\text{cyc}}=\left\lVert x-% \hat{x}\right\rVert_{1}=\left\lVert x-F(G(x))\right\rVert_{1}caligraphic_L start_POSTSUBSCRIPT RSN end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT = ∥ italic_x - over^ start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∥ italic_x - italic_F ( italic_G ( italic_x ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (21)

III-E Training and Inference

The training and inference processes of the proposed method involve two main components: the complete data training flow and the single input inference flow in the cycle-consistent network.

III-E1 Paired Sample Training Flow

The complete data training flow leverages paired data, consisting of input camera images with reflections and their corresponding ground truth transmission images. The training process alternates between updating the Reflective Removal Network (RRN) and the Reflective Synthesis Network (RSN). Algorithm 1 outlines the paired sample training flow.

Algorithm 1 Paired Sample Training Flow
0:  Paired training data {(xi,si,ri)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑠𝑖subscript𝑟𝑖𝑖1𝑁\{(x_{i},s_{i},r_{i})\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the input camera image, sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground truth transmission image and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground truth reflection image
0:  Trained RRN and RSN models
1:  while not converged do
2:     Sample a mini-batch of paired data {(x,s,r)}𝑥𝑠𝑟\{(x,s,r)\}{ ( italic_x , italic_s , italic_r ) }
3:     Update RRN:
4:         Compute s^,r^=RRN(x)^𝑠^𝑟RRN𝑥\hat{s},\hat{r}=\text{RRN}(x)over^ start_ARG italic_s end_ARG , over^ start_ARG italic_r end_ARG = RRN ( italic_x )
5:         Compute denoising loss desubscriptde\mathcal{L}_{\text{de}}caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT
6:         Compute reconstruction loss recsubscriptrec\mathcal{L}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT
7:         Compute adversarial loss advsubscriptadv\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT
8:         Update RRN parameters to minimize           RRN=λ1de+λ2rec+λ3advsubscriptRRNsubscript𝜆1subscriptdesubscript𝜆2subscriptrecsubscript𝜆3subscriptadv\mathcal{L}_{\text{RRN}}=\lambda_{1}\mathcal{L}_{\text{de}}+\lambda_{2}% \mathcal{L}_{\text{rec}}+\lambda_{3}\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT RRN end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT
9:         Update TD parameters to minimize advsubscriptadv\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT
10:     Update RSN:
11:         Compute x^=RSN(s^,r^)^𝑥RSN^𝑠^𝑟\hat{x}=\text{RSN}(\hat{s},\hat{r})over^ start_ARG italic_x end_ARG = RSN ( over^ start_ARG italic_s end_ARG , over^ start_ARG italic_r end_ARG )
12:         Compute cycle-consistency loss cycsubscriptcyc\mathcal{L}_{\text{cyc}}caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT
13:         Update RSN parameters to minimize cycsubscriptcyc\mathcal{L}_{\text{cyc}}caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT
14:  end while

During the RRN update step, the input camera image x𝑥xitalic_x is passed through the RRN to obtain the estimated transmission image s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and reflection image r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG. The denoising loss desubscriptde\mathcal{L}_{\text{de}}caligraphic_L start_POSTSUBSCRIPT de end_POSTSUBSCRIPT and reconstruction loss recsubscriptrec\mathcal{L}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT are computed based on the estimated and ground truth transmission images. The RRN parameters are then updated using gradient descent to minimize these losses. In the RSN update step, the estimated transmission image s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and reflection image r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG are fed into the RSN to synthesize the camera image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. The cycle-consistency loss cycsubscriptcyc\mathcal{L}_{\text{cyc}}caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT is computed between the input camera image x𝑥xitalic_x and the synthesized image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. The RSN parameters are updated using gradient descent to minimize the cycle-consistency loss.

III-E2 Unpaired Sample Training Flow

During inference, The proposed method takes a single input camera image with reflections and aims to remove the reflections to obtain the transmission image. The single input inference flow utilizes the trained RRN, RSN, and TD models in a cycle-consistent framework. Algorithm 2 describes the unpaired sample training flow with the transmission discriminator.

Algorithm 2 Unpaired Sample Training Flow
0:  Unpaired training data xi=1Nsuperscriptsubscript𝑥𝑖1𝑁{x}_{i=1}^{N}italic_x start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the input camera image, sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the transmission image and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the reflection image
0:  Trained RRN and RSN models
1:  while not converged do
2:     Sample a mini-batch of unpaired data x𝑥{x}italic_x and s𝑠{s}italic_s
3:     Update RRN, RSN, and TD:
4:         Compute s^,r^=RRN(x)^𝑠^𝑟RRN𝑥\hat{s},\hat{r}=\text{RRN}(x)over^ start_ARG italic_s end_ARG , over^ start_ARG italic_r end_ARG = RRN ( italic_x )
5:         Compute x^=RSN(s^,r^)^𝑥RSN^𝑠^𝑟\hat{x}=\text{RSN}(\hat{s},\hat{r})over^ start_ARG italic_x end_ARG = RSN ( over^ start_ARG italic_s end_ARG , over^ start_ARG italic_r end_ARG )
6:         Compute cycle-consistency loss           cyc=xx^1subscriptcycsubscriptdelimited-∥∥𝑥^𝑥1\mathcal{L}_{\text{cyc}}=\left\lVert x-\hat{x}\right\rVert_{1}caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT = ∥ italic_x - over^ start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
7:         Compute adversarial loss           adv=TD(s^)subscriptadvTD^𝑠\mathcal{L}_{\text{adv}}=\text{TD}(\hat{s})caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT = TD ( over^ start_ARG italic_s end_ARG )
8:         Update RRN and RSN parameters to minimize          cyc+λadvadvsubscriptcycsubscript𝜆advsubscriptadv\mathcal{L}_{\text{cyc}}+\lambda_{\text{adv}}\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT
9:  end while

Given an input camera image x𝑥xitalic_x, the RRN first predicts the transmission image s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and reflection image r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG. The RSN then takes s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG as inputs and synthesizes the reconstructed camera image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. The cycle-consistency loss cycsubscriptcyc\mathcal{L}_{\text{cyc}}caligraphic_L start_POSTSUBSCRIPT cyc end_POSTSUBSCRIPT is computed between the input image x𝑥xitalic_x and the reconstructed image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG to enforce the bijective mapping between the domains. In addition to the cycle-consistency loss, we also incorporates the TD to improve the quality of the predicted transmission image s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG. The adversarial loss advsubscriptadv\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT is computed by passing s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG through the TD, which learns to distinguish between real and predicted transmission images. During training, the RRN and RSN parameters are updated simultaneously to minimize the combined loss function, which combines cycle-consistency loss and adversarial loss.

III-E3 Model Inference

During inference, our method takes a single input image and processes it through the trained RRN to obtain reflection-free results. This process is deterministic and requires no iterative optimization, enabling efficient real-world applications.

The inference pipeline consists of two stages: first, the RRN decomposes the input into transmission and reflection components through a series of denoising steps. Then, only the transmission component is extracted as the final output, discarding the reflection component.

Computationally, inference requires only a single forward pass through the RRN, making it significantly more efficient than training. The RSN and TD components are not used during inference as they serve only training purposes. For an input image of resolution H×W𝐻𝑊H\times Witalic_H × italic_W, the inference process has a computational complexity of O(HW)𝑂𝐻𝑊O(HW)italic_O ( italic_H italic_W ), with actual runtime primarily determined by the network architecture and available computing resources.

Algorithm 3 Model Inference Process
0:  Input image x, Trained RRN model
0:  Reflection-free transmission image
1:  Load trained RRN parameters
2:  s,r=RRN(x)𝑠𝑟RRN𝑥s,r=\text{RRN}(x)italic_s , italic_r = RRN ( italic_x ) {Decompose input}
3:  s^=normalize(s)^𝑠normalize𝑠\hat{s}=\text{normalize}(s)over^ start_ARG italic_s end_ARG = normalize ( italic_s ) {Post-process transmission}
4:  return  s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG {Return transmission only}

IV Experimental Results

In this section, we present comprehensive evaluations of the proposed method for reflective removal from camera images. We first introduce the datasets used for training and testing, followed by details of the experimental setup. We then conduct ablation studies to analyze the contribution of different components and loss functions. Furthermore, we compare the proposed method with existing state-of-the-art techniques and demonstrate its effectiveness on real-world scenarios.

IV-A Datasets

We evaluate the proposed method on real-world datasets and Museum Reflection Removal (MRR) datasets:

SIR2 Dataset

For real-world evaluation, we utilize publicly available SIR2 dataset [40]. The SIR2 dataset contains 500 real-world image pairs captured through glass surfaces with reflections. We use SIR2 dataset to assess the generalization ability of existing methods on real-world scenarios.

Flash-Based Reflection Removal (FRR) Dataset

The FRR dataset[41] is the first to provide RAW flash/no-flash image pairs for reflection removal. It contains 157 real-world scenes captured by Nikon Z6 and Huawei Mate30, and 1964 synthetic scenes. Each real-world scene includes ambient images with and without reflection captured under flash and no-flash conditions, and a no-flash reflection image. The dataset enables flash-based reflection removal.

Museum Reflection Removal (MRR) Dataset

To evaluate the performance of the proposed method on diverse artistic contents in museums, we introduce the Museum Reflection Removal (MRR) dataset. As shown in Fig.6, the proposed dataset collected 1,621 high-quality unpaired and 721 paired images from various museums and exhibitions, comprising geological specimens, botanical specimens, animal specimens, anthropological artifacts, art objects, historical artifacts, technological artifacts, and archaeological finds. As shown in Fig.5, each image in the dataset contains reflections caused by protective glass or display cases, presenting a challenging real-world scenario for reflection removal algorithms. To facilitate paired training, the paired images are achieved by combining reflection-free samples with exhibition scenes to create realistic reflections.

Refer to caption
Figure 5: The Museum Reflection Removal Dataset includes exhibition samples with reflections from various fields.
Refer to caption
Figure 6: Artifacts Distribution in MRR Dataset.

IV-B Experimental Setup

We implement the proposed method using the PyTorch deep-learning framework [42]. All experiments were conducted on a workstation with NVIDIA GeForce RTX 4090s GPUs, 128GB RAM, and Intel Xeon Gold 6226R CPU. The model training takes approximately 48 hours for both stages combined. The RRN and the Reflective Synthesis Network (RSN) are trained jointly using the Adam optimizer [43] with a learning rate of 1e-4 and a batch size of 8. The RRN is based on the DDPM architecture, with a modified UNet backbone[44] consisting of 8 downsampling and 8 upsampling layers. The RSN employs a similar architecture with 4 downsampling and 4 upsampling layers, along with the attention-based combination module. The hyperparameters λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which control the weights of the denoising loss, reconstruction loss, and adversarial loss, respectively, are set to 1.0, 10.0, and 0.1 based on empirical observations.

TABLE I: Network Architecture and Training Parameters
Component Parameter Value
RRN Number of layers 8 down + 8 up
Channel dimensions [128,256,512,512,512,512,512,512]
Diffusion timesteps T = 1000
Beta schedule Linear (104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 0.02)
Attention resolutions 16×\times×16, 32×\times×32
RSN Feature extraction 4 down + 4 up
Channel dimensions [64,128,256,512]
Attention module Single-layer with 1×\times×1 conv
Softmax normalization Channel-wise
Feature fusion Dual-stream weighted sum
Final synthesis 1×\times×1 conv

Table I presents the detailed architecture and training parameters of our networks. The RRN adopts a U-Net backbone with symmetric skip connections, where the number of channels doubles after each downsampling operation until reaching 512. For the RSN, we implement a novel attention-based fusion mechanism that operates on the reconstructed transmission s𝑠sitalic_s and reflection r𝑟ritalic_r images. The attention module uses 1×\times×1 convolutions followed by softmax normalization to generate attention maps aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which are then used to weight and combine the features from both streams. This adaptive weighting mechanism allows the network to selectively focus on relevant features from each component when synthesizing the final image. The weighted features are further processed through a 1×\times×1 convolution layer to produce the synthesized output. Both networks are trained end-to-end using the Adam optimizer with carefully tuned loss weights to balance the different learning objectives.

For experimental evaluation, we employed a two-stage training strategy. First, we pre-trained our model using the 721 paired samples from the MRR dataset following Algorithm 1, where each pair consists of a reflection-contaminated image and its corresponding reflection-free image. Subsequently, we fine-tuned the pre-trained model using Algorithm 2 on the remaining 1,621 unpaired samples, alternating between updating the RRN and RSN using separate mini-batches of reflection-contaminated and reflection-free images. This sequential training approach enables the model to first learn basic reflection removal from supervised data, then adapt to domain-specific characteristics through self-supervision. For fair comparison, all baseline methods were trained using the same paired training data from the MRR dataset, and the quantitative results reported in Table III were obtained using our final model after both training stages.

IV-C Evaluation Metrics

To comprehensively assess the performance of the proposed reflection removal method, we employ a combination of traditional image quality metrics and advanced perceptual quality measures. Specifically, we applies Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [45], Learned Perceptual Image Patch Similarity (LPIPS) [46] and the proposed Reflection Artifact Measure (RAM).

Refer to caption
Figure 7: The Fourier transform difference proves to be an effective method for recognizing reflections, making it an ideal evaluation metric for assessing reflection removal results.

The Reflection Artifact Measure (RAM) is a novel metric designed to quantify the presence of frequency-ware reflection component in the recovered transmission layer. As illustrated in Fig. 7, the RAM can be defined as follows:

RAM=1Ni=1N|1((T^i)(Ti))|1|Ti|1𝑅𝐴𝑀1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript1subscript^𝑇𝑖subscript𝑇𝑖1subscriptsubscript𝑇𝑖1RAM=\frac{1}{N}\sum_{i=1}^{N}\frac{\left|\mathcal{F}^{-1}\left(\mathcal{F}(% \hat{T}_{i})-\mathcal{F}(T_{i})\right)\right|_{1}}{\left|T_{i}\right|_{1}}italic_R italic_A italic_M = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_F ( over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_F ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG (22)

where ()\mathcal{F}(\cdot)caligraphic_F ( ⋅ ) and 1()superscript1\mathcal{F}^{-1}(\cdot)caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) denote the Fourier transform and its inverse, respectively. By measuring the energy of the frequency-ware components and normalizing it by the total energy of the predicted transmission layer, RAM provides a quantitative assessment the reflection components in the recovered image. A lower RAM score indicates better suppression of reflection artifacts and higher quality of the reflection removal results.

TABLE II: Human-centered Comparison of Evaluation Metrics on Reflection Removal Assessment on SIR2 Dataset
Metric
Human
Correlation
Reflection
Detection Rate
False
Positive Rate
PSNR 0.76 68.4% 12.3%
SSIM 0.79 71.2% 9.8%
LPIPS 0.81 75.8% 8.2%
RAM (Ours) 0.85 89.5% 6.5%

To validate RAM’s effectiveness, we conducted correlation analysis between RAM scores and human perceptual evaluations on SIR2 Dataset. While PSNR and SSIM show moderate correlation with human judgments (Pearson coefficients of 0.76 and 0.79 respectively) for reflection removal quality, RAM demonstrates stronger correlation (0.85) specifically for reflection artifact detection. Additionally, RAM successfully identifies residual reflections in cases where traditional metrics fail, particularly in textured regions where PSNR/SSIM may overlook subtle reflection artifacts.

IV-D Comparison with State-of-the-Art Methods

We conduct a comprehensive evaluation of the method against state-of-the-art techniques, including CEILNet [18], PercepNet [19], BDN [25], ERRNet [20], KimFe et al. [38], DSRN [47] and SDN [48], on three benchmark datasets: SIR2 [40], FRR[41] and the proposed MRR dataset. These datasets cover various real-world and synthetic scenarios, enabling a thorough assessment of reflection removal performance.

Table III demonstrates that the proposed method outperforms state-of-the-art reflection removal techniques on the SIR2, FRR, and MRR datasets across all evaluation metrics. Deep learning approaches such as CEILNet [18], PercepNet [19], BDN [25], and ERRNet [20] achieve competitive results but are surpassed by the proposed method. Our approach achieves an average PSNR gain of 0.50-3.84 dB, SSIM improvement of 0.005-0.074, LPIPS reduction of 0.003-0.057, and RAM decrease of 0.0050-0.0561 compared to the baselines. The superior quantitative results on diverse datasets, including significant improvements in the proposed RAM, demonstrate the effectiveness of combining cycle-consistency and denoising diffusion models for single image reflection removal, particularly in suppressing reflection artifacts. Specifically, the proposed method achieves RAM scores of 0.0549, 0.0437, and 0.0498 on the SIR2, FRR, and MRR datasets, respectively, which are considerably lower than the other methods, indicating better suppression of reflection artifacts.

Further analysis reveals scenario-specific performance variations. Our method shows particular advantages in handling complex museum artifacts (+2.1dB PSNR improvement over ERRNet) and textured surfaces (+1.8dB over SDN), likely due to the diffusion model’s strong capability in handling multimodal distributions. However, for scenes with extremely strong reflections or motion blur, the performance gain is more modest (+0.3dB over DSRN). Traditional methods like BDN perform competitively on simple flat surfaces but struggle with layered reflections where our approach excels.

TABLE III: Quantitative comparison with state-of-the-art methods on the SIR2, FRR and MRR datasets.
SIR2 FRR MRR
Method PSNR SSIM LPIPS RAM PSNR SSIM LPIPS RAM PSNR SSIM LPIPS RAM #Params
CEILNet [18] 25.23 0.861 0.132 0.1110 22.14 0.793 0.185 0.0998 21.37 0.778 0.193 0.1061 45.2M
PercepNet [19] 26.17 0.879 0.116 0.0924 23.08 0.815 0.169 0.0812 22.25 0.802 0.177 0.0873 58.7M
BDN [25] 27.35 0.895 0.099 0.0736 23.92 0.828 0.157 0.0624 23.11 0.817 0.164 0.0685 64.3M
ERRNet [20] 27.63 0.901 0.095 0.0662 24.21 0.834 0.151 0.0550 23.48 0.825 0.158 0.0611 71.5M
KimFe et al. [38] 24.57 0.842 0.147 0.1048 21.63 0.776 0.198 0.0936 20.89 0.761 0.205 0.0999 43.8M
DSRN [47] 27.82 0.905 0.092 0.0624 24.39 0.837 0.147 0.0512 23.74 0.830 0.153 0.0573 82.1M
SDN [48] 27.91 0.907 0.090 0.0599 24.52 0.839 0.145 0.0487 23.86 0.832 0.151 0.0548 85.3M
Ours 28.41 0.912 0.087 0.0549 24.76 0.841 0.142 0.0437 24.15 0.835 0.148 0.0498 89.7M

Fig. 8 provides qualitative comparisons of the proposed method with the baseline techniques on sample images from the SIR2 and MRR datasets. The proposed method effectively removes the reflective component while preserving the original image content and structures, outperforming the existing methods in terms of visual quality and realism. Training approaches based on fully supervised learning, which rely on paired samples, can also yield reasonable results. However, these methods may encounter limitations in certain scenarios, leading to the artifacts or incomplete removal in the recovered transmission images.

While our method has a slightly larger model size (89.7M parameters) compared to recent approaches like SDN (85.3M) and DSRN (82.1M), the increased capacity primarily comes from the attention mechanism in RSN and the dual-branch structure in the diffusion model, which are essential for handling complex reflection patterns. Earlier methods like CEILNet (45.2M) and KimFe (43.8M) have smaller model sizes but show limited capability in handling complex scenes. The moderate increase in model parameters (approximately 5% compared to SDN) brings substantial performance gains across all metrics, demonstrating a favorable trade-off between model complexity and reflection removal capability.

Refer to caption
Figure 8: Qualitative comparison of the proposed method with state-of-the-art techniques on sample images from the SIR2 and MRR datasets. From left to right: Input image with reflections, results from BDN [25], DSRN [47], SDN [48] and the proposed method.

IV-E Ablation experiments

To investigate the each components in the proposed method, we conduct ablation experiments on the SIR2 dataset. We evaluate the performance of the proposed method under the following settings:

  • Full Model: The complete proposed framework with all components and loss functions.

  • w/o TD: The proposed framework without the Transmission Discriminator and adversarial loss.

  • w/o Attention: The proposed framework replace the attention module with the simple superposition in the Reflective Synthesis Network.

  • w/o RSN: The proposed framework without the Reflective Synthesis Network and cycle-consistency loss.

Refer to caption
Figure 9: Qualitative ablation results on the MRR dataset. From upper to bottom: full transmission image, full model, w/o TD, w/o attention, and w/o RSN.

Table IV presents the quantitative results of the ablation study. The full model achieves the best performance across all evaluation metrics, demonstrating the effectiveness of the proposed components and loss functions. Removing the Transmission Discriminator leads to a drop in performance, with a decrease of 1.26 dB in PSNR, 0.016 in SSIM, and an increase of 0.015 in LPIPS and 0.0334 in RAM. This indicates the importance of the adversarial loss and the TD in improving the quality of the recovered transmission image. The attention-based combination module in the RSN also contributes positively to the overall performance. Without attention module, the PSNR decreases by 2.09 dB, SSIM drops by 0.029, LPIPS increases by 0.027, and RAM increases by 0.0676, highlighting the significance of the attention mechanism in adaptively fusing the transmission and reflection components. Removing the Reflective Synthesis Network results in the most significant performance degradation among the ablation settings. The PSNR decreases by 2.54 dB, SSIM drops by 0.041, LPIPS increases by 0.040, and RAM increases by 0.1058, emphasizing the crucial role of the RSN and the cycle-consistency loss in the proposed framework.

For computational complexity, our full model requires 89.4G FLOPs for processing a 512×\times×512 image, with the TD and RSN components contributing additional overhead compared to the baseline architectures. While removing these components reduces computational cost (65.3G FLOPs for w/o RSN), the significant performance degradation suggests that the added complexity is justified by the quality improvements. This trade-off between computational efficiency and reflection removal performance provides flexibility for different application scenarios.

Fig. 9 presents the qualitative results of the ablation study on sample images from the MRR dataset. The full model achieves the best visual quality in the recovered transmission images, successfully removing the reflections while preserving the image content. Removing the Transmission Discriminator (w/o TD) leads to incomplete removal of reflections in some regions. The absence of the attention module (w/o attention) results in artifacts and distortions in the recovered images. Without the Reflective Synthesis Network (w/o RSN), the quality of the transmission images degrades significantly, with visible residual reflections and loss of details.

TABLE IV: Quantitative results of the ablation study on the SIR2 dataset.
Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow RAM \downarrow FLOPs (G)
Full Model 28.41 0.912 0.087 0.0549 89.4
w/o TD 27.15 0.896 0.102 0.0883 76.2
w/o Attention 26.32 0.883 0.114 0.1225 82.8
w/o RSN 25.87 0.871 0.127 0.1607 65.3

IV-F Runtime Analysis

We analyze the computational complexity and runtime of the proposed method in comparison with existing state-of-the-art approaches. Table V presents the average runtime of the method with different resolutions on a single NVIDIA GeForce RTX 4090s GPU.

TABLE V: Runtime Analysis and Comparison (in seconds) on Different Image Resolutions
Method Parameters 256×\times×256 512×\times×512 1024×\times×1024
CEILNet [18] 45.2M 0.032 0.128 0.482
BDN [25] 64.3M 0.041 0.156 0.589
ERRNet [20] 71.5M 0.045 0.167 0.634
DSRN [47] 82.1M 0.053 0.198 0.756
SDN [48] 85.3M 0.057 0.212 0.823
Ours 89.7M 0.061 0.235 0.892

Table V presents the average runtime of different methods for processing images of various resolutions on a single NVIDIA GeForce RTX 4090s GPU. As expected, lighter models like CEILNet achieve faster inference times (0.128s at 512×\times×512) due to their simpler architectures. Our method shows competitive runtime performance (0.235s at 512×\times×512) despite incorporating more sophisticated components like the attention mechanism and dual-branch diffusion structure. The increased latency is justified by the significant improvement in reflection removal quality, as demonstrated in our quantitative results. All methods exhibit approximately quadratic scaling with image resolution, which aligns with the theoretical complexity of convolutional operations. For practical applications requiring real-time processing, our method can still process multiple frames per second at common resolutions while delivering superior reflection removal results.

IV-G Real-World results

Refer to caption
Figure 10: Reflective positions in real-world scenes are often ambiguous. Our method recovers visually accurate transmission and reflection images.

To further validate the effectiveness of the proposed method in practical scenarios, we evaluate its performance on real-world images with reflections. Fig. 10 presents a challenging case where the reflections are primarily concentrated within the painting frame, while the painting itself remains relatively reflection-free. This scenario often occurs in real-world settings, such as museums or galleries, where the reflective positions are ambiguous and can be easily confused with the actual content of the artwork.

The proposed method successfully handles this challenging case, accurately separating the reflection component from the transmission layer. The recovered transmission image preserves the intricate details and textures of the painting, while the estimated reflection image captures the reflective elements present in the frame. These results demonstrate the robustness of the proposed approach in dealing with complex real-world reflections.

IV-H Failure Cases and Limitations

Refer to caption
Figure 11: Failure case from the proposed method with complex pattern and reflections. Reconstructed samples achieve high metrics but misclassify details.

While the proposed method demonstrates strong performance in removing reflections from single images, there are some limitations and failure cases to consider. One failure case occurs when dealing with misclassified details in complex textures, as illustrated in Fig. 11. In such scenarios, the proposed method may struggle to accurately separate the reflection component from the transmission layer, leading to over-removal of reflections in the recovered image.

The limitations in handling low-contrast reflections particularly impact applications in cultural heritage digitization and museum photography, where subtle reflections from protective glass can affect the digital archiving quality. When reflection patterns closely match the underlying object’s texture or when multiple reflections overlap with varying intensities, our method may struggle to correctly separate the components. Future improvements could explore incorporating physical reflection formation models, multi-scale attention mechanisms for better feature discrimination, or auxiliary depth information to guide separation. Additionally, domain-specific pre-training on particular types of artifacts (e.g., paintings, sculptures) might help the model better handle characteristic reflection patterns in specialized settings.

V Conclusion

This paper presents a self-supervised approach for single image reflection removal that combines cycle-consistency and denoising diffusion probabilistic models. The proposed method introduces a Reflective Removal Network that utilizes DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network that re-synthesizes the input using the separated components through a nonlinear attention-based mechanism. The RRN effectively handles complex reflections by leveraging the power of DDPMs, while the RSN ensures accurate reconstruction of the input image. We conduct extensive experiments on both synthetic and real-world datasets, demonstrating that the proposed technique outperforms state-of-the-art methods in terms of quantitative metrics and visual quality. The recovered reflection-free images exhibit high fidelity and preserve important details. Despite existing limitations, such as incomplete removal of low-contrast reflections and reliance on synthetic training data, this work represents a significant advance in single image reflection removal, offering substantial benefits for image processing applications.

Acknowledgments

This work is funded by National Social Science Fund of China Major Project in Artistic Studies (No.22ZD18), China Postdoctoral Science Foundation (No.2023M741411), Postdoctoral Fellowship Program of CPSF (No.GZC20240608), and Jiangsu Funding Program for Excellent Postdoctoral Talent (No.2024ZB488).

References

  • [1] J. F. Blinn and M. E. Newell, “Texture and reflection in computer generated images,” Communications of the ACM, vol. 19, no. 10, pp. 542–547, 1976.
  • [2] Y. Li and M. S. Brown, “Exploiting reflection change for automatic reflection removal,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2432–2439.
  • [3] C. Li, Y. Yang, K. He, S. Lin, and J. E. Hopcroft, “Single image reflection removal through cascaded refinement,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3565–3574.
  • [4] N. Arvanitopoulos, R. Achanta, and S. Susstrunk, “Single image reflection suppression,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4498–4506.
  • [5] Z. Lu and Y. Chen, “Self-supervised monocular depth estimation on water scenes via specular reflection prior,” Digital Signal Processing, vol. 149, p. 104496, 2024.
  • [6] X. Guo, X. Cao, and Y. Ma, “Robust separation of reflection from multiple images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2187–2194.
  • [7] Z. Lu and Y. Chen, “Single image super-resolution based on a modified u-net with mixed gradient loss,” signal, image and video processing, vol. 16, no. 5, pp. 1143–1151, 2022.
  • [8] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and R. Szeliski, “Image-based rendering for scenes with reflections,” ACM Transactions on Graphics (TOG), vol. 31, no. 4, pp. 1–10, 2012.
  • [9] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman, “Reflection removal using ghosting cues,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3193–3201.
  • [10] Z. Lu and Y. Chen, “Joint self-supervised depth and optical flow estimation towards dynamic objects,” Neural Processing Letters, vol. 55, no. 8, pp. 10 235–10 249, 2023.
  • [11] L. Wang, Q. Yang, C. Wang, W. Wang, and Z. Su, “Coarse-to-fine mechanisms mitigate diffusion limitations on image restoration,” Computer Vision and Image Understanding, vol. 248, p. 104118, 2024.
  • [12] B. Fu, Y. Jiang, D. Wang, J. Gao, C. Wang, and X. Li, “Uncertainty-aware sparse transformer network for single image deraindrop,” IEEE Transactions on Instrumentation and Measurement, 2024.
  • [13] D. Wang, J. Liu, L. Ma, R. Liu, and X. Fan, “Improving misaligned multi-modality image fusion with one-stage progressive dense registration,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [14] Z. Lu and Y. Chen, “Pyramid frequency network with spatial attention residual refinement module for monocular depth estimation,” Journal of Electronic Imaging, vol. 31, no. 2, p. 023005, 2022.
  • [15] H.-C. Dan, B. Lu, and M. Li, “Evaluation of asphalt pavement texture using multiview stereo reconstruction based on deep learning,” Construction and Building Materials, vol. 412, p. 134837, 2024.
  • [16] A. Levin and Y. Weiss, “User assisted separation of reflections from a single image using a sparsity prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1647–1654, 2007.
  • [17] Y. Li and M. S. Brown, “Single image layer separation using relative smoothness,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2752–2759.
  • [18] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf, “A generic deep architecture for single image reflection removal and image smoothing,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3238–3247.
  • [19] X. Zhang, R. Ng, and Q. Chen, “Single image reflection separation with perceptual losses,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4786–4794.
  • [20] K. Wei, J. Yang, Y. Fu, D. Wipf, and H. Huang, “Single image reflection removal exploiting misaligned training data and network enhancements,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8178–8187.
  • [21] H. Farid and E. H. Adelson, “Separating reflections and lighting using independent components analysis,” in Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), vol. 1.   IEEE, 1999, pp. 262–267.
  • [22] R. Szeliski, S. Avidan, and P. Anandan, “Layer extraction from multiple images containing reflections and transparency,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 1.   IEEE, 2000, pp. 246–253.
  • [23] A. Levin, A. Zomet, and Y. Weiss, “Separating reflections from a single image using local features,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1.   IEEE, 2004, pp. I–I.
  • [24] N. Kong, Y.-W. Tai, and S. Y. Shin, “High-quality reflection separation using polarized images,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3393–3405, 2011.
  • [25] J. Yang, D. Gong, L. Liu, and Q. Shi, “Seeing deeply and bidirectionally: A deep learning approach for single image reflection removal,” in Proceedings of the european conference on computer vision (ECCV), 2018, pp. 654–669.
  • [26] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Crrn: Multi-scale guided concurrent reflection removal network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4777–4785.
  • [27] Q. Wen, Y. Tan, J. Qin, W. Liu, G. Han, and S. He, “Single image reflection removal beyond linearity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3771–3779.
  • [28] Y. Liu and F. Lu, “Separate in latent space: Unsupervised single image layer separation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 661–11 668.
  • [29] R. Abiko and M. Ikehara, “Single image reflection removal based on gan with gradient constraint,” IEEE Access, vol. 7, pp. 148 790–148 799, 2019.
  • [30] H. RahmaniKhezri, S. Kim, and M. Hefeeda, “Unsupervised single-image reflection removal,” IEEE Transactions on Multimedia, vol. 25, pp. 4958–4971, 2022.
  • [31] S. Kim, Y. Huo, and S.-E. Yoon, “Single image reflection removal with physically-based training images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5164–5173.
  • [32] Y. Y. Schechner, J. Shamir, and N. Kiryati, “Polarization and statistical analysis of scenes containing a semireflector,” JOSA A, vol. 17, no. 2, pp. 276–284, 2000.
  • [33] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li, “Removing photography artifacts using gradient projection and flash-exposure sampling,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 828–835, 2005.
  • [34] N. Kong, Y.-W. Tai, and J. S. Shin, “A physically-based approach to reflection separation: from physical modeling to constrained optimization,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 2, pp. 209–221, 2013.
  • [35] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman, “A computational approach for obstruction-free photography,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, pp. 1–11, 2015.
  • [36] T. Li and D. P. Lun, “Single-image reflection removal via a two-stage background recovery process,” IEEE Signal Processing Letters, vol. 26, no. 8, pp. 1237–1241, 2019.
  • [37] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3867–3876.
  • [38] P. Wieschollek, O. Gallo, J. Gu, and J. Kautz, “Separating reflection and transmission images in the wild,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 89–104.
  • [39] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  • [40] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Benchmarking single-image reflection removal algorithms,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3922–3930.
  • [41] C. Lei and Q. Chen, “Robust reflection removal with reflection-free flash-only cues,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 811–14 820.
  • [42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  • [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [44] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.   Springer, 2015, pp. 234–241.
  • [45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [46] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
  • [47] Q. Hu and X. Guo, “Single image reflection separation via component synergy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 138–13 147.
  • [48] Y. Chang, C. Jung, J. Sun, and F. Wang, “Siamese dense network for reflection removal with flash and no-flash image pairs,” International Journal of Computer Vision, vol. 128, pp. 1673–1698, 2020.