(Translated by https://www.hiragana.jp/)
One Look is Enough: A Novel Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation Models on High-Resolution Images

One Look is Enough: A Novel Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation Models on High-Resolution Images

Byeongjun Kwon
KAIST111Korea Advanced Institute of Science and Technology.
kbj2738@kaist.ac.kr
   Munchurl Kim 222Corresponding author.
KAIST111Korea Advanced Institute of Science and Technology.
mkimee@kaist.ac.kr
   https://kaist-viclab.github.io/One-Look-is-Enough_site
Abstract

Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluation on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrates into which our PRO can be well harmonized, making their DE capabilities still effective for the grid input of high-resolution images with little depth discontinuities at the grid boundaries. Our PRO runs fast at inference time.

{strip}[Uncaptioned image]
Figure 1: Qualitative comparison of patch-based depth estimation models (PatchFusion [23], PatchRefiner [24] and PRO (Ours)) for high-resolution images. PatchRefiner [24] and PRO (ours) also visualize the residuals to effectively highlight depth discontinuity issues in the 4th and 6th columns. All models performed depth estimations on 4×4444\times 44 × 4 patches of the 1st and 3rd input images. Zoom-in can help distinguish the grid lines (depth discontinuities) of the 4×4444\times 44 × 4 patches. Our proposed PRO achieves smooth boundary transitions without depth discontinuity artifacts along the grid boundaries.

1 Introduction

Monocular Depth Estimation (MDE) [10, 33, 36, 27, 43, 13, 14] has been widely investigated as the demand for 3D information continues to grow in autonomous driving, robotics, and virtual reality applications. After Midas [30] proposed an MDE network trainable with mixed training dataset, numerous zero-shot MDE networks [31, 9, 2, 46, 47, 19, 11] have been studied to predict depth for unseen real-world images. However, due to the architectural limitation or limited image resolutions of training dataset, most zero-shot MDE networks are trained specifically with the images of low resolutions (e.g. 384×\times×384, 518×\times×518). Consequently, when high resolution images are inputted to these models, the resulting estimated depth images tend to contain disrupted overall structures with low-frequency artifacts, although yielding improved edge details [26, 7].

Patch-based methods [26, 24, 23] have shown promising results by splitting high-resolution images into patches for DE, mitigating memory consumption issues while achieving remarkable performance. However, they suffer from depth discontinuity problems (e.g. boundary artifacts along grid boundaries) which occur when independent DE patches are reassembled to construct a complete (whole) depth map because depth continuity is maintained within each patch but not between patches. Previous methods [24, 23] alleviate this depth discontinuity problem (i) by incorporating a consistency loss during training and (ii) by ensemble averaging at test time which slows down inference speeds, making it impractical for real-world applications.

Refer to caption
Figure 2: Performance and efficiency comparison on Middlebury 2014 [34]. The area of each circle represents the inference time. Circles of the same color represent the same model with different patch numbers (P=16P16\mathrm{P}=16roman_P = 16 for small-sized yellow and grey circles, P=177P177\mathrm{P}=177roman_P = 177 for large-sized yellow and grey circles). Our model (PRO) achieves the best performance in terms of edge errors (D3RsuperscriptD3R\mathrm{D}^{3}\mathrm{R}roman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R) and depth errors (AbsRel) while maintaining the fastest inference time. Inference time for all models is measured on the RTX 4090.

A common strategy for training high-resolution depth estimation is to train models on real-world high-resolution datasets [17, 20, 39, 42, 34] or leverage high-resolution synthetic datasets [38, 41, 40, 18]. Real-world datasets have the advantage of a smaller domain gap compared to synthetic datasets, but their ground truth (GT) depths are often sparse, particularly around edges, where supervision is crucial. Additionally, acquiring real-world dense depth data is challenging, limiting the DE learning for high-resolution images. On the other hand, synthetic datasets provide dense GT depth with fine details, but models trained on synthetic datasets often struggle with zero-shot inference on real-world data due to the domain gap [21]. Moreover, we observe that transparent objects do not have accurate depth annotations in the UnrealStereo4K dataset [38] that is the only synthetic dataset with 4K depth labeling, as they are annotated with the depth of the background behind the transparent objects.

To address the aforementioned problems, we propose Patch Refine Once (PRO), a novel refinement model that enables DE models to generate accurate DE results via only a single refinement per patch during inference (i.e.,“One Look”). Our PRO consists of two strategies: Grouped Patch Consistency Training and Bias Free Masking. Fig. 1 shows qualitative comparison of patch-based depth estimation models for high-resolution images. As shown, our PRO yields depth refinement results with little depth discontinuities along the grid boundaries while the very recent two state-of-the-art (SOTA) models exhibit depth discontinuity artifacts. Fig. 2 illustrates performance and efficiency comparison for SOTA depth refinement models and our PRO. As observed, our PRO provides the best performance in terms of D3RsuperscriptD3R\mathrm{D}^{3}\mathrm{R}roman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R, AbsRel and inference speed. Our contribution is summarized as:

  • We propose Grouped Patch Consistency Training (GPCT) strategy that can be well harmonized with existing DE models to mitigate the depth discontinuity problem for patch-wise DE approaches on high-resolution images. The GPCT does not require test-time ensemble, allowing a 12×\times× faster inference speed compared to the SOTA patch-based method;

  • We introduce Bias Free Masking (BFM) to selectively apply supervision signals by masking out unreliable regions in images. The unreliable regions are the regions that have inaccurately annotated depth in GT, which is often observed in the UnrealStereo4K dataset. This is problematic without our BFM that efficiently prevents depth refinement models from overfitting to domain specific bias;

  • Our PRO achieves SOTA performance in zero-shot evaluation on Booster [29], ETH3D [35], Middlebury 2014 [34], and NuScenes [4] across all metrics, and runs efficiently at inference time (12×\times× faster than PatchRefiner with 177 patches), compared to the very recent depth refinement methods for high-resolution images.

2 Related work

2.1 Zero-shot Monocular Depth Estimation

Early works on MDE models [1, 25, 10, 33, 49] focused on training on each dataset [12, 37], resulting in high performance on the trained dataset with degenerate performance on unseen datasets because of domain gap between datasets. Zero-shot MDE models that perform well on unseen (in-the-wild) images have been widely studied to improve generalization. MegaDepth [22] and DiverseDepth [48] construct large-sized datasets by collecting images from the Internet, improving adaptability on images from diverse scenes and conditions. They encourage efforts to scale up datasets to enhance zero-shot performance.

MiDaS [30] proposes a scale and shift invariant loss, which enables MDE networks to be trained on mixed diverse dataset, making the model robust to unseen datasets. DPT [31], Omnidata [9], and MiDaS v3.1 [2] improve DE performance by applying transformer architectures to replace CNN architectures. DepthAnything [46] introduces semi-supervised learning that utilizes 62M unlabeled images as pseudo labels. A dramatic increase in dataset size has demonstrated robust generalization performance. DepthAnythingv2 [47] points out that real-world depth datasets contain label noise and lack fine details. It is trained with pseudo GT labels obtained from a teacher model trained on a synthetic dataset, achieving high-quality depth maps.

Instead of scaling up datasets, some studies deploy prior knowledge from Diffusion Models for DE. Marigold [19] fine-tunes a pretrained Stable Diffusion [32] model on a relatively small synthetic dataset, showing competitive results. Geowizard [11] jointly estimates depth and surface normals by utilizing self-attention across different domains to enhance geometric consistency. However, zero-shot MDE models are trained on datasets of low resolutions (e.g., 384×\times×384, 518×\times×518), which results in degenerate performance on high resolution images. When high-resolution images are downsampled to the training resolution for inference, fine details are lost. On the other hand, performing inference with a larger resolution input preserves details but degrades depth accuracy.

2.2 High-Resolution Depth Estimation

High-resolution DE models aim to predict accurate depth while preserving fine details. SMD-Net [38] utilizes an implicit function [5] to predict mixture density for precise DE at object boundaries. Dai et al. [7] introduce a Poisson fusion-based depth map optimization method by applying guided filtering in a self-supervised learning framework. SDDR [21] proposes a self-distilled depth refinement method that generates pseudo edge labels, addressing local consistency issues and edge deformation noise with Noisy Poisson Fusion. Patch-based high-resolution DE methods select patches and merge patch-wise results to improve depth details. BoostingDepth [26] introduces a patch selection process based on edge density, and employs iterative refinement through multi-resolution depth merging. PatchFusion [23] eliminates the need for patch selection by splitting patches according to a predefined method and utilizes shifted window self-attention to fuse global and local information. PatchRefiner [24] initializes depth estimation with a low-resolution depth map and predicts residuals, and proposes a detail and disentangling (DSD) loss for training on both real-world and synthetic datasets. However, patch-based methods inherently suffer from depth discontinuities at patch boundaries, as they emphasize local information over global consistency, or maintain depth continuity only within individual patches. To address this problem, prior works [23, 24] employ a significantly large number of patches for test-time ensemble, which considerably slows down inference speed. By contrast, our PRO (Patch Refine Once) model achieves fine-grained details efficiently by performing only a single refinement per patch during inference, despite utilizing a tile-based approach.

2.3 Training Data for High-resolution Depth Estimation

High-resolution DE requires datasets with fine-grained depth details and minimal label noise. Real-world datasets [12, 37, 4, 15, 22] are often sparse due to the limitations of LiDAR or contain inaccurate depth boundaries when obtained through Structure-from-Motion (SfM). Therefore, previous researches [7, 26] have trained the networks using a small number of relatively dense real-world datasets [44, 20, 34]. Some studies [7, 21] utilize pseudo labels for self-supervision, but these labels often contain noise. Other works [23, 24] leverage synthetic datasets for more accurate depth, resulting in degenerate generalization abilities due to the domain gaps. To take advantage of dense depth annotations in synthetic datasets without compromising generalization performance, we introduce Bias Free Masking (BFM), which selectively identifies the regions where supervision is applied. It mitigates overfitting to synthetic dataset biases and enhances robustness in zero-shot high-resolution DE.

3 Method

Refer to caption
Figure 3: Labeling error examples in UnrealStereo4K [38] dataset.
Refer to caption
Figure 4: Overview of the Framework. (a) Patch-wise Refinement Process. First, the pretrained MDE model ΨΨ\Psiroman_Ψ estimates a coarse depth 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and a fine depth 𝐃fisuperscriptsubscript𝐃f𝑖\mathbf{D}_{\mathrm{f}}^{i}bold_D start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the resized input image 𝐈𝐈\mathbf{I}bold_I and the i𝑖iitalic_i-th patch 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Based on 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT, the residual prediction network predicts residuals 𝐑𝐑\mathbf{R}bold_R. This process is applied to every patch. Note that Bias Free Masking (BFM) is only required at training. (b) Grouped Patch Consistency Training. The training image is cropped with overlapping regions, followed by patch-wise refinement. Subsequently, depth consistency loss consubscriptcon\mathcal{L}_{\mathrm{con}}caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT is applied to the gray-shaded regions to enforce consistency between depth results refined separately. (c) Bias Free Masking. We identify the reliable region as 𝐌BFMsubscript𝐌BFM\mathbf{M}_{\mathrm{BFM}}bold_M start_POSTSUBSCRIPT roman_BFM end_POSTSUBSCRIPT by utilizing 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT.

We describe our Patch Refine Once (PRO) model. In Sec. 3.1, we explain the process of PRO. Then, Grouped Patch Consistency Training (GPCT) and Bias Free Masking (BFM), which are the key contributions of our paper, are described in Sec. 3.2 and Sec. 3.3, respectively.

3.1 Overview of Patch Refine Once (PRO)

Fig. 4 illustrates the overall pipeline of the proposed PRO framework. Following [23], given an original input image 𝐈H×W×3𝐈superscript𝐻𝑊3\mathbf{I}\in\mathbb{R}^{H\times W\times 3}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, we estimate a coarse depth map 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT by feeding a resized version of 𝐈𝐈\mathbf{I}bold_I into a pretrained zero-shot MDE network ΨΨ\Psiroman_Ψ. 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT captures the overall scene structure, which is essential for preserving global consistency.

To complement the local details, we take the following steps. First, we crop the original image 𝐈𝐈\mathbf{I}bold_I into N𝑁Nitalic_N grid-based patches {𝐏i}i=1Nsuperscriptsubscriptsuperscript𝐏𝑖𝑖1𝑁\{\mathbf{P}^{i}\}_{i=1}^{N}{ bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Each individual patch 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is then resized and fed into ΨΨ\Psiroman_Ψ to estimate the corresponding fine depth map 𝐃fisuperscriptsubscript𝐃f𝑖\mathbf{D}_{\mathrm{f}}^{i}bold_D start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. We extract two sets of five-level features from the decoder of ΨΨ\Psiroman_Ψ: (i) 𝐅c={𝐟c,ji}j=15subscript𝐅csuperscriptsubscriptsuperscriptsubscript𝐟cj𝑖𝑗15\mathbf{F}_{\mathrm{c}}=\{\mathbf{f}_{\mathrm{c,j}}^{i}\}_{j=1}^{5}bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT = { bold_f start_POSTSUBSCRIPT roman_c , roman_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT for the resized image and (ii) 𝐅f={𝐟f,ji}j=15subscript𝐅fsuperscriptsubscriptsuperscriptsubscript𝐟f𝑗𝑖𝑗15\mathbf{F}_{\mathrm{f}}=\{\mathbf{f}_{\mathrm{f},j}^{i}\}_{j=1}^{5}bold_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT = { bold_f start_POSTSUBSCRIPT roman_f , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT for 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. To align 𝐅csubscript𝐅c\mathbf{F}_{\mathrm{c}}bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT with 𝐅fsubscript𝐅f\mathbf{F}_{\mathrm{f}}bold_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT, we apply the 𝖱𝖮𝖨𝖱𝖮𝖨\mathsf{ROI}sansserif_ROI operation [16] to extract patch-aligned features from 𝐅csubscript𝐅c\mathbf{F}_{\mathrm{c}}bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT. The patch-level features 𝐅fisuperscriptsubscript𝐅f𝑖\mathbf{F}_{\mathrm{f}}^{i}bold_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the corresponding 𝖱𝖮𝖨𝖱𝖮𝖨\mathsf{ROI}sansserif_ROI-extracted features are subsequently passed into a fusion module within the residual prediction network θ𝜃\thetaitalic_θ. The fusion module integrates these features using a wavelet transform [8] and shallow convolutional layers, effectively combining global and local information. Finally, the refined depth map for 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, denoted as 𝐃refineisuperscriptsubscript𝐃refine𝑖\mathbf{D}_{\mathrm{refine}}^{i}bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, is obtained as:

𝐃refinei=θ(𝖼𝗈𝗇𝖼𝖺𝗍(𝐏i,𝖱𝖮𝖨(𝐃c),𝐃fi)),superscriptsubscript𝐃refine𝑖𝜃𝖼𝗈𝗇𝖼𝖺𝗍superscript𝐏𝑖𝖱𝖮𝖨subscript𝐃csuperscriptsubscript𝐃f𝑖\mathbf{D}_{\mathrm{refine}}^{i}=\theta(\mathsf{concat}(\mathbf{P}^{i},\mathsf% {ROI}(\mathbf{D}_{\mathrm{c}}),\mathbf{D}_{\mathrm{f}}^{i})),bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_θ ( sansserif_concat ( bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , sansserif_ROI ( bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) , bold_D start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) , (1)

where 𝖱𝖮𝖨(𝐃c)=𝐃ci𝖱𝖮𝖨subscript𝐃csuperscriptsubscript𝐃c𝑖\mathsf{ROI}(\mathbf{D}_{\mathrm{c}})=\mathbf{D}_{\mathrm{c}}^{i}sansserif_ROI ( bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) = bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a region of interest extraction operator for 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT for 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, yielding 𝐃cisuperscriptsubscript𝐃c𝑖\mathbf{D}_{\mathrm{c}}^{i}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and 𝖼𝗈𝗇𝖼𝖺𝗍𝖼𝗈𝗇𝖼𝖺𝗍\mathsf{concat}sansserif_concat denotes channel-wise concatenation. Unlike previous methods [23, 24] where separate networks are trained to predict 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃fsubscript𝐃f\mathbf{D}_{\mathrm{f}}bold_D start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT individually, we utilize the same pretrained zero-shot MDE network ΨΨ\Psiroman_Ψ and train only the residual prediction network as shown in Fig. 4-(a).

3.2 Grouped Patch Consistency Training (GPCT)

To address the boundary artifacts introduced by patch-wise refinement, we propose a simple yet effective GPCT strategy that ensures depth consistency across patch boundaries. While previous approaches [23, 24] focus only on two diagonally adjacent overlapping patches (e.g., (A, D) or (B, C) in Fig. 4) to enforce consistency constraints, our method employs four overlapping patches (A, B, C, and D in Fig. 4) simultaneously. As shown in Fig. 4-(b), we divide the training sample into overlapping patches, so that each patch overlaps with its neighbors. After refining each patch independently, we apply a depth consistency loss consubscriptcon\mathcal{L}_{\text{con}}caligraphic_L start_POSTSUBSCRIPT con end_POSTSUBSCRIPT to enforce consistency between the depth refinement results of overlapping patches:

con=ij1|Ω|pΩ(𝐃refinei(p)𝐃refinej(p))2,subscriptconsubscript𝑖𝑗1Ωsubscript𝑝Ωsuperscriptsuperscriptsubscript𝐃refine𝑖𝑝superscriptsubscript𝐃refine𝑗𝑝2\mathcal{L}_{\text{con}}=\sum_{i\neq j}\frac{1}{|\Omega|}\sum_{p\in\Omega}% \left(\mathbf{D}_{\mathrm{refine}}^{i}(p)-\mathbf{D}_{\mathrm{refine}}^{j}(p)% \right)^{2},caligraphic_L start_POSTSUBSCRIPT con end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ roman_Ω end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_p ) - bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_p ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

where 𝐃refineisuperscriptsubscript𝐃refine𝑖\mathbf{D}_{\mathrm{refine}}^{i}bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝐃refinejsuperscriptsubscript𝐃refine𝑗\mathbf{D}_{\mathrm{refine}}^{j}bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the depth predictions from overlapping patches i𝑖iitalic_i and j𝑗jitalic_j, respectively. We denote the overlapping region between patches i𝑖iitalic_i and j𝑗jitalic_j as ΩΩ\Omegaroman_Ω, with |Ω|Ω|\Omega|| roman_Ω | indicating the number of pixels within this region. The merged depth map 𝐃mergedsubscript𝐃merged\mathbf{D}_{\text{merged}}bold_D start_POSTSUBSCRIPT merged end_POSTSUBSCRIPT at position p𝑝pitalic_p is computed as:

𝐃merged(p)=1No(p)i{A,B,C,D}𝐃refinei(p),subscript𝐃merged𝑝1subscript𝑁𝑜𝑝subscript𝑖𝐴𝐵𝐶𝐷superscriptsubscript𝐃refine𝑖𝑝\mathbf{D}_{\mathrm{merged}}(p)=\frac{1}{N_{o}(p)}\sum_{i\in\{A,B,C,D\}}% \mathbf{D}_{\mathrm{refine}}^{i}(p),bold_D start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT ( italic_p ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_p ) end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ { italic_A , italic_B , italic_C , italic_D } end_POSTSUBSCRIPT bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_p ) , (3)

where 𝐃refineisuperscriptsubscript𝐃refine𝑖\mathbf{D}_{\mathrm{refine}}^{i}bold_D start_POSTSUBSCRIPT roman_refine end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the predicted depth from patch i𝑖iitalic_i where i=A,B,C𝑖𝐴𝐵𝐶i=A,B,Citalic_i = italic_A , italic_B , italic_C and D𝐷Ditalic_D shown in Fig.4-(b), and No(p)subscript𝑁𝑜𝑝N_{o}(p)italic_N start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_p ) is the number of overlapping patches contributing depth estimates at location p𝑝pitalic_p. Then, we calculate the loss between merged depth 𝐃mergedsubscript𝐃merged\mathbf{D}_{\mathrm{merged}}bold_D start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT and GT depth 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT. Since our method computes the loss using all four patches simultaneously, every backward propagation step enforces consistency at all boundaries. In contrast, the prior method [23] only applied a consistency loss on the overlapped region between two patches (e.g., (A, D) or (B, C)) at a time, leading to insufficient boundary supervision. Our approach provides a stronger supervision signal, yielding significantly improved cross-patch consistency and smooth boundary transitions during inference. By training with this GPCT strategy, our PRO model mitigates boundary artifacts without requiring additional refinement during inference, allowing efficient patch-wise depth estimation, as only a single refinement step per patch (i.e., “One Look”) is required.

3.3 Bias Free Masking (BFM)

To utilize dense GT depth when training depth refinement models for fine-grained details, we use a synthetic dataset such as the UnrealStereo4K [38]. However, the UnrealStereo4K dataset, which provides 4K resolutions, annotates the depths of transparent objects as the depths of their backgrounds (e.g., window objects in Fig. 3), which becomes a specific bias to the dataset. To maintain the benefits of dense depth labeling while avoiding overfitting to the dataset-specific biases, we propose Bias Free Masking (BFM) that uses the prior knowledge of the pretrained zero-shot MDE models. That is, the strategy of BFM is to exclude the unreliable regions corresponding to the incorrect GT depths for the transparent objects in the synthetic dataset (UnrealStereo4K) during training. Since the pretrained zero-shot MDE model ΨΨ\Psiroman_Ψ can be considered to have informative prior knowledge learned from large scale image-depth pairs, they can be used to identify the unreliable regions where 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT might significantly deviate each other, indicating potential biases to the synthetic dataset.

  Methods Publications Runtime (s) Booster ETH3D Middle14 NuScnes DIS
AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ D3RsuperscriptD3Rabsent\mathrm{D}^{3}\mathrm{R}\downarrowroman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R ↓ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ BR\uparrow
DepthAnythingV2 NIPS 2024 - 0.0307 0.993 0.0465 0.983 0.0307 0.994 0.0979 0.106 0.882 0.059
BoostingDepth CVPR 2021 12.7 0.0330 0.993 0.0552 0.974 0.0330 0.995 0.1035 0.115 0.870 0.170
PatchFusion P=16 CVPR 2024 3.4 0.0504 0.985 0.0735 0.956 0.0450 0.989 0.1124 0.141 0.831 0.206
PatchFusion P=177 CVPR 2024 36.8 0.0496 0.986 0.0723 0.957 0.0448 0.989 0.1050 0.139 0.833 0.189
PatchRefinerP=16 ECCV 2024 1.5 0.0348 0.989 0.0435 0.985 0.0292 0.995 0.0830 0.107 0.879 0.151
PatchRefinerP=177 ECCV 2024 16.2 0.0336 0.991 0.0430 0.985 0.0292 0.995 0.0805 0.106 0.881 0.141
PRO (Ours) - 1.4 0.0304 0.994 0.0422 0.985 0.0287 0.996 0.0803 0.104 0.883 0.156
 
Table 1: Quantitative comparison of depth estimation methods on Booster [29], ETH3D [35], Middlebury 2014 [34], NuScenes [4], and DIS-5K [28] datasets. Bold indicates the best performance in each metric.

Given a resized version for input image 𝐈𝐈\mathbf{I}bold_I, we obtain coarse depth 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT from ΨΨ\Psiroman_Ψ and GT depth 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT from the synthetic dataset. Then, by measuring the relative consistency between 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT, we identify unreliable regions as:

𝐌unreliable=[max(N(𝐃c)N(𝐃gt),N(𝐃gt)N(𝐃c))>τ],subscript𝐌unreliabledelimited-[]𝑁subscript𝐃c𝑁subscript𝐃gt𝑁subscript𝐃gt𝑁subscript𝐃c𝜏\mathbf{M}_{\mathrm{unreliable}}=\left[\max\left(\frac{N(\mathbf{D}_{\mathrm{c% }})}{N(\mathbf{D}_{\mathrm{gt}})},\frac{N(\mathbf{D}_{\mathrm{gt}})}{N(\mathbf% {D}_{\mathrm{c}})}\right)>\tau\right],bold_M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT = [ roman_max ( divide start_ARG italic_N ( bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N ( bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT ) end_ARG , divide start_ARG italic_N ( bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N ( bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) end_ARG ) > italic_τ ] , (4)

where []delimited-[]\left[\cdot\right][ ⋅ ] represents the Iverson bracket, N(𝐃)𝑁𝐃N(\mathbf{D})italic_N ( bold_D ) denotes min-max normalization for 𝐃𝐃\mathbf{D}bold_D and τ𝜏\tauitalic_τ is an empirically chosen threshold (set to 2 in our experiments). 𝐌unreliablesubscript𝐌unreliable\mathbf{M}_{\mathrm{unreliable}}bold_M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT as an unreliable mask identifies the regions where 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT significantly deviate, indicating potential synthetic dataset biases or inconsistencies, as aforementioned. Since 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT lacks sharp edges, complementing 𝐌unreliablesubscript𝐌unreliable\mathbf{M}_{\mathrm{unreliable}}bold_M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT to obtain the reliable regions is a naive approach, which causes the exclusion of critical edge regions from training, thus resulting in the removal of valuable depth gradients required for refinement. To address this, we incorporate edge information into the reliable mask generation. However, simply utilizing the edge map from 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT as an edge mask could bring noise in a training sample because the flat transparent region could include the edges due to background with edges as observed in Fig. 4-(c). To handle this problem, we additionally employ the edge maps from 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT. Consequently, we first extract edges from both 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT, and dilate them as:

𝐄c=dilate(edge(𝐃c)),𝐄gt=dilate(edge(𝐃gt)),formulae-sequencesubscript𝐄cdilateedgesubscript𝐃csubscript𝐄gtdilateedgesubscript𝐃gt\mathbf{E}_{\mathrm{c}}=\text{dilate}(\text{edge}(\mathbf{D}_{\mathrm{c}})),% \quad\mathbf{E}_{\mathrm{gt}}=\text{dilate}(\text{edge}(\mathbf{D}_{\mathrm{gt% }})),bold_E start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT = dilate ( edge ( bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) ) , bold_E start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT = dilate ( edge ( bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT ) ) , (5)

to guarantee that the final edge mask should not miss the regions that require depth refinements, otherwise they might be excluded simply from complementing 𝐌unreliablesubscript𝐌unreliable\mathbf{M}_{\mathrm{unreliable}}bold_M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT. We define the edge mask as 𝐌edge=𝐄c𝐄gtsubscript𝐌edgesubscript𝐄csubscript𝐄gt\mathbf{M}_{\mathrm{edge}}=\mathbf{E}_{\mathrm{c}}\cap\mathbf{E}_{\mathrm{gt}}bold_M start_POSTSUBSCRIPT roman_edge end_POSTSUBSCRIPT = bold_E start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ∩ bold_E start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT. Finally, the BFM that denotes the reliable mask is constructed as 𝐌BFM=MedgeMunreliablesubscript𝐌BFMlimit-fromsubscriptMedgesimilar-tosubscriptMunreliable\mathbf{M}_{\mathrm{BFM}}=\textbf{M}_{\mathrm{edge}}\cup\sim\textbf{M}_{% \mathrm{unreliable}}bold_M start_POSTSUBSCRIPT roman_BFM end_POSTSUBSCRIPT = M start_POSTSUBSCRIPT roman_edge end_POSTSUBSCRIPT ∪ ∼ M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT where Munreliablesimilar-toabsentsubscriptMunreliable\sim\textbf{M}_{\mathrm{unreliable}}∼ M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT is the complement of MunreliablesubscriptMunreliable\textbf{M}_{\mathrm{unreliable}}M start_POSTSUBSCRIPT roman_unreliable end_POSTSUBSCRIPT. This ensures that the unreliable regions due to large discrepancies between 𝐃csubscript𝐃c\mathbf{D}_{\mathrm{c}}bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐃gtsubscript𝐃gt\mathbf{D}_{\mathrm{gt}}bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT can be ignored, while preserving essential edge details for training. Consequently, the supervision can only be applied for the reliable regions via a masked loss:

masked=1|MBFM|pMBFM(𝐃merged(p),𝐃gt(p))subscriptmasked1subscriptMBFMsubscript𝑝subscriptMBFMsubscript𝐃merged𝑝subscript𝐃gt𝑝\mathcal{L}_{\mathrm{masked}}=\frac{1}{|\textbf{M}_{\mathrm{BFM}}|}\sum_{p\in% \textbf{M}_{\mathrm{BFM}}}\mathcal{L}(\mathbf{D}_{\mathrm{merged}}(p),\mathbf{% D}_{\mathrm{gt}}(p))caligraphic_L start_POSTSUBSCRIPT roman_masked end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | M start_POSTSUBSCRIPT roman_BFM end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ M start_POSTSUBSCRIPT roman_BFM end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_D start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT ( italic_p ) , bold_D start_POSTSUBSCRIPT roman_gt end_POSTSUBSCRIPT ( italic_p ) ) (6)

where \mathcal{L}caligraphic_L is the loss between merged depth and GT depth, which is the combination of an L1 loss, an L2 loss, and a multi-scale gradient loss with four scale levels [22] using a ratio of 1:1:5. By restricting training to reliable regions, the depth refinement model refines depth only where necessary, avoiding modifications in regions where the pretrained model’s prior knowledge is considered more reliable than the (noisy) GT. This helps prevent the model from refining transparent regions incorrectly.

Final Loss function. The final training loss consists of masked loss maskedsubscriptmasked\mathcal{L}_{\mathrm{masked}}caligraphic_L start_POSTSUBSCRIPT roman_masked end_POSTSUBSCRIPT and depth consistency loss consubscriptcon\mathcal{L}_{\mathrm{con}}caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT as:

final=masked+λcon.subscriptfinalsubscriptmasked𝜆subscriptcon\mathcal{L}_{\mathrm{final}}=\mathcal{L}_{\mathrm{masked}}+\lambda\mathcal{L}_% {\mathrm{con}}.caligraphic_L start_POSTSUBSCRIPT roman_final end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_masked end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT . (7)

where λ𝜆\lambdaitalic_λ is empirically set to 4 in our experiments.

4 Experiments

4.1 Datasets and Metrics

We train recent SOTA depth refinement models and our PRO model on a synthetic dataset to get the advantage of dense GT depth for a fair comparison.

UnrealStereo4K. The UnrealStereo4K dataset [38] is a synthetic dataset comprising of stereo images with 4K resolution (2,160×\times×3,840) and pixel-wise GT disparity labels. As aforementioned, its some GT labels are inaccurate especially for transparent objects as shown in Fig. 3.

Booster. The Booster dataset [29] has high-resolution (3,008 ×\times× 4,112) indoor images with specular and transparent surfaces. We use the training set with GT depth for testing. To ensure diversity, we removed images captured from the same scenes, resulting in a test set of 38 images.

ETH3D. The ETH3D dataset [35] contains high-resolution indoor and outdoor images (6,048 ×\times× 4,032) with GT depth captured by LiDAR sensors. Among the dataset, 34 images were rotated 90 degrees counterclockwise, causing the bottom sides to appear on the right sides. Before testing, we rotate these images 90 degrees clockwise to ensure that the floor sides are correctly positioned at the bottom sides.

Middlebury 2014. The Middlebury 2014 dataset [34] consists of 23 high-resolution (nearly 4K) indoor images with dense ground truth disparity maps.

NuScenes. The NuScenes dataset [4] is an autonomous driving dataset that provides multi-directional camera images along with depth information obtained from LiDAR and radar sensors. Among outdoor datasets, the Cityscapes dataset [6] provides higher resolution but suffers from blurred edges since its disparity maps are generated using a stereo matching algorithm instead of LiDAR sensors. Therefore, we use the NuScenes dataset for testing.

DIS-5K. DIS-5K dataset [28] is a high-resolution image segmentation dataset that provides accurately annotated masks, making it suitable for evaluating edge accuracy.

Evaluation Metrics. We use the commonly adopted metrics for DE, Absolute Relative Error (AbsRel) and δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Additionally, to evaluate the edge quality, we adopt D3RsuperscriptD3R\mathrm{D}^{3}\mathrm{R}roman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R metric on the Middlebury 2014 dataset follwing [26], and measure Boundary Recall (BR) metric on the DIS-5K dataset following [3]. We utilize Consistency Error (CE) [23] to evaluate the consistency of the DE results across patches.

Refer to caption

Figure 5: Qulitative comparisons for patch-wise DE methods on ETH3D [35] and DIS-5k [28]. We compare PRO (Ours) with BoostingDepth [26], PatchFusion (PF) [23], and PatchRefiner (PR) [24]. The time displayed in the leftmost depth column represents the inference time. Black rectangle area represents the transparent object region. Black arrows indicate patch boundaries. Zoom in for details.
  Metric PatchFusion PatchRefiner PRO (Ours)
CE\downarrow 0.364 0.347 0.049
 
Table 2: Comparison of PatchFusion [23], PatchRefiner [24], and PRO (Ours) in terms of Consistency Error (CE).
  Model GPCT BFM Booster ETH3D Middle14 NuScnes CE\downarrow
AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ D3RsuperscriptD3Rabsent\mathrm{D}^{3}\mathrm{R}\downarrowroman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R ↓ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑
(a) 0.0385 0.987 0.0428 0.985 0.0292 0.995 0.0807 0.105 0.882 0.208
(b) 0.0303 0.994 0.0426 0.985 0.0290 0.995 0.0837 0.105 0.882 0.117
(c) 0.0313 0.993 0.0425 0.985 0.0288 0.996 0.0792 0.105 0.882 0.058
(d) 0.0304 0.994 0.0422 0.985 0.0287 0.996 0.0803 0.104 0.883 0.049
 
Table 3: Ablation on GPCT and BFM of our PRO for Booster [29], ETH3D [35], Middlebury 2014 [34], and NuScenes [4] datasets.

4.2 Implementation Details

We adopt the pretrained DepthAnythingV2 (DA2) [47] as the baseline model (ΨΨ\Psiroman_Ψ in Fig. 4). PRO, PatchFusion [23], and PatchRefiner [24] are (re)trained based on DA2, while BoostingDepth [26] uses DA2 without additional training. The input resolution to ΨΨ\Psiroman_Ψ is fixed at 518×518518518518\times 518518 × 518. When we use the GPCT strategy, adjacent patches are cropped with an overlap of 224 pixels, ensuring an overlap of 224×h224224\times h224 × italic_h for horizontally adjacent patches and 224×w224𝑤224\times w224 × italic_w for vertically adjacent patches, where hhitalic_h and w𝑤witalic_w denote the height and width of the patch, respectively. In our BFM, the dilation kernel size is empirically set to (10, 20). Training is conducted on 7,592 UnrealStereo4K samples for 8 epochs with a batch size of 64 (with gradient accumulation), taking 10 hours on a single RTX 4090 GPU.

4.3 Performance Comparison

At the test time, each input image is divided into a 4×4444\times 44 × 4 grid of patches. The depth of each patch is refined individually, and the refined patches are then reassembled to generate the final depth map.

Zero-shot performance on high-resolution datasets. We evaluate four patch-wise DE models: three are depth refinement models including our PRO model, BoostingDepth [26], PatchRefiner [24], and the other one is a direct DE model that is PatchFusion [23]. For PatchFusion and PatchRefiner, Our evaluations are done with two conditions: (i) ‘PP\mathrm{P}roman_P=16’ setting that uses the same patch number as our method, and (ii) ‘PP\mathrm{P}roman_P=177’ setting, where additional patches are used for test-time ensembling. Note that PP\mathrm{P}roman_P denotes the number of patches used to reassemble final depth maps during inference. As shown in Table 1, our PRO achieves SOTA performance across all depth metrics while maintaining the lowest inference time. Notably, our BFM prevents overfitting to synthetic dataset biases, leading to a 9.5% improvement in AbsRel on the Booster dataset that contains transparent objects. In terms of edge quality, our method performs worse than BoostingDepth and PatchFusion in Boundary Recall (BR). However, as shown in the depth metrics, these two methods prioritize enhancing edge sharpness rather than focusing on enhancing overall depth accuracy. In contrast, when compared to PatchRefiner that exhibits similar depth performance, our PRO achieves superior results in terms of the edge metric (BR). In terms of consistency, our PRO achieves the best result, with an 85.9% improvement, as shown in Table 2. It demonstrates the effectiveness of our GPCT strategy in maintaining consistency between individually processed patches.

Qualitative Comparisons. Fig. 5 shows qualitative comparisons for four patch-wise DE models. As shown in Fig. 5, PatchFusion and PatchRefiner trained on the synthetic dataset without masking exhibit artifacts in the window regions. Additionally, two patch-based methods [24, 23] produce noticeable boundary artifacts when test ensembling is not applied. In contrast, despite refining each patch only once, our PRO demonstrates minimal depth discontinuity while achieving the fastest inference time.

4.4 Ablation Studies

We analyze the effectiveness of the core components of our PRO through ablation studies: Grouped Patch Consistency Training (GPCT) and Bias Free Masking (BFM).

Ablation on GPCT and BFM. Table 3 presents the ablation results on GPCT and BFM. When BFM is applied to the baseline (a), the improvements are relatively minor for most datasets. However, for the Booster dataset [29] that contains transparent objects, we observe a significant improvement of 21.3% in AbsRel metric, demonstrating that our BFM method effectively identifies unreliable regions. Even when only BFM is applied without GPCT, a reduction in consistency error (CE) suggests that supervision limited to reliable regions helps mitigate unnecessary refinement, thereby reducing confusion in the depth refinement process. Furthermore, applying GPCT to the baseline yields a 50.4% improvement in consistency error, demonstrating its effectiveness in maintaining consistency between independently processed patches. Although model (c) achieves the best performance in the D3RsuperscriptD3R\mathrm{D}^{3}\mathrm{R}roman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R metric, it introduces artifacts in transparent regions. Consequently, model (d), which incorporates both GPCT and BFM, achieves the best overall performance across most metrics.

  Overlap CE\downarrow ETH3D Middle14
AbsRel\downarrow AbsRel\downarrow
28 0.108 0.0423 0.0288
56 0.113 0.0424 0.0289
112 0.065 0.0425 0.0288
224 0.049 0.0422 0.0287
448 0.060 0.0423 0.0287
 
Table 4: Ablations on overlap sizes in Consistency Error (CE) for ETH3D [35] and Middlebury 2014 [34] datasets.

Ablation on Overlap Sizes in GPCT. Table 4 shows the effect of different overlap sizes in GPCT. Each overlap value represents the shorter side of the overlapping region between adjacent patches. As shown, increasing the overlap generally reduces the CE. However, when the overlap reaches to 448 pixels, CE begins to increase. We attribute this to the fact that highly overlapped patches capture nearly identical scene contents, resulting in fewer inconsistencies between patches. So, the depth consistency loss provides less meaningful supervision, reducing its effectiveness.

5 Conclusion

In this paper, we propose the Patch Refine Once (PRO) model for depth refinement on high-resolution images. To address the depth discontinuity problem, we introduce the Grouped Patch Consistency Training (GPCT) strategy, ensuring consistency between independently processed patches. Also, we propose Bias Free Masking (BFM) to prevent the depth refinement model from overfitting to dataset-specific biases. Through these strategies, our PRO achieves superior zero-shot performance while maintaining a low inference time (12×\times× faster than PatchRefiner with 177 patches), outperforming recent state-of-the-art methods across diverse datasets.

APPENDIX

{strip}[Uncaptioned image]
Figure 6: Architecture of the Residual Prediction Network and Frequency Fusion Module (FFM). (a) Residual Prediction Network. The Residual Prediction Network comprises an encoder, a decoder, and the Fusion Module. (b) Frequency Fusion Module (FFM). We utilize Discrete Wavelet Transform (DWT) to decompose the input features into four frequency components. Then, each frequency component is processed independently using convolutional layers.
  Models FLOPs (Mac) Parameter DIS UHRSD Middle14 ETH3D Booster NuScenes
BR\uparrow BR\uparrow AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ D3RsuperscriptD3Rabsent\mathrm{D}^{3}\mathrm{R}\downarrowroman_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_R ↓ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ AbsRel\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑
Conv 31664G 63M 0.127 0.071 0.0290 0.995 0.0842 0.0428 0.984 0.0306 0.993 0.106 0.882
FFM (Ours) 20250G 51M 0.156 0.083 0.0287 0.996 0.0803 0.0422 0.985 0.0304 0.994 0.104 0.883
 
Table 5: Ablation study of the Frequency Fusion Module (FFM) on DIS-5K [28], UHRSD [45], Middlebury 2014 [34], ETH3D [35], Booster [29], and NuScenes [4]. Bold indicates the best performance in each metric.

Appendix A Architecture of the Fusion Module

The encoder takes 𝐏isuperscript𝐏𝑖\mathbf{P}^{i}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, 𝐃fisuperscriptsubscript𝐃f𝑖\mathbf{D}_{\mathrm{f}}^{i}bold_D start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and 𝖱𝖮𝖨(𝐃c)𝖱𝖮𝖨subscript𝐃c\mathsf{ROI}(\mathbf{D}_{\mathrm{c}})sansserif_ROI ( bold_D start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) as inputs and produces a set of five intermediate features, denoted as 𝐅enc={𝐟enc,ji}j=15subscript𝐅encsuperscriptsubscriptsuperscriptsubscript𝐟enc𝑗𝑖𝑗15\mathbf{F}_{\mathrm{enc}}=\{\mathbf{f}_{\mathrm{enc},j}^{i}\}_{j=1}^{5}bold_F start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT = { bold_f start_POSTSUBSCRIPT roman_enc , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT following [23]. Then, the fused feature map 𝐅fusesubscript𝐅fuse\mathbf{F}_{\mathrm{fuse}}bold_F start_POSTSUBSCRIPT roman_fuse end_POSTSUBSCRIPT is obtained through the frequency fusion module (𝖥𝖥𝖬𝖥𝖥𝖬\mathsf{FFM}sansserif_FFM), defined as 𝐅fuse=𝖥𝖥𝖬(𝖼𝗈𝗇𝖼𝖺𝗍(𝐅f,𝖱𝖮𝖨(𝐅c))).subscript𝐅fuse𝖥𝖥𝖬𝖼𝗈𝗇𝖼𝖺𝗍subscript𝐅f𝖱𝖮𝖨subscript𝐅c\mathbf{F}_{\mathrm{fuse}}=\mathsf{FFM}(\mathsf{concat}(\mathbf{F}_{\mathrm{f}% },\mathsf{ROI}(\mathbf{F}_{\mathrm{c}}))).bold_F start_POSTSUBSCRIPT roman_fuse end_POSTSUBSCRIPT = sansserif_FFM ( sansserif_concat ( bold_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT , sansserif_ROI ( bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) ) ) .

Subsequently, 𝖼𝗈𝗇𝖼𝖺𝗍(𝐅fuse,𝖱𝖮𝖨(𝐅c),𝐅enc)𝖼𝗈𝗇𝖼𝖺𝗍subscript𝐅fuse𝖱𝖮𝖨subscript𝐅csubscript𝐅enc\mathsf{concat}(\mathbf{F}_{\mathrm{fuse}},\mathsf{ROI}(\mathbf{F}_{\mathrm{c}% }),\mathbf{F}_{\mathrm{enc}})sansserif_concat ( bold_F start_POSTSUBSCRIPT roman_fuse end_POSTSUBSCRIPT , sansserif_ROI ( bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) , bold_F start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT ) is processed through two consecutive layers, each consisting of a 3×3333\times 33 × 3 convolution, batch normalization, and ReLU activation. Finally, the resulting feature is fed into the DPT decoder [31] to obtain the residual map 𝐑𝐑\mathbf{R}bold_R.

Architecture of the Frequency Fusion Module (FFM) To obtain accurate depth values from the coarse depth and preserve fine details from the fine depth, we design a Frequency Fusion Module (FFM) that effectively extracts and integrates edge information. We utilize Discrete Wavelet Transform (DWT) to decompose the input features into four frequency components: LL, LH, HL, and HH, which represent the low-frequency and high-frequency information. Each component is processed with its own dedicated convolution to capture scale-specific features. Finally, the components are recombined using the Inverse Discrete Wavelet Transform (IDWT), resulting in fused features that retain both global depth consistency and enhanced edge details. Overall process is described in Fig. 6-(b). To describe this process in more detail, we first decompose 𝖱𝖮𝖨(𝐅c)𝖱𝖮𝖨subscript𝐅c\mathsf{ROI}(\mathbf{F}_{\mathrm{c}})sansserif_ROI ( bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) and 𝐅fsubscript𝐅f\mathbf{F}_{\mathrm{f}}bold_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT into four frequency sub-bands 𝐗𝐗\mathbf{X}bold_X (𝐗{𝐋𝐋,𝐋𝐇,𝐇𝐋,𝐇𝐇}for-all𝐗𝐋𝐋𝐋𝐇𝐇𝐋𝐇𝐇\forall\mathbf{X}\in\{\mathbf{LL},\mathbf{LH},\mathbf{HL},\mathbf{HH}\}∀ bold_X ∈ { bold_LL , bold_LH , bold_HL , bold_HH }) using 𝖣𝖶𝖳𝖣𝖶𝖳\mathsf{DWT}sansserif_DWT. Each sub-band is then fused using a corresponding convolution Conv𝐗subscriptConv𝐗\text{Conv}_{\mathbf{X}}Conv start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT. Finally, the fused feature map 𝐅fusesubscript𝐅fuse\mathbf{F}_{\mathrm{fuse}}bold_F start_POSTSUBSCRIPT roman_fuse end_POSTSUBSCRIPT is obtained through 𝖨𝖣𝖶𝖳𝖨𝖣𝖶𝖳\mathsf{IDWT}sansserif_IDWT.

𝐗c,𝐗f=𝖣𝖶𝖳(𝖱𝖮𝖨(𝐅c)),𝖣𝖶𝖳(𝐅f)formulae-sequencesubscript𝐗csubscript𝐗f𝖣𝖶𝖳𝖱𝖮𝖨subscript𝐅c𝖣𝖶𝖳subscript𝐅f\mathbf{X}_{\mathrm{c}},\mathbf{X}_{\mathrm{f}}=\mathsf{DWT}(\mathsf{ROI}(% \mathbf{F}_{\mathrm{c}})),\mathsf{DWT}(\mathbf{F}_{\mathrm{f}})bold_X start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT = sansserif_DWT ( sansserif_ROI ( bold_F start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) ) , sansserif_DWT ( bold_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ) (8)
𝐗fuse=Conv𝐗(𝖼𝗈𝗇𝖼𝖺𝗍(𝐗c,𝐗f))subscript𝐗fusesubscriptConv𝐗𝖼𝗈𝗇𝖼𝖺𝗍subscript𝐗csubscript𝐗f\mathbf{X}_{\text{fuse}}=\text{Conv}_{\mathbf{X}}(\mathsf{concat}(\mathbf{X}_{% \mathrm{c}},\mathbf{X}_{\mathrm{f}}))bold_X start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT = Conv start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( sansserif_concat ( bold_X start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ) ) (9)
𝐅fuse=𝖨𝖣𝖶𝖳(𝐋𝐋fuse,𝐋𝐇fuse,𝐇𝐋fuse,𝐇𝐇fuse)subscript𝐅fuse𝖨𝖣𝖶𝖳subscript𝐋𝐋fusesubscript𝐋𝐇fusesubscript𝐇𝐋fusesubscript𝐇𝐇fuse\mathbf{F}_{\text{fuse}}=\mathsf{IDWT}(\mathbf{LL}_{\text{fuse}},\mathbf{LH}_{% \text{fuse}},\mathbf{HL}_{\text{fuse}},\mathbf{HH}_{\text{fuse}})bold_F start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT = sansserif_IDWT ( bold_LL start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT , bold_LH start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT , bold_HL start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT , bold_HH start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT ) (10)

A.1 Ablation study of FFM

To validate the effectiveness of the Frequency Fusion Module (FFM), we conduct an ablation study by replacing the FFM with a simple convolutional block consisting of Conv-ReLU-Conv layers. To ensure that any performance gain is not simply due to an increase in the number of parameters or FLOPs, we design the simple convolutional block to have more parameters and FLOPs than the FFM. This allows us to attribute the performance improvement to the design of FFM itself, rather than computational complexity. As shown in the Table 5, our method not only achieves the best performance in standard depth metrics, but also yields significant improvements in edge accuracy. Specifically, it achieves a 22.5% improvement on the DIS-5K dataset and a 16.9% improvement on the UHRSD dataset in the Boundary Recall (BR) metric. In addition, we observe a 4.6% improvement in the edge quality metric (D3R). It demonstrates that the proposed FFM effectively integrates edge information through the use of Discrete Wavelet Transform (DWT), which enables selective enhancement of high-frequency details without sacrificing global structure. This highlights the benefit of frequency-domain processing in depth refinement tasks.

Appendix B Qualitative Results

Ablation Study In the ablation study (Sec. 4.4), we analyze the effect of Grouped Patch Consistency Training (GPCT) and Bias Free Masking (BFM) quantitatively. In this section, we analyze the effect of GPCT and BFM with qualitative results. As shown in the Fig. 7, the model trained without GPCT shows remarkable depth discontinuity problem on the grids. On the other hand, our PRO model trained with GPCT alleviates the depth discontinuity problem. Likewise, as shown in the Fig. 8 , the model trained without BFM exhibits artifacts on transparent surfaces, such as glass windows, as well as reflective surfaces like TV screens. In contrast, our model trained with BFM effectively refines only the edge regions while preventing artifacts on transparent objects.

Additional Qualitative Results

We provide additional qualitative comparisons of BoostingDepth [26], PatchFusion [23], PatchRefiner [24], and PRO (Ours) on the UHRSD [45] dataset and on internet images (e.g., from Unsplash111https://unsplash.com and Pexels222https://www.pexels.com), as shown in Fig.9 and Fig.10.

Refer to caption

Figure 7: Qulitative comparisons of GPCT’s impact on ETH3D [35], Booster [29], and DIS-5k [28]. We compare PRO (Ours (d)) with the model trained without Grouped Patch Consistency Training (GPCT) (b). (b) and (d) represent the model index in Table 3. We also visualize the residuals to highlight the presence of artifacts more effectively. Zoom in for details.

Refer to caption

Figure 8: Qulitative comparisons of BFM’s impact on ETH3D [35], Booster [29], and DIS-5k [28]. We compare PRO (Ours (d)) with the model trained without Bias Free Masking (BFM) (c). (c) and (d) represent the model index in Table 3. We also visualize the residuals to highlight the presence of artifacts more effectively. Zoom in for details.

Refer to caption

Figure 9: Qulitative comparisons for patch-wise DE methods on UHRSD [45] and images from the internet. We compare PRO (Ours) with BoostingDepth [26], PatchFusion (PF) [23], and PatchRefiner (PR) [24]. The time displayed in the leftmost depth column represents the inference time. Black rectangle area represents the transparent object region. Black arrows indicate patch boundaries. Zoom in for details.

Refer to caption

Figure 10: Qulitative comparisons for patch-wise DE methods on images from the internet. We compare PRO (Ours) with BoostingDepth [26], PatchFusion (PF) [23], and PatchRefiner (PR) [24]. The time displayed in the top depth row represents the inference time. Black rectangle area represents the transparent object region. Black arrows indicate patch boundaries. Zoom in for details.

References

  • Bhat et al. [2021] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021.
  • Birkl et al. [2023] Reiner Birkl, Diana Wofk, and Matthias Müller. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023.
  • Bochkovskii et al. [2024] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024.
  • Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  • Chen et al. [2021] Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Dai et al. [2023] Yaqiao Dai, Renjiao Yi, Chenyang Zhu, Hongjun He, and Kai Xu. Multi-resolution monocular depth map fusion by self-supervised gradient-based composition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 488–496, 2023.
  • Daubechies [1990] Ingrid Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE transactions on information theory, 36(5):961–1005, 1990.
  • Eftekhar et al. [2021] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
  • Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  • Fu et al. [2024] Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In European Conference on Computer Vision, pages 241–258. Springer, 2024.
  • Geiger et al. [2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013.
  • Godard et al. [2017] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017.
  • Godard et al. [2019] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019.
  • Guizilini et al. [2020] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • Hua et al. [2020] Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. arXiv preprint arXiv:2003.11172, 2020.
  • Huang et al. [2018] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018.
  • Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492–9502, 2024.
  • Koch et al. [2018] Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
  • Li et al. [2025] Jiaqi Li, Yiran Wang, Jinghong Zheng, Zihao Huang, Ke Xian, Zhiguo Cao, and Jianming Zhang. Self-distilled depth refinement with noisy poisson fusion. Advances in Neural Information Processing Systems, 37:69999–70025, 2025.
  • Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  • Li et al. [2024a] Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10016–10025, 2024a.
  • Li et al. [2024b] Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchrefiner: Leveraging synthetic data for real-domain high-resolution monocular metric depth estimation. In European Conference on Computer Vision, pages 250–267. Springer, 2024b.
  • Li et al. [2024c] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE Transactions on Image Processing, 2024c.
  • Miangoleh et al. [2021] S Mahdi H Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris, and Yagiz Aksoy. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9685–9694, 2021.
  • Moon et al. [2024] Jaeho Moon, Juan Luis Gonzalez Bello, Byeongjun Kwon, and Munchurl Kim. From-ground-to-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10519–10529, 2024.
  • Qin et al. [2022] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, and Luc Van Gool. Highly accurate dichotomous image segmentation. In European Conference on Computer Vision, pages 38–56. Springer, 2022.
  • Ramirez et al. [2023] Pierluigi Zama Ramirez, Alex Costanzino, Fabio Tosi, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. Booster: a benchmark for depth from images of specular and transparent surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):85–102, 2023.
  • Ranftl et al. [2020] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  • Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Saxena et al. [2008] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2008.
  • Scharstein et al. [2014] Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Nešić, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings 36, pages 31–42. Springer, 2014.
  • Schops et al. [2017] Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017.
  • Shao et al. [2023] Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, and Zhengguo Li. Nddepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7931–7940, 2023.
  • Silberman et al. [2012] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012.
  • Tosi et al. [2021] Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8942–8952, 2021.
  • Wang et al. [2019] Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019.
  • Wang et al. [2021] Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
  • Wang et al. [2020] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020.
  • Wang et al. [2023] Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, and Guosheng Lin. Neural video depth stabilizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9466–9476, 2023.
  • Watson et al. [2021] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1164–1174, 2021.
  • Xian et al. [2018] Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 311–320, 2018.
  • Xie et al. [2022] Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xiaowu Chen, and Jia Li. Pyramid grafting network for one-stage high resolution saliency detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11717–11726, 2022.
  • Yang et al. [2024] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024.
  • Yang et al. [2025] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. Advances in Neural Information Processing Systems, 37:21875–21911, 2025.
  • Yin et al. [2020] Wei Yin, Xinlong Wang, Chunhua Shen, Yifan Liu, Zhi Tian, Songcen Xu, Changming Sun, and Dou Renyin. Diversedepth: Affine-invariant depth prediction using diverse data. arXiv preprint arXiv:2002.00569, 2020.
  • Yuan et al. [2022] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3916–3925, 2022.