One Look is Enough: A Novel Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation Models on High-Resolution Images
Abstract
Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluation on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrates into which our PRO can be well harmonized, making their DE capabilities still effective for the grid input of high-resolution images with little depth discontinuities at the grid boundaries. Our PRO runs fast at inference time.
![[Uncaptioned image]](/html/2503.22351v1/x1.png)
1 Introduction
Monocular Depth Estimation (MDE) [10, 33, 36, 27, 43, 13, 14] has been widely investigated as the demand for 3D information continues to grow in autonomous driving, robotics, and virtual reality applications. After Midas [30] proposed an MDE network trainable with mixed training dataset, numerous zero-shot MDE networks [31, 9, 2, 46, 47, 19, 11] have been studied to predict depth for unseen real-world images. However, due to the architectural limitation or limited image resolutions of training dataset, most zero-shot MDE networks are trained specifically with the images of low resolutions (e.g. 384384, 518518). Consequently, when high resolution images are inputted to these models, the resulting estimated depth images tend to contain disrupted overall structures with low-frequency artifacts, although yielding improved edge details [26, 7].
Patch-based methods [26, 24, 23] have shown promising results by splitting high-resolution images into patches for DE, mitigating memory consumption issues while achieving remarkable performance. However, they suffer from depth discontinuity problems (e.g. boundary artifacts along grid boundaries) which occur when independent DE patches are reassembled to construct a complete (whole) depth map because depth continuity is maintained within each patch but not between patches. Previous methods [24, 23] alleviate this depth discontinuity problem (i) by incorporating a consistency loss during training and (ii) by ensemble averaging at test time which slows down inference speeds, making it impractical for real-world applications.

A common strategy for training high-resolution depth estimation is to train models on real-world high-resolution datasets [17, 20, 39, 42, 34] or leverage high-resolution synthetic datasets [38, 41, 40, 18]. Real-world datasets have the advantage of a smaller domain gap compared to synthetic datasets, but their ground truth (GT) depths are often sparse, particularly around edges, where supervision is crucial. Additionally, acquiring real-world dense depth data is challenging, limiting the DE learning for high-resolution images. On the other hand, synthetic datasets provide dense GT depth with fine details, but models trained on synthetic datasets often struggle with zero-shot inference on real-world data due to the domain gap [21]. Moreover, we observe that transparent objects do not have accurate depth annotations in the UnrealStereo4K dataset [38] that is the only synthetic dataset with 4K depth labeling, as they are annotated with the depth of the background behind the transparent objects.
To address the aforementioned problems, we propose Patch Refine Once (PRO), a novel refinement model that enables DE models to generate accurate DE results via only a single refinement per patch during inference (i.e.,“One Look”). Our PRO consists of two strategies: Grouped Patch Consistency Training and Bias Free Masking. Fig. 1 shows qualitative comparison of patch-based depth estimation models for high-resolution images. As shown, our PRO yields depth refinement results with little depth discontinuities along the grid boundaries while the very recent two state-of-the-art (SOTA) models exhibit depth discontinuity artifacts. Fig. 2 illustrates performance and efficiency comparison for SOTA depth refinement models and our PRO. As observed, our PRO provides the best performance in terms of , AbsRel and inference speed. Our contribution is summarized as:
-
•
We propose Grouped Patch Consistency Training (GPCT) strategy that can be well harmonized with existing DE models to mitigate the depth discontinuity problem for patch-wise DE approaches on high-resolution images. The GPCT does not require test-time ensemble, allowing a 12 faster inference speed compared to the SOTA patch-based method;
-
•
We introduce Bias Free Masking (BFM) to selectively apply supervision signals by masking out unreliable regions in images. The unreliable regions are the regions that have inaccurately annotated depth in GT, which is often observed in the UnrealStereo4K dataset. This is problematic without our BFM that efficiently prevents depth refinement models from overfitting to domain specific bias;
-
•
Our PRO achieves SOTA performance in zero-shot evaluation on Booster [29], ETH3D [35], Middlebury 2014 [34], and NuScenes [4] across all metrics, and runs efficiently at inference time (12 faster than PatchRefiner with 177 patches), compared to the very recent depth refinement methods for high-resolution images.
2 Related work
2.1 Zero-shot Monocular Depth Estimation
Early works on MDE models [1, 25, 10, 33, 49] focused on training on each dataset [12, 37], resulting in high performance on the trained dataset with degenerate performance on unseen datasets because of domain gap between datasets. Zero-shot MDE models that perform well on unseen (in-the-wild) images have been widely studied to improve generalization. MegaDepth [22] and DiverseDepth [48] construct large-sized datasets by collecting images from the Internet, improving adaptability on images from diverse scenes and conditions. They encourage efforts to scale up datasets to enhance zero-shot performance.
MiDaS [30] proposes a scale and shift invariant loss, which enables MDE networks to be trained on mixed diverse dataset, making the model robust to unseen datasets. DPT [31], Omnidata [9], and MiDaS v3.1 [2] improve DE performance by applying transformer architectures to replace CNN architectures. DepthAnything [46] introduces semi-supervised learning that utilizes 62M unlabeled images as pseudo labels. A dramatic increase in dataset size has demonstrated robust generalization performance. DepthAnythingv2 [47] points out that real-world depth datasets contain label noise and lack fine details. It is trained with pseudo GT labels obtained from a teacher model trained on a synthetic dataset, achieving high-quality depth maps.
Instead of scaling up datasets, some studies deploy prior knowledge from Diffusion Models for DE. Marigold [19] fine-tunes a pretrained Stable Diffusion [32] model on a relatively small synthetic dataset, showing competitive results. Geowizard [11] jointly estimates depth and surface normals by utilizing self-attention across different domains to enhance geometric consistency. However, zero-shot MDE models are trained on datasets of low resolutions (e.g., 384384, 518518), which results in degenerate performance on high resolution images. When high-resolution images are downsampled to the training resolution for inference, fine details are lost. On the other hand, performing inference with a larger resolution input preserves details but degrades depth accuracy.
2.2 High-Resolution Depth Estimation
High-resolution DE models aim to predict accurate depth while preserving fine details. SMD-Net [38] utilizes an implicit function [5] to predict mixture density for precise DE at object boundaries. Dai et al. [7] introduce a Poisson fusion-based depth map optimization method by applying guided filtering in a self-supervised learning framework. SDDR [21] proposes a self-distilled depth refinement method that generates pseudo edge labels, addressing local consistency issues and edge deformation noise with Noisy Poisson Fusion. Patch-based high-resolution DE methods select patches and merge patch-wise results to improve depth details. BoostingDepth [26] introduces a patch selection process based on edge density, and employs iterative refinement through multi-resolution depth merging. PatchFusion [23] eliminates the need for patch selection by splitting patches according to a predefined method and utilizes shifted window self-attention to fuse global and local information. PatchRefiner [24] initializes depth estimation with a low-resolution depth map and predicts residuals, and proposes a detail and disentangling (DSD) loss for training on both real-world and synthetic datasets. However, patch-based methods inherently suffer from depth discontinuities at patch boundaries, as they emphasize local information over global consistency, or maintain depth continuity only within individual patches. To address this problem, prior works [23, 24] employ a significantly large number of patches for test-time ensemble, which considerably slows down inference speed. By contrast, our PRO (Patch Refine Once) model achieves fine-grained details efficiently by performing only a single refinement per patch during inference, despite utilizing a tile-based approach.
2.3 Training Data for High-resolution Depth Estimation
High-resolution DE requires datasets with fine-grained depth details and minimal label noise. Real-world datasets [12, 37, 4, 15, 22] are often sparse due to the limitations of LiDAR or contain inaccurate depth boundaries when obtained through Structure-from-Motion (SfM). Therefore, previous researches [7, 26] have trained the networks using a small number of relatively dense real-world datasets [44, 20, 34]. Some studies [7, 21] utilize pseudo labels for self-supervision, but these labels often contain noise. Other works [23, 24] leverage synthetic datasets for more accurate depth, resulting in degenerate generalization abilities due to the domain gaps. To take advantage of dense depth annotations in synthetic datasets without compromising generalization performance, we introduce Bias Free Masking (BFM), which selectively identifies the regions where supervision is applied. It mitigates overfitting to synthetic dataset biases and enhances robustness in zero-shot high-resolution DE.
3 Method


We describe our Patch Refine Once (PRO) model. In Sec. 3.1, we explain the process of PRO. Then, Grouped Patch Consistency Training (GPCT) and Bias Free Masking (BFM), which are the key contributions of our paper, are described in Sec. 3.2 and Sec. 3.3, respectively.
3.1 Overview of Patch Refine Once (PRO)
Fig. 4 illustrates the overall pipeline of the proposed PRO framework. Following [23], given an original input image , we estimate a coarse depth map by feeding a resized version of into a pretrained zero-shot MDE network . captures the overall scene structure, which is essential for preserving global consistency.
To complement the local details, we take the following steps. First, we crop the original image into grid-based patches . Each individual patch is then resized and fed into to estimate the corresponding fine depth map . We extract two sets of five-level features from the decoder of : (i) for the resized image and (ii) for . To align with , we apply the operation [16] to extract patch-aligned features from . The patch-level features and the corresponding -extracted features are subsequently passed into a fusion module within the residual prediction network . The fusion module integrates these features using a wavelet transform [8] and shallow convolutional layers, effectively combining global and local information. Finally, the refined depth map for , denoted as , is obtained as:
(1) |
where is a region of interest extraction operator for for , yielding , and denotes channel-wise concatenation. Unlike previous methods [23, 24] where separate networks are trained to predict and individually, we utilize the same pretrained zero-shot MDE network and train only the residual prediction network as shown in Fig. 4-(a).
3.2 Grouped Patch Consistency Training (GPCT)
To address the boundary artifacts introduced by patch-wise refinement, we propose a simple yet effective GPCT strategy that ensures depth consistency across patch boundaries. While previous approaches [23, 24] focus only on two diagonally adjacent overlapping patches (e.g., (A, D) or (B, C) in Fig. 4) to enforce consistency constraints, our method employs four overlapping patches (A, B, C, and D in Fig. 4) simultaneously. As shown in Fig. 4-(b), we divide the training sample into overlapping patches, so that each patch overlaps with its neighbors. After refining each patch independently, we apply a depth consistency loss to enforce consistency between the depth refinement results of overlapping patches:
(2) |
where and denote the depth predictions from overlapping patches and , respectively. We denote the overlapping region between patches and as , with indicating the number of pixels within this region. The merged depth map at position is computed as:
(3) |
where is the predicted depth from patch where and shown in Fig.4-(b), and is the number of overlapping patches contributing depth estimates at location . Then, we calculate the loss between merged depth and GT depth . Since our method computes the loss using all four patches simultaneously, every backward propagation step enforces consistency at all boundaries. In contrast, the prior method [23] only applied a consistency loss on the overlapped region between two patches (e.g., (A, D) or (B, C)) at a time, leading to insufficient boundary supervision. Our approach provides a stronger supervision signal, yielding significantly improved cross-patch consistency and smooth boundary transitions during inference. By training with this GPCT strategy, our PRO model mitigates boundary artifacts without requiring additional refinement during inference, allowing efficient patch-wise depth estimation, as only a single refinement step per patch (i.e., “One Look”) is required.
3.3 Bias Free Masking (BFM)
To utilize dense GT depth when training depth refinement models for fine-grained details, we use a synthetic dataset such as the UnrealStereo4K [38]. However, the UnrealStereo4K dataset, which provides 4K resolutions, annotates the depths of transparent objects as the depths of their backgrounds (e.g., window objects in Fig. 3), which becomes a specific bias to the dataset. To maintain the benefits of dense depth labeling while avoiding overfitting to the dataset-specific biases, we propose Bias Free Masking (BFM) that uses the prior knowledge of the pretrained zero-shot MDE models. That is, the strategy of BFM is to exclude the unreliable regions corresponding to the incorrect GT depths for the transparent objects in the synthetic dataset (UnrealStereo4K) during training. Since the pretrained zero-shot MDE model can be considered to have informative prior knowledge learned from large scale image-depth pairs, they can be used to identify the unreliable regions where and might significantly deviate each other, indicating potential biases to the synthetic dataset.
Methods | Publications | Runtime (s) | Booster | ETH3D | Middle14 | NuScnes | DIS | |||||
AbsRel | AbsRel | AbsRel | AbsRel | BR | ||||||||
DepthAnythingV2 | NIPS 2024 | - | 0.0307 | 0.993 | 0.0465 | 0.983 | 0.0307 | 0.994 | 0.0979 | 0.106 | 0.882 | 0.059 |
BoostingDepth | CVPR 2021 | 12.7 | 0.0330 | 0.993 | 0.0552 | 0.974 | 0.0330 | 0.995 | 0.1035 | 0.115 | 0.870 | 0.170 |
PatchFusion P=16 | CVPR 2024 | 3.4 | 0.0504 | 0.985 | 0.0735 | 0.956 | 0.0450 | 0.989 | 0.1124 | 0.141 | 0.831 | 0.206 |
PatchFusion P=177 | CVPR 2024 | 36.8 | 0.0496 | 0.986 | 0.0723 | 0.957 | 0.0448 | 0.989 | 0.1050 | 0.139 | 0.833 | 0.189 |
PatchRefinerP=16 | ECCV 2024 | 1.5 | 0.0348 | 0.989 | 0.0435 | 0.985 | 0.0292 | 0.995 | 0.0830 | 0.107 | 0.879 | 0.151 |
PatchRefinerP=177 | ECCV 2024 | 16.2 | 0.0336 | 0.991 | 0.0430 | 0.985 | 0.0292 | 0.995 | 0.0805 | 0.106 | 0.881 | 0.141 |
PRO (Ours) | - | 1.4 | 0.0304 | 0.994 | 0.0422 | 0.985 | 0.0287 | 0.996 | 0.0803 | 0.104 | 0.883 | 0.156 |
Given a resized version for input image , we obtain coarse depth from and GT depth from the synthetic dataset. Then, by measuring the relative consistency between and , we identify unreliable regions as:
(4) |
where represents the Iverson bracket, denotes min-max normalization for and is an empirically chosen threshold (set to 2 in our experiments). as an unreliable mask identifies the regions where and significantly deviate, indicating potential synthetic dataset biases or inconsistencies, as aforementioned. Since lacks sharp edges, complementing to obtain the reliable regions is a naive approach, which causes the exclusion of critical edge regions from training, thus resulting in the removal of valuable depth gradients required for refinement. To address this, we incorporate edge information into the reliable mask generation. However, simply utilizing the edge map from as an edge mask could bring noise in a training sample because the flat transparent region could include the edges due to background with edges as observed in Fig. 4-(c). To handle this problem, we additionally employ the edge maps from . Consequently, we first extract edges from both and , and dilate them as:
(5) |
to guarantee that the final edge mask should not miss the regions that require depth refinements, otherwise they might be excluded simply from complementing . We define the edge mask as . Finally, the BFM that denotes the reliable mask is constructed as where is the complement of . This ensures that the unreliable regions due to large discrepancies between and can be ignored, while preserving essential edge details for training. Consequently, the supervision can only be applied for the reliable regions via a masked loss:
(6) |
where is the loss between merged depth and GT depth, which is the combination of an L1 loss, an L2 loss, and a multi-scale gradient loss with four scale levels [22] using a ratio of 1:1:5. By restricting training to reliable regions, the depth refinement model refines depth only where necessary, avoiding modifications in regions where the pretrained model’s prior knowledge is considered more reliable than the (noisy) GT. This helps prevent the model from refining transparent regions incorrectly.
Final Loss function. The final training loss consists of masked loss and depth consistency loss as:
(7) |
where is empirically set to 4 in our experiments.
4 Experiments
4.1 Datasets and Metrics
We train recent SOTA depth refinement models and our PRO model on a synthetic dataset to get the advantage of dense GT depth for a fair comparison.
UnrealStereo4K. The UnrealStereo4K dataset [38] is a synthetic dataset comprising of stereo images with 4K resolution (2,1603,840) and pixel-wise GT disparity labels. As aforementioned, its some GT labels are inaccurate especially for transparent objects as shown in Fig. 3.
Booster. The Booster dataset [29] has high-resolution (3,008 4,112) indoor images with specular and transparent surfaces. We use the training set with GT depth for testing. To ensure diversity, we removed images captured from the same scenes, resulting in a test set of 38 images.
ETH3D. The ETH3D dataset [35] contains high-resolution indoor and outdoor images (6,048 4,032) with GT depth captured by LiDAR sensors. Among the dataset, 34 images were rotated 90 degrees counterclockwise, causing the bottom sides to appear on the right sides. Before testing, we rotate these images 90 degrees clockwise to ensure that the floor sides are correctly positioned at the bottom sides.
Middlebury 2014. The Middlebury 2014 dataset [34] consists of 23 high-resolution (nearly 4K) indoor images with dense ground truth disparity maps.
NuScenes. The NuScenes dataset [4] is an autonomous driving dataset that provides multi-directional camera images along with depth information obtained from LiDAR and radar sensors. Among outdoor datasets, the Cityscapes dataset [6] provides higher resolution but suffers from blurred edges since its disparity maps are generated using a stereo matching algorithm instead of LiDAR sensors. Therefore, we use the NuScenes dataset for testing.
DIS-5K. DIS-5K dataset [28] is a high-resolution image segmentation dataset that provides accurately annotated masks, making it suitable for evaluating edge accuracy.
Evaluation Metrics. We use the commonly adopted metrics for DE, Absolute Relative Error (AbsRel) and . Additionally, to evaluate the edge quality, we adopt metric on the Middlebury 2014 dataset follwing [26], and measure Boundary Recall (BR) metric on the DIS-5K dataset following [3]. We utilize Consistency Error (CE) [23] to evaluate the consistency of the DE results across patches.
Metric | PatchFusion | PatchRefiner | PRO (Ours) |
CE | 0.364 | 0.347 | 0.049 |
Model | GPCT | BFM | Booster | ETH3D | Middle14 | NuScnes | CE | |||||
AbsRel | AbsRel | AbsRel | AbsRel | |||||||||
(a) | 0.0385 | 0.987 | 0.0428 | 0.985 | 0.0292 | 0.995 | 0.0807 | 0.105 | 0.882 | 0.208 | ||
(b) | ✓ | 0.0303 | 0.994 | 0.0426 | 0.985 | 0.0290 | 0.995 | 0.0837 | 0.105 | 0.882 | 0.117 | |
(c) | ✓ | 0.0313 | 0.993 | 0.0425 | 0.985 | 0.0288 | 0.996 | 0.0792 | 0.105 | 0.882 | 0.058 | |
(d) | ✓ | ✓ | 0.0304 | 0.994 | 0.0422 | 0.985 | 0.0287 | 0.996 | 0.0803 | 0.104 | 0.883 | 0.049 |
4.2 Implementation Details
We adopt the pretrained DepthAnythingV2 (DA2) [47] as the baseline model ( in Fig. 4). PRO, PatchFusion [23], and PatchRefiner [24] are (re)trained based on DA2, while BoostingDepth [26] uses DA2 without additional training. The input resolution to is fixed at . When we use the GPCT strategy, adjacent patches are cropped with an overlap of 224 pixels, ensuring an overlap of for horizontally adjacent patches and for vertically adjacent patches, where and denote the height and width of the patch, respectively. In our BFM, the dilation kernel size is empirically set to (10, 20). Training is conducted on 7,592 UnrealStereo4K samples for 8 epochs with a batch size of 64 (with gradient accumulation), taking 10 hours on a single RTX 4090 GPU.
4.3 Performance Comparison
At the test time, each input image is divided into a grid of patches. The depth of each patch is refined individually, and the refined patches are then reassembled to generate the final depth map.
Zero-shot performance on high-resolution datasets. We evaluate four patch-wise DE models: three are depth refinement models including our PRO model, BoostingDepth [26], PatchRefiner [24], and the other one is a direct DE model that is PatchFusion [23]. For PatchFusion and PatchRefiner, Our evaluations are done with two conditions: (i) ‘=16’ setting that uses the same patch number as our method, and (ii) ‘=177’ setting, where additional patches are used for test-time ensembling. Note that denotes the number of patches used to reassemble final depth maps during inference. As shown in Table 1, our PRO achieves SOTA performance across all depth metrics while maintaining the lowest inference time. Notably, our BFM prevents overfitting to synthetic dataset biases, leading to a 9.5% improvement in AbsRel on the Booster dataset that contains transparent objects. In terms of edge quality, our method performs worse than BoostingDepth and PatchFusion in Boundary Recall (BR). However, as shown in the depth metrics, these two methods prioritize enhancing edge sharpness rather than focusing on enhancing overall depth accuracy. In contrast, when compared to PatchRefiner that exhibits similar depth performance, our PRO achieves superior results in terms of the edge metric (BR). In terms of consistency, our PRO achieves the best result, with an 85.9% improvement, as shown in Table 2. It demonstrates the effectiveness of our GPCT strategy in maintaining consistency between individually processed patches.
Qualitative Comparisons. Fig. 5 shows qualitative comparisons for four patch-wise DE models. As shown in Fig. 5, PatchFusion and PatchRefiner trained on the synthetic dataset without masking exhibit artifacts in the window regions. Additionally, two patch-based methods [24, 23] produce noticeable boundary artifacts when test ensembling is not applied. In contrast, despite refining each patch only once, our PRO demonstrates minimal depth discontinuity while achieving the fastest inference time.
4.4 Ablation Studies
We analyze the effectiveness of the core components of our PRO through ablation studies: Grouped Patch Consistency Training (GPCT) and Bias Free Masking (BFM).
Ablation on GPCT and BFM. Table 3 presents the ablation results on GPCT and BFM. When BFM is applied to the baseline (a), the improvements are relatively minor for most datasets. However, for the Booster dataset [29] that contains transparent objects, we observe a significant improvement of 21.3% in AbsRel metric, demonstrating that our BFM method effectively identifies unreliable regions. Even when only BFM is applied without GPCT, a reduction in consistency error (CE) suggests that supervision limited to reliable regions helps mitigate unnecessary refinement, thereby reducing confusion in the depth refinement process. Furthermore, applying GPCT to the baseline yields a 50.4% improvement in consistency error, demonstrating its effectiveness in maintaining consistency between independently processed patches. Although model (c) achieves the best performance in the metric, it introduces artifacts in transparent regions. Consequently, model (d), which incorporates both GPCT and BFM, achieves the best overall performance across most metrics.
Overlap | CE | ETH3D | Middle14 |
AbsRel | AbsRel | ||
28 | 0.108 | 0.0423 | 0.0288 |
56 | 0.113 | 0.0424 | 0.0289 |
112 | 0.065 | 0.0425 | 0.0288 |
224 | 0.049 | 0.0422 | 0.0287 |
448 | 0.060 | 0.0423 | 0.0287 |
Ablation on Overlap Sizes in GPCT. Table 4 shows the effect of different overlap sizes in GPCT. Each overlap value represents the shorter side of the overlapping region between adjacent patches. As shown, increasing the overlap generally reduces the CE. However, when the overlap reaches to 448 pixels, CE begins to increase. We attribute this to the fact that highly overlapped patches capture nearly identical scene contents, resulting in fewer inconsistencies between patches. So, the depth consistency loss provides less meaningful supervision, reducing its effectiveness.
5 Conclusion
In this paper, we propose the Patch Refine Once (PRO) model for depth refinement on high-resolution images. To address the depth discontinuity problem, we introduce the Grouped Patch Consistency Training (GPCT) strategy, ensuring consistency between independently processed patches. Also, we propose Bias Free Masking (BFM) to prevent the depth refinement model from overfitting to dataset-specific biases. Through these strategies, our PRO achieves superior zero-shot performance while maintaining a low inference time (12 faster than PatchRefiner with 177 patches), outperforming recent state-of-the-art methods across diverse datasets.
APPENDIX
![[Uncaptioned image]](/html/2503.22351v1/x6.png)
Models | FLOPs (Mac) | Parameter | DIS | UHRSD | Middle14 | ETH3D | Booster | NuScenes | |||||
BR | BR | AbsRel | AbsRel | AbsRel | AbsRel | ||||||||
Conv | 31664G | 63M | 0.127 | 0.071 | 0.0290 | 0.995 | 0.0842 | 0.0428 | 0.984 | 0.0306 | 0.993 | 0.106 | 0.882 |
FFM (Ours) | 20250G | 51M | 0.156 | 0.083 | 0.0287 | 0.996 | 0.0803 | 0.0422 | 0.985 | 0.0304 | 0.994 | 0.104 | 0.883 |
Appendix A Architecture of the Fusion Module
The encoder takes , , and as inputs and produces a set of five intermediate features, denoted as following [23]. Then, the fused feature map is obtained through the frequency fusion module (), defined as
Subsequently, is processed through two consecutive layers, each consisting of a convolution, batch normalization, and ReLU activation. Finally, the resulting feature is fed into the DPT decoder [31] to obtain the residual map .
Architecture of the Frequency Fusion Module (FFM) To obtain accurate depth values from the coarse depth and preserve fine details from the fine depth, we design a Frequency Fusion Module (FFM) that effectively extracts and integrates edge information. We utilize Discrete Wavelet Transform (DWT) to decompose the input features into four frequency components: LL, LH, HL, and HH, which represent the low-frequency and high-frequency information. Each component is processed with its own dedicated convolution to capture scale-specific features. Finally, the components are recombined using the Inverse Discrete Wavelet Transform (IDWT), resulting in fused features that retain both global depth consistency and enhanced edge details. Overall process is described in Fig. 6-(b). To describe this process in more detail, we first decompose and into four frequency sub-bands () using . Each sub-band is then fused using a corresponding convolution . Finally, the fused feature map is obtained through .
(8) |
(9) |
(10) |
A.1 Ablation study of FFM
To validate the effectiveness of the Frequency Fusion Module (FFM), we conduct an ablation study by replacing the FFM with a simple convolutional block consisting of Conv-ReLU-Conv layers. To ensure that any performance gain is not simply due to an increase in the number of parameters or FLOPs, we design the simple convolutional block to have more parameters and FLOPs than the FFM. This allows us to attribute the performance improvement to the design of FFM itself, rather than computational complexity. As shown in the Table 5, our method not only achieves the best performance in standard depth metrics, but also yields significant improvements in edge accuracy. Specifically, it achieves a 22.5% improvement on the DIS-5K dataset and a 16.9% improvement on the UHRSD dataset in the Boundary Recall (BR) metric. In addition, we observe a 4.6% improvement in the edge quality metric (D3R). It demonstrates that the proposed FFM effectively integrates edge information through the use of Discrete Wavelet Transform (DWT), which enables selective enhancement of high-frequency details without sacrificing global structure. This highlights the benefit of frequency-domain processing in depth refinement tasks.
Appendix B Qualitative Results
Ablation Study In the ablation study (Sec. 4.4), we analyze the effect of Grouped Patch Consistency Training (GPCT) and Bias Free Masking (BFM) quantitatively. In this section, we analyze the effect of GPCT and BFM with qualitative results. As shown in the Fig. 7, the model trained without GPCT shows remarkable depth discontinuity problem on the grids. On the other hand, our PRO model trained with GPCT alleviates the depth discontinuity problem. Likewise, as shown in the Fig. 8 , the model trained without BFM exhibits artifacts on transparent surfaces, such as glass windows, as well as reflective surfaces like TV screens. In contrast, our model trained with BFM effectively refines only the edge regions while preventing artifacts on transparent objects.
Additional Qualitative Results
We provide additional qualitative comparisons of BoostingDepth [26], PatchFusion [23], PatchRefiner [24], and PRO (Ours) on the UHRSD [45] dataset and on internet images (e.g., from Unsplash111https://unsplash.com and Pexels222https://www.pexels.com), as shown in Fig.9 and Fig.10.
References
- Bhat et al. [2021] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021.
- Birkl et al. [2023] Reiner Birkl, Diana Wofk, and Matthias Müller. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023.
- Bochkovskii et al. [2024] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024.
- Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Chen et al. [2021] Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021.
- Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Dai et al. [2023] Yaqiao Dai, Renjiao Yi, Chenyang Zhu, Hongjun He, and Kai Xu. Multi-resolution monocular depth map fusion by self-supervised gradient-based composition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 488–496, 2023.
- Daubechies [1990] Ingrid Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE transactions on information theory, 36(5):961–1005, 1990.
- Eftekhar et al. [2021] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
- Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
- Fu et al. [2024] Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In European Conference on Computer Vision, pages 241–258. Springer, 2024.
- Geiger et al. [2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013.
- Godard et al. [2017] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017.
- Godard et al. [2019] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019.
- Guizilini et al. [2020] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020.
- He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Hua et al. [2020] Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. arXiv preprint arXiv:2003.11172, 2020.
- Huang et al. [2018] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018.
- Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492–9502, 2024.
- Koch et al. [2018] Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Li et al. [2025] Jiaqi Li, Yiran Wang, Jinghong Zheng, Zihao Huang, Ke Xian, Zhiguo Cao, and Jianming Zhang. Self-distilled depth refinement with noisy poisson fusion. Advances in Neural Information Processing Systems, 37:69999–70025, 2025.
- Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
- Li et al. [2024a] Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10016–10025, 2024a.
- Li et al. [2024b] Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchrefiner: Leveraging synthetic data for real-domain high-resolution monocular metric depth estimation. In European Conference on Computer Vision, pages 250–267. Springer, 2024b.
- Li et al. [2024c] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE Transactions on Image Processing, 2024c.
- Miangoleh et al. [2021] S Mahdi H Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris, and Yagiz Aksoy. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9685–9694, 2021.
- Moon et al. [2024] Jaeho Moon, Juan Luis Gonzalez Bello, Byeongjun Kwon, and Munchurl Kim. From-ground-to-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10519–10529, 2024.
- Qin et al. [2022] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, and Luc Van Gool. Highly accurate dichotomous image segmentation. In European Conference on Computer Vision, pages 38–56. Springer, 2022.
- Ramirez et al. [2023] Pierluigi Zama Ramirez, Alex Costanzino, Fabio Tosi, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. Booster: a benchmark for depth from images of specular and transparent surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):85–102, 2023.
- Ranftl et al. [2020] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
- Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Saxena et al. [2008] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2008.
- Scharstein et al. [2014] Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Nešić, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings 36, pages 31–42. Springer, 2014.
- Schops et al. [2017] Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017.
- Shao et al. [2023] Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, and Zhengguo Li. Nddepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7931–7940, 2023.
- Silberman et al. [2012] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012.
- Tosi et al. [2021] Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8942–8952, 2021.
- Wang et al. [2019] Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019.
- Wang et al. [2021] Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
- Wang et al. [2020] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020.
- Wang et al. [2023] Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, and Guosheng Lin. Neural video depth stabilizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9466–9476, 2023.
- Watson et al. [2021] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1164–1174, 2021.
- Xian et al. [2018] Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 311–320, 2018.
- Xie et al. [2022] Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xiaowu Chen, and Jia Li. Pyramid grafting network for one-stage high resolution saliency detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11717–11726, 2022.
- Yang et al. [2024] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024.
- Yang et al. [2025] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. Advances in Neural Information Processing Systems, 37:21875–21911, 2025.
- Yin et al. [2020] Wei Yin, Xinlong Wang, Chunhua Shen, Yifan Liu, Zhi Tian, Songcen Xu, Changming Sun, and Dou Renyin. Diversedepth: Affine-invariant depth prediction using diverse data. arXiv preprint arXiv:2002.00569, 2020.
- Yuan et al. [2022] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3916–3925, 2022.