(Translated by https://www.hiragana.jp/)
Temporally-consistent 3D Reconstruction of Seabirds

Temporally-consistent 3D Reconstruction of Seabirds

Johannes Hägerlind1, Jonas Hentati-Sundberg2, Bastian Wandt1
1Linköping University, Sweden
2Swedish University of Agricultural Sciences, Sweden
{johannes.hagerlind, bastian.wandt}@liu.se, jonas.sundberg@slu.se
Abstract

This paper deals with 3D reconstruction of seabirds which recently came into focus of environmental scientists as valuable bio-indicators for environmental change. Such 3D information is beneficial for analyzing the bird’s behavior and physiological shape, for example by tracking motion, shape, and appearance changes. From a computer vision perspective birds are especially challenging due to their rapid and oftentimes non-rigid motions. We propose an approach to reconstruct the 3D pose and shape from monocular videos of a specific breed of seabird – the common murre. Our approach comprises a full pipeline of detection, tracking, segmentation, and temporally consistent 3D reconstruction. Additionally, we propose a temporal loss that extends current single-image 3D bird pose estimators to the temporal domain. Moreover, we provide a real-world dataset of 10000 frames of video observations on average capture nine birds simultaneously, comprising a large variety of motions and interactions, including a smaller test set with bird-specific keypoint labels. Using our temporal optimization, we achieve state-of-the-art performance for the challenging sequences in our dataset111https://huggingface.co/datasets/seabirds/common_murre_temporal.

1 Introduction and Related Work

Studying detailed behaviour of animals is a fundamental topic in biological, ecological and environmental conservation research [7]. Seabirds are a large and diverse group of animals with a high conservation value and known for their potential to indicate changes in marine and terrestrial ecosystems [9, 18]. Behavioural studies of seabirds has a long history, where novel technologies such as cameras and computer vision has been increasingly used in applied research [8, 13].

An automated 3D reconstruction of searbirds from video sequences can offer detailed insights into behavior, physiology, and adaptability over time. In this paper, we present a novel approach aimed at reconstructing the 3D pose and shape of a specific breed of seabird, namely the common murre (uria aalge). Our method encompasses a multi-stage pipeline, including detection, tracking, segmentation, and temporally consistent 3D reconstructions.

Refer to caption
Figure 1: The proposed pipeline. The pink box represents learning the 3D pose prior [24], the blue boxes introduce the fitting the parameterized model to the 3D fitting and the prediction of segmentation masks inspired [11], the orange boxes additional improvements that were made in the current work, and the green boxes show the integration of temporal information which is the main contribution of this work.

Many methods investigate the use of parametric mesh models to do 3D reconstruction of humans, e.g. methods that build upon SMPL [16], such as [6, 14, 27, 15, 4, 26, 23, 22, 25]). For birds Badger et al. [3] develop a 3D reconstruction for cowbirds. Wang et al. [24] build on [3] and developed species-specific as well as multi-species shape models. Hägerlind [11] noted that the method of [24] was not sufficient to reconstruct the common murre from top-view images, which is the dominant view for the cliff-inhabiting common murre. They use the pose prior and bone length prior of the cowbird model in [3] to fit keypoints annotated in a 3D scan. The resulting bone length and shape parameters are used as an initialization for a more information-rich side-view optimization that uses 2D images annotated with keypoints and masks as input. In the side-view optimization [11] uses a similar method as in [24] and moved the mean of the bone length and the shape parameters towards that of the common murre. Finally, the results from the the side-view optimization were used to initialize the top-view optimization.

We build on top of the work by [11] by using the mesh parameters and optimization parameters and extending the single image-based approach to a temporal approach. To achieve this we introduce a motion consistency assumption. This temporal assumption is crucial for capturing the dynamic nature of seabird movements and ensuring the fidelity of reconstructed 3D poses over time. We also investigate the use of temporally consistent bone lengths. Additionally, to improve the keypoint detections, we investigate the use of a weighted median filter. Fig. 1 shows our full framework.

To facilitate further research and benchmarking efforts in this domain, we introduce a real-world dataset comprising video observations with 10K consecutive frames, created by researchers in the Baltic Searbird Project [1]. This dataset captures, on average, nine seabirds simultaneously engaged in a diverse array of behaviors, which lead to large pose changes, e.g. flapping their wings, and interactions with strong occlusions. We provide this dataset and a small test dataset containing keypoint labels for 7 birds in 100 consecutive frames at https://huggingface.co/datasets/seabirds/common_murre_temporal.

In summary, this paper presents a comprehensive framework for 3D reconstruction of seabirds from monocular videos, addressing the unique challenges posed by their behavior and movements. Through our proposed method and the accompanying dataset, we aim to advance the field of seabird research, providing valuable insights into their ecological significance and responses to environmental change.

2 Method

Fig. 1 shows all processing steps of our full approach. It consists of a detection and tracking stage, an offline 3D scan fitting and the temporal pose optimization.

2.1 Detection and Segmentation

We use the segmentation network provided by Álvarez Fernández Del Vallado [2]. The keypoint detector is trained using DeepLabCut [17, 19] by fine-tuning a Resnet50 [12]. The training dataset consists of 500 images with 20 keypoints (2 more than [11]). Since there are many frames where birds are close together we follow [11] and consider each animal individually. First, the image is cropped using the bounding boxes obtained from the segmentation network. This is followed by masking all pixels that are not labeled by the predicted segmentation masks. To compensate for possible inaccurate segmentation masks, we pad the bounding box by 40 pixels in each direction and then dilate the original prediction using a squared kernel of width 70 as in [11].

2.1.1 Weighted Median Filter

To filter occasional misdetections, a weighted median filter is applied to the detected 2D keypoints using a window size of 5. The x and y coordinates are filtered separately. The coordinates are chosen based on the median of the cumulative sum of the confidence associated with the keypoints (separately for the x and the y dimensions). This reduces the amount of outliers in the keypoint detection.

2.2 Tracking

The tight bounding boxes around the predicted segmentation mask are used as input to a tracker. Using the bounding box of the segmentation masks allows for a direct connection between the tracker and the segmentation mask (necessary for later steps). In case a segmentation mask is missed, there is a 5-frame memory that keeps track of the previous prediction. We track based on the highest IoU between bounding boxes in consecutive frames.

2.3 Fitting the 3D Model to the Image

We aim to fit a 3D bird model to the 2D keypoint and 2D masks. To allow for batch-optimization we pad and scale the keypoints and segmentation masks to a dimension of 256x256 pixels. The starting point is the common murre model from Hägerlind [11] adapted from [24]. The shape and pose of the reconstructed bird model is controlled by the translation (κ𝜅\kappaitalic_κ), the scale (σ𝜎\sigmaitalic_σ), the global orientation (θgsubscript𝜃𝑔\theta_{g}italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT), and the body pose θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT parameterized by joint angles. The scale parameter scales all the bones by a common factor. As in [11] we keep the depth fixed since the camera is looking from the top towards a flat surface. We keep the bone length constant since this was shown to reduce the perceptual quality in this setting (cf. [11]). The model M𝑀Mitalic_M is hence described by the function M(κ,σ,θg,θp)𝑀𝜅𝜎subscript𝜃𝑔subscript𝜃𝑝M(\kappa,\sigma,\theta_{g},\theta_{p})italic_M ( italic_κ , italic_σ , italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) As initialization, we use the method in [11] where we rotate the 3D bird in top-view by 360° in 12° steps and select the one that best matches the predicted 2D keypoints from Sec. 2.1. We optimize the full parameter set κ,σ,θg𝜅𝜎subscript𝜃𝑔\kappa,\sigma,\theta_{g}italic_κ , italic_σ , italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, and θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Frame-wise objective. We minimize the frame-wise loss from [11] that achieved the best results in [11]:

Estart(Θ)=λkptEkpt+λmskEmsk+λppEpp,subscript𝐸𝑠𝑡𝑎𝑟𝑡Θsubscript𝜆𝑘𝑝𝑡subscript𝐸𝑘𝑝𝑡subscript𝜆𝑚𝑠𝑘subscript𝐸𝑚𝑠𝑘subscript𝜆𝑝𝑝subscript𝐸𝑝𝑝E_{start}(\Theta)=\lambda_{kpt}E_{kpt}+\lambda_{msk}E_{msk}+\lambda_{pp}E_{pp},italic_E start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ( roman_Θ ) = italic_λ start_POSTSUBSCRIPT italic_k italic_p italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_k italic_p italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT , (1)

where Ekptsubscript𝐸𝑘𝑝𝑡E_{kpt}italic_E start_POSTSUBSCRIPT italic_k italic_p italic_t end_POSTSUBSCRIPT is a keypoint reprojection error, Emsksubscript𝐸𝑚𝑠𝑘E_{msk}italic_E start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT is a mask error, and Eppsubscript𝐸𝑝𝑝E_{pp}italic_E start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT is a pose prior. We set λmsk=1subscript𝜆𝑚𝑠𝑘1\lambda_{msk}=1italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT = 1, λkpt=1subscript𝜆𝑘𝑝𝑡1\lambda_{kpt}=1italic_λ start_POSTSUBSCRIPT italic_k italic_p italic_t end_POSTSUBSCRIPT = 1, and λpp=100subscript𝜆𝑝𝑝100\lambda_{pp}=100italic_λ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT = 100. The keypoint loss, mask loss, and pose prior loss are calculated similar to [24]. The keypoint loss is an instance of the Geman-McLure error function (cf. [10]) given by

Ekpt=i=1Nciσ2(Π(mi)pi)2σ2+(Π(mi)pi)2,subscript𝐸𝑘𝑝𝑡superscriptsubscript𝑖1𝑁subscript𝑐𝑖superscript𝜎2superscriptΠsubscript𝑚𝑖subscript𝑝𝑖2superscript𝜎2superscriptΠsubscript𝑚𝑖subscript𝑝𝑖2E_{kpt}=\sum_{i=1}^{N}c_{i}\frac{\sigma^{2}(\Pi(m_{i})-p_{i})^{2}}{\sigma^{2}+% (\Pi(m_{i})-p_{i})^{2}},italic_E start_POSTSUBSCRIPT italic_k italic_p italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Π ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Π ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (2)

where N𝑁Nitalic_N is the number of keypoints, Π(mi)Πsubscript𝑚𝑖\Pi(m_{i})roman_Π ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a projected keypoint from the mesh (using a simple perspective camera without any distortion) and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding target keypoint. cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the confidence assigned to a keypoint prediction. As in previous work [24, 11] we use σ=50𝜎50\sigma=50italic_σ = 50. The mask loss Emasksubscript𝐸𝑚𝑎𝑠𝑘E_{mask}italic_E start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT is calculated as the L1 distance between the predicted mask and the soft mask (silhouette) using PyTorch soft rasterizer [21]. The pose prior loss is calculated using the squared Mahalanobis distance as in [5]:

Epp=(𝐱𝝁)T𝚺𝟏(𝐱𝝁),subscript𝐸𝑝𝑝superscript𝐱𝝁𝑇superscript𝚺1𝐱𝝁E_{pp}=(\mathbf{x}-\bm{\mu})^{T}\mathbf{\Sigma^{-1}}(\mathbf{x}-\bm{\mu}),italic_E start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT = ( bold_x - bold_italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - bold_1 end_POSTSUPERSCRIPT ( bold_x - bold_italic_μ ) , (3)

where the mean 𝝁𝝁\bm{\mu}bold_italic_μ is taken from [11], and the the covariance 𝚺𝚺\mathbf{\Sigma}bold_Σ is taken from [3] (from the cowbird species).

Temporal objective. Since our goal is to achieve temporal consistency in a sequence of poses, we introduce two additional regularization terms for the velocity and the acceleration.

The first regularizer aims to decrease the difference between consecutive 3D poses

Evel=k{g,p}βki=1Nθk,i+1θk,i2.subscript𝐸𝑣𝑒𝑙subscript𝑘𝑔𝑝subscript𝛽𝑘superscriptsubscript𝑖1𝑁subscriptdelimited-∥∥subscript𝜃𝑘𝑖1subscript𝜃𝑘𝑖2E_{vel}=\sum_{k\in\{g,p\}}\beta_{k}\sum_{i=1}^{N}{\lVert\theta_{k,i+1}-\theta_% {k,i}\rVert_{2}}.italic_E start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ∈ { italic_g , italic_p } end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_k , italic_i + 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (4)

While regularizing the velocity already significantly smoothes the motion, some jitters remain. To this end, we introduce another acceleration-based term:

Eacc=k{g,p}βkj=2Nθk,j+1θk,j2.subscript𝐸𝑎𝑐𝑐subscript𝑘𝑔𝑝subscript𝛽𝑘superscriptsubscript𝑗2𝑁subscriptdelimited-∥∥subscriptsuperscript𝜃𝑘𝑗1subscriptsuperscript𝜃𝑘𝑗2E_{acc}=\sum_{k\in\{g,p\}}\beta_{k}\sum_{j=2}^{N}{\lVert\theta^{\prime}_{k,j+1% }-\theta^{\prime}_{k,j}\rVert_{2}}.italic_E start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ∈ { italic_g , italic_p } end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_j + 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (5)

θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes velocity. The global orientation θgsubscript𝜃𝑔\theta_{g}italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and body pose θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT have separate weights: βg=10subscript𝛽𝑔10\beta_{g}=10italic_β start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 10 for global orientation and βp=1subscript𝛽𝑝1\beta_{p}=1italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 for body pose. This is based on the assumption that movements in the joints are likely to be faster than global orientation changes.

The combined objective function is

E=Estart+λvelEvel+λaccEacc.𝐸subscript𝐸𝑠𝑡𝑎𝑟𝑡subscript𝜆𝑣𝑒𝑙subscript𝐸𝑣𝑒𝑙subscript𝜆𝑎𝑐𝑐subscript𝐸𝑎𝑐𝑐E=E_{start}+\lambda_{vel}E_{vel}+\lambda_{acc}E_{acc}.italic_E = italic_E start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT . (6)

Common size constraint. Although a bird can vary in shape, the bone length should remain constant during a reasonable time frame. In some experiments, we enforce this by optimizing a single scale for all bones during the full temporal window.

Optimization. There are two steps in the mesh optimization, excluding initialization. The first step uses the objective in Eq. 6 and the second step adds a mask loss. The first step uses 600 iterations and the second step uses 400 iterations. We use the Adam optimizer [20] and a learning rate of 0.01.

3 Experiments

3.1 Dataset

The common murre is a particularly interesting seabird as an indicator of environmental change since it heavily interacts with the environment by catching fish in the ocean. Moreover, it is relatively easy to observe since it breeds on cliffs that can be equipped with surveillance cameras. Researchers in the Baltic Searbird Project [1] have created a dataset comprising 10K consecutive frames capturing common murres on a cliff ledge during main breeding season. The resolution is 2592×1520259215202592\times 15202592 × 1520px and the frame rate is 25 frames per second. On average there are nine birds in the camera view. We identify several different behaviors: standing, walking, flying away, approaching, preening, flapping wings, and attacking other birds. It shows many challenging poses from bending the neck backward as well as non-rigid deformations, mainly of the neck. Additionally, interactions between individual birds lead to strong occlusions posing an additional challenge for tracking and reconstruction. In addition to the video sequences, we provide temporally consistent 2D keypoint labels for 100 images for 7 out of 9 birds for testing purposes. While we target accurate and time-consistent 3D reconstruction, this dataset also enables further behavioral studies for the computer vision community.

3.2 Metrics

Since there is no available 3D data for evaluation, we use the 2D reprojection of the keypoints in the mesh and compare them with the ground truth evaluation. The mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, mev𝑚subscript𝑒𝑣me_{v}italic_m italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT measure the RMS error for the projected mesh keypoints position and velocity respectively. This is calculated by dividing the error by the longest side of the bounding box of the predicted segmentation mask to enable comparison in different scales.

3.3 Experiments

We conduct a line of experiments in different settings, evaluating all part of our proposed pipeline. Fig. 2 shows an example of a 3D reconstruction from our approach. Note that we only show reconstructions for a subset of all the birds in the image. This is due to limitations in the segmentation network that only provides trackable regions for the visualized birds. A single image reconstruction for the remaining frames is conceivable but here we focus on the results of our tracker in combination with our temporal optimization. The supplementary material contains additional videos showing reconstructions on the test set using different parameter settings.

Refer to caption
Figure 2: Example reconstruction. The odd rows show the input image. The even rows show the corresponding mesh for the tracked bird rendered on top of the background image. The texture of the reconstructed bird is only added for visualization purposes.

3.4 Quantitative Results

In total, 66 experiments were conducted. A temporal window of 1 (no temporal optimization) and 100 is investigated. For the window size of 1, the use of a median filter for the input 2D joints is investigated. The setting with window size 1 and no median filter corresponds to the setting used by [11].

For the temporal window size of 100, the following cases are investigated:

  • λvel,{102,103,104,105}\lambda_{vel},\in\{10^{2},10^{3},10^{4},10^{5}\}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT , ∈ { 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT }

  • Use acceleration loss: true/false. If true λacc=λvelsubscript𝜆𝑎𝑐𝑐subscript𝜆𝑣𝑒𝑙\lambda_{acc}=\lambda_{vel}italic_λ start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT. If false λacc=0subscript𝜆𝑎𝑐𝑐0\lambda_{acc}=0italic_λ start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT = 0

  • Use weighted median filter (for the predicted keypoints) (True/False)

  • Optimize a common size in the temporal window (True/False).

λvelsubscript𝜆𝑣𝑒𝑙\lambda_{vel}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT acc med size mep𝑚subscript𝑒𝑝absentme_{p}\downarrowitalic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ↓ mev𝑚subscript𝑒𝑣absentme_{v}\downarrowitalic_m italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ↓
baseline 0 - False - 0.0824 0.0322
Ours 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT False True True 0.1647 0.0164
104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT False True False 0.1099 0.0142
1000 True False True 0.0872 0.0196
1000 True False False 0.0845 0.0174
1000 True True True 0.0824 0.0172
1000 False False False 0.0823 0.0188
1000 False True True 0.0823 0.0179
1000 False False True 0.0820 0.0187
1000 True True False 0.0817 0.0154
1000 False True False 0.0816 0.0178
0 - True - 0.0809 0.0262
100 False True False 0.0804 0.0182
100 False True True 0.0804 0.0185
100 False False False 0.0803 0.0223
100 False False True 0.0803 0.0228
100 True False False 0.0793 0.0196
100 True False True 0.0791 0.0196
100 True True False 0.0758 0.0149
Ours (best) 100 True True True 0.0756 0.0150
Table 1: Evaluation on our test set sorted in descending order (worst to best). The first row shows the baseline. The following abbreviations are used: acc for acceleration loss, med for the median filter, mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for mean error of the keypoint positions, and mev𝑚subscript𝑒𝑣me_{v}italic_m italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for mean error of the keypoint position velocities. Each individual contribution improves the performance.

Table 1 shows the evaluation results. The best mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is achieved using a window size of 100, a temporal loss of 100, a common size in the window, and an acceleration loss. Each individual component improves the performance. Looking at the top-8 mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT we see that using the weighted median filter in conjunction with the acceleration loss results in a lower mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The best mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for a window size of 100 is 6.6%percent6.66.6\%6.6 % lower than the best result for a window size of 1, validating the superior performance of our temporal approach compared to single frame methods.

Comparing λvel=100subscript𝜆𝑣𝑒𝑙100\lambda_{vel}=100italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 100 and λvel=1000subscript𝜆𝑣𝑒𝑙1000\lambda_{vel}=1000italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 1000 we see that the former results in a better mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The two last rows in the table show the best result for λvel=104subscript𝜆𝑣𝑒𝑙superscript104\lambda_{vel}=10^{4}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and λvel=105subscript𝜆𝑣𝑒𝑙superscript105\lambda_{vel}=10^{5}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT respectively. Using either λvel=104subscript𝜆𝑣𝑒𝑙superscript104\lambda_{vel}=10^{4}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and λvel=105subscript𝜆𝑣𝑒𝑙superscript105\lambda_{vel}=10^{5}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT greatly increases the mep𝑚subscript𝑒𝑝me_{p}italic_m italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (i.e. worsens the result).

Using λvel=0subscript𝜆𝑣𝑒𝑙0\lambda_{vel}=0italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 0, i.e. window-size 1, produces many non-existing high-frequency motions for the body pose and global orientation. Furthermore, the scale of the bird changes in an unnatural way.

Using λvel=100subscript𝜆𝑣𝑒𝑙100\lambda_{vel}=100italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 100 reduces many of the non-existing high-frequency motions, but not all, and λvel=1000subscript𝜆𝑣𝑒𝑙1000\lambda_{vel}=1000italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 1000 further reduces these motions.

A large weight for the temporal regularizer of λvel{104,105}subscript𝜆𝑣𝑒𝑙superscript104superscript105\lambda_{vel}\in\{10^{4},10^{5}\}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT ∈ { 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT } fails to capture the quick motion of the birds and λvel=105subscript𝜆𝑣𝑒𝑙superscript105\lambda_{vel}=10^{5}italic_λ start_POSTSUBSCRIPT italic_v italic_e italic_l end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT even has a severe negative impact on the size even if optimizing a single size in the window (see videos in supplementary material).

4 Conclusion

This pilot study investigates how different temporal assumptions can be used to improve the 3D reconstruction of the common murre captured by monocular cameras. We showed that our temporal regularizer, including the acceleration, leads to a significantly improved performance when used together with our weighted median filter, which improves the 2D keypoint prediction. Additionally, the temporal loss helps to enforce more physically plausible motions. Moreover, optimizing for a single scale during the whole sequence is another way to enforce temporal coherence and further improves the reconstruction. Since we build upon [24] our method still fails for extreme pose changes.

We will deal with such strong deformations in future work.

5 Acknowledgments

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, Sweden.

The computations were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

References

  • [1] Baltic seabird project. http://www.balticseabird.com/. Accessed: 2024-05-23.
  • Álvarez Fernández Del Vallado [2021] Juan Álvarez Fernández Del Vallado. Alternative solution to catastrophical forgetting on fewshot instance segmentation, 2021.
  • Badger et al. [2020] Marc Badger, Yufu Wang, Adarsh Modh, Ammon Perkes, Nikos Kolotouros, Bernd G Pfrommer, Marc F Schmidt, and Kostas Daniilidis. 3d bird reconstruction: a dataset, model, and shape recovery from a single view. In European Conference on Computer Vision, pages 1–17. Springer, 2020.
  • Baradel et al. [2021] Fabien Baradel, Thibault Groueix, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis, and Grégory Rogez. Leveraging mocap data for human mesh recovery. In 2021 International Conference on 3D Vision (3DV), pages 586–595. IEEE, 2021.
  • Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. Springer Science+Business Media, LLC, 2006.
  • Bogo et al. [2016] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pages 561–578. Springer, 2016.
  • Couzin and Heins [2023] Iain D Couzin and Conor Heins. Emerging technologies for behavioral research in changing environments. Trends in Ecology & Evolution, 38(4):346–354, 2023.
  • Edney and Wood [2021] Alice J Edney and Matt J Wood. Applications of digital imaging and analysis in seabird monitoring and research. Ibis, 163(2):317–337, 2021.
  • Elliott et al. [2008] Kyle Hamish Elliott, Kerry Woo, Anthony J Gaston, Silvano Benvenuti, Luigi Dall’Antonia, and Gail K Davoren. Seabird foraging behaviour indicates prey type. Marine Ecology Progress Series, 354:289–303, 2008.
  • Geman [1987] Stuart Geman. Statistical methods for tomographic image reconstruction. Bulletin of International Statistical Institute, 4:5–21, 1987.
  • Hägerlind [2023] Johannes Hägerlind. 3d-reconstruction of the common murre. Master’s thesis, Linköping University, 2023.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hentati-Sundberg et al. [2023] Jonas Hentati-Sundberg, Agnes B Olin, Sheetal Reddy, Per-Arvid Berglund, Erik Svensson, Mareddy Reddy, Siddharta Kasarareni, Astrid A Carlsen, Matilda Hanes, Shreyash Kad, et al. Seabird surveillance: combining cctv and artificial intelligence for monitoring and research. Remote sensing in ecology and conservation, 9(4):568–581, 2023.
  • Kanazawa et al. [2018] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
  • Kolotouros et al. [2021] Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
  • Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
  • Mathis et al. [2018] Alexander Mathis, Pranav Mamidanna, Kevin M. Cury, Taiga Abe, Venkatesh N. Murthy, Mackenzie W. Mathis, and Matthias Bethge. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 2018.
  • Monaghan [1996] Pat Monaghan. Relevance of the behaviour of seabirds to the conservation of marine environments. Oikos, pages 227–237, 1996.
  • Nath* et al. [2019] Tanmay Nath*, Alexander Mathis*, An Chi Chen, Amir Patel, Matthias Bethge, and Mackenzie W Mathis. Using deeplabcut for 3d markerless pose estimation across species and behaviors. Nature Protocols, 2019.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Ravi et al. [2020] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  • Sun et al. [2023] Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J Black. Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8856–8866, 2023.
  • Tian et al. [2023] Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Recovering 3d human mesh from monocular images: A survey. IEEE transactions on pattern analysis and machine intelligence, 2023.
  • Wang et al. [2021] Yufu Wang, Nikos Kolotouros, Kostas Daniilidis, and Marc Badger. Birds of a feather: capturing avian shape models from images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14739–14749, 2021.
  • Yao et al. [2024] Wei Yao, Hongwen Zhang, Yunlian Sun, and Jinhui Tang. Staf: 3d human mesh recovery from video with spatio-temporal alignment fusion. arXiv preprint arXiv:2401.01730, 2024.
  • Yuan et al. [2022] Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022.
  • Zhang et al. [2020] Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. Learning 3d human shape and pose from dense body parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2610–2627, 2020.