Temporally-consistent 3D Reconstruction of Seabirds

Johannes Hägerlind¹, Jonas Hentati-Sundberg², Bastian Wandt¹
¹Linköping University, Sweden
²Swedish University of Agricultural Sciences, Sweden
{johannes.hagerlind, bastian.wandt}@liu.se, jonas.sundberg@slu.se

Abstract

This paper deals with 3D reconstruction of seabirds which recently came into focus of environmental scientists as valuable bio-indicators for environmental change. Such 3D information is beneficial for analyzing the bird’s behavior and physiological shape, for example by tracking motion, shape, and appearance changes. From a computer vision perspective birds are especially challenging due to their rapid and oftentimes non-rigid motions. We propose an approach to reconstruct the 3D pose and shape from monocular videos of a specific breed of seabird – the common murre. Our approach comprises a full pipeline of detection, tracking, segmentation, and temporally consistent 3D reconstruction. Additionally, we propose a temporal loss that extends current single-image 3D bird pose estimators to the temporal domain. Moreover, we provide a real-world dataset of 10000 frames of video observations on average capture nine birds simultaneously, comprising a large variety of motions and interactions, including a smaller test set with bird-specific keypoint labels. Using our temporal optimization, we achieve state-of-the-art performance for the challenging sequences in our dataset¹¹1https://huggingface.co/datasets/seabirds/common_murre_temporal.

1 Introduction and Related Work

Studying detailed behaviour of animals is a fundamental topic in biological, ecological and environmental conservation research [7]. Seabirds are a large and diverse group of animals with a high conservation value and known for their potential to indicate changes in marine and terrestrial ecosystems [9, 18]. Behavioural studies of seabirds has a long history, where novel technologies such as cameras and computer vision has been increasingly used in applied research [8, 13].

An automated 3D reconstruction of searbirds from video sequences can offer detailed insights into behavior, physiology, and adaptability over time. In this paper, we present a novel approach aimed at reconstructing the 3D pose and shape of a specific breed of seabird, namely the common murre (uria aalge). Our method encompasses a multi-stage pipeline, including detection, tracking, segmentation, and temporally consistent 3D reconstructions.

Refer to caption — Figure 1: The proposed pipeline. The pink box represents learning the 3D pose prior [24], the blue boxes introduce the fitting the parameterized model to the 3D fitting and the prediction of segmentation masks inspired [11], the orange boxes additional improvements that were made in the current work, and the green boxes show the integration of temporal information which is the main contribution of this work.

Many methods investigate the use of parametric mesh models to do 3D reconstruction of humans, e.g. methods that build upon SMPL [16], such as [6, 14, 27, 15, 4, 26, 23, 22, 25]). For birds Badger et al. [3] develop a 3D reconstruction for cowbirds. Wang et al. [24] build on [3] and developed species-specific as well as multi-species shape models. Hägerlind [11] noted that the method of [24] was not sufficient to reconstruct the common murre from top-view images, which is the dominant view for the cliff-inhabiting common murre. They use the pose prior and bone length prior of the cowbird model in [3] to fit keypoints annotated in a 3D scan. The resulting bone length and shape parameters are used as an initialization for a more information-rich side-view optimization that uses 2D images annotated with keypoints and masks as input. In the side-view optimization [11] uses a similar method as in [24] and moved the mean of the bone length and the shape parameters towards that of the common murre. Finally, the results from the the side-view optimization were used to initialize the top-view optimization.

We build on top of the work by [11] by using the mesh parameters and optimization parameters and extending the single image-based approach to a temporal approach. To achieve this we introduce a motion consistency assumption. This temporal assumption is crucial for capturing the dynamic nature of seabird movements and ensuring the fidelity of reconstructed 3D poses over time. We also investigate the use of temporally consistent bone lengths. Additionally, to improve the keypoint detections, we investigate the use of a weighted median filter. Fig. 1 shows our full framework.

To facilitate further research and benchmarking efforts in this domain, we introduce a real-world dataset comprising video observations with 10K consecutive frames, created by researchers in the Baltic Searbird Project [1]. This dataset captures, on average, nine seabirds simultaneously engaged in a diverse array of behaviors, which lead to large pose changes, e.g. flapping their wings, and interactions with strong occlusions. We provide this dataset and a small test dataset containing keypoint labels for 7 birds in 100 consecutive frames at https://huggingface.co/datasets/seabirds/common_murre_temporal.

In summary, this paper presents a comprehensive framework for 3D reconstruction of seabirds from monocular videos, addressing the unique challenges posed by their behavior and movements. Through our proposed method and the accompanying dataset, we aim to advance the field of seabird research, providing valuable insights into their ecological significance and responses to environmental change.

2 Method

Fig. 1 shows all processing steps of our full approach. It consists of a detection and tracking stage, an offline 3D scan fitting and the temporal pose optimization.

2.1 Detection and Segmentation

We use the segmentation network provided by Álvarez Fernández Del Vallado [2]. The keypoint detector is trained using DeepLabCut [17, 19] by fine-tuning a Resnet50 [12]. The training dataset consists of 500 images with 20 keypoints (2 more than [11]). Since there are many frames where birds are close together we follow [11] and consider each animal individually. First, the image is cropped using the bounding boxes obtained from the segmentation network. This is followed by masking all pixels that are not labeled by the predicted segmentation masks. To compensate for possible inaccurate segmentation masks, we pad the bounding box by 40 pixels in each direction and then dilate the original prediction using a squared kernel of width 70 as in [11].

2.1.1 Weighted Median Filter

To filter occasional misdetections, a weighted median filter is applied to the detected 2D keypoints using a window size of 5. The x and y coordinates are filtered separately. The coordinates are chosen based on the median of the cumulative sum of the confidence associated with the keypoints (separately for the x and the y dimensions). This reduces the amount of outliers in the keypoint detection.

2.2 Tracking

The tight bounding boxes around the predicted segmentation mask are used as input to a tracker. Using the bounding box of the segmentation masks allows for a direct connection between the tracker and the segmentation mask (necessary for later steps). In case a segmentation mask is missed, there is a 5-frame memory that keeps track of the previous prediction. We track based on the highest IoU between bounding boxes in consecutive frames.

2.3 Fitting the 3D Model to the Image

We aim to fit a 3D bird model to the 2D keypoint and 2D masks. To allow for batch-optimization we pad and scale the keypoints and segmentation masks to a dimension of 256x256 pixels. The starting point is the common murre model from Hägerlind [11] adapted from [24]. The shape and pose of the reconstructed bird model is controlled by the translation ( $\kappa$ ), the scale ( $\sigma$ ), the global orientation ( $\theta_{g}$ ), and the body pose $\theta_{p}$ parameterized by joint angles. The scale parameter scales all the bones by a common factor. As in [11] we keep the depth fixed since the camera is looking from the top towards a flat surface. We keep the bone length constant since this was shown to reduce the perceptual quality in this setting (cf. [11]). The model $M$ is hence described by the function $M(\kappa,\sigma,\theta_{g},\theta_{p})$ As initialization, we use the method in [11] where we rotate the 3D bird in top-view by 360° in 12° steps and select the one that best matches the predicted 2D keypoints from Sec. 2.1. We optimize the full parameter set $\kappa,\sigma,\theta_{g}$ , and $\theta_{p}$ .

Frame-wise objective. We minimize the frame-wise loss from [11] that achieved the best results in [11]:

E_{start}(\Theta)=\lambda_{kpt}E_{kpt}+\lambda_{msk}E_{msk}+\lambda_{pp}E_{pp},

(1)

where $E_{kpt}$ is a keypoint reprojection error, $E_{msk}$ is a mask error, and $E_{pp}$ is a pose prior. We set $\lambda_{msk}=1$ , $\lambda_{kpt}=1$ , and $\lambda_{pp}=100$ . The keypoint loss, mask loss, and pose prior loss are calculated similar to [24]. The keypoint loss is an instance of the Geman-McLure error function (cf. [10]) given by

E_{kpt}=\sum_{i=1}^{N}c_{i}\frac{\sigma^{2}(\Pi(m_{i})-p_{i})^{2}}{\sigma^{2}+% (\Pi(m_{i})-p_{i})^{2}},

(2)

where $N$ is the number of keypoints, $\Pi(m_{i})$ is a projected keypoint from the mesh (using a simple perspective camera without any distortion) and $p_{i}$ is the corresponding target keypoint. $c_{i}$ is the confidence assigned to a keypoint prediction. As in previous work [24, 11] we use $\sigma=50$ . The mask loss $E_{mask}$ is calculated as the L1 distance between the predicted mask and the soft mask (silhouette) using PyTorch soft rasterizer [21]. The pose prior loss is calculated using the squared Mahalanobis distance as in [5]:

E_{pp}=(\mathbf{x}-\bm{\mu})^{T}\mathbf{\Sigma^{-1}}(\mathbf{x}-\bm{\mu}),

(3)

where the mean $\bm{\mu}$ is taken from [11], and the the covariance $\mathbf{\Sigma}$ is taken from [3] (from the cowbird species).

Temporal objective. Since our goal is to achieve temporal consistency in a sequence of poses, we introduce two additional regularization terms for the velocity and the acceleration.

The first regularizer aims to decrease the difference between consecutive 3D poses

E_{vel}=\sum_{k\in\{g,p\}}\beta_{k}\sum_{i=1}^{N}{\lVert\theta_{k,i+1}-\theta_% {k,i}\rVert_{2}}.

(4)

While regularizing the velocity already significantly smoothes the motion, some jitters remain. To this end, we introduce another acceleration-based term:

E_{acc}=\sum_{k\in\{g,p\}}\beta_{k}\sum_{j=2}^{N}{\lVert\theta^{\prime}_{k,j+1% }-\theta^{\prime}_{k,j}\rVert_{2}}.

(5)

$\theta^{\prime}$ denotes velocity. The global orientation $\theta_{g}$ and body pose $\theta_{p}$ have separate weights: $\beta_{g}=10$ for global orientation and $\beta_{p}=1$ for body pose. This is based on the assumption that movements in the joints are likely to be faster than global orientation changes.

The combined objective function is

E=E_{start}+\lambda_{vel}E_{vel}+\lambda_{acc}E_{acc}.

(6)

Common size constraint. Although a bird can vary in shape, the bone length should remain constant during a reasonable time frame. In some experiments, we enforce this by optimizing a single scale for all bones during the full temporal window.

Optimization. There are two steps in the mesh optimization, excluding initialization. The first step uses the objective in Eq. 6 and the second step adds a mask loss. The first step uses 600 iterations and the second step uses 400 iterations. We use the Adam optimizer [20] and a learning rate of 0.01.

3 Experiments

3.1 Dataset

The common murre is a particularly interesting seabird as an indicator of environmental change since it heavily interacts with the environment by catching fish in the ocean. Moreover, it is relatively easy to observe since it breeds on cliffs that can be equipped with surveillance cameras. Researchers in the Baltic Searbird Project [1] have created a dataset comprising 10K consecutive frames capturing common murres on a cliff ledge during main breeding season. The resolution is $2592\times 1520$ px and the frame rate is 25 frames per second. On average there are nine birds in the camera view. We identify several different behaviors: standing, walking, flying away, approaching, preening, flapping wings, and attacking other birds. It shows many challenging poses from bending the neck backward as well as non-rigid deformations, mainly of the neck. Additionally, interactions between individual birds lead to strong occlusions posing an additional challenge for tracking and reconstruction. In addition to the video sequences, we provide temporally consistent 2D keypoint labels for 100 images for 7 out of 9 birds for testing purposes. While we target accurate and time-consistent 3D reconstruction, this dataset also enables further behavioral studies for the computer vision community.

3.2 Metrics

Since there is no available 3D data for evaluation, we use the 2D reprojection of the keypoints in the mesh and compare them with the ground truth evaluation. The $me_{p}$ , $me_{v}$ measure the RMS error for the projected mesh keypoints position and velocity respectively. This is calculated by dividing the error by the longest side of the bounding box of the predicted segmentation mask to enable comparison in different scales.

3.3 Experiments

We conduct a line of experiments in different settings, evaluating all part of our proposed pipeline. Fig. 2 shows an example of a 3D reconstruction from our approach. Note that we only show reconstructions for a subset of all the birds in the image. This is due to limitations in the segmentation network that only provides trackable regions for the visualized birds. A single image reconstruction for the remaining frames is conceivable but here we focus on the results of our tracker in combination with our temporal optimization. The supplementary material contains additional videos showing reconstructions on the test set using different parameter settings.

3.4 Quantitative Results

In total, 66 experiments were conducted. A temporal window of 1 (no temporal optimization) and 100 is investigated. For the window size of 1, the use of a median filter for the input 2D joints is investigated. The setting with window size 1 and no median filter corresponds to the setting used by [11].

For the temporal window size of 100, the following cases are investigated:

•

$\lambda_{vel},\in\{10^{2},10^{3},10^{4},10^{5}\}$
•

Use acceleration loss: true/false. If true $\lambda_{acc}=\lambda_{vel}$ . If false $\lambda_{acc}=0$
•

Use weighted median filter (for the predicted keypoints) (True/False)
•

Optimize a common size in the temporal window (True/False).

	$\lambda_{vel}$	acc	med	size	$me_{p}\downarrow$	$me_{v}\downarrow$
baseline	0	-	False	-	0.0824	0.0322
Ours	$10^{5}$	False	True	True	0.1647	0.0164
	$10^{4}$	False	True	False	0.1099	0.0142
	1000	True	False	True	0.0872	0.0196
	1000	True	False	False	0.0845	0.0174
	1000	True	True	True	0.0824	0.0172
	1000	False	False	False	0.0823	0.0188
	1000	False	True	True	0.0823	0.0179
	1000	False	False	True	0.0820	0.0187
	1000	True	True	False	0.0817	0.0154
	1000	False	True	False	0.0816	0.0178
	0	-	True	-	0.0809	0.0262
	100	False	True	False	0.0804	0.0182
	100	False	True	True	0.0804	0.0185
	100	False	False	False	0.0803	0.0223
	100	False	False	True	0.0803	0.0228
	100	True	False	False	0.0793	0.0196
	100	True	False	True	0.0791	0.0196
	100	True	True	False	0.0758	0.0149
Ours (best)	100	True	True	True	0.0756	0.0150

Table 1: Evaluation on our test set sorted in descending order (worst to best). The first row shows the baseline. The following abbreviations are used: acc for acceleration loss, med for the median filter,

me_{p}

for mean error of the keypoint positions, and

me_{v}

for mean error of the keypoint position velocities. Each individual contribution improves the performance.

Table 1 shows the evaluation results. The best $me_{p}$ is achieved using a window size of 100, a temporal loss of 100, a common size in the window, and an acceleration loss. Each individual component improves the performance. Looking at the top-8 $me_{p}$ we see that using the weighted median filter in conjunction with the acceleration loss results in a lower $me_{p}$ . The best $me_{p}$ for a window size of 100 is $6.6\%$ lower than the best result for a window size of 1, validating the superior performance of our temporal approach compared to single frame methods.

Comparing $\lambda_{vel}=100$ and $\lambda_{vel}=1000$ we see that the former results in a better $me_{p}$ . The two last rows in the table show the best result for $\lambda_{vel}=10^{4}$ and $\lambda_{vel}=10^{5}$ respectively. Using either $\lambda_{vel}=10^{4}$ and $\lambda_{vel}=10^{5}$ greatly increases the $me_{p}$ (i.e. worsens the result).

Using $\lambda_{vel}=0$ , i.e. window-size 1, produces many non-existing high-frequency motions for the body pose and global orientation. Furthermore, the scale of the bird changes in an unnatural way.

Using $\lambda_{vel}=100$ reduces many of the non-existing high-frequency motions, but not all, and $\lambda_{vel}=1000$ further reduces these motions.

A large weight for the temporal regularizer of $\lambda_{vel}\in\{10^{4},10^{5}\}$ fails to capture the quick motion of the birds and $\lambda_{vel}=10^{5}$ even has a severe negative impact on the size even if optimizing a single size in the window (see videos in supplementary material).

4 Conclusion

This pilot study investigates how different temporal assumptions can be used to improve the 3D reconstruction of the common murre captured by monocular cameras. We showed that our temporal regularizer, including the acceleration, leads to a significantly improved performance when used together with our weighted median filter, which improves the 2D keypoint prediction. Additionally, the temporal loss helps to enforce more physically plausible motions. Moreover, optimizing for a single scale during the whole sequence is another way to enforce temporal coherence and further improves the reconstruction. Since we build upon [24] our method still fails for extreme pose changes.

We will deal with such strong deformations in future work.

5 Acknowledgments

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, Sweden.

The computations were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

References

[1] Baltic seabird project. http://www.balticseabird.com/. Accessed: 2024-05-23.
Álvarez Fernández Del Vallado [2021] Juan Álvarez Fernández Del Vallado. Alternative solution to catastrophical forgetting on fewshot instance segmentation, 2021.
Badger et al. [2020] Marc Badger, Yufu Wang, Adarsh Modh, Ammon Perkes, Nikos Kolotouros, Bernd G Pfrommer, Marc F Schmidt, and Kostas Daniilidis. 3d bird reconstruction: a dataset, model, and shape recovery from a single view. In European Conference on Computer Vision, pages 1–17. Springer, 2020.
Baradel et al. [2021] Fabien Baradel, Thibault Groueix, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis, and Grégory Rogez. Leveraging mocap data for human mesh recovery. In 2021 International Conference on 3D Vision (3DV), pages 586–595. IEEE, 2021.
Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. Springer Science+Business Media, LLC, 2006.
Bogo et al. [2016] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pages 561–578. Springer, 2016.
Couzin and Heins [2023] Iain D Couzin and Conor Heins. Emerging technologies for behavioral research in changing environments. Trends in Ecology & Evolution, 38(4):346–354, 2023.
Edney and Wood [2021] Alice J Edney and Matt J Wood. Applications of digital imaging and analysis in seabird monitoring and research. Ibis, 163(2):317–337, 2021.
Elliott et al. [2008] Kyle Hamish Elliott, Kerry Woo, Anthony J Gaston, Silvano Benvenuti, Luigi Dall’Antonia, and Gail K Davoren. Seabird foraging behaviour indicates prey type. Marine Ecology Progress Series, 354:289–303, 2008.
Geman [1987] Stuart Geman. Statistical methods for tomographic image reconstruction. Bulletin of International Statistical Institute, 4:5–21, 1987.
Hägerlind [2023] Johannes Hägerlind. 3d-reconstruction of the common murre. Master’s thesis, Linköping University, 2023.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hentati-Sundberg et al. [2023] Jonas Hentati-Sundberg, Agnes B Olin, Sheetal Reddy, Per-Arvid Berglund, Erik Svensson, Mareddy Reddy, Siddharta Kasarareni, Astrid A Carlsen, Matilda Hanes, Shreyash Kad, et al. Seabird surveillance: combining cctv and artificial intelligence for monitoring and research. Remote sensing in ecology and conservation, 9(4):568–581, 2023.
Kanazawa et al. [2018] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
Kolotouros et al. [2021] Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
Mathis et al. [2018] Alexander Mathis, Pranav Mamidanna, Kevin M. Cury, Taiga Abe, Venkatesh N. Murthy, Mackenzie W. Mathis, and Matthias Bethge. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 2018.
Monaghan [1996] Pat Monaghan. Relevance of the behaviour of seabirds to the conservation of marine environments. Oikos, pages 227–237, 1996.
Nath* et al. [2019] Tanmay Nath*, Alexander Mathis*, An Chi Chen, Amir Patel, Matthias Bethge, and Mackenzie W Mathis. Using deeplabcut for 3d markerless pose estimation across species and behaviors. Nature Protocols, 2019.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Ravi et al. [2020] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
Sun et al. [2023] Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J Black. Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8856–8866, 2023.
Tian et al. [2023] Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Recovering 3d human mesh from monocular images: A survey. IEEE transactions on pattern analysis and machine intelligence, 2023.
Wang et al. [2021] Yufu Wang, Nikos Kolotouros, Kostas Daniilidis, and Marc Badger. Birds of a feather: capturing avian shape models from images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14739–14749, 2021.
Yao et al. [2024] Wei Yao, Hongwen Zhang, Yunlian Sun, and Jinhui Tang. Staf: 3d human mesh recovery from video with spatio-temporal alignment fusion. arXiv preprint arXiv:2401.01730, 2024.
Yuan et al. [2022] Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022.
Zhang et al. [2020] Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. Learning 3d human shape and pose from dense body parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2610–2627, 2020.