(Translated by https://www.hiragana.jp/)
Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences
\addauthor

Rui Yuy80220166@mail.ecust.edu.cn1 \addauthorRunkai Zhaorzha9419@uni.sydney.edu.au2 \addauthorCong Nie2132586@tongji.edu.cn3 \addauthorHeng Wanghwan9147@uni.sydney.edu.au2 \addauthorHuaiCheng Yanhcyan@ecust.edu.cn1 \addauthorMeng Wangmengwagn@ecust.edu.cn1 \addinstitutionEast China University of Science and Technology, Shanghai, China \addinstitutionUniversity of Sydney, Sydney, Australia \addinstitutionTongji University,Shanghai, China Boosting 3D Object Detection with Temporal Motion Estimation

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

Abstract

Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning. https://github.com/YuRui-Learning/LiSTM

\bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption
Figure 1: Different from the global bird’s eye view (BEV) Neighbor Feature Fusion Method (a) and Trajectory-based Method (b) which do not count for the role of the future states, we propose a novel LiDAR 3D object detection framework that utilizes motion forecasting to guide the temporal fusion learning across past and future frames as shown in (c).

1 Introduction

3D LiDAR object detector [Lang et al.(2019)Lang, Vora, Caesar, Zhou, Yang, and Beijbom, Zhou and Tuzel(2018), Yin et al.(2021)Yin, Zhou, and Krahenbuhl] plays an important role in autonomous driving, it identifies object information within a 3D road scene represented by an unstructured point cloud. Although discrete LiDAR points reflect accurate spatial positioning of surrounding driving scenes, they are insufficient to comprehensively describe traffic objects due to data sparsity, particularly at far distances. Moreover, the LiDAR sensor captures partial view information of a scene from a single-frame perspective, leading to incomplete information collection of the visible objects. These inherent limitations of LiADR result in inconsistent point distribution for the same object across a driving sequence. Hence, a dynamic object may be represented with varying densities of point clouds in different frames, which introduces ambiguity in accurately determining the true shape for a 3D detector.

To eliminate the inconsistency, the increasing works [Zhou et al.(2022)Zhou, Zhao, Wang, Wang, and Foroosh, Calvo et al.(2023)Calvo, Taveira, Kahl, Gustafsson, Larsson, and Tonderski, Yin et al.(2020)Yin, Shen, Guan, Zhou, and Yang] attempt to detect 3D objects by utilizing multiple frames of point clouds. The LiDAR sensor records driving scenarios as the vehicle moves, delineating objects across multiple perspectives in sequence. This adds valuable modal information, enriching object representation. A straightforward method to implement this idea is to fuse the neighboring frame features, using the insight of historical frames to enhance the semantic representation of the current scene. Referring to the application of transformer in computer vision, the cross-attention mechanism bridges the previous and current point features either densely or sparsely, as depicted in Figure 1(a).

Direct integration of features for historical frames enhances the detection performance, but this method struggles to handle fast-moving objects. To solve this issue, trajectory-based methods [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li, He et al.(2023)He, Li, Zhang, Li, and Zhang] are designed to aggregate extensive temporal contexts of the object flows and utilize multi-frame proposals to comprehend the spatial information among the driving scenes. As shown in Figure 1(b), this method enhances the representation of the object by incorporating multi-view complementary information from the corresponding trajectory. However, this input-level manipulation is resource-intensive, limiting detection efficiency.

To boost temporal object detection, we propose a novel LiDAR 3D object detection with enhancing Spatial-Temporal feature fusion through Motion estimation, namely LiSTM. We bolster spatial-temporal feature fusion by integrating a Kalman filter module [Kim et al.(2021)Kim, Ošep, and Leal-Taixé] as prior kinetic information and focus on effectively integrating both ego and object motion states. Unlike previous approaches [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li, He et al.(2023)He, Li, Zhang, Li, and Zhang] that directly encode proposal trajectories with point clouds, we uncover an implicit feature representation for both trajectories and point clouds within the BEV space using the motion-based heatmap generator. This enables direct feature-level fusion, eliminating the need for reliance on the PointNet [Qi et al.(2017)Qi, Su, Mo, and Guibas] backbone. To have a stronger dynamic prior for each frame, we design the Motion-Guided Feature Aggregation (MGFA) mechanism to combine the heatmap generated by trajectory prediction for guiding the reconstruction of LiDAR features. Ultimately, with the integration of the Dual Correlation Weighting Module (DCWM) and Motion Transformer, we enhance feature characterization across frames, thereby enriching the semantic and geometric representations.

The main contributions of this paper can be summarized as follows:

  • We propose a novel LiDAR object detector considering future motion estimation of objects and point clouds to enhance the effectiveness of the spatial-temporal fusion.

  • We design a Motion-Guided Feature Aggregation (MGFA) mechanism to enhance object geometric representations of motions, and the Dual Correlation Weighting Module (DCWM) to characterize the spatial relationship of features across sequences.

  • We conduct experiments on the nuScenes and Waymo datasets to validate our proposed framework, which outperforms CenterPoint by 8% on the Waymo dataset.

2 Related Work

BEV 3D Object Detection. The bird’s-eye view (BEV) is a widely used feature representation in the field of autonomous driving which is derived from LiDAR’s ability to perceive objects from a circular viewpoint. Thanks to the PointNet series [Qi et al.(2017)Qi, Su, Mo, and Guibas], point-based methods [Shi et al.(2019a)Shi, Wang, and Li] have become extensively employed to extract geometric features directly from point clouds. Voxel-based methods [Zhou and Tuzel(2018), Chen et al.(2023)Chen, Liu, Zhang, Qi, and Jia, Zhao et al.(2024)Zhao, Heng, Wang, Gao, Liu, Yao, Chen, and Cai] and Pillar-based methods [Lang et al.(2019)Lang, Vora, Caesar, Zhou, Yang, and Beijbom, Shi et al.(2022)Shi, Li, and Ma] are mainly applied in environmental perception by converting point cloud to BEV feature. Meanwhile, Camera-based detectors [Philion and Fidler(2020), Li et al.(2023b)Li, Ge, Yu, Yang, Wang, Shi, Sun, and Li] learn pixel-wise categorical depth distributions to lift 2D images of different views into BEV space. Additionally, Li et al\bmvaOneDot  [Li et al.(2022)Li, Wang, Li, Xie, Sima, Lu, Qiao, and Dai] proposes a spatiotemporal transformer and focus on feature fusion in the spatial-temporal 4D working space.

Keypoint Detection. Anchor-based methods [Tian et al.(2019)Tian, Yang, Wang, Wang, Li, and Liang, Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] often result in redundant bounding boxes, requiring the use of Non-Maximum Suppression. Law and Deng [Law and Deng(2018)] produce two corner pairs to detect, while Zhou et al\bmvaOneDot  [Zhou et al.(2019)Zhou, Wang, and Krähenbühl] uses keypoint estimation with a normal distribution to locate center points, which use the central region to regress other properties. Therefore, CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] follows the struct of CenterNet [Zhou et al.(2019)Zhou, Wang, and Krähenbühl] and employs an object detector in BEV space. Zhou et al\bmvaOneDot  [Zhou et al.(2022)Zhou, Zhao, Wang, Wang, and Foroosh] utilizes the initial query embedding to facilitate learning of the transformer and uses cross attention to efficiently aggregate neighboring features.

Temporal Fusion Methodology. Temporal Fusion plays a critical role in autonomous driving, allowing models to gain a deeper understanding of contextual geometric information. Zhou et al\bmvaOneDot  [Zhou et al.(2022)Zhou, Zhao, Wang, Wang, and Foroosh] performs multi-frame features fusion by utilizing spatial-aware attention, while RNN-based models [Calvo et al.(2023)Calvo, Taveira, Kahl, Gustafsson, Larsson, and Tonderski, Yin et al.(2020)Yin, Shen, Guan, Zhou, and Yang] employ LSTM and GRU to fuse previous state features with the current feature. BEVFormer [Li et al.(2022)Li, Wang, Li, Xie, Sima, Lu, Qiao, and Dai] designs a temporal deformable attention to fuse previous features for enhanced performance. Meanwhile, Wang et al\bmvaOneDot  [Wang et al.(2023)Wang, Liu, Wang, Li, and Zhang] develops an object-centric temporal mechanism and a motion-aware layer normalization to model the movement of the objects. 3D-MAN [Yang et al.(2021)Yang, Zhou, Chen, and Ngiam] utilizes a multi-frame alignment and aggregation module to learn temporal attention for detection from multiple frames. motion-based models [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li, He et al.(2023)He, Li, Zhang, Li, and Zhang, Huang et al.(2024)Huang, Lyu, Yang, and Tsai] design point-trajectory transformer with long short-term memory for efficient temporal 3D object detection. Li et al\bmvaOneDot  [Li et al.(2023a)Li, Qi, Zhou, Liu, and Anguelov] uses motion forecasting outputs as a type of virtual lightweight sensor modality. Hence, we propose a more powerful and efficient spatial-temporal fusion model under BEV using CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] as the baseline.

Refer to caption
Figure 2: Overview of our proposed framework LiSTM. It processes multi-frame point clouds by performing voxelization before feeding them into the LiDAR BEV encoder. The first module employs a single-stage detector combined with tracking prediction to produce trajectories and then enhances the spatial representation with a Motion-Guided Feature Aggregation Module. The second module is used for cross-frame feature extraction by the proposed Dual Correlation Weighting Module and Motion Transformer.

3 Approach

As depicted in Figure 2, to incorporate the motion prior, we focus in Section 3.1 on the generation of the motion feature and the Motion-Guided Feature Aggregation (MGFA) mechanism. Then, the Dual Correlation Weighting Module (DCWM) and the Motion Transformer will be presented in Sections 3.2 and 3.3 to describe the cross-frame fusion strategy.

3.1 Motion-Guided Feature Aggregation

Unlike the early fusion methods [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li, He et al.(2023)He, Li, Zhang, Li, and Zhang], we utilize motion-based heatmap representing temporal streams to normalize features of objects for deep fusion. To predict object positions in future scenes, we use a kinematic model of ego-motion to derive the transformation matrix from time t𝑡titalic_t to t+n𝑡𝑛t+nitalic_t + italic_n, based on prior motion data and ego-pose observations. The transformation matrix is then used to transfer the point cloud in the current scene to a future coordinate, but it only applies to static objects and obtains a coarse-grained prediction. However, whether the points are predictions or observations are processed through voxelization and encoder to produce features Fmulti={Ftn,,Ft+n}subscript𝐹𝑚𝑢𝑙𝑡𝑖subscript𝐹𝑡𝑛subscript𝐹𝑡𝑛F_{multi}=\{F_{t-n},...,F_{t+n}\}italic_F start_POSTSUBSCRIPT italic_m italic_u italic_l italic_t italic_i end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT }. Then as implemented in CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl], we can get multi-frame proposals, which are temporal independence and geometric correlation.

Motion Model. After acquiring multiple consecutive frames of object proposals, we can use a Kalman filtering [Kim et al.(2021)Kim, Ošep, and Leal-Taixé] to estimate the motion state of each object across the frames. We define a ten-dimension state space (x,y,z,θ,l,w,h,x˙,y˙,z˙)𝑥𝑦𝑧𝜃𝑙𝑤˙𝑥˙𝑦˙𝑧(x,y,z,\theta,l,w,h,\dot{x},\dot{y},\dot{z})( italic_x , italic_y , italic_z , italic_θ , italic_l , italic_w , italic_h , over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG , over˙ start_ARG italic_z end_ARG ), where B=(x,y,z)𝐵𝑥𝑦𝑧B=(x,y,z)italic_B = ( italic_x , italic_y , italic_z ) is the center of a 3d bounding box, Pdim=(l,w,h)subscript𝑃𝑑𝑖𝑚𝑙𝑤P_{dim}=(l,w,h)italic_P start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT = ( italic_l , italic_w , italic_h ) is the object size, θ𝜃\thetaitalic_θ is the orientation under BEV and V=(x˙,y˙,z˙)𝑉˙𝑥˙𝑦˙𝑧V=(\dot{x},\dot{y},\dot{z})italic_V = ( over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG , over˙ start_ARG italic_z end_ARG ) are the respective velocities in the 3D space learned by a Kalman filter for constant velocity motion with a linear observation model.

Trajectory Prediction. With the Kalman filter modeling multiple targets over a driving sequence, we can obtain information about the velocity prediction Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of each proposal at every moment. For the forward trajectory prediction, we utilize the bounding box observation Bt1subscript𝐵𝑡1B_{t-1}italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT at t1𝑡1t-1italic_t - 1, along with the velocity prediction Vt1subscript𝑉𝑡1V_{t-1}italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, to update the Btsuperscript𝐵𝑡B^{\prime}{t}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_t for frame t𝑡titalic_t. Similarly, for the reverse trajectory prediction, we employ the bounding box observation Bt+1subscript𝐵𝑡1B_{t+1}italic_B start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT at t+1𝑡1t+1italic_t + 1 and the updated velocity prediction Vt+1subscript𝑉𝑡1V_{t+1}italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to reverse-predict the predicted the Bt′′subscriptsuperscript𝐵′′𝑡B^{\prime\prime}_{t}italic_B start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

Bt=Bt1+Vt1Δt,subscriptsuperscript𝐵𝑡subscript𝐵𝑡1superscript𝑉𝑡1Δ𝑡\displaystyle B^{\prime}_{t}=B_{t-1}+V^{t-1}\cdot\Delta t,italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_V start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋅ roman_Δ italic_t , (1)
Bt′′=Bt+1Vt+1Δt.subscriptsuperscript𝐵′′𝑡subscript𝐵𝑡1superscript𝑉𝑡1Δ𝑡\displaystyle B^{\prime\prime}_{t}=B_{t+1}-V^{t+1}\cdot\Delta t.italic_B start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_V start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ⋅ roman_Δ italic_t . (2)

Motion-based Heatmap Generator. After acquiring the forward and backward trajectory predictions Btsubscriptsuperscript𝐵𝑡B^{\prime}_{t}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Bt′′subscriptsuperscript𝐵′′𝑡B^{\prime\prime}_{t}italic_B start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we transfer these trajectories into motion feature Fmotionsubscript𝐹𝑚𝑜𝑡𝑖𝑜𝑛F_{motion}italic_F start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT using gaussian distribution. As is known, gaussian distribution is determined as:

μkx=cxk,μky=cyk,formulae-sequencesuperscriptsubscript𝜇𝑘𝑥𝑐subscript𝑥𝑘superscriptsubscript𝜇𝑘𝑦𝑐subscript𝑦𝑘\mu_{k}^{x}=cx_{k},\quad\mu_{k}^{y}=cy_{k},italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = italic_c italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = italic_c italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (3)

where μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the location of the proposal under BEV, and σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the hyperparameter of the category associated with the category of the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT object.

For the normal representations Nt1t(μk,σk2)superscriptsubscript𝑁𝑡1𝑡subscript𝜇𝑘superscriptsubscript𝜎𝑘2N_{t-1}^{t}(\mu_{k},\sigma_{k}^{2})italic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and Nt+1t(μk,σk2)superscriptsubscript𝑁𝑡1𝑡subscript𝜇𝑘superscriptsubscript𝜎𝑘2N_{t+1}^{t}(\mu_{k},\sigma_{k}^{2})italic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) of each frame proposal generated by bidirectional trajectory prediction, We respectively use the σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to control the probability of the distribution and the μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to represent the center of the distribution. Given the proposals from neighboring frames, we can consolidate all distributions into the BEV representation Fmotionsubscript𝐹motionF_{\text{motion}}italic_F start_POSTSUBSCRIPT motion end_POSTSUBSCRIPT, which enhances the understanding of agent objects by providing additional motion modality insights. This can be very effective in solving fast-moving objects and supplying a prior for occlusion situations.

Motion Guided Feature Aggregation Module. The designed MGFA module utilizes the information from previous and future motion states to interact with dense BEV features to model spatial-temporal correlations. By incorporating Fmotionsubscript𝐹𝑚𝑜𝑡𝑖𝑜𝑛F_{motion}italic_F start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT, we can enrich positional semantic information and integrate motion characterization into the model’s understanding. As mentioned, the motion feature includes bidirectional projections for the target frame. Therefore, the specific motion features are denoted as follows, with p2c𝑝2𝑐p2citalic_p 2 italic_c and f2c𝑓2𝑐f2citalic_f 2 italic_c representing past and future predictions of the current frame, respectively:

Fmotionp2c={Ntn1tn,Nt1t,,Nt+n1t+n},superscriptsubscript𝐹𝑚𝑜𝑡𝑖𝑜𝑛𝑝2𝑐superscriptsubscript𝑁𝑡𝑛1𝑡𝑛superscriptsubscript𝑁𝑡1𝑡superscriptsubscript𝑁𝑡𝑛1𝑡𝑛\displaystyle F_{motion}^{p2c}=\{{N_{t-n-1}^{t-n}},...N_{t-1}^{t},...,N_{t+n-1% }^{t+n}\},italic_F start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 2 italic_c end_POSTSUPERSCRIPT = { italic_N start_POSTSUBSCRIPT italic_t - italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT , … italic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_t + italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_n end_POSTSUPERSCRIPT } , (4)
Fmotionf2c={Ntn+1tn,Nt+1t,,Nt+n+1t+n}.superscriptsubscript𝐹𝑚𝑜𝑡𝑖𝑜𝑛𝑓2𝑐superscriptsubscript𝑁𝑡𝑛1𝑡𝑛superscriptsubscript𝑁𝑡1𝑡superscriptsubscript𝑁𝑡𝑛1𝑡𝑛\displaystyle F_{motion}^{f2c}=\{{N_{t-n+1}^{t-n}},...N_{t+1}^{t},...,N_{t+n+1% }^{t+n}\}.italic_F start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 2 italic_c end_POSTSUPERSCRIPT = { italic_N start_POSTSUBSCRIPT italic_t - italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT , … italic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_t + italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_n end_POSTSUPERSCRIPT } . (5)
\bmvaHangBox

Refer to caption

Figure 3: Motion Guided Feature Aggregation.
\bmvaHangBox

Refer to caption

Figure 4: Dual Correlation Weighting Module.

Based on the given feature Fmotionsubscript𝐹𝑚𝑜𝑡𝑖𝑜𝑛F_{motion}italic_F start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT, it is first expanded along the channel dimension and then processed by a shared Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v to encode the geometric information of the target center. Specifically, Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v denotes channel expansion followed by dimensionality reduction within the channel dimension:

Fcenterp2c/f2c=Conv(repeat(Fmotionp2c/f2c)).subscriptsuperscript𝐹𝑝2𝑐𝑓2𝑐𝑐𝑒𝑛𝑡𝑒𝑟𝐶𝑜𝑛𝑣𝑟𝑒𝑝𝑒𝑎𝑡superscriptsubscript𝐹𝑚𝑜𝑡𝑖𝑜𝑛𝑝2𝑐𝑓2𝑐F^{p2c/f2c}_{center}=Conv(repeat(F_{motion}^{p2c/f2c})).italic_F start_POSTSUPERSCRIPT italic_p 2 italic_c / italic_f 2 italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_e italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_r italic_e italic_p italic_e italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p 2 italic_c / italic_f 2 italic_c end_POSTSUPERSCRIPT ) ) . (6)

After obtaining the center distribution feature Fcenterp2c/f2csubscriptsuperscript𝐹𝑝2𝑐𝑓2𝑐𝑐𝑒𝑛𝑡𝑒𝑟F^{p2c/f2c}_{center}italic_F start_POSTSUPERSCRIPT italic_p 2 italic_c / italic_f 2 italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_e italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT, we follow the method illustrated in Figure 4 to perform the feature fusion using a shared convolutional network. In the aggregation of forward prediction, we merge the forward distribution feature Fcenterp2csubscriptsuperscript𝐹𝑝2𝑐𝑐𝑒𝑛𝑡𝑒𝑟F^{p2c}_{center}italic_F start_POSTSUPERSCRIPT italic_p 2 italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_e italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT of the target frame from the previous frame with the BEV feature by using a convolutional network. It mainly convolves the channel dimension to realize the fusion of heterogeneous features:

FMGFA=Conv(Fmulti,Fcenterp2c)={Conv(Ftn,Ntn1tn),,Conv(Ft+n,Nt+n1tn)}.subscript𝐹𝑀𝐺𝐹𝐴𝐶𝑜𝑛𝑣subscript𝐹𝑚𝑢𝑙𝑡𝑖subscriptsuperscript𝐹𝑝2𝑐𝑐𝑒𝑛𝑡𝑒𝑟𝐶𝑜𝑛𝑣subscript𝐹𝑡𝑛superscriptsubscript𝑁𝑡𝑛1𝑡𝑛𝐶𝑜𝑛𝑣subscript𝐹𝑡𝑛superscriptsubscript𝑁𝑡𝑛1𝑡𝑛F_{MGFA}=Conv(F_{multi},F^{p2c}_{center})=\{Conv(F_{t-n},N_{t-n-1}^{t-n}),...,% Conv(F_{t+n},N_{t+n-1}^{t-n})\}.italic_F start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_m italic_u italic_l italic_t italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT italic_p 2 italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_e italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ) = { italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_t - italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT ) , … , italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_t + italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT ) } . (7)

Similarly, in reverse trajectory prediction, the center distribution feature Fcenterf2csubscriptsuperscript𝐹𝑓2𝑐𝑐𝑒𝑛𝑡𝑒𝑟F^{f2c}_{center}italic_F start_POSTSUPERSCRIPT italic_f 2 italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_e italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT is sequentially concatenated and convolved with the BEV feature to enhance the dynamic property:

FMGFA=Conv(FMGFA,Mcenterf2c)={Conv(Ftn,Ntn+1tn),,Conv(Ft+n,Nt+n+1tn)}.subscriptsuperscript𝐹𝑀𝐺𝐹𝐴𝐶𝑜𝑛𝑣subscript𝐹𝑀𝐺𝐹𝐴subscriptsuperscript𝑀𝑓2𝑐𝑐𝑒𝑛𝑡𝑒𝑟𝐶𝑜𝑛𝑣subscriptsuperscript𝐹𝑡𝑛superscriptsubscript𝑁𝑡𝑛1𝑡𝑛𝐶𝑜𝑛𝑣subscriptsuperscript𝐹𝑡𝑛superscriptsubscript𝑁𝑡𝑛1𝑡𝑛F^{\prime}_{MGFA}=Conv(F_{MGFA},M^{f2c}_{center})=\{Conv(F^{\prime}_{t-n},N_{t% -n+1}^{t-n}),...,Conv(F^{\prime}_{t+n},N_{t+n+1}^{t-n})\}.italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT italic_f 2 italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_e italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ) = { italic_C italic_o italic_n italic_v ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_t - italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT ) , … , italic_C italic_o italic_n italic_v ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_t + italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT ) } . (8)

3.2 Dual Correlation Weighting Module

Unlike the feature concatenation in Figure 1(a), we propose learning a multi-frame fusion weight matrix to capture cross-frame correlations in both channel and temporal dimensions. As shown in Figure 4, global max pooling (GMP𝐺𝑀𝑃GMPitalic_G italic_M italic_P) is first applied along the spatial dimensions to obtain a feature vector vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Subsequently, vectors from multiple frames are concatenated to form a representation for the scene sequence data, denoted as V={vtn,,vt+n}𝑉subscript𝑣𝑡𝑛subscript𝑣𝑡𝑛V=\{v_{t-n},...,v_{t+n}\}italic_V = { italic_v start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT }:

V=Concat(vtn,..,vt+n)=Concat(GMP{FMGFAtn},..,GMP{FMGFAt+n}),V=Concat(v_{t-n},..,v_{t+n})=Concat(GMP\{F^{{}^{\prime}t-n}_{MGFA}\},..,GMP\{F% _{MGFA}^{{}^{\prime}t+n}\}),italic_V = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_v start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT , . . , italic_v start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_G italic_M italic_P { italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_t - italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT } , . . , italic_G italic_M italic_P { italic_F start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_t + italic_n end_POSTSUPERSCRIPT } ) , (9)
Md/t=conv(Vid/t,Vjd/t)σVid/tσVjd/t.subscript𝑀𝑑𝑡𝑐𝑜𝑛𝑣subscriptsuperscript𝑉𝑑𝑡𝑖subscriptsuperscript𝑉𝑑𝑡𝑗subscript𝜎subscriptsuperscript𝑉𝑑𝑡𝑖subscript𝜎subscriptsuperscript𝑉𝑑𝑡𝑗M_{d/t}=\frac{conv(V^{d/t}_{i},V^{d/t}_{j})}{\sigma_{V^{d/t}_{i}}*\sigma_{V^{d% /t}_{j}}}.italic_M start_POSTSUBSCRIPT italic_d / italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c italic_o italic_n italic_v ( italic_V start_POSTSUPERSCRIPT italic_d / italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT italic_d / italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_d / italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∗ italic_σ start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_d / italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG . (10)

We then compute the correlation between matrices across each vector(e.g., i and j), where Vd/tsuperscript𝑉𝑑𝑡V^{d/t}italic_V start_POSTSUPERSCRIPT italic_d / italic_t end_POSTSUPERSCRIPT denotes the process of transforming the sequence along the channel and temporal. After obtaining the correlation matrices Mdsubscript𝑀𝑑M_{d}italic_M start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which represent interlinks within the feature structure and across frames in the temporal domain, respectively, the weight matrix is flattened and passed through a two-layer linear network with ReLU activation:

Wd/t=Linear(ReLu(Linear(Md/t))).subscript𝑊𝑑𝑡𝐿𝑖𝑛𝑒𝑎𝑟𝑅𝑒𝐿𝑢𝐿𝑖𝑛𝑒𝑎𝑟subscript𝑀𝑑𝑡W_{d/t}=Linear(ReLu(Linear(M_{d/t}))).italic_W start_POSTSUBSCRIPT italic_d / italic_t end_POSTSUBSCRIPT = italic_L italic_i italic_n italic_e italic_a italic_r ( italic_R italic_e italic_L italic_u ( italic_L italic_i italic_n italic_e italic_a italic_r ( italic_M start_POSTSUBSCRIPT italic_d / italic_t end_POSTSUBSCRIPT ) ) ) . (11)

Eventually, we obtain weight vectors Wd/tsuperscript𝑊𝑑𝑡W^{d/t}italic_W start_POSTSUPERSCRIPT italic_d / italic_t end_POSTSUPERSCRIPT for channels and temporal dimensions, respectively, and generate the weight matrix Mweightsubscript𝑀weightM_{\text{weight}}italic_M start_POSTSUBSCRIPT weight end_POSTSUBSCRIPT through their outer product tensor-product\otimes. Then, this weight is multiplied and channel-wise convolution with the MGFA feature FMGFA′′subscriptsuperscript𝐹′′𝑀𝐺𝐹𝐴F^{\prime\prime}_{MGFA}italic_F start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT (excluding the current frame) to generate the Dual Correlation Weighting feature FDCWMsubscript𝐹𝐷𝐶𝑊𝑀F_{DCWM}italic_F start_POSTSUBSCRIPT italic_D italic_C italic_W italic_M end_POSTSUBSCRIPT as follows:

FDCWM=Conv(FMGFA′′Mweight)=Conv(FMGFA′′(WdWt)).subscript𝐹𝐷𝐶𝑊𝑀𝐶𝑜𝑛𝑣subscriptsuperscript𝐹′′𝑀𝐺𝐹𝐴subscript𝑀𝑤𝑒𝑖𝑔𝑡𝐶𝑜𝑛𝑣subscriptsuperscript𝐹′′𝑀𝐺𝐹𝐴tensor-productsubscript𝑊𝑑subscript𝑊𝑡F_{DCWM}=Conv(F^{\prime\prime}_{MGFA}\cdot M_{weight})=Conv(F^{\prime\prime}_{% MGFA}\cdot(W_{d}\otimes W_{t})).italic_F start_POSTSUBSCRIPT italic_D italic_C italic_W italic_M end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT ⋅ italic_M start_POSTSUBSCRIPT italic_w italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ) = italic_C italic_o italic_n italic_v ( italic_F start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_G italic_F italic_A end_POSTSUBSCRIPT ⋅ ( italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊗ italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (12)

3.3 Motion Transformer

With the assistance of the designed modules MGFA and DCWM, the features are enhanced to include details about both ego-motion and object-motion. The attention mechanism [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] is then employed using a transformer decoder to focus on feature learning within the spatial-temporal 4D space. First, the features are processed through self-attention as follows:

QC/M=MultiHeadAttn(Q(FC/M+PE),K(FC/M+PE),V(FC/M)),subscript𝑄𝐶𝑀𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡𝑛𝑄subscript𝐹𝐶𝑀𝑃𝐸𝐾subscript𝐹𝐶𝑀𝑃𝐸𝑉subscript𝐹𝐶𝑀Q_{C/M}=MultiHeadAttn(Q(F_{C/M}+PE),K(F_{C/M}+PE),V(F_{C/M})),italic_Q start_POSTSUBSCRIPT italic_C / italic_M end_POSTSUBSCRIPT = italic_M italic_u italic_l italic_t italic_i italic_H italic_e italic_a italic_d italic_A italic_t italic_t italic_n ( italic_Q ( italic_F start_POSTSUBSCRIPT italic_C / italic_M end_POSTSUBSCRIPT + italic_P italic_E ) , italic_K ( italic_F start_POSTSUBSCRIPT italic_C / italic_M end_POSTSUBSCRIPT + italic_P italic_E ) , italic_V ( italic_F start_POSTSUBSCRIPT italic_C / italic_M end_POSTSUBSCRIPT ) ) , (13)

where FC/Msubscript𝐹𝐶𝑀F_{C/M}italic_F start_POSTSUBSCRIPT italic_C / italic_M end_POSTSUBSCRIPT represents the current frame feature and DCWM feature, QC/Msubscript𝑄𝐶𝑀Q_{C/M}italic_Q start_POSTSUBSCRIPT italic_C / italic_M end_POSTSUBSCRIPT denotes the query and PE is the position embedding. After self-attention, we make a cross-attention mechanism with QCsubscript𝑄𝐶Q_{C}italic_Q start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and QMsubscript𝑄𝑀Q_{M}italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, which guides the training to focus on aggregating more spatial information containing meaningful object details. Then, the cross-attention is shown below:

QC=MultiHeadAttn(Q(QC+PE),K(QM+PE),V(QM)).superscriptsubscript𝑄𝐶𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡𝑛𝑄subscript𝑄𝐶𝑃𝐸𝐾subscript𝑄𝑀𝑃𝐸𝑉subscript𝑄𝑀Q_{C}^{\prime}=MultiHeadAttn(Q(Q_{C}+PE),K(Q_{M}+PE),V(Q_{M})).italic_Q start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_M italic_u italic_l italic_t italic_i italic_H italic_e italic_a italic_d italic_A italic_t italic_t italic_n ( italic_Q ( italic_Q start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT + italic_P italic_E ) , italic_K ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT + italic_P italic_E ) , italic_V ( italic_Q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ) . (14)

After feature generation and fusion, we get the final target characterization QCsuperscriptsubscript𝑄𝐶Q_{C}^{\prime}italic_Q start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then we follow the steps of CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] to learn the representation of the different geometric elements in the 3D scene.

Model Frames Vehicle (AP/APH)\uparrow Pedestrian (AP/APH)\uparrow Cyclist (AP/APH)\uparrow
L1 L2 L1 L2 L1 L2
PointPillar [Lang et al.(2019)Lang, Vora, Caesar, Zhou, Yang, and Beijbom] 1 66.94 / 66.36 58.96 / 58.43 63.35 / 45.22 55.21 / 39.32 55.06 / 52.55 52.97 / 50.55
VoxelNet [Zhou and Tuzel(2018)] 1 68.73 / 67.31 60.11 / 59.97 69.65 / 57.38 60.19 / 53.67 62.31 / 59.85 60.34 / 55.89
PillarNet [Shi et al.(2022)Shi, Li, and Ma] 1 66.29 / 65.63 59.03 / 58.43 70.35 / 64.24 64.24 / 55.75 65.43 / 63.93 63.53 / 62.08
Second [Yan et al.(2018)Yan, Mao, and Li] 1 68.95 / 68.33 61.81 / 61.24 65.59 / 54.80 57.85 / 48.16 61.14 / 59.50 56.84 / 55.26
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 1 67.87 / 67.27 59.96 / 59.43 69.31 / 62.55 61.17 / 55.06 64.28 / 63.05 61.86 / 60.68
PartA2 [Shi et al.(2019b)Shi, Wang, Wang, and Li] 1 65.52 / 64.85 57.32 / 56.63 54.83 / 37.72 46.85 / 32.19 54.29 / 48.75 52.21 / 46.89
PVRCNN [Shi et al.(2020)Shi, Guo, Jiang, Wang, Shi, Wang, and Li] 1 71.11 / 70.32 62.60 / 61.88 63.63 / 32.77 54.88 / 28.26 59.49 / 34.14 57.22 / 32.83
VoxelRCNN [Deng et al.(2021)Deng, Shi, Li, Zhou, Zhang, and Li] 1 71.51 / 70.98 63.75 / 63.26 65.95 / 65.99 65.47 / 60.86 70.11 / 68.71 67.98 / 66.63
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 4 71.27 / 70.73 63.59 / 63.09 73.91 / 70.45 66.28 / 60.10 63.78 / 62.98 61.59 / 60.82
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 16 72.53 / 71.31 64.18 / 64.21 74.05 / 71.17 66.17 / 61.03 64.05/ 64.54 62.31 / 61.77
MPPNet [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li] 4 74.24 / 73.55 66.29 / 65.38 76.94 / 72.29 68.63 / 66.16 67.34/ 66.67 65.12 / 64.48
MSF [He et al.(2023)He, Li, Zhang, Li, and Zhang] 4 74.37 / 73.97 66.35 / 65.85 78.16 / 74.91 70.27 / 67.21 67.89/ 67.14 65.58 / 64.89
LiSTM 3 74.83 / 74.32 66.85 / 66.17 75.89 / 69.72 66.83 / 63.43 70.84 / 69.75 68.23 / 69.12
Table 1: Quantative comparisons on 20% Sequence Waymo validation set.
Model NDS\uparrow mAP\uparrow mATE\downarrow mASE\downarrow mAOE\downarrow mAVE\downarrow mAAE\downarrow
PointPillar [Lang et al.(2019)Lang, Vora, Caesar, Zhou, Yang, and Beijbom] 58.62 45.27 0.3353 0.259 03286 0.2784 0.2002
Second [Yan et al.(2018)Yan, Mao, and Li] 62.31 50.8 0.3140 0.2554 0.2785 0.2587 0.2019
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 66.29 58.77 0.2919 0.2566 0.3692 0.2081 0.1837
VoxelNext [Chen et al.(2023)Chen, Liu, Zhang, Qi, and Jia] 67.09 60.55 0.3023 0.2526 0.3701 0.2087 0.1851
LiSTM 68.32 63.77 0.2895 0.2479 0.3182 0.2472 0.1850
Table 2: Quantative comparisons on nuScenes validation set.

4 Experiments

Dataset and Metrics. The Waymo Open dataset [Sun et al.(2020)Sun, Kretzschmar, Dotiwalla, Chouard, Patnaik, Tsui, Guo, Zhou, Chai, Caine, et al.] is a highly regarded benchmark for automatic driving. It consists of 1150 point cloud sequences, with over 200,000 frames in total. Evaluation of results using mean Average Precision (mAP) and its weighted variant by heading accuracy (mAPH). Results are reported for LEVEL 1 (L1, easy only) and LEVEL 2 (L2, easy and hard) difficulty levels, considering vehicles, pedestrians, and cyclists.

The nuScenes dataset [Caesar et al.(2020)Caesar, Bankiti, Lang, Vora, Liong, Xu, Krishnan, Pan, Baldan, and Beijbom] provides diverse annotations for autonomous driving and features challenging evaluation metrics. These include mean Average Precision (mAP) at four center distance thresholds and five true-positive metrics: ATE, ASE, AOE, AVE, and AAE, which measure translation, scale, orientation, velocity, and attribute errors, respectively. Additionally, the nuScenes detection score (NDS) combines mAP with these metrics.

Experimental Settings. In our experimental setup, we follow the default settings of Openpcdet [Team(2020)] and conduct the experiments using two 24GB Nvidia RTX 3090 GPUs. The validation process utilized the nuScenes and Waymo datasets. We employed the AdamW optimizer with a base learning rate of 3×1033superscript1033\times 10^{-3}3 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and applied layer-wise learning rate decay.

Comparison Experiment. We validate the effectiveness of the designed LiSTM on Waymo’s validation set (Table 1), using 20% of the sequences for training. Full results are available in Table 8 of the Appendix. LiSTM achieves an impressive improvement of over 8% compared to single-stage models like CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl], while also outperforming two-stage models such as PVRCNN [Shi et al.(2020)Shi, Guo, Jiang, Wang, Shi, Wang, and Li] and VoxelRCNN [Deng et al.(2021)Deng, Shi, Li, Zhou, Zhang, and Li]. Meanwhile, LiSTM, a multi-frame single-stage model, eliminates the need for region-of-interest extraction, resulting in reduced resource consumption, as illustrated in Table 7. In comparison to multi-frame CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl], LiSTM achieves remarkable improvements while utilizing fewer frames. When compared to two-stage models MPPNet [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li] and MSF [He et al.(2023)He, Li, Zhang, Li, and Zhang], LiSTM demonstrates clear advancements in vehicle and cyclist detection which is attributed to motion-based feature integration. More details and discussions can be found in the Appendix.

In Figure 5, we compare the baseline with our module. LiSTM demonstrates superior capability, particularly highlighted by the pink arrows, in detecting cases that CenterPoint fails to identify due to distance and occlusion challenges. Additionally, LiSTM offers an increased number of positive samples with no annotations, as indicated by the yellow arrows.

On the nuScenes dataset, LiSTM outperforms the benchmarks PointPillar [Lang et al.(2019)Lang, Vora, Caesar, Zhou, Yang, and Beijbom] and VoxelNet [Zhou and Tuzel(2018)], improving NDS and mAP by 2-3% compared to CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl]. Meanwhile, LiSTM is a boost in ATE and ASE as noted in Table 2.

\bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption \bmvaHangBoxRefer to caption
Figure 5: Qualitative visualization of our LiSTM on Waymo validation set. We show the 3D boxes predictions in the LiDAR bird’s-eye-view
CenterPoint MotionTransformer MGFA DCWM Veh. L2 APH Ped. L2 APH Cyl. L2 APH
\checkmark ×\times× ×\times× ×\times× 59.51 55.22 60.54
\checkmark \checkmark ×\times× ×\times× 62.49 56.04 63.17
\checkmark \checkmark \checkmark ×\times× 64.67 57.56 67.86
\checkmark \checkmark \checkmark \checkmark 65.88 61.10 68.8
Table 3: Ablation studies on Waymo validation set.

Ablation Study. As shown in Table 3, we compare the CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl], Motion Transformer, Motion-Guided Feature Aggregation, and Dual Correlation Weighting Module sequentially for feature fusion structure, and we can see that CenterPoint is difficult to model multi-frame features. Meanwhile, modeling features solely through a Transformer can be challenging. The proposed methods MGFA and DCWM offer a significant enhancement in APH by 2-3% through the incorporation of dynamic priors into the Transformer models.

Experiment Number Time Veh. L2 APH Ped. L2 APH Cyl. L2 APH
1 t𝑡titalic_t 62.13 60.91 61.16
2 t1,t𝑡1𝑡t-1,titalic_t - 1 , italic_t 63.41 58.17 62.37
3 t2,t1,t𝑡2𝑡1𝑡t-2,t-1,titalic_t - 2 , italic_t - 1 , italic_t 63.46 58.62 63.89
4 t1,t,t+1𝑡1𝑡𝑡1t-1,t,t+1italic_t - 1 , italic_t , italic_t + 1 65.88 61.10 68.80
5 t2,t1,t,t+1,t+2𝑡2𝑡1𝑡𝑡1𝑡2t-2,t-1,t,t+1,t+2italic_t - 2 , italic_t - 1 , italic_t , italic_t + 1 , italic_t + 2 65.73 61.13 67.56
Table 4: Ablation study of the frame fusion effects on Waymo validation set.

Since our task is a multi-frame fusion strategy, we need to consider the number of frames to be used. In Table 4, we compare the effects of multi-frame fusion including single-frame, past-frame fusion, and past-future fusion. In summary, we can draw three key conclusions. Firstly, the fusion of cross-frame, as seen (EXP. 1,2, and 4), significantly contributes to detection results. Secondly, using too many frames (EXP. 5) not only increases memory requirements but also hampers model convergence. The main reason this conclusion differs from MSF [He et al.(2023)He, Li, Zhang, Li, and Zhang] is that we use feature-level temporal fusion, whereas excessive attention stacking can hinder target characterization. Lastly, relying solely on past frames limits the model’s understanding of the scene’s geometry (EXP. 3 and 4). Incorporating both past and future frames provides a more comprehensive context for improved performance.

Motion Feature Cyl. L2 APH Veh. L2 APH
pre2cur 66.51 65.72
fut2cur 66.47 65.78
cur2pre 66.13 65.7
cur2fut 66.21 65.72
cur2pre + cur2fut 66.57 65.83
pre2cur + fut2cur 68.80 65.88
Table 5: Ablation study of motion-based heatmap
feature selections on Waymo validation set.
Fusion Method NDS mAP
Concatenate 66.79 58.13
Attention [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] 65.37 58.99
Spatial Fusion [Zhou et al.(2022)Zhou, Zhao, Wang, Wang, and Foroosh] 67.13 60.31
DCWM 68.32 63.77
Table 6: Ablation study of different feature fusion strategies on Waymo validation set.

We select the motion feature as shown in Table 6, it fuses the information of object motion and encodes its features according to trajectory predictions. However, we find the feature observed at different times does not have much effect on the metrics. It can be concluded that the trajectory feature predicted by the future and the past for the present works best and is the most logical. For multiple frames feature map fusion, we sequentially compare the following schemes, concatenate, attention, and spatial-aware attention which are mentioned in CenterFormer [Zhou et al.(2022)Zhou, Zhao, Wang, Wang, and Foroosh] and our proposed DCWM in Table 6. We can discern that directly employing attention could hinder model learning, potentially yielding inferior results compared to concatenation and spatial fusion. However, our proposed Dual Correlation Weighting Module effectively fuses multiple frames and brings more pronounced enhancements.

5 Counclusion

Addressing the challenge of detecting sparse and occluded long-range LiDAR point clouds, we introduce LiSTM, a motion-based spatial-temporal fusion 3D point cloud detector. It leverages well-designed motion features and motion-guided feature fusion to enhance detection performance on Waymo and nuScenes datasets. In future work, we will focus on developing an end-to-end motion generator and exploring sparse feature representations.

Appendix

Computational Efficiency. We acknowledge that some reviewers have raised concerns regarding the computational resources. To address this, we compare CenterPoint, PVRCNN, MSF, and LiSTM. Despite LiSTM having significantly larger model parameters, its actual FPS is comparable to that of CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl]. Moreover, LiSTM demonstrates a nearly 50% speed improvement over PV-RCNN++ [Shi et al.(2023)Shi, Jiang, Deng, Wang, Guo, Shi, Wang, and Li] while consuming less memory and operating more efficiently than MSF [He et al.(2023)He, Li, Zhang, Li, and Zhang]. This performance advantage primarily stems from our use of sparse feature operations and shared networks, which eliminate the need for computationally intensive processes such as multi-frame splicing and resampling.

Model Model Parameter Memory cost FPS
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 7758811 2464 MiB 5.68 it/s
PV-RCNN++ [Shi et al.(2023)Shi, Jiang, Deng, Wang, Guo, Shi, Wang, and Li] 13073505 3918 MiB 3.75 it/s
MSF [He et al.(2023)He, Li, Zhang, Li, and Zhang] 15661651 6684 MiB 4.58 it/s
LiSTM 17592422 4400 MiB 5.26 it/s
Table 7: Computational efficiency

Point-Trajectory Model Analysis and Performance Comparison. Taking MSF [He et al.(2023)He, Li, Zhang, Li, and Zhang] as an example, it enhances temporal features at the input level in two stages. In contrast, our approach targets implicit features, allowing for more efficient parallel computation and improved resource utilization. Unlike MSF’s ROI sampling on point clouds, our method constructs a BEV heatmap, significantly boosting performance for larger targets like Vel (6m) and Cly (2m). However, for smaller targets like Ped (0.5m), even minor deviations can reduce performance, leading to lower results compared to MSF.

Lack Related Work on The Motion Estimation Model. Works [Yan et al.(2021)Yan, Peng, Fu, Wang, and Lu, Cui et al.(2023)Cui, Li, and Fang, Xia et al.(2023)Xia, Wu, Li, Chan, and Stilla] have proposed learnable SOT models and we will try to complete the end-to-end model in this direction in the future. However, this class of methods requires significant computational resources and is not well-suited for multiple target detectors. Therefore, our proposed strategy is to use a simple linear Kalman model for target trajectory prediction, which characterizes target motion a priori without the need for learnable parameters or GPU resources.

Total Waymo Evaluation. The model validation results for Waymo’s full training dataset are shown below, focusing on a comparison between CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] and PVRCNN++ [Shi et al.(2023)Shi, Jiang, Deng, Wang, Guo, Shi, Wang, and Li].

Model Vehicle (AP/APH)\uparrow Pedestrian (AP/APH)\uparrow Cyclist (AP/APH)\uparrow
L1 L2 L1 L2 L1 L2
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 72.64 / 72.10 64.57 / 64.07 74.53 / 68.36 66.50 / 60.84 71.14 / 69.91 68.56 / 67.37
PV-RCNN++ [Shi et al.(2023)Shi, Jiang, Deng, Wang, Guo, Shi, Wang, and Li] 77.80 / 77.34 69.43 / 69.01 80.00 / 73.94 71.62 / 65.97 72.43/ 71.35 69.79 / 68.74
LiSTM 78.91 / 78.31 70.64 / 70.10 80.79 / 75.01 72.16 / 66.87 74.42 / 73.33 71.84 / 70.79
Table 8: Quantative comparison on Waymo validation set.

Long Distance Perception. The LiSTM architecture leverages continuous frames and motion priors to enhance performance, particularly for long-range detection. In our evaluation with the Waymo dataset, which covers a 75m radius horizontally and vertically, we use three distance thresholds to metric. Results show that LiSTM outperforms the baseline by an average of 5 points in the 25m to 75m range. Even beyond this range, where the point cloud is mostly filtered out, LiSTM metrics remain somewhat elevated compared to the baseline.

Model 25m away mAP\uparrow 50m-75m mAP\uparrow 75m away mAP\uparrow
Vehicle Pedestrians Cyclist Vehicle Pedestrians Cyclist Vehicle Pedestrians Cyclist
CenterPoint [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] 58.80 63.12 61.37 41.82 54.00 50.50 11.46 16.30 14.82
LiSTM 64.14 68.05 65.25 46.31 57.51 53.87 12.89 17.25 15.16
Table 9: Long distance perception metric on the Waymo validation set.

References

  • [Caesar et al.(2020)Caesar, Bankiti, Lang, Vora, Liong, Xu, Krishnan, Pan, Baldan, and Beijbom] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020.
  • [Calvo et al.(2023)Calvo, Taveira, Kahl, Gustafsson, Larsson, and Tonderski] Ernesto Lozano Calvo, Bernardo Taveira, Fredrik Kahl, Niklas Gustafsson, Jonathan Larsson, and Adam Tonderski. Timepillars: Temporally-recurrent 3d lidar object detection. arXiv preprint arXiv:2312.17260, 2023.
  • [Chen et al.(2022)Chen, Shi, Zhu, Cheung, Xu, and Li] Xuesong Chen, Shaoshuai Shi, Benjin Zhu, Ka Chun Cheung, Hang Xu, and Hongsheng Li. Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection. In European Conference on Computer Vision, pages 680–697. Springer, 2022.
  • [Chen et al.(2023)Chen, Liu, Zhang, Qi, and Jia] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21674–21683, 2023.
  • [Cui et al.(2023)Cui, Li, and Fang] Yubo Cui, Zhiheng Li, and Zheng Fang. Sttracker: Spatio-temporal tracker for 3d single object tracking. IEEE Robotics and Automation Letters, 2023.
  • [Deng et al.(2021)Deng, Shi, Li, Zhou, Zhang, and Li] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1201–1209, 2021.
  • [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [He et al.(2023)He, Li, Zhang, Li, and Zhang] Chenhang He, Ruihuang Li, Yabin Zhang, Shuai Li, and Lei Zhang. Msf: Motion-guided sequential fusion for efficient 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5196–5205, 2023.
  • [Huang et al.(2024)Huang, Lyu, Yang, and Tsai] Kuan-Chih Huang, Weijie Lyu, Ming-Hsuan Yang, and Yi-Hsuan Tsai. Ptt: Point-trajectory transformer for efficient temporal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14938–14947, 2024.
  • [Kim et al.(2021)Kim, Ošep, and Leal-Taixé] Aleksandr Kim, Aljoša Ošep, and Laura Leal-Taixé. Eagermot: 3d multi-object tracking via sensor fusion. In 2021 IEEE International Conference on Robotics and Automation, pages 11315–11321. IEEE, 2021.
  • [Lang et al.(2019)Lang, Vora, Caesar, Zhou, Yang, and Beijbom] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019.
  • [Law and Deng(2018)] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In European Conference on Computer Vision, pages 734–750, 2018.
  • [Li et al.(2023a)Li, Qi, Zhou, Liu, and Anguelov] Yingwei Li, Charles R Qi, Yin Zhou, Chenxi Liu, and Dragomir Anguelov. Modar: Using motion forecasting for 3d object detection in point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9329–9339, 2023a.
  • [Li et al.(2023b)Li, Ge, Yu, Yang, Wang, Shi, Sun, and Li] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1477–1485, 2023b.
  • [Li et al.(2022)Li, Wang, Li, Xie, Sima, Lu, Qiao, and Dai] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
  • [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37. Springer, 2016.
  • [Philion and Fidler(2020)] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European Conference on Computer Vision, pages 194–210. Springer, 2020.
  • [Qi et al.(2017)Qi, Su, Mo, and Guibas] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
  • [Shi et al.(2022)Shi, Li, and Ma] Guangsheng Shi, Ruifeng Li, and Chao Ma. Pillarnet: Real-time and high-performance pillar-based 3d object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2022.
  • [Shi et al.(2019a)Shi, Wang, and Li] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–779, 2019a.
  • [Shi et al.(2019b)Shi, Wang, Wang, and Li] Shaoshuai Shi, Zhe Wang, Xiaogang Wang, and Hongsheng Li. Part-a2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv preprint arXiv:1907.03670, 2(3), 2019b.
  • [Shi et al.(2020)Shi, Guo, Jiang, Wang, Shi, Wang, and Li] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10529–10538, 2020.
  • [Shi et al.(2023)Shi, Jiang, Deng, Wang, Guo, Shi, Wang, and Li] Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023.
  • [Sun et al.(2020)Sun, Kretzschmar, Dotiwalla, Chouard, Patnaik, Tsui, Guo, Zhou, Chai, Caine, et al.] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset, 2020.
  • [Team(2020)] OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. GitHub-open-mmlab/OpenPCDet:OpenPCDetToolboxforLiDAR-based3DObjectDetection., 2020.
  • [Tian et al.(2019)Tian, Yang, Wang, Wang, Li, and Liang] Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, and Zize Liang. Apple detection during different growth stages in orchards using the improved yolo-v3 model. Computers and Electronics in Agriculture, 157:417–426, 2019.
  • [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • [Wang et al.(2023)Wang, Liu, Wang, Li, and Zhang] Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3621–3631, 2023.
  • [Xia et al.(2023)Xia, Wu, Li, Chan, and Stilla] Yan Xia, Qiangqiang Wu, Wei Li, Antoni B Chan, and Uwe Stilla. A lightweight and detector-free 3d single object tracker on point clouds. IEEE Transactions on Intelligent Transportation Systems, 24(5):5543–5554, 2023.
  • [Yan et al.(2021)Yan, Peng, Fu, Wang, and Lu] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10448–10457, 2021.
  • [Yan et al.(2018)Yan, Mao, and Li] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  • [Yang et al.(2021)Yang, Zhou, Chen, and Ngiam] Zetong Yang, Yin Zhou, Zhifeng Chen, and Jiquan Ngiam. 3d-man: 3d multi-frame attention network for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1863–1872, 2021.
  • [Yin et al.(2020)Yin, Shen, Guan, Zhou, and Yang] Junbo Yin, Jianbing Shen, Chenye Guan, Dingfu Zhou, and Ruigang Yang. Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11495–11504, 2020.
  • [Yin et al.(2021)Yin, Zhou, and Krahenbuhl] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11784–11793, 2021.
  • [Zhao et al.(2024)Zhao, Heng, Wang, Gao, Liu, Yao, Chen, and Cai] Runkai Zhao, Yuwen Heng, Heng Wang, Yuanda Gao, Shilei Liu, Changhao Yao, Jiawen Chen, and Weidong Cai. Advancements in 3d lane detection using lidar point clouds: From data collection to model development. In 2024 IEEE International Conference on Robotics and Automation, pages 5382–5388. IEEE, 2024.
  • [Zhou et al.(2019)Zhou, Wang, and Krähenbühl] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [Zhou and Tuzel(2018)] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
  • [Zhou et al.(2022)Zhou, Zhao, Wang, Wang, and Foroosh] Zixiang Zhou, Xiangchen Zhao, Yu Wang, Panqu Wang, and Hassan Foroosh. Centerformer: Center-based transformer for 3d object detection. In European Conference on Computer Vision, pages 496–513. Springer, 2022.