MambaBEV: An efficient 3D detection model with Mamba2

Zihan You¹ , Ni Wang² , Hao Wang³ , Qichao Zhao³ , Jinxiang Wang⁴ * This work is partially supported by National Natural Science Foundation of China (NSFC) under Grant 52372410. All correspondence should be sent to J. Wang (Email: wangjx@seu.edu.cn).¹ School of Instrument Science and Engineering, Southeast University, China² Amazon Development Center Germany GmbH, Germany³ T3CAIC Technology, China⁴ School of Mechanical Engineering, Southeast University, China

Abstract

Accurate 3D object detection in autonomous driving relies on Bird’s Eye View (BEV) perception and effective temporal fusion. However, existing fusion strategies—based on convolutional layers or deformable self-attention—struggle with global context modeling in BEV space, leading to lower accuracy for large objects. To address this, we introduce MambaBEV, a novel BEV-based 3D object detection model that leverages Mamba2, an advanced state-space model (SSM) optimized for long-sequence processing. Our key contribution is TemporalMamba, a temporal fusion module that enhances global awareness by introducing a BEV feature discrete rearrangement mechanism tailored for Mamba’s sequential processing. Additionally, we propose Mamba-based-DETR as the detection head to improve multi-object representation. Evaluations on the nuScenes dataset demonstrate that MambaBEV-base achieves an NDS of 51.7% and an mAP of 42.7%. Furthermore, an end-to-end autonomous driving paradigm validates its effectiveness in motion forecasting and planning. Our results highlight the potential of SSMs in autonomous driving perception, particularly in enhancing global context understanding and large-object detection.

I INTRODUCTION

Ensuring accurate and reliable 3D object detection is crucial for autonomous driving systems, directly impacting safety and path planning. Traditional perception methods, such as Hough Transform[1] and keypoint-based feature extraction [2], laid the groundwork for object detection but struggled with limited robustness and scale variance. The rise of deep learning-based perception has significantly improved detection accuracy, yet challenges remain, particularly for monocular camera-based methods[3], which suffer from depth estimation errors and blind spots, posing a risk to vehicle safety.

To tackle these issues, researchers have explored multi-camera perception systems, such as binocular stereo matching[4] and surround-view camera networks. While these approaches improve distance estimation, they introduce challenges like high computational cost, redundancy, and difficulty in cross-camera target re-identification. A more promising solution is Bird’s Eye View (BEV)-based 3D object detection, which consolidates multi-camera inputs into a unified top-down representation, enhancing distance estimation, obstacle detection, and cross-view information sharing[5].

TABLE I: average precision for bevformer-tiny[6]

categories	dist0.5 $\uparrow$	dist1.0 $\uparrow$	dist2.0 $\uparrow$	dist4.0 $\uparrow$
car	0.0877	0.3366	0.6277	0.7809
pedestrian	0.0251	0.1993	0.4604	0.6469
bicycle	0.0056	0.1132	0.2902	0.3992
truck	0.0019	0.0700	0.2587	0.4403
construction vehicle	0.0000	0.0008	0.0651	0.1696
bus	0.0000	0.0588	0.3203	0.5619

Another critical aspect of autonomous driving perception is the temporal aggregation. While single-frame detection provides a straightforward approach, it often suffers from occlusion, missed detections, and temporal inconsistency between frames. To address these limitations, temporal fusion techniques have been developed to incorporate historical features, significantly improving detection robustness and accuracy[6]. Traditional temporal fusion methods, such as deformable self-attention mechanisms[6], dynamically sample spatial features and enhance computational efficiency compared to global self-attention. However, these methods struggle with global context modeling and long-range interactions. For example, in large-scale detection tasks like the COCO 2017 validation set, deformable attention-based models (Deformable-DETR)[7] demonstrated a 2.9% lower average precision (AP) for large objects compared to global self-attention models.

Similarly, in BEV-based 3D object detection, deformable self-attention models like BEVFormer[6] demonstrate higher accuracy for small objects (e.g., pedestrians, bicycles) but reduced performance for larger objects (e.g., trucks, buses) (Table I). This disparity arises because sparse sampling points in deformable attention limit spatial coverage, and the absence of explicit global interaction mechanisms prevents effective cross-scale feature fusion. Even increasing the number of sampling points does not fully resolve this issue, as deformable attention primarily aggregates local features rather than capturing holistic spatial relationships.

Refer to caption — Figure 1: Given an RGB image captured by six surrounding cameras, a pretrained backbone generates six feature maps. These feature maps are processed through a Feature Pyramid Network (FPN) to extract multi-scale features. Subsequently, the Special Cross Attention (SCA) module performs backward projection to produce a bird’s-eye view (BEV) feature map. The TemporalMamba block then fuses historical BEV features with current BEV features, guiding the generation of new current BEV features. After several processing layers, a Mamba-based-DETR head serves as the 3D object detection head.

Addressing these limitations is critical for enhancing 3D object detection performance in autonomous driving. Recently, state-space models (SSMs) have emerged as a promising alternative for long-sequence modeling, demonstrating superior efficiency and scalability compared to transformers[8]. Among them, Mamba, a novel structured SSM, has shown remarkable performance across multiple tasks. Mamba2, an improved version, further enhances computational efficiency and long-range dependency modeling[9]. These advances provide a strong foundation for developing a new approach to temporal fusion that overcomes the limitations of deformable attention. The integration of Mamba2 into the perception of 3D autonomous driving represents an innovative and promising direction. To address the challenges in temporal fusion modules, we introduce MambaBEV, a 3D perception model based on BEV and Mamba2. MambaBEV emphasizes the feasibility and potential of state-space models for autonomous driving perception systems, offering a solution to enhance the precision of large object detection. Our contribution in the paper could be summarized as fellows:

•

We introduce a mamba2 based 3D object detection model named MambaBEV. To our knowledge, this is the first attempt to integrate Mamba2 into a camera-based 3D object detection network.
•

We propose a mamba2 based temporal fusion module called TemproalMamba, showing the possibility and future insight for temporal fusion using mamba. To accommodate the nature of mamba sequence sweeps, we design the BEV feature discrete rearrangement mechanism.
•

In the decoder layer, we design a Mamba-based-DETR head based on the mamba-cross-attention module.
•

We conduct extensive experiments in the 3D object detection task and end-to-end autonomous driving paradigms adopted by VAD[10].

II Related work

II-A Camera based 3D Object detection

In the field of 3D object detection from images, several pioneering approaches have significantly advanced the domain. LSS[11] and BEVDet[5] represent notable methods that transforms image-based features into a bird’s-eye view (BEV) representation, enabling more accurate 3D detection by leveraging the geometric properties of the scene. Building upon this, BEVerse[12] enhances the BEV paradigm by integrating multi-view and temporal information, thereby improving the robustness and precision of 3D object detection. BEVDepth [13] takes a step further by explicitly modeling depth information within the BEV framework. This addition allows for a more precise understanding of the spatial relationships between objects, leading to enhanced detection performance. DETR3D[14] introduces the transformer architecture to the 3D detection task, leveraging its powerful sequence-to-sequence modeling capabilities. By generating a cost volume from a long history of image observations and augmenting per-frame monocular depth predictions with short-term, fine-grained matching, SOLOFusion[15] leverages both long-term and short-term temporal fusion to enhance object perception. VideoBEV[16] addresses the challenges of increasing computational and memory overheads associated with parallel fusion methods, demonstrating strong performance across various camera-based 3D perception tasks, including object detection, segmentation, tracking, and motion prediction. FB-BEV[17] and FB-OCC[18] define the Special Cross Attention (SCA) process as backward projection, an inverse process of forward projection. They combine two processes into one method, which effectively improving the 3D detection capabilities of models.

II-B Mamba-based models

Transformers have revolutionized various domains of deep learning with their ability to model sequences effectively[19]. However, their application in long-sequence modeling remains computationally expensive due to quadratic complexity with respect to sequence length. Addressing this, structured state space models (SSMs) have emerged as a scalable alternative, boasting linear complexity during training and constant state size during generation[8]. Among the notable advancements in this field is the Mamba model, which offers a promising solution for efficiently handling long sequences without sacrificing performance.

The evolution of Mamba into Mamba2[9] marked a significant leap, incorporating a structured state space duality (SSD) framework. This framework bridges SSMs with attention mechanisms, enhancing efficiency and scalability. Mamba2 particularly excels in sequence processing tasks by employing block decompositions of semiseparable matrices, which optimize both computational and memory efficiency. These innovations have set a new benchmark in sequence modeling, extending their application to complex tasks such as language processing and, crucially, to 3D object detection in autonomous driving systems.

Recent developments have expanded the capabilities of Mamba-based models. The introduction of Vision Mamba(Vim)[20] showcases a bidirectional state space model that enhances visual representation learning. Vim addresses the challenges of positional sensitivity and the need for global context in visual data, marking a significant advancement over traditional self-attention mechanisms. The VMamba[21] model transitions Mamba’s capabilities into the vision domain with its Visual State-Space (VSS) blocks and the innovative 2D Selective Scan (SS2D) module, optimizing contextual information gathering. Voxel Mamba[22] utilizes a group-free strategy to handle 3D voxel data, maintaining spatial proximity and enhancing detection accuracy without the computational overhead typical of Transformers. This model’s application to point cloud data for 3D object detection exemplifies the potential of SSMs to revolutionize the processing of spatial data. Furthermore, ‘MS-Temba’[23] adapts Mamba for action detection, introducing Temporal Mamba (Temba) Blocks that effectively capture both short- and long-term temporal relationships. This model’s scalability and reduced parameter count make it ideal for processing extensive video sequences.

These advancements illustrate the evolving landscape of Mamba-based models, highlighting their potential to redefine the efficiency and efficacy of autonomous systems requiring complex sequence and spatial data processing.

III Methodology

MambaBEV employs an advanced state-space model comprising two primary components. The first is the TemporalMamba block, a fusion engine based on a proposed Mamba-CNN architecture, which integrates BEV features across sequential frames to enhance temporal consistency and detection robustness. The second component is the Mamba-based-DETR, an innovative decoder head that processes the fused features to accurately localize and classify 3D objects.

III-A Architectural Design & Feature Encoding

The MambaBEV system architecture, illustrated in Figure 1, integrates four essential components to process input from six RGB cameras. Initially, inputs are handled by the image feature encoder, which utilizes a robust backbone comprising ResNet-50 pretrained on ImageNet, ResNet-101-DCN and VoV-99 initialized from the FCOS3D checkpoint, to efficiently extract high-level features from each image. An alternative backbone, Vmamba[21], can also be employed. The extracted features are then enhanced using a Feature Pyramid Network (FPN), generating multiscale features crucial for detecting objects at various scales.

These multi-scale feature maps are subsequently processed by the Spatial Cross Attention (SCA) module, producing a unified Bird’s Eye View (BEV) feature map. The TemporalMamba module enriches this integration by fusing historical and current BEV features, thereby enhancing the feature context for accurate object detection. The enriched features undergo further refinement through several processing layers before the Mamba-based-DETR head analyzes them for final object detection.

III-B TemporalMamba block

Traditional temporal fusion strategies for BEV-based 3D object detection rely on deformable self-attention, which dynamically samples spatial features to aggregate historical and current BEV features. The Temporal Self-Attention (TSA) module, for instance, operates as follows: Given historical BEV feature maps and current feature maps, TSA concatenates them and uses a linear layer to generate attention weights and offsets. Each query, representing BEV features, is then calculated based on these weights in parallel.

Experiment results have shown this paradigm have limitations. According to Table I, which lists some of the results of each categories, we find it interesting that these deformable self-attention module performs much better in detecting small objects like bicycle and pedestrian rather than large objects like bus and construction vehicle. Same results happened in other deformable-attention-based model. The reason for this is that the mechanism does not facilitate cross-frame global interaction of large object features due to its restriction of allowing only three queries to interact with each reference query.

Our approach using Mamba2 increases global interaction capabilities. Initially, features from the previous frame are transformed using the ego rotation angle. As shown in Figure 2, given the transformed historical BEV feature map and the current feature map (both with dimensions of 256), we concatenate them along the third dimension. The concatenated features are then compressed from 512 to 256 dimensions using a convolutional block. This convolutional block consists of two parallel sub-modules: a 3×3 down-sampling convolutional layer that preserves important features while reducing dimensionality, and a pointwise convolutional layer that lowers the dimensionality and introduces non-linearity to learn complex patterns. To mitigate internal covariate shift, batch normalization is applied after each convolutional layer. The outputs of both sub-modules are concatenated and apply non-linear activation functions followed by linear layer and layernorm. The process can be write as fomular as follows:

\begin{gathered}F_{c}=F+F_{history}\\ F_{0}=BN(ConV1(F+F_{history})))\\ F_{1}=BN(ConV(BN(ConV3(F_{c})))\\ Z=LN(Linear(F_{1}))\\ \end{gathered}

(1)

Next, we discretely rearrange Z and process it through the Mamba2 block. The Mamba2 block, originally designed for natural language processing and sequence processing, faces significant challenges when applied to vision-like data. Therefore, designing an appropriate discrete rearrangement method is crucial. We propose a four-direction rearrangement method, based on experimental results and inspired by Vmamba[21]. The impact of different rearrangement methods is discussed in the ablation study.

A multi-directional feature sequence scanning mechanism is innovatively proposed, where the feature map Z is discretely serialized and then recombined in four directions: forward-left, forward-upward, reverse-left, and reverse-upward, as shown in Figure 3. It is important to note that we do not adopt a serpentine, snake-like recombination approach, as we believe this results in an imbalance in the interaction between adjacent features, where some adjacent features may be close together while others are far apart. After Query-Re-arrange, the sequence has pass through a linear layer to project to a different dimensional space that fit mamba2 block. This new sequence is then fed into the Mamba2 model. The Mamba2 model outputs the enhanced sequence features, which incorporate and interact with gobal features. This helps to increase gobal awareness of the BEV space and aggregate cross frame features. The sequence then recombined and restored to the original order as shown in Figure 4. We calculate the average of the four tensors, and the enhanced fused BEV feature map is added to the current BEV feature map with a dropout rate of 0.9 as a skip connection to avoid overfitting and reduces co-adaptation of neurons.

III-C Mamba-based-DETR head

As depicted in Figure 1, we have redesigned the DETR head by integrating the Mamba2 architecture with the traditional DETR encoder and it is named mamba-based-DETR. Initially, 900 object queries undergo preprocessing and interact within the Mamba2 block, which functions similarly to self-attention mechanisms. The outputs from the Mamba block are then processed using deformable attention, akin to the traditional DeformableAttention.

IV Experiment

IV-A Datasets

We evaluate our method on nuScenes[24] dataset for 3D object dectection and the challenging nuPlan dataset for end to end planning. The nuScenes dataset is a large-scale, collection dataset, which comprises 1,000 driving scenes in different cities with total of 280,000 annotated frames, designed for autonomous driving research with 3d object annotations. The NuScenes Detection Score (NDS) is the official metric for model performance, defined as:

NDS=\frac{1}{10}*[5*mAP+\sum_{mTP\in TP}(1-min(1,mTP))]{(2)}

Here, mTP represents the average metrics of Average Translation Error, Average Scale Error, Average Orientation Error, Average Velocity Error, and Average Attribute Error. This comprehensive approach ensures that all critical aspects of object detection and tracking are considered, leading to a more robust evaluation of the autonomous driving system’s capabilities. For planning paradigms, displacement error and collision rate are used to validate planning performance.

IV-B Experimental settings

We produce two versions of MambaBEV, which is named as MambaBEV-tiny and MambaBEV-base. For MambaBEV-tiny, the ResNet50 model pre-trained on ImageNet is employed as the backbone. The BEV grid size in the tiny version is 50 × 50, with a grid resolution of 2.048 meters. The input image size is 800 × 450, and three historical frames are used. The maximum perception distance is 51.2 meters in both the X and Y directions. The BEV encoder in the tiny version consists of three layers.

For MambaBEV-base, the ResNet101-CDN backbone initialized from the FCOS3D checkpoint is utilized. The BEV grid size in the base version is 200 × 200, with a grid resolution of 0.512 meters. The input image size is 900 × 1600, and four historical frames are used. Similar to DETR3D[14], the base version of the BEV feature encoder consists of six layers. An alternative backbone, V-99 initialized from the FCOS3D checkpoint, is also provided.

During training, the AdamW optimizer is employed with a base learning rate of 2e-4 and a weight decay of 0.01 for both versions. The MambaBEV-tiny version is trained on eight Tesla V100S-PCIE-32GB GPUs, while the MambaBEV-base version is trained on eight NVIDIA A800-SXM4-80GB GPUs. The dimension of the BEV query is 256, and the batch size per GPU is 1. No data augmentation is applied during training.

For further analysis, We also conduct an end to end autonomous driving model to evaluate the effectiveness of the dense BEV-feature perception paradigm based on MambaBEV. In this model, MambaBEV serves as the BEV feature generation backbone. The model is trained for 60 epochs using eight NVIDIA A800-SXM4-80GB GPUs with a learning rate of 0.0001. ResNet50 is used as the backbone, and the input image size is 1280 × 720. The number of map queries is set to 100 × 20, and the number of agent queries is 300. All other training details remain the same as those for MambaBEV-base.

IV-C Main results

3D objects detection task We evaluate our model on the nuScenes validation set for the 3D object detection task. As shown in Table III, our MambaBEV-base outperforms BEVFormer-S by 3.51% in mean Average Precision (mAP) and 5.97% in the nuScenes Detection Score (NDS). The BEVFormer-S uses the Special Cross Attention (SCA) as the backward projection method, similar to our approach, but processes a single frame without any temporal fusion technic. This improvement highlights the effectiveness of our TemporalMamba block. Furthermore, the average velocity error decreases by 37% when the TemporalMamba block is added, demonstrating that incorporating historical information, particularly processed by the TemporalMamba block, significantly enhances velocity estimation by providing historical data.

Compared to other methods leveraging temporal information, the Mamba block shows superior performance. For instance, our method achieves a 4.51% improvement in mAP and a 6.37% improvement in NDS over PolarDETR. Additionally, our method exhibits the lowest mean Absolute Velocity Error (mAVE) of 0.432, further validating its exceptional performance in velocity estimation.

TABLE II: Average precision of large objects from tiny-version

Methods	categorizes	dist0.5 $\uparrow$	dist1.0 $\uparrow$	dist2.0 $\uparrow$	dist4.0 $\uparrow$
deformable	truck(%)	0.19	7.00	25.87	44.03
ours	truck(%)	0.45	7.73	26.97	46.61
deformable	bus(%)	0.0	5.8	32.03	56.19
ours	bus(%)	0.5	7.27	37.05	60.95

TABLE III: 3D object detection results on nuScenes val. set

method	backbone	imagesize	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mASE $\downarrow$	mAOE $\downarrow$	mAVE $\downarrow$	mAAE $\downarrow$
BEVFormer-T	R-50	450*800	0.252	0.354	0.900	0.294	0.655	0.657	0.216
MambaBEV-T	R-50	450*800	0.262	0.368	0.881	0.2909	0.5998	0.6366	0.2209
CenterNet[25]	DLA	-	0.328	0.306	0.716	0.264	0.609	1.426	0.658
FCOS3D[26]	R101	1600*900	0.372	0.295	0.806	0.268	0.511	1.315	0.170
PGD[27]	R101	1600*900	0.409	0.335	0.732	0.263	0.423	1.285	0.172
DETR3D	R101	1600*900	0.425	0.346	0.773	0.268	0.383	0.842	0.216
BEVDet	R101	1056*384	0.384	0.317	0.704	0.273	0.531	0.940	0.250
PolarDETR[28]	R101	1600*900	0.444	0.365	0.742	0.269	0.350	0.829	0.197
PolarDETR-T	R101	1600*900	0.488	0.383	0.707	0.269	0.344	0.518	0.196
PETR[29]	R101	1600*900	0.442	0.370	0.711	0.267	0.383	0.865	0.201
UVTR[30]	R101	1600*900	0.483	0.379	0.731	0.267	0.350	0.510	0.200
EPro-PnP-Detv2[31]	R101	-	0.490	0.423	0.547	0.236	0.302	1.071	0.123
DD3D[32]	-	-	0.477	0.418	0.572	0.249	0.368	1.014	0.124
BEVFormer-s	R101	-	0.448	0.375	0.725	0.272	0.391	0.802	0.200
Ego3RT[33]	R101	-	0.473	0.425	0.550	0.264	0.433	1.014	0.145
TempBEV[34]	-	-	0.508	0.408	-	-	-	-	-
MambaBEV-B	R101	1600*900	0.508	0.410	0.688	0.275	0.375	0.432	0.203
MambaBEV-B	V2-99	1600*900	0.517	0.427	0.669	0.265	0.365	0.468	0.193

TABLE IV: Open loop planning performence

Method	1s(L2) $\downarrow$	2s(L2) $\downarrow$	3s(L2) $\downarrow$	Avg $\downarrow$	1s(Col) $\downarrow$	2s(Col) $\downarrow$	3s(Col) $\downarrow$	Avg $\downarrow$
NMP	-	-	2.31	-	-	-	1.92	-
ST-P3	1.33	2.11	2.90	2.11	0.23	0.62	1.27	0.71
ours	1.03	1.76	2.53	1.77	0.25	0.84	1.72	0.93

Table II presents the average precision for large objects, comparing deformable attention-based methods with our approach. We find that our model performs better in detecting large objects, such as trucks and buses, demonstrating its ability to facilitate global feature interaction and enhance global awareness within the bird’s-eye-view space.

Additionally, we observe that increasing the resolution of the bird’s-eye-view grid and incorporating additional frames significantly improve the model’s performance. This is evidenced by the substantial performance gains in the base version compared to the tiny version, as more research is shown in the Ablation Study.

TABLE V: Motion forecasting

Method	minADE $\downarrow$	minFDE $\downarrow$	MR $\downarrow$	EPA $\uparrow$
PnPNet	1.15	1.95	0.226	0.222
ViP3D	2.05	2.84	0.246	0.226
Traditional	2.06	3.02	0.277	0.209
Constant Pos.	5.80	10.27	0.347	-
Constant Vel.	2.13	4.01	0.318	-
Ours	0.84	1.203	0.1478	0.546

End to end automonus driving paradiams Furthermore, we test our backbone in an end-to-end autonomous driving paradigm, with results presented in Table IV and Table V. The paradigms using our method demonstrate strong performance in open-loop evaluation on the nuScenes dataset. We assess the model in two key aspects: motion forecasting and planning. The performance is validated using L2 error and collision rate metrics. The results indicate that the end-to-end autonomous driving paradigm utilizing our method outperforms LiDAR-based methods in some instances. Additionally, we discuss the performance of motion forecasting in Table V .

V Ablation study

V-A Design of conv block

To further analyze the effectiveness of the basic convolutional block, a fair comparison is made between the method that uses concatenation and the one that uses the convolutional methods within the TemporalMamba block. All methods in this study employ four-direction discrete rearrangement techniques. The concatenation method refers to the simple concatenation of historical BEV features with current BEV features along the vector dimension. As illustrated in Table VI, the use of a convolutional block to fuse historical and current BEV features effectively improves model performance by approximately 1%. Furthermore, it is evident that the convolutional block reduces the average velocity error, which is a crucial metric in the historical fusion module.

TABLE VI: Comparison between concatenates methods and Convolutions methods

Ways	backbone	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAVE $\downarrow$	mAAE $\downarrow$
Concat	R101	0.4936	0.399	0.696	0.513	0.213
Conv	R101	0.508	0.410	0.688	0.432	0.203
Concat	R50	0.3425	0.2518	0.9035	0.7612	0.2268
Conv	R50	0.3682	0.2622	0.8810	0.6366	0.2209

The improvement observed can be attributed to several factors. One key reason is the distinct types of features present in typical driving scenarios: dynamic (moving) features and static features. Concatenation-based fusion methods may fail to appropriately account for the interaction between these two feature types. Specifically, ego-motion and moving features can create discrepancies when fused without proper consideration of their spatial context. By employing a suitable receptive field, these issues can be mitigated, ensuring that related features are processed within the same receptive field.

TABLE VII: Impact cause by different window size

Window size	backbone	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAVE $\downarrow$
5	R50	0.3642	0.2514	0.8844	0.6480
3	R50	0.3682	0.2622	0.8810	0.6366

As shown in Table VII, experiments indicate that a kernel size of 3 is optimal for feature fusion. Testing with different window sizes reveals that increasing the kernel size to 5 led to a decline in performance metrics. This suggests that overly large receptive fields may introduce noise, particularly when handling the distinct dynamic and static features found in typical driving environments.

V-B Impact of different discrete rearrangement methods

Mamba2 is specifically designed for natural language processing. However, for 3D object detection, the BEV feature map must be flattened or discretized before it can be processed by the Mamba block. Therefore, selecting an optimal discretization strategy is crucial. Motivated by [21], two discretization methods are proposed: a one-direction discretization and a four-direction discretization approach.

TABLE VIII: Comparison between different rearrangement methods

Ways	backbone	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAOE $\downarrow$	mAVE $\downarrow$
Single	R50	0.3323	0.2390	0.9100	0.6596	0.8040
Four	R50	0.3425	0.2518	0.9035	0.6545	0.7612

As demonstrated in Table VIII, the four-direction discretization method improves model performance. In the one-direction approach, the BEV queries (50×50) are discretized into a long sequence. While all methods have positional embeddings and implicitly encoded CAN bus i nformation, the four-direction discretization method outperforms the one-direction method by 1.02% in NDS and 1.28% in mAP. This result supports the hypothesis that grouping more related queries closely enhances the Mamba block’s ability to capture inter-query relationships. However, increasing the number of directions in the discretization method leads to higher parameter counts and computational complexity. Experiment shows that our methods provide an optimal solution. In this experiment, all models use the concatenation method.

V-C Impact of improving resolution of BEV features

To investigate the impact of resolution on model performance, the resolution of the MambaBEV-tiny BEV features is increased from 50 to 100. As shown in Table IX, improving the resolution significantly enhances the model’s performance across all metrics. In this experiment, convolutional block is employed instead of the concatenation strategy.

TABLE IX: Impact of different resolution of BEV features

Resolution	NDS $\uparrow$	mAP $\uparrow$	mATE $\downarrow$	mAVE $\downarrow$	mAAE $\downarrow$
$100\times 100$	0.4180	0.2951	0.7837	0.4732	0.1982
$50\times 50$	0.3425	0.2518	0.9035	0.7612	0.2268

V-D Visualization of BEV features

To further analyze the effectiveness of the TemporalMamba block, BEV features are visualized after passing through the TemporalMamba block. L2 norm is used to generate heatmaps for comparison. As illustrated in Figure 5, it is evident that the model learns the object features effectively. Additionally, a comparative analysis is conducted between the TemporalMamba block and the deformable attention module. The left side of the figure illustrates the bird’s-eye view (BEV) features generated by the Mamba-based module, while the right side displays those produced by the deformable-based module. It is evident that the BEV features from the Mamba-based module exhibit superior global awareness, which intuitively accounts for its enhanced performance in detecting large objects. Further comparisons across multiple frames are presented in Figure 6.

VI CONCLUSIONS

This paper introduces MambaBEV, an efficient 3D detection model, which we believe is the first to integrate Mamba2 into a camera-based detection model. We design the TemporalMamba block to effectively fuse temporal information and enhance the global awareness. Extensive experiments on the NuScenes dataset demonstrate the effectiveness and efficient of our proposed methods, especially in improving precision of large object detection. Additionally, we adopt an end-to-end self-driving paradigm to further assess the model’s performance with a good result. This work highlights the feasibility and potential of state-space models for autonomous driving perception systems, and provides a solution for improving the precision in large objects detection.

Limitation: Our method could significantly reduce the computational cost compared to the global attention based method, but it is still slightly higher than the deformable attention based method. In the future, we will develop a solution in this aspect.

References

[1] Sunil Kumar, Manisha Jailia, and Sudeep Varshney. An efficient approach for highway lane detection based on the hough transform and kalman filter. Innovative infrastructure solutions, 7(5):290, 2022.
[2] Yunshuang Yuan, Hao Cheng, and Monika Sester. Keypoints-based deep feature fusion for cooperative vehicle detection of autonomous driving. IEEE Robotics and Automation Letters, 7(2):3054–3061, 2022.
[3] Laiquan Han, Yuan Jiang, Yongjun Qi, Khuder Altangerel, et al. Monocular visual obstacle avoidance method for autonomous vehicles based on yolov5 in multi lane scenes. Alexandria Engineering Journal, 109:497–507, 2024.
[4] Ge Yang and Yuting Liao. An improved binocular stereo matching algorithm based on aanet. Multimedia Tools and Applications, 82(26):40987–41003, 2023.
[5] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
[6] Z Li, W Wang, H Li, E Xie, C Sima, T Lu, Q Yu, and J Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arxiv 2022. arXiv preprint arXiv:2203.17270.
[7] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
[8] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
[9] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024.
[10] Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023.
[11] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
[12] Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
[13] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 1477–1485, 2023.
[14] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
[15] Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprint arXiv:2210.02443, 2022.
[16] Chunrui Han, Jinrong Yang, Jianjian Sun, Zheng Ge, Runpei Dong, Hongyu Zhou, Weixin Mao, Yuang Peng, and Xiangyu Zhang. Exploring recurrent long-term temporal fusion for multi-view 3d perception. IEEE Robotics and Automation Letters, 2024.
[17] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6919–6928, 2023.
[18] Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023.
[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[20] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024.
[21] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. Advances in neural information processing systems, 37:103031–103063, 2025.
[22] Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaoxiang Zhang, and Lei Zhang. Voxel mamba: Group-free state space models for point cloud based 3d object detection. arXiv preprint arXiv:2406.10700, 2024.
[23] Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, and Srijan Das. Ms-temba: Multi-scale temporal mamba for efficient temporal action detection. arXiv preprint arXiv:2501.06138, 2025.
[24] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
[25] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6569–6578, 2019.
[26] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 913–922, 2021.
[27] Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022.
[28] Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Chang Huang, and Wenyu Liu. Polar parametrization for vision-based surround-view 3d detection. arXiv preprint arXiv:2206.10965, 2022.
[29] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In European conference on computer vision, pages 531–548. Springer, 2022.
[30] Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel-based representation with transformer for 3d object detection. Advances in Neural Information Processing Systems, 35:18442–18455, 2022.
[31] Hansheng Chen, Pichao Wang, Fan Wang, Wei Tian, Lu Xiong, and Hao Li. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2781–2790, 2022.
[32] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021.
[33] Jiachen Lu, Zheyuan Zhou, Xiatian Zhu, Hang Xu, and Li Zhang. Learning ego 3d representation as ray tracing. In European Conference on Computer Vision, pages 129–144. Springer, 2022.
[34] Thomas Monninger, Vandana Dokkadi, Md Zafar Anwar, and Steffen Staab. Tempbev: Improving learned bev encoders with combined image and bev space temporal aggregation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9668–9675. IEEE, 2024.