Autonomous driving paper index

Multimodal vehicle trajectory prediction method based on visual perception information

2025-03-25 · Proceedings of the Institution of mechanical engineers. Part D, journal of automobile engineering

autonomous drivingbevtrajectory predictionnusceneswaymoperceptionprediction

One-line summary

This paper proposes a multimodal vehicle trajectory prediction model based on visual perception information (VP-MTP).

Engineering notes

Finally, experiments on the Waymo motion and nuScenes datasets demonstrate that, compared to existing baseline models, the VP-MTP achieves average improvements of 12.4% and 9.9% in minimum Average Displacement Error (minADE) and minimum Final Displacement Error (minFDE) on the Waymo dataset, and 9.3% and 10.0% on the nuScenes dataset, respectively.

Chinese explanation / 中文解读

中文解读待补充：本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

Surrounding vehicle trajectory prediction is a crucial component of autonomous driving. Currently, trajectory prediction research relies primarily on publicly available datasets processed by perception methods rather than raw sensor perception information. With the increasing emphasis on visual perception, the integration of the visual perception trajectory prediction pathway will be highly important for the application of prediction algorithms. This paper proposes a multimodal vehicle trajectory prediction model based on visual perception information (VP-MTP). First, a vehicle detection network is employed to obtain the position coordinates of vehicles in consecutive frame bird’s eye view (BEV) images. Subsequently, the discrete position coordinates are processed into complete vehicle historical trajectories through a processing block that includes affine coordinate transformation, vehicle tracking, and trajectory smoothing (ATS). To address the high computational complexity of the standard Transformer, the input sequence is decomposed in the time dimension. Additionally, layer normalization positions are adjusted, convolutional feed-forward layers are introduced, and hierarchical encoding is employed to enhance feature extraction capability and encoding efficiency. Thus, a hierarchical Transformer encoder based on convolutional feedforward with time decomposition attention (HT-CTA) is constructed. Considering the large workload and limited adaptability of clustering-based multimodal training strategies in complex scenarios, learnable anchor embedding features are introduced as model parameters to establish a multimodal trajectory decoder. Finally, experiments on the Waymo motion and nuScenes datasets demonstrate that, compared to existing baseline models, the VP-MTP achieves average improvements of 12.4% and 9.9% in minimum Average Displacement Error (minADE) and minimum Final Displacement Error (minFDE) on the Waymo dataset, and 9.3% and 10.0% on the nuScenes dataset, respectively. This enhancement provides higher prediction accuracy and good multimodality, achieving multimodal trajectory prediction based on raw visual perception information.

5.5Engineering value

7.0Research novelty

6.0Business relevance

Links and sources

Official / arXiv page

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.