Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions.
Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets — nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) — to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.
Datasets for Download
NuView
GroundView
Important Links:
Paper (arxiv): https://ieeexplore.ieee.org/document/11097781
Github: https://github.com/intelligent-vehicles/BEV-LLM
Cite this:
@INPROCEEDINGS{11097781,
author={Brandstätter, Felix and Schütz, Erik and Winter, Katharina and Flohr, Fabian B.},
booktitle={2025 IEEE Intelligent Vehicles Symposium (IV)},
title={BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving},
year={2025},
volume={},
number={},
pages={345-350},
keywords={Point cloud compression;Solid modeling;Three-dimensional displays;Laser radar;Image coding;Decision making;Transportation;Benchmark testing;Safety;Autonomous vehicles;Explainability;Scene Understanding},
doi={10.1109/IV64158.2025.11097781}}