BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions.
Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets — nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) — to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.

Datasets for Download

NuView

NuView_train.json Download

GroundView

GroundView_train.json Download

GroundView_val.json Download