Abstract
Estimation of human pose and shape (3DHPS) in 3D is crucial to ensure the safety of vulnerable road users (VRUs) in autonomous driving (AD) scenarios, as it can serve as an additional feature for trajectory prediction and ego-motion planning in complex urban environments. While there are already only a few multimodal datasets from other domains available, transferring knowledge into the AD domain remains challenging regarding its specific characteristics, such as ego-motion, varying lighting conditions, and dynamic object distances.
To address this gap, we propose a novel 3DHPS model that effectively integrates LiDAR \textit{and} image features through an intermediate fusion network, allowing robust estimation of Skinned Multi-Person Linear Model (SMPL) parameters. Unlike prior approaches, our method utilizes both image and LiDAR data and is also trained and validated on the Waymo Open Dataset (WOD). Leveraging WOD’s 2D/3D joint annotations, we generate 3D SMPL pseudo-ground truth for training and evaluate the model quantitatively and qualitatively. We can estimate accurate human poses on both single and fused modalities.
Experimental results demonstrate the effectiveness and robustness of our approach in challenging real-world AD scenarios – including low contrast conditions, poor lighting, and visually ambiguous background – while achieving an MPJPE of (138\,\mathrm{mm}) and a PA-MPJPE of (103\,\mathrm{mm}). The code and model will be made publicly available at https://github.com/max-a-ai/lif-net.
Method
We introduce LIF-Net, a method for recovering 3D human pose and shape (SMPL) of pedestrians in urban driving scenes by fusing camera and LiDAR data. Monocular RGB and LiDAR have complementary failure modes – while RGB suffers from depth ambiguity and degrades under occlusion and poor lighting, LiDAR provides direct geometry but is sparse and lacks semantic cues. LIF-Net is designed to exploit this complementarity for robust pose estimation in the wild.
Our main contributions are: (1) a novel intermediate cross-attention fusion architecture that combines the complementary strengths of RGB and LiDAR for robust SMPL estimation; (2) an extensive multi-modal SMPL evaluation on the large-scale, in-the-wild Waymo Open Dataset; and (3) state-of-the-art accuracy, reducing MPJPE by 35.5% compared to image-only and by 9.1% compared to LiDAR-only baselines.
The architecture has three components. An RGB encoder (a ViT-H/16 backbone from ViTPose) processes a 256×256 image crop of the pedestrian into image features. A LiDAR encoder (PointNet++) processes a fixed set of 512 points into a global geometry feature. These two streams are combined by a cross-attention fusion module, in which the image features query the global LiDAR context — letting the model dynamically reweight the two modalities and stay robust when one sensor degrades. Finally, an HMR2-initialised transformer decoder takes a mean-initialised SMPL query, attends to the fused features, and predicts a residual on the SMPL pose and shape parameters; the SMPL body model itself remains frozen.
Qualitative Results
Citation
@INPROCEEDINGS{2026_lifnet_buettner,
title = {{LIF-Net}: {LiDAR and Camera Fused 3D Human Pose and Shape Estimation for Autonomous Driving}},
author = {Buettner, Max A. and Schuetz, Erik and Flohr, Fabian B.},
booktitle = {2026 Proc. of the IEEE Intelligent Vehicles Symposium (IV)},
year = {2026},
}
