Walk-the-Talk: LLM driven pedestrian motion generation

A pedestrian walks towards a corsswalk by waving their right hand.
Animated Unreal Skeleton from Walk-the-Talk output
Retargeted animation to a CARLA pedestrian
Simulation with traffic
A dizzy jaywalking pedestrian trips and leans onto a parked vehicle.
Animated Unreal Skeleton from Walk-the-Talk output
Retargeted animation to a CARLA pedestrian
Simulation with traffic

Workflow

Walk-the-Talk framework: Using domain-specific data to generate realistic agent behaviours and motions for autonomous driving simulations utilizing Large Language Models.

Paper Abstract

In the field of autonomous driving, a key challenge is the “reality gap”: transferring knowledge gained in simulation to real-world settings. Despite various approaches to mitigate this gap, there’s a notable absence of solutions targeting agent behavior generation which are crucial for mimicking spontaneous, erratic, and realistic actions of traffic participants. Recent advancements in Generative AI have enabled the representation of human activities in semantic space and generate real human motion from textual descriptions. Despite current limitations such as modality constraints, motion sequence length, resource demands, and data specificity, there’s an opportunity to innovate and use these techniques in the intelligent vehicles domain. We propose Walk-the-Talk, a motion generator utilizing Large Language Models (LLMs) to produce reliable pedestrian motions for high-fidelity simulators like CARLA. Thus, we contribute to autonomous driving simulations by aiming to scale realistic, diverse long-tail agent motion data – currently a gap in training datasets. We employ Motion Capture (MoCap) techniques to develop the Walk-the-Talk dataset, which illustrates a broad spectrum of pedestrian behaviors in street-crossing scenarios, ranging from standard walking patterns to extreme behaviors such as drunk walking and near-crash incidents. By utilizing this new dataset within a LLM, we facilitate the creation of realistic pedestrian motion sequences, a capability previously unattainable (cf. Figure 1). Additionally, our findings demonstrate that leveraging the Walk-the-Talk dataset enhances cross-domain generalization and significantly improves the Fréchet Inception Distance (FID) score by approximately 15% on the HumanML3D dataset.

Some more examples

2 People meeting in the parking lane.
A drunk pedestrian falling in front of a car.

Citation

@conference{ramesh2024,
title = {Walk-the-Talk: LLM driven pedestrian motion generation},
author = {Mohan Ramesh and Fabian Flohr},
year = {2024},
date = {2024-03-30},
booktitle = {IEEE Intelligent Vehicles Symposium (IV) (accepted for publication)},
keywords = {},
pubstate = {accepted},
tppubtype = {conference}
}