CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Andrew Jeong · Jaemin Kim · Sebin Lee · Sung-Eui Yoon*

KAIST

CLaD teaser overview
Overview of CLaD. (a) Conventional approaches either generate semantic artifacts (e.g., subgoal images or texts), or plan in unimodal latent spaces that lack cross-modal understanding. (b) CLaD learns cross-modal latent dynamics to predict grounded latent foresights, which condition a diffusion policy for action generation. CLaD achieves 94.7% with only 0.66B parameters, competitive with OpenVLA (7B) and π0.5 (3.3B).

Overview

Core Contributions

Cross-Modal Latent Dynamics

We propose a cross-modal dynamics model that learns how proprioceptive and semantic transitions jointly evolve under actions via asymmetric cross-attention, interpreting a semantic transition through a proprioceptive transition cue.

Grounded Latent Foresight

We introduce a two-stage training framework that predicts compact future latent states from the shared cross-modal dynamics, which are grounded to observable quantities, enabling them to serve as subgoals for diffusion-based control.

Parameter-Efficient Planning

CLaD achieves competitive performance to OpenVLA (7B) and π0.5 (3.3B) on LIBERO-LONG with significantly fewer parameters (0.66B), demonstrating that grounded latent foresight enables efficient and scalable robot planning within a compact latent space.

Method

Planning with Cross-Modal Latent Dynamics

CLaD method architecture
Stage 1: Learn cross-modal latent dynamics via asymmetric cross-attention and predict grounded latent foresights supervised by EMA targets with reconstruction losses.
Stage 2: Modulate foresights with observations via FiLM layers to condition a diffusion policy for action generation.
Cross-modal Latent Dynamics

Rather than aligning static states across modalities, CLaD learns how proprioceptive and semantic transitions co-evolve under actions via asymmetric cross-attention, capturing their shared dynamic context.

Learning Grounded Latent Foresight

From the learned cross-modal dynamics, lightweight MLPs predict future latent states supervised by EMA target encoders, while auxiliary reconstruction losses ground these foresights to observable quantities and prevent representation collapse.

Diffusion Policy Guided by Latent Foresight

Predicted latent foresights are modulated with current observations via FiLM layers and condition a diffusion policy for action generation, serving as implicit subgoals without the overhead of explicit semantic artifact generation.

Results

Efficient, while Effective Planning of CLaD

Performance on long-horizon planning

CLaD reaches the best average success rate on LIBERO-LONG (94.7%, 0.66B), compared to large VLAs, such as OpenVLA (93.8%, 7B) and π0.5 (93.2%, 3.3B) with much fewer parameters.

Benchmark comparison chart
Performance comparison on LIBERO-LONG benchmark.
Computational efficiency

CLaD runs at 25 Hz with only 4 GB memory, versus OpenVLA at 6 Hz / 15 GB and π0.5 at 10 Hz / 19 GB. Among latent planning methods, CLaD achieves 94.7% success rate with a planning latency of just 0.012 s, outperforming UVA (90.0%) and LBP (88.6%) while maintaining real-time deployment capability.

Method Avg. SR (%) Params (B) Inference (Hz) Memory (GB)
OpenVLA 93.8 7.0 6 15
π0.5 93.2 3.3 10 19
CLaD 94.7 0.66 25 4
Method Params (B) Planning Time (s) Avg. SR (%)
UVA 0.5 0.195 90.0
LBP 0.19 0.008 88.6
CLaD 0.66 0.012 94.7

Citation (TBU)

BibTeX

@misc{jeong2026clad,
  title  = {CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics},
  author = {Jeong, Andrew and Kim, Jaemin and Lee, Sebin and Yoon, Sung-Eui},
  year   = {2026},
}