CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

CVPR 2026 Highlight

Andrew Jeong · Jaemin Kim · Sebin Lee · Sung-Eui Yoon*

KAIST

Paper Code BibTeX

Overview

Core Contributions

Cross-Modal Latent Dynamics

We propose a cross-modal dynamics model that learns how proprioceptive and semantic transitions jointly evolve under actions via asymmetric cross-attention, interpreting a semantic transition through a proprioceptive transition cue.

Grounded Latent Foresight

We introduce a two-stage training framework that predicts compact future latent states from the shared cross-modal dynamics, which are grounded to observable quantities, enabling them to serve as subgoals for diffusion-based control.

Parameter-Efficient Planning

CLaD achieves competitive performance to OpenVLA (7B) and π0.5 (3.3B) on LIBERO-LONG with significantly fewer parameters (0.66B), demonstrating that grounded latent foresight enables efficient and scalable robot planning within a compact latent space.

Method

Planning with Cross-Modal Latent Dynamics

Cross-modal Latent Dynamics

Rather than aligning static states across modalities, CLaD learns how proprioceptive and semantic transitions co-evolve under actions via asymmetric cross-attention, capturing their shared dynamic context.

Learning Grounded Latent Foresight

From the learned cross-modal dynamics, lightweight MLPs predict future latent states supervised by EMA target encoders, while auxiliary reconstruction losses ground these foresights to observable quantities and prevent representation collapse.

Diffusion Policy Guided by Latent Foresight

Predicted latent foresights are modulated with current observations via FiLM layers and condition a diffusion policy for action generation, serving as implicit subgoals without the overhead of explicit semantic artifact generation.

Results

Efficient, while Effective Planning of CLaD

Performance on long-horizon planning

CLaD reaches the best average success rate on LIBERO-LONG (94.7%, 0.66B), compared to large VLAs, such as OpenVLA (93.8%, 7B) and π_0.5 (93.2%, 3.3B) with much fewer parameters.

Benchmark comparison chart — Performance comparison on LIBERO-LONG benchmark.

Computational efficiency

CLaD runs at 25 Hz with only 4 GB memory, versus OpenVLA at 6 Hz / 15 GB and π_0.5 at 10 Hz / 19 GB. Among latent planning methods, CLaD achieves 94.7% success rate with a planning latency of just 0.012 s, outperforming UVA (90.0%) and LBP (88.6%) while maintaining real-time deployment capability.

Method	Avg. SR (%)	Params (B)	Inference (Hz)	Memory (GB)
OpenVLA	93.8	7.0	6	15
π_0.5	93.2	3.3	10	19
CLaD	94.7	0.66	25	4

Method	Params (B)	Planning Time (s)	Avg. SR (%)
UVA	0.5	0.195	90.0
LBP	0.19	0.008	88.6
CLaD	0.66	0.012	94.7

Citation (TBU)

BibTeX

@misc{jeong2026clad,
  title  = {CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics},
  author = {Jeong, Andrew and Kim, Jaemin and Lee, Sebin and Yoon, Sung-Eui},
  year   = {2026},
}