VTAM: Video-Tactile-Action Models for
Complex Physical Interaction Beyond VLAs

Haoran Yuan^1,*,‡, Weigang Yi^1,*, Zhenyu Zhang^2,*, Wendi Chen^3,*

Yuchen Mo¹, Jiashi Yin¹, Xinzhuo Li¹, Xiangyu Zeng¹

Chuan Wen³, Cewu Lu³, Katherine Driggs-Campbell¹, Ismini Lourentzou^1,†

¹University of Illinois Urbana-Champaign ²Stanford University ³Shanghai Jiao Tong University

^‡Project lead ^*Equal contribution ^†Corresponding author

Paper (coming soon) arXiv Code Results Video

We propose VTAM, a new video-tactile world action model for contact-rich robotic manipulation. Given multi-view visual observations, tactile signals, and robot state/action context, our method predicts interaction dynamics and generates physically grounded actions for complex real-world tasks.

Abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90% on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi_0.5 baseline by 80%. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

Method

VTAM uses a two-stage training strategy. In Stage I, the model learns joint visuo-tactile predictive dynamics in latent space using multi-view diffusion over two RGB views and one GelSight stream. In Stage II, the model learns control with conditional diffusion and jointly predicts action, state, and a deformation-derived virtual force proxy. This force proxy preserves tactile sensitivity and stabilizes multimodal optimization.

VTAM architecture: shared visuo-tactile latent world modeling + action-state-force diffusion head.

Real-world setup and teleoperation data collection pipeline.

Results and Analysis

VTAM is evaluated on three contact-rich real-world tasks with 80 total trials per model (20 trials per task setting). It consistently outperforms strong baselines including Genie Envisioner and vision-only pi_0.5 variants.

Model	Chip	Peel	Wipe
Genie Envisioner	0%	0%	2.5%
pi_0.5 (Vision)	10%	0%	0%
pi_0.5 + Tactile	5%	0%	0%
VTAM (Ours)	90%	85%	95%

Video Comparisons

Chip Pick-and-Place

VTAM (Ours)

Genie Envisioner

pi_0.5 (Vision)

pi_0.5 + Tactile

Cucumber Peeling

VTAM (Ours)

Genie Envisioner

pi_0.5 (Vision)

pi_0.5 + Tactile

Whiteboard Wiping

VTAM (Ours)

Genie Envisioner

pi_0.5 (Vision) / pi_0.5 + Tactile

Model Behavior Diversity

VTAM (Ours) - Failure Case

VTAM (Ours) - Retrial Recovery

Success Cases for Baseline Models

pi_0.5 (Vision) - Success Case

pi_0.5 + Tactile - Success Case

Prediction Visualization

Prediction visualization of the backbone video model. From top to bottom: Camera-1 view, Camera-2 view, Tactile stream prediction. Ground-truth (top rows) and VTAM predictions (bottom rows).

BibTeX

@article{vtam2026,
  title={VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs},
  author={Yuan, Haoran and Yi, Weigang and Zhang, Zhenyu and Chen, Wendi and Mo, Yuchen and Yin, Jiashi and Li, Xinzhuo and Zeng, Xiangyu and Wen, Chuan and Lu, Cewu and Driggs-Campbell, Katherine and Lourentzou, Ismini},
  journal={arXiv preprint arXiv:2603.23481},
  eprint={2603.23481},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  year={2026}
}

VTAM: Video-Tactile-Action Models forComplex Physical Interaction Beyond VLAs

Abstract

Method

Results and Analysis

Video Comparisons

Chip Pick-and-Place

Cucumber Peeling

Whiteboard Wiping

Model Behavior Diversity

Success Cases for Baseline Models

Prediction Visualization

BibTeX

VTAM: Video-Tactile-Action Models for
Complex Physical Interaction Beyond VLAs