VTAM: Video-Tactile-Action Models for
Complex Physical Interaction Beyond VLAs

Haoran Yuan1,*,‡, Weigang Yi1,*, Zhenyu Zhang2,*, Wendi Chen3,*
Yuchen Mo1, Jiashi Yin1, Xinzhuo Li1, Xiangyu Zeng1
Chuan Wen3, Cewu Lu3, Katherine Driggs-Campbell1, Ismini Lourentzou1,†
1University of Illinois Urbana-Champaign 2Stanford University 3Shanghai Jiao Tong University
Project lead *Equal contribution Corresponding author
We propose VTAM, a new video-tactile world action model for contact-rich robotic manipulation. Given multi-view visual observations, tactile signals, and robot state/action context, our method predicts interaction dynamics and generates physically grounded actions for complex real-world tasks.
VTAM teaser figure

Abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90% on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi0.5 baseline by 80%. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

Method

VTAM uses a two-stage training strategy. In Stage I, the model learns joint visuo-tactile predictive dynamics in latent space using multi-view diffusion over two RGB views and one GelSight stream. In Stage II, the model learns control with conditional diffusion and jointly predicts action, state, and a deformation-derived virtual force proxy. This force proxy preserves tactile sensitivity and stabilizes multimodal optimization.

VTAM method architecture overview

VTAM architecture: shared visuo-tactile latent world modeling + action-state-force diffusion head.

Robot experiment setup Data acquisition setup for VTAM

Real-world setup and teleoperation data collection pipeline.

Results and Analysis

VTAM is evaluated on three contact-rich real-world tasks with 80 total trials per model (20 trials per task setting). It consistently outperforms strong baselines including Genie Envisioner and vision-only pi0.5 variants.

Model Chip Peel Wipe
Genie Envisioner 0% 0% 2.5%
pi0.5 (Vision) 10% 0% 0%
pi0.5 + Tactile 5% 0% 0%
VTAM (Ours) 90% 85% 95%

Video Comparisons


Chip Pick-and-Place

VTAM (Ours)
Genie Envisioner
pi0.5 (Vision)
pi0.5 + Tactile

Cucumber Peeling

VTAM (Ours)
Genie Envisioner
pi0.5 (Vision)
pi0.5 + Tactile

Whiteboard Wiping

VTAM (Ours)
Genie Envisioner
pi0.5 (Vision) / pi0.5 + Tactile

Model Behavior Diversity

VTAM (Ours) - Failure Case
VTAM (Ours) - Retrial Recovery

Success Cases for Baseline Models

pi0.5 (Vision) - Success Case
pi0.5 + Tactile - Success Case

Prediction Visualization


Camera 1 prediction Camera 2 prediction Tactile prediction

Prediction visualization of the backbone video model. From top to bottom: Camera-1 view, Camera-2 view, Tactile stream prediction. Ground-truth (top rows) and VTAM predictions (bottom rows).

BibTeX

@article{vtam2026,
  title={VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs},
  author={Yuan, Haoran and Yi, Weigang and Zhang, Zhenyu and Chen, Wendi and Mo, Yuchen and Yin, Jiashi and Li, Xinzhuo and Zeng, Xiangyu and Driggs-Campbell, Katherine and Lourentzou, Ismini},
  journal={arXiv preprint},
  year={2026}
}