Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

PLAN Lab  —  1 University of Illinois Urbana-Champaign

CVPR 2026

TL;DR: Phantom a Physics-Infused Video Generation model that jointly models visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames — without requiring explicit specification of complex physical properties.
Phantom architecture overview
Overview. Phantom consists of two parallel latent flow-matching branches: the video branch and physics branch. These branches jointly model future visual and physical dynamics — the video branch predicts future visual trajectories, while the physics branch predicts the evolution of latent physical states. Dual cross-attention layers tightly couple these branches, allowing physics cues to guide visual generation and visual evidence to refine physics reasoning.

Abstract

Recent advances in generative video modeling, driven by large-scale datasets and increasingly powerful architectures, have produced striking visual realism. However, growing evidence suggests that scaling data and model size alone does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics.

In this work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Phantom leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent.

Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

Video Comparisons

We compare Phantom against state-of-the-art video generation models in two settings: text-to-video (T2V) and text-/image-to-video (TI2V). We also demonstrate force-conditioned generation where Phantom responds to explicit physical control signals. Red boxes mark conditioning frames in the TI2V and force-conditioned settings.

Wan2.1-T2V
CogVideoX-5B
WISA
VideoREPA
Wan2.2-TI2V
Phantom (Ours)
Prompt: A colorful rubber ball is dropped from a height, showing the bounce of the ball as it makes contact with the hard floor.
Wan2.1-T2V
CogVideoX-5B
WISA
VideoREPA
Wan2.2-TI2V
Phantom

Prompt: A balloon changes from large to small.
Wan2.1-T2V
CogVideoX-5B
WISA
VideoREPA
Wan2.2-TI2V
Phantom

Prompt: A coffee pot pours a morning cup of joe.
Wan2.1-T2V
CogVideoX-5B
WISA
VideoREPA
Wan2.2-TI2V
Phantom

Wan2.2-TI2V vs. Phantom (Ours)
Wan2.2-TI2V
Phantom (Ours)
Prompt: A round cookie falls from a table and shatters into crumbs on the hard floor below.
Wan2.2-TI2V
Phantom

Prompt: A set of interlocking metal gears turns in synchrony, each gear meshing precisely with the next.
Wan2.2-TI2V
Phantom

Prompt: A bottle of ketchup is tilted; the ketchup resists flow at first, then suddenly pours out in a thick burst.
Wan2.2-TI2V
Phantom

Prompt: Water pours from a faucet into a glass, splashing and bubbling as it fills to the brim.
Wan2.2-TI2V
Phantom
CogVideoX1.5-I2V
Wan2.2-TI2V
Phantom (Ours)
Prompt: The video captures a serene beach scene at sunset, where a group of people are engaged in creating large, colorful soap bubbles.
CogVideoX1.5-I2V
Wan2.2-TI2V
Phantom

Prompt: A thick, viscous blue liquid pours into a bowl, forming folds, splashes, and slow flowing waves.
CogVideoX1.5-I2V
Wan2.2-TI2V
Phantom

Prompt: A juicy burger is pressed down; the bun compresses, sauce oozes from the sides, and the layers shift under the weight.
CogVideoX1.5-I2V
Wan2.2-TI2V
Phantom

Prompt: Fresh juice is poured from a pitcher into a glass; the liquid swirls and foam forms at the surface.
CogVideoX1.5-I2V
Wan2.2-TI2V
Phantom

Additional examples — Wan2.2-TI2V vs. Phantom (Ours)
Wan2.2-TI2V
Phantom (Ours)
Prompt: A tall building stands as a demolition ball swings into its facade; the wall crumbles and debris cascades to the ground.
Wan2.2-TI2V
Phantom

Prompt: Foam dispensed from a can expands rapidly into a fluffy white mound, its surface bubbling and settling.
Wan2.2-TI2V
Phantom

Phantom can be further fine-tuned on the Force-Prompting dataset. Given a static image and a force tensor, Phantom synthesizes physically plausible motion driven by the applied force.

Car — pushed left
Car — pushed up
Flower — pushed right
Ornament — pushed left
Rose — pushed up
Trail motion

Quantitative Results

VideoPhy & VideoPhy-2

SA (Semantic Adherence) measures video–text alignment. PC (Physical Commonsense) measures adherence to real-world physics.

† denotes results reported from VideoREPA with the original prompt. * denotes evaluation with detailed prompts. Green percentages show improvement over the base model Wan2.2-TI2V. Best results in bold, second-best underlined.

Method VideoPhy VideoPhy-2
SA ↑ PC ↑ SA ↑ PC ↑
General-Purpose
VideoCrafter2 50.3 29.7 25.89 55.67
LaVIE 48.7 31.5
Cosmos-Diffusion-7B 57.0 18.0 26.32 54.19
CogVideoX-5B 63.1 31.4 28.86 68.42
Wan2.2-TI2V-5B 41.5 25.2 24.53 69.20
Wan2.2-TI2V-5B* 64.7 28.6 24.53 69.20
Physics-Focused
PhyT2V (Round 4)† 61 37
WISA† 62 33
VideoREPA 51.9 22.4 21.02 72.54
VideoREPA*† 72.1 40.1 21.02 72.54
Phantom (Ours) 47.5 ↑14.5% 37.9 ↑50.4% 27.75 ↑13.1% 71.74 ↑2.6%
Phantom* (Ours) 70.3 ↑8.7% 39.4 ↑37.8% 27.75 ↑13.1% 71.74 ↑2.6%

Physics-IQ

Physics-IQ tests physical extrapolation from real-world motion sequences under single-frame and multi-frame conditioning.

Method Spatial IoU ↑ Spatiotem. IoU ↑ Weighted IoU ↑ MSE ↓ Physics-IQ ↑
Single Frame — General-Purpose
VideoPoet 0.141 0.126 0.087 0.012 20.30
Lumiere 0.113 0.173 0.061 0.016 19.00
Runway Gen 3 0.201 0.115 0.116 0.015 22.80
CogVideoX1.5-I2V 0.198 0.189 0.127 0.015 27.90
Wan2.2-TI2V-5B 0.164 0.132 0.102 0.010 22.10
Single Frame — Physics-Focused
RDPO 25.21
Phantom (Ours) 0.245 ↑49.4% 0.146 ↑10.6% 0.140 ↑37.3% 0.009 ↑11.1% 29.59 ↑33.9%
Multi-Frame — General-Purpose
VideoPoet 0.204 0.164 0.137 0.010 29.50
Lumiere 0.170 0.155 0.093 0.013 23.00
Multi-Frame — Physics-Focused
Phantom (Ours) 0.235 0.133 0.132 0.011 27.53

VBench-2

VBench-2.0 evaluates five core dimensions, Human Fidelity, Controllability, Creativity, Physics, and Commonsense, across 18 fine-grained metrics, providing a comprehensive assessment of overall video quality.

Model Total ↑ Creativity ↑ Commonsense ↑ Controllability ↑ Human Fidelity ↑ Physics ↑
Wan2.2-TI2V-5B 51.57 52.50 60.57 18.50 86.10 40.19
Phantom (Ours) 51.84 ↑0.5% 45.51 61.43 ↑1.4% 20.23 ↑9.4% 88.39 ↑2.7% 43.61 ↑6.0%

BibTeX

@inproceedings{shen2026phantom,
          title     = {Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics},
          author    = {Shen, Ying and Xiong, Jerry and Yu, Tianjiao and Lourentzou, Ismini},
          booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          year      = {2026}
        }