Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Ying Shen¹, Jerry Xiong¹, Tianjiao Yu¹, Ismini Lourentzou¹

PLAN Lab — ¹ University of Illinois Urbana-Champaign

CVPR 2026

TL;DR: Phantom a Physics-Infused Video Generation model that jointly models visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames — without requiring explicit specification of complex physical properties.

Overview. Phantom consists of two parallel latent flow-matching branches: the video branch and physics branch. These branches jointly model future visual and physical dynamics — the video branch predicts future visual trajectories, while the physics branch predicts the evolution of latent physical states. Dual cross-attention layers tightly couple these branches, allowing physics cues to guide visual generation and visual evidence to refine physics reasoning.

Abstract

Recent advances in generative video modeling, driven by large-scale datasets and increasingly powerful architectures, have produced striking visual realism. However, growing evidence suggests that scaling data and model size alone does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics.

In this work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Phantom leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent.

Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

Video Comparisons

We compare Phantom against state-of-the-art video generation models in two settings: text-to-video (T2V) and text-/image-to-video (TI2V). We also demonstrate force-conditioned generation where Phantom responds to explicit physical control signals. Red boxes mark conditioning frames in the TI2V and force-conditioned settings.

Wan2.1-T2V

CogVideoX-5B

WISA

VideoREPA

Wan2.2-TI2V

Phantom (Ours)

Prompt: A colorful rubber ball is dropped from a height, showing the bounce of the ball as it makes contact with the hard floor.

Wan2.1-T2V

CogVideoX-5B

WISA

VideoREPA

Wan2.2-TI2V

Phantom

Prompt: A balloon changes from large to small.

Wan2.1-T2V

CogVideoX-5B

WISA

VideoREPA

Wan2.2-TI2V

Phantom

Prompt: A coffee pot pours a morning cup of joe.

Wan2.1-T2V

CogVideoX-5B

WISA

VideoREPA

Wan2.2-TI2V

Phantom

Wan2.2-TI2V vs. Phantom (Ours)

Wan2.2-TI2V

Phantom (Ours)

Prompt: A round cookie falls from a table and shatters into crumbs on the hard floor below.

Wan2.2-TI2V

Phantom

Prompt: A set of interlocking metal gears turns in synchrony, each gear meshing precisely with the next.

Wan2.2-TI2V

Phantom

Prompt: A bottle of ketchup is tilted; the ketchup resists flow at first, then suddenly pours out in a thick burst.

Wan2.2-TI2V

Phantom

Prompt: Water pours from a faucet into a glass, splashing and bubbling as it fills to the brim.

Wan2.2-TI2V

Phantom

CogVideoX1.5-I2V

Wan2.2-TI2V

Phantom (Ours)

Prompt: The video captures a serene beach scene at sunset, where a group of people are engaged in creating large, colorful soap bubbles.

CogVideoX1.5-I2V

Wan2.2-TI2V

Phantom

Prompt: A thick, viscous blue liquid pours into a bowl, forming folds, splashes, and slow flowing waves.

CogVideoX1.5-I2V

Wan2.2-TI2V

Phantom

Prompt: A juicy burger is pressed down; the bun compresses, sauce oozes from the sides, and the layers shift under the weight.

CogVideoX1.5-I2V

Wan2.2-TI2V

Phantom

Prompt: Fresh juice is poured from a pitcher into a glass; the liquid swirls and foam forms at the surface.

CogVideoX1.5-I2V

Wan2.2-TI2V

Phantom

Additional examples — Wan2.2-TI2V vs. Phantom (Ours)

Wan2.2-TI2V

Phantom (Ours)

Prompt: A tall building stands as a demolition ball swings into its facade; the wall crumbles and debris cascades to the ground.

Wan2.2-TI2V

Phantom

Prompt: Foam dispensed from a can expands rapidly into a fluffy white mound, its surface bubbling and settling.

Wan2.2-TI2V

Phantom

Phantom can be further fine-tuned on the Force-Prompting dataset. Given a static image and a force tensor, Phantom synthesizes physically plausible motion driven by the applied force.

Car — pushed left

Car — pushed up

Flower — pushed right

Ornament — pushed left

Rose — pushed up

Trail motion

Quantitative Results

VideoPhy & VideoPhy-2

SA (Semantic Adherence) measures video–text alignment. PC (Physical Commonsense) measures adherence to real-world physics.

† denotes results reported from VideoREPA with the original prompt. * denotes evaluation with detailed prompts. Green percentages show improvement over the base model Wan2.2-TI2V. Best results in bold, second-best underlined.

Method	VideoPhy		VideoPhy-2
Method	SA ↑	PC ↑	SA ↑	PC ↑
General-Purpose
VideoCrafter2	50.3	29.7	25.89	55.67
LaVIE	48.7	31.5	—	—
Cosmos-Diffusion-7B	57.0	18.0	26.32	54.19
CogVideoX-5B	63.1	31.4	28.86	68.42
Wan2.2-TI2V-5B	41.5	25.2	24.53	69.20
Wan2.2-TI2V-5B*	64.7	28.6	24.53	69.20
Physics-Focused
PhyT2V (Round 4)†	61	37	—	—
WISA†	62	33	—	—
VideoREPA	51.9	22.4	21.02	72.54
VideoREPA*†	72.1	40.1	21.02	72.54
Phantom (Ours)	47.5 ↑14.5%	37.9 ↑50.4%	27.75 ↑13.1%	71.74 ↑2.6%
Phantom* (Ours)	70.3 ↑8.7%	39.4 ↑37.8%	27.75 ↑13.1%	71.74 ↑2.6%

Physics-IQ

Physics-IQ tests physical extrapolation from real-world motion sequences under single-frame and multi-frame conditioning.

Method	Spatial IoU ↑	Spatiotem. IoU ↑	Weighted IoU ↑	MSE ↓	Physics-IQ ↑
Single Frame — General-Purpose
VideoPoet	0.141	0.126	0.087	0.012	20.30
Lumiere	0.113	0.173	0.061	0.016	19.00
Runway Gen 3	0.201	0.115	0.116	0.015	22.80
CogVideoX1.5-I2V	0.198	0.189	0.127	0.015	27.90
Wan2.2-TI2V-5B	0.164	0.132	0.102	0.010	22.10
Single Frame — Physics-Focused
RDPO	—	—	—	—	25.21
Phantom (Ours)	0.245 ↑49.4%	0.146 ↑10.6%	0.140 ↑37.3%	0.009 ↑11.1%	29.59 ↑33.9%
Multi-Frame — General-Purpose
VideoPoet	0.204	0.164	0.137	0.010	29.50
Lumiere	0.170	0.155	0.093	0.013	23.00
Multi-Frame — Physics-Focused
Phantom (Ours)	0.235	0.133	0.132	0.011	27.53

VBench-2

VBench-2.0 evaluates five core dimensions, Human Fidelity, Controllability, Creativity, Physics, and Commonsense, across 18 fine-grained metrics, providing a comprehensive assessment of overall video quality.

Model	Total ↑	Creativity ↑	Commonsense ↑	Controllability ↑	Human Fidelity ↑	Physics ↑
Wan2.2-TI2V-5B	51.57	52.50	60.57	18.50	86.10	40.19
Phantom (Ours)	51.84 ↑0.5%	45.51	61.43 ↑1.4%	20.23 ↑9.4%	88.39 ↑2.7%	43.61 ↑6.0%

BibTeX

@inproceedings{shen2026phantom,
          title     = {Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics},
          author    = {Shen, Ying and Xiong, Jerry and Yu, Tianjiao and Lourentzou, Ismini},
          booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          year      = {2026}
        }