Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
CVPR 2026
Abstract
Recent advances in generative video modeling, driven by large-scale datasets and increasingly powerful architectures, have produced striking visual realism. However, growing evidence suggests that scaling data and model size alone does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics.
In this work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Phantom leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent.
Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
Video Comparisons
We compare Phantom against state-of-the-art video generation models in two settings: text-to-video (T2V) and text-/image-to-video (TI2V). We also demonstrate force-conditioned generation where Phantom responds to explicit physical control signals. Red boxes mark conditioning frames in the TI2V and force-conditioned settings.
Phantom can be further fine-tuned on the Force-Prompting dataset. Given a static image and a force tensor, Phantom synthesizes physically plausible motion driven by the applied force.
Quantitative Results
VideoPhy & VideoPhy-2
SA (Semantic Adherence) measures video–text alignment. PC (Physical Commonsense) measures adherence to real-world physics.
† denotes results reported from VideoREPA with the original prompt. * denotes evaluation with detailed prompts. Green percentages show improvement over the base model Wan2.2-TI2V. Best results in bold, second-best underlined.
| Method | VideoPhy | VideoPhy-2 | ||
|---|---|---|---|---|
| SA ↑ | PC ↑ | SA ↑ | PC ↑ | |
| General-Purpose | ||||
| VideoCrafter2 | 50.3 | 29.7 | 25.89 | 55.67 |
| LaVIE | 48.7 | 31.5 | — | — |
| Cosmos-Diffusion-7B | 57.0 | 18.0 | 26.32 | 54.19 |
| CogVideoX-5B | 63.1 | 31.4 | 28.86 | 68.42 |
| Wan2.2-TI2V-5B | 41.5 | 25.2 | 24.53 | 69.20 |
| Wan2.2-TI2V-5B* | 64.7 | 28.6 | 24.53 | 69.20 |
| Physics-Focused | ||||
| PhyT2V (Round 4)† | 61 | 37 | — | — |
| WISA† | 62 | 33 | — | — |
| VideoREPA | 51.9 | 22.4 | 21.02 | 72.54 |
| VideoREPA*† | 72.1 | 40.1 | 21.02 | 72.54 |
| Phantom (Ours) | 47.5 ↑14.5% | 37.9 ↑50.4% | 27.75 ↑13.1% | 71.74 ↑2.6% |
| Phantom* (Ours) | 70.3 ↑8.7% | 39.4 ↑37.8% | 27.75 ↑13.1% | 71.74 ↑2.6% |
Physics-IQ
Physics-IQ tests physical extrapolation from real-world motion sequences under single-frame and multi-frame conditioning.
| Method | Spatial IoU ↑ | Spatiotem. IoU ↑ | Weighted IoU ↑ | MSE ↓ | Physics-IQ ↑ |
|---|---|---|---|---|---|
| Single Frame — General-Purpose | |||||
| VideoPoet | 0.141 | 0.126 | 0.087 | 0.012 | 20.30 |
| Lumiere | 0.113 | 0.173 | 0.061 | 0.016 | 19.00 |
| Runway Gen 3 | 0.201 | 0.115 | 0.116 | 0.015 | 22.80 |
| CogVideoX1.5-I2V | 0.198 | 0.189 | 0.127 | 0.015 | 27.90 |
| Wan2.2-TI2V-5B | 0.164 | 0.132 | 0.102 | 0.010 | 22.10 |
| Single Frame — Physics-Focused | |||||
| RDPO | — | — | — | — | 25.21 |
| Phantom (Ours) | 0.245 ↑49.4% | 0.146 ↑10.6% | 0.140 ↑37.3% | 0.009 ↑11.1% | 29.59 ↑33.9% |
| Multi-Frame — General-Purpose | |||||
| VideoPoet | 0.204 | 0.164 | 0.137 | 0.010 | 29.50 |
| Lumiere | 0.170 | 0.155 | 0.093 | 0.013 | 23.00 |
| Multi-Frame — Physics-Focused | |||||
| Phantom (Ours) | 0.235 | 0.133 | 0.132 | 0.011 | 27.53 |
VBench-2
VBench-2.0 evaluates five core dimensions, Human Fidelity, Controllability, Creativity, Physics, and Commonsense, across 18 fine-grained metrics, providing a comprehensive assessment of overall video quality.
| Model | Total ↑ | Creativity ↑ | Commonsense ↑ | Controllability ↑ | Human Fidelity ↑ | Physics ↑ |
|---|---|---|---|---|---|---|
| Wan2.2-TI2V-5B | 51.57 | 52.50 | 60.57 | 18.50 | 86.10 | 40.19 |
| Phantom (Ours) | 51.84 ↑0.5% | 45.51 | 61.43 ↑1.4% | 20.23 ↑9.4% | 88.39 ↑2.7% | 43.61 ↑6.0% |
BibTeX
@inproceedings{shen2026phantom,
title = {Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics},
author = {Shen, Ying and Xiong, Jerry and Yu, Tianjiao and Lourentzou, Ismini},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}