Towards Physically Plausible Video Generation via VLM Planning

1Monash University,  2Dalian University of Technology,  3Shanghai Artificial Intelligence Laboratory 4Oxford University,  5The University of Sydney,  6ZMO AI
Equal Contribution   ✉ Corresponding Author  

Showcases

 

 


Comparisons with baseline methods

 


OverView

Our pipeline consists of two stages. In the first stage, the VLM generates a coarse-grained, physically plausible motion trajectory based on the provided input conditions. In the second stage, we synthesize a fake motion video using the predicted trajectory. We then extract the optical flow from this video and convert it into structured noise. These conditions are fed into an image-to-video diffusion model, and ultimately generates a physically plausible video.

 


Abstract

Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods.