Introduction
Humans internalize physical laws and causality not just by observing, but by actively interacting with the environment. Inspired by this cognitive process, my recent work explores whether artificial agents can acquire similar human-like physical reasoning capabilities through interactive experience. While current Large Vision Models and World Models excel at static understanding, they often struggle to capture the underlying causal mechanisms required for robust, forward-looking physical interaction in dynamic worlds.
To bridge this gap, we introduced IPR-1 (Interactive Physical Reasoner), a novel paradigm that fundamentally enhances how agents ground their understanding in the physical world. By integrating a generative World Model with a Vision-Language Model (VLM), our approach enables the agent to perform prediction-reinforced reasoning. This mechanism allows the system to simulate potential future outcomes and refine its strategies through active interaction, effectively transforming static semantic knowledge into dynamic, actionable physical intelligence.
We validated this approach on G2U (Game-to-Unseen), a massive benchmark we constructed comprising over 1,000 heterogeneous game environments. Experiments demonstrate that IPR-1 achieves robust performance across three capability hierarchies: Survival, Curiosity, and Utility. Notably, the model exhibits exceptional zero-shot transfer abilities in completely unseen environments, outperforming state-of-the-art models like GPT-5 in comprehensive physical reasoning benchmarks.
Method
We propose IPR (Interactive Physical Reasoner), which combines a VLM-based policy with world-model rollouts for look-ahead planning. Our approach introduces PhysCode, a physics-centric action representation that bridges semantic reasoning and physical dynamics, enabling the model to learn from interaction and steadily improve its physical reasoning capabilities.
Stage 1: PhysCode Pre-training
- Input: video clips + optical flow + action semantics as supervision.
- Train a VQ-style latent action model to learn discrete action codes (PhysCode).
- PhysCode captures core dynamics and serves as a shared action space.
Stage 2: Latent-conditioned World Model
- Condition on current features plus PhysCode sequences.
- Predict future features and rewards under latent actions.
- Enable fast rollouts for lookahead scoring and planning.
Stage 3: Prediction-reinforced Reasoning
- The VLM proposes candidate PhysCode sequences from scene understanding.
- The world model rolls out candidates in imagination and scores them (reward/value).
- Use the scores to pick actions and reinforce the VLM policy, closing the predict–reason loop.
Performance Leaderboard
Evaluated on our Game-to-Unseen (G2U) benchmark with 1,000+ heterogeneous games, IPR demonstrates strong performance across three reasoning dimensions: Survival (basic physical intuition), Curiosity (exploration and discovery), and Utility (goal-driven reasoning). Results show that performance improves with more training games and interaction steps.
Interactive Demo (Under Development)
Experience IPR in action with our interactive Super Mario demo. Adjust the gravity parameter to modify the physical environment and observe how our VLM-based agent adapts its behavior in real-time, demonstrating robust physical reasoning under changing dynamics.
- Check the reference videos below for pre-recorded runs.
- Human try: try the demo manually using the embedded game iframe.
Generalization to Robotics
Across different embodiments, the same physical and causal mechanisms apply. IPR picks objects zero-shot (though with observable motion jitter), while other vision–language models often fail to pick objects correctly — and in some cases fail to even approach the object reliably.
IPR
GPT-5.1
Gemini 2.5 Pro
Qwen3-VL
Citation
@article{zhang2025ipr,
title={IPR-1: Interactive Physical Reasoner},
author={Zhang, Mingyu and Zhuo, Lifeng and Tan, Tianxi and Xie, Guocan and Nie, Xian and Li, Yan and Zhao, Renjie and He, Zizhu and Wang, Ziyu and Cai, Jiting and others},
journal={arXiv preprint arXiv:2511.15407},
year={2025}
}