IPR-1: Interactive Physical Reasoner

abstract

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.

Introduction

Humans internalize physical laws and causality not just by observing, but by actively interacting with the environment. Inspired by this cognitive process, my recent work explores whether artificial agents can acquire similar human-like physical reasoning capabilities through interactive experience. While current Large Vision Models and World Models excel at static understanding, they often struggle to capture the underlying causal mechanisms required for robust, forward-looking physical interaction in dynamic worlds.

To bridge this gap, we introduced IPR-1 (Interactive Physical Reasoner), a novel paradigm that fundamentally enhances how agents ground their understanding in the physical world. By integrating a generative World Model with a Vision-Language Model (VLM), our approach enables the agent to perform prediction-reinforced reasoning. This mechanism allows the system to simulate potential future outcomes and refine its strategies through active interaction, effectively transforming static semantic knowledge into dynamic, actionable physical intelligence.

We validated this approach on G2U (Game-to-Unseen), a massive benchmark we constructed comprising over 1,000 heterogeneous game environments. Experiments demonstrate that IPR-1 achieves robust performance across three capability hierarchies: Survival, Curiosity, and Utility. Notably, the model exhibits exceptional zero-shot transfer abilities in completely unseen environments, outperforming state-of-the-art models like GPT-5 in comprehensive physical reasoning benchmarks.

Method

We propose IPR (Interactive Physical Reasoner), which combines a VLM-based policy with world-model rollouts for look-ahead planning. Our approach introduces PhysCode, a physics-centric action representation that bridges semantic reasoning and physical dynamics, enabling the model to learn from interaction and steadily improve its physical reasoning capabilities.

Stage 1: PhysCode Pre-training

Input: video clips + optical flow + action semantics as supervision.
Train a VQ-style latent action model to learn discrete action codes (PhysCode).
PhysCode captures core dynamics and serves as a shared action space.

Stage 2: Latent-conditioned World Model

Condition on current features plus PhysCode sequences.
Predict future features and rewards under latent actions.
Enable fast rollouts for lookahead scoring and planning.

Stage 3: Prediction-reinforced Reasoning

The VLM proposes candidate PhysCode sequences from scene understanding.
The world model rolls out candidates in imagination and scores them (reward/value).
Use the scores to pick actions and reinforce the VLM policy, closing the predict–reason loop.

Performance Leaderboard

Evaluated on our Game-to-Unseen (G2U) benchmark with 1,000+ heterogeneous games, IPR demonstrates strong performance across three reasoning dimensions: Survival (basic physical intuition), Curiosity (exploration and discovery), and Utility (goal-driven reasoning). Results show that performance improves with more training games and interaction steps.

🍖 Survival 🧭 Curiosity 👑 Utility

Interactive Demo (Under Development)

Experience IPR in action with our interactive Super Mario demo. Adjust the gravity parameter to modify the physical environment and observe how our VLM-based agent adapts its behavior in real-time, demonstrating robust physical reasoning under changing dynamics.

Model deployment: Deploying the IPR model to the web — upcoming.

You can:

Check the reference videos below for pre-recorded runs.
Human try: try the demo manually using the embedded game iframe.

🌏 Gravity Control (0.2 - 2.0): 1.0

👾 Monster Speed Control (0.2 - 3.0): 1.0

VLM: stopped Human: input Backend: detecting… Decision: -

If the demo feels laggy or unresponsive (e.g., due to backend/network issues), you can watch the reference videos below (looping playback).

Gravity: 0.2×

Gravity: 0.7×

Gravity: 1.1×

Gravity: 1.4×

Generalization to Robotics

Across different embodiments, the same physical and causal mechanisms apply. IPR picks objects zero-shot (though with observable motion jitter), while other vision–language models often fail to pick objects correctly — and in some cases fail to even approach the object reliably.

IPR

GPT-5.1

Gemini 2.5 Pro

Qwen3-VL

Citation

@article{zhang2025ipr,
  title={IPR-1: Interactive Physical Reasoner},
  author={Zhang, Mingyu and Zhuo, Lifeng and Tan, Tianxi and Xie, Guocan and Nie, Xian and Li, Yan and Zhao, Renjie and He, Zizhu and Wang, Ziyu and Cai, Jiting and others},
  journal={arXiv preprint arXiv:2511.15407},
  year={2025}
}