Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Abstract
Kinema4D introduces a 4D generative robotic simulator that models robot-world interactions through precise kinematic control and spatiotemporal environmental reaction synthesis, enabling physically plausible and embodiment-agnostic simulations with zero-shot transfer capability.
Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.
Community
Project page: https://mutianxu.github.io/Kinema4D-project-page/
Demo video: https://www.youtube.com/watch?v=9Z1fLIwuZdM
the idea of tying exact 4d robot kinematics to a learned 4d world reaction is neat, i like the disentanglement of control from environment dynamics. but for fast, contact-rich maneuvers, projection drift and occlusions could desynchronize the 4d pointmap from the true motion, which might hurt geometry-consistency. did you test how sensitive the results are to the fidelity of the 4d projection, or to frame-rate differences between the kinematic trajectory and the generated sequence? btw the arxivlens breakdown helped me parse the method details, there's a solid walkthrough on arxivlens that covers this well: https://arxivlens.com/PaperView/Details/kinema4d-kinematic-4d-world-modeling-for-spatiotemporal-embodied-simulation-1744-9a570cd1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics (2026)
- Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis (2026)
- Mirage2Matter: A Physically Grounded Gaussian World Model from Video (2026)
- AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation (2026)
- DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control (2026)
- TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion (2026)
- Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper