Generative AI

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

LLinxi XieLLisong C. SunAAshley NeallTTong WuSShengqu CaiGGordon Wetzstein
Published
February 20, 2026
Authors
6
Word Count
8,116

AI generates immersive VR worlds responding to real-time hand and head movements without manual content creation.

Abstract

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Key Takeaways

  • 1

    AI video models can generate immersive XR experiences by conditioning on real-time hand and camera pose data.

  • 2

    Hybrid conditioning combining 2D skeleton and 3D joint parameters enables precise hand-object interaction in virtual worlds.

  • 3

    Latent video diffusion transformers with specialized attention mechanisms can efficiently generate temporally coherent interactive content.

Limitations

  • Existing video world models lack fine-grained control needed for precise hand interactions in immersive experiences.

  • High-dimensional hand pose data creates ambiguity between 2D skeleton depth information and 3D spatial grounding.

Keywords

video world modelsdiffusion transformer3D head posejoint-level hand posesdexterous hand-object interactionsbidirectional video diffusion modelcausal interactive systemegocentric virtual environments

More in Generative AI

View all
Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control | Paperchime