Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

LLinxi XieLLisong C. SunAAshley NeallTTong WuSShengqu CaiGGordon Wetzstein

Published: February 20, 2026
Authors: 6
Word Count: 8,116

AI generates immersive VR worlds responding to real-time hand and head movements without manual content creation.

Abstract

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Key Takeaways

1
AI video models can generate immersive XR experiences by conditioning on real-time hand and camera pose data.
2
Hybrid conditioning combining 2D skeleton and 3D joint parameters enables precise hand-object interaction in virtual worlds.
3
Latent video diffusion transformers with specialized attention mechanisms can efficiently generate temporally coherent interactive content.

Limitations

Existing video world models lack fine-grained control needed for precise hand interactions in immersive experiences.
High-dimensional hand pose data creates ambiguity between 2D skeleton depth information and 3D spatial grounding.

Keywords

video world modelsdiffusion transformer3D head posejoint-level hand posesdexterous hand-object interactionsbidirectional video diffusion modelcausal interactive systemegocentric virtual environments

More in Generative AI

View all

Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan, Yuanyang Yin +4

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We mak...

Mar 4136

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Yiying Yang, Wei Cheng +6

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON...

Mar 2111

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu +4

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or i...

Jan 28107

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev +8

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside high...

Jan 558

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal +6

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the dis...

Feb 1954

More Generative AI papers