Latest Generative AI Research Papers

Research on AI systems that create new content including image generation, text-to-image, video synthesis, and creative AI applications.

76 Papers
Showing 20 of 20 papers

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Ruisi Zhao, Haoren Zheng, Zongxin Yang +2 more

Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a no...

Skeletal Graph VAESk-VAESkeletal Graph DiTSk-DiTTextuRig+6 more
Feb 10, 20268

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Zihan Yang, Shuyuan Tu, Licheng Zhang +3 more

Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typic...

diffusion modelsdistillationvelocity fieldmomentum processestrajectory distillation+5 more
Feb 9, 20263

Rethinking Global Text Conditioning in Diffusion Transformers

Nikita Starodubcev, Daniil Pakhomov, Zongze Wu +6 more

Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-b...

diffusion transformersattention layersmodulation mechanismpooled text embeddingtext-to-image generation+2 more
Feb 9, 20267

Autoregressive Image Generation with Masked Bit Modeling

Qihang Yu, Qihao Liu, Ju He +4 more

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily ...

discrete tokenizerscontinuous pipelineslatent spacecodebook sizeautoregressive transformer+4 more
Feb 9, 20264

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu, Jiahao Lu, Wenbo Hu +4 more

We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to...

video diffusion4D geometrydense motion estimation3D point maps3D scene flows+6 more
Feb 9, 20263

GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li, Jingwei Wu, Quan Sun +14 more

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in...

GUI generationtemporal coherencedynamic interactionvisual fidelityGUI-specific contexts+7 more
Feb 9, 202630

PISCO: Precise Video Instance Insertion with Sparse Control

Xiangbo Gao, Renjie Li, Xinghao Chen +4 more

The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is c...

video diffusion modelvideo instance insertionsparse keyframe controlvariable-information guidancedistribution-preserving temporal masking+5 more
Feb 9, 202610

Optimizing Few-Step Generation with Adaptive Matching Distillation

Lichen Bai, Zikai Zhou, Shitong Shao +5 more

Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimizat...

Distribution Matching DistillationForbidden Zonereward proxiesstructural signal decompositionRepulsive Landscape Sharpening+2 more
Feb 7, 20266

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong, Mushui Liu, Canyu Zhao +9 more

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compare...

GRPOflow matching modelstext-to-image generationdenoising stepsreward sparsity+5 more
Feb 6, 202636

Stable Velocity: A Variance Perspective on Flow Matching

Donglin Yang, Yongxing Zhang, Xin Yu +5 more

While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is cha...

flow matchingconditional velocitiesvariance reductionStable Velocity MatchingVariance-Aware Representation Alignment+3 more
Feb 5, 20263

RISE-Video: Can Video Generators Decode Implicit World Rules?

Mingxin Liu, Shuran Ma, Shibei Meng +9 more

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2...

Text-Image-to-Videomultimodal modelsreasoning alignmenttemporal consistencyphysical rationality+3 more
Feb 5, 202626

Pathwise Test-Time Correction for Autoregressive Long Video Generation

Xunzhi Xiang, Zixuan Duan, Guiyu Zhang +7 more

Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift ...

distilled autoregressive diffusion modelsTest-Time OptimizationTest-Time Correctionerror accumulationlong-sequence generation+4 more
Feb 5, 20263

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Shuo Chen, Cong Wei, Sun Sun +5 more

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5...

streaming tuning strategieslong-context studentshort-context teacherstudent-teacher mismatchcontext management system+3 more
Feb 5, 202629

Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

Jia-peng Zhang, Cheng-Feng Pu, Meng-Hao Guo +2 more

The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is ...

FSQ-CVAEskinning weightstoken sequence predictionautoregressive frameworkreinforcement learning+2 more
Feb 4, 20263

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Bozhou Li, Yushuo Guan, Haolin Li +7 more

Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network ...

DiT-based text-to-image modelsLLMstext encodersdiffusion modelsnormalized convex fusion+13 more
Feb 3, 202623

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Zhixue Fang, Xu He, Songlin Tang +5 more

Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally i...

motion encodervideo generatormotion tokenscross-attentionview-rich supervision+5 more
Feb 3, 202634

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Tianhe Wu, Ruibin Li, Lei Zhang +1 more

Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which exist...

distribution matching distillationmode collapsereverse-KL formulationv-predictiontext-to-image generation+2 more
Feb 3, 202630

Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan +11 more

Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent acces...

privacy-sensitive datasynthetic datasettext sanitizationlarge language modelsparallel corpus+1 more
Feb 3, 20264

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang, Zhengguang Zhou, Zhentao Yu +11 more

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned int...

dual-stream frameworkperception and interaction moduleaudio-interaction aware generation modulemotion-to-video alignergrounded human-object interaction+1 more
Feb 2, 202613
PreviousPage 2 of 4Next