Latest Generative AI Research Papers

Research on AI systems that create new content including image generation, text-to-image, video synthesis, and creative AI applications.

76 Papers
Showing 20 of 20 papers

Show, Don't Tell: Morphing Latent Reasoning into Image Generation

Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu +6 more

Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning...

text-to-image generationlatent reasoningvisual memorylatent thoughtsactionable guidance+9 more
Feb 2, 202610

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He +3 more

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap th...

video diffusion modelsautoregressive modelscausal attentionbidirectional attentionODE distillation+4 more
Feb 2, 202616

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FSVideo Team, Qingyu Chen, Zhiyuan Fang +17 more

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstr...

video autoencoderdiffusion transformerDITlayer memory designinter-layer information flow+4 more
Feb 2, 202613

Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss

Yucheng Zhou, Hao Li, Jianbing Shen

Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive model...

autoregressive modelsdiffusion modelsdiffusion losspatch denoising optimizationcondition error+5 more
Feb 2, 202618

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Zehong Ma, Ruihan Xu, Shiliang Zhang

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leavin...

pixel diffusionperceptual supervisionLPIPS lossDINO-based perceptual lossdiffusion model+3 more
Feb 2, 202624

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Minh-Quan Le, Gaurav Mittal, Cheng Zhao +3 more

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent...

text-to-video generationreward-based post-trainingoptimal transportDual Optimal TransportOT-aligned Rewards+7 more
Feb 2, 202620

Balancing Understanding and Generation in Discrete Diffusion Models

Yue Liu, Yuzhong Zhao, Zheyong Xie +5 more

In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet nei...

Masked Diffusion Language ModelsUniform-noise Diffusion Language Modelsstationary noise kernelPareto frontierposterior probabilities+5 more
Feb 1, 20269

DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Hun Chang, Byunghee Cha, Jong Chul Ye

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work,...

Vision Foundation ModelsDINOgenerative autoencoderscontrastive representationsfeature vectors+11 more
Jan 30, 20263

TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

Ariel Shaulov, Eitan Shaar, Amit Edenzon +1 more

Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothe...

auto-regressive video generationtemporal driftlatent conditioning tokensinference-time error propagationunstable tokens+1 more
Jan 30, 20269

One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu, Susie Lu, Qiao Sun +6 more

Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without l...

diffusion modelsflow-based modelsmulti-step samplinglatent spaceone-step generation+4 more
Jan 29, 202612

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Mingshuang Luo, Shuang Liang, Zhengkun Rong +7 more

Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off betw...

motion conditioningin-context learninglatent spacegenerative priorself-bootstrapped data synthesis+3 more
Jan 29, 20266

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Haoyou Deng, Keyu Yan, Chaojie Mao +4 more

Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermedi...

flow matching modelsdenoising trajectorysparse reward problemdense rewardsstep-wise reward gain+7 more
Jan 28, 20267

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu, Yong Wang +3 more

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short o...

text-to-image modelsspatial intelligencebenchmarklong promptsinformation-dense prompts+5 more
Jan 28, 2026107

Efficient Autoregressive Video Diffusion with Dummy Head

Hang Guo, Zhaoyang Jia, Jiahao Li +5 more

The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusi...

autoregressive video diffusion modelmulti-head self-attentioncausal modelingiterative denoisingKV caches+4 more
Jan 28, 20264

Self-Refining Video Sampling

Sangwon Jang, Taekyung Ki, Jaehyeong Jo +3 more

Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. I...

video generatorsdenoising autoencoderiterative inner-loop refinementself-refining video samplinguncertainty-aware refinement+2 more
Jan 26, 202620

SkyReels-V3 Technique Report

Debang Li, Zhengcong Fei, Tuanhui Li +18 more

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion...

diffusion Transformersmultimodal in-context learning frameworkreference images-to-video synthesisvideo-to-video extensionaudio-guided video generation+8 more
Jan 24, 20266

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Bin Lin, Zongjian Li, Yuwei Niu +9 more

The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantiz...

autoregressive modelsdiffusion modelsVQ-VAEsVAEsfinite scalar quantization+5 more
Jan 23, 202626

LoL: Longer than Longer, Scaling Video Generation to Hour

Justin Cui, Jie Wu, Ming Li +6 more

Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a...

Rotary Position Embeddingmulti-head attentionautoregressive modelssink-collapseattention sink frames+3 more
Jan 23, 202616

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Remy Sabathier, David Novotny, Niloy J. Mitra +1 more

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes...

3D diffusion modelstemporal axislatent sequencestemporal 3D autoencoderreference shape+3 more
Jan 22, 202611
PreviousPage 3 of 4Next