Generative AI

Show, Don't Tell: Morphing Latent Reasoning into Image Generation

HHarold Haodong ChenXXinxiang YinWWen-Jie ShuHHongfei ZhangZZixin ZhangCChenfei LiaoLLitao GuoQQifeng ChenYYing-Cong Chen
Published
February 2, 2026
Authors
9
Word Count
11,235
Code
Includes code

Revolutionizing text-to-image generation with adaptive reasoning.

Abstract

Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by 16% on GenEval and 25% on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by 15% and 11% on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by 44% and token consumption by 51%; and (IV) exhibits 71% cognitive alignment with human intuition on reasoning invocation.

Key Takeaways

  • 1

    LatentMorph integrates adaptive reasoning into image generation.

  • 2

    Enhances fidelity, abstract reasoning, and efficiency.

  • 3

    Mimics human cognitive rhythms in creative processes.

Limitations

  • Relies on base model quality and RL invoker.

  • Potentially high computational requirements for training.

Keywords

text-to-image generationlatent reasoningvisual memorylatent thoughtsactionable guidanceimage token predictionsreinforcement learningcognitive alignmentJanus-ProGenEvalT2I-CompBenchWISEIPV-TxtTwiG

More in Generative AI

View all
Show, Don't Tell: Morphing Latent Reasoning into Image Generation | Paperchime