Multimodal AI

Thinking with Drafting: Optical Decompression via Logical Reconstruction

JJingxuan WeiHHonghao HeCCaijun JiaSSiyuan LiZZheng SunYYuhang XuYYuanyuan LinLLinzhuang SunYYuchen WuBBihui YuXXiangxiang ZhangCCheng Tan
Published
February 12, 2026
Authors
12

Abstract

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

Keywords

multimodal large language modelsvisual perceptionvisual generationoptical decompressionvisual tokensDomain-Specific Languagevisual algebra benchmarkvisual reasoninglogical topologydeterministic visual proofs

More in Multimodal AI

View all
Thinking with Drafting: Optical Decompression via Logical Reconstruction | Paperchime