Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Zhixiang Wei, Yi Li, Zhehan Kan +38 more
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent...