Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Jie Zhu, Yiyang Su, Xiaoming Liu
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boo...