Multimodal AI

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

YYiming RenJJunjie WangYYuxin MengYYihang ShiZZhiqiang LinRRuihang ChuYYiran XuZZiming LiYYunfei ZhaoZZihan WangYYu QiaoRRuiming TangMMinghao LiuYYujiu Yang
arXiv ID
2601.10108
Published
January 15, 2026
Authors
14
Hugging Face Likes
6
Comments
3

Abstract

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

Keywords

multimodal large language modelsscientific papersevidence chainsSIN-DataSIN-BenchSIN-FindSIN-VerifySIN-QASIN-SummaryNo EvidenceNo Score

More in Multimodal AI

View all
SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature | Paperchime