BABE: Biology Arena BEnchmark

JJunting ZhouJJin ChenLLinfeng HaoDDenghui CaoZZheyu WangQQiguang ChenCChaoyou FuJJiaze ChenYYuchen WuGGe ZhangMMingxuan WangWWenhao HuangTTong Yang

Published: February 5, 2026
Authors: 13
Word Count: 7,953

View on arXiv Download PDF

BABE: Advancing AI for complex biological reasoning.

Abstract

The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.

Key Takeaways

1
BABE evaluates AI's ability to reason like scientists.
2
Current models struggle with complex biological reasoning tasks.
3
BABE aims to accelerate scientific discovery with AI.

Limitations

Current benchmarks fail to assess interdisciplinary reasoning.
Existing models lack practical utility in biological research.

Keywords

large language modelsbiological AI systemsexperimental reasoningcausal reasoningcross-scale inferencepeer-reviewed research papersreal-world biological studies

More in AI for Science

View all

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

Zonglin Yang, Lidong Bing

While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning pro...

Mar 483

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Jiaxuan Lu, Ziyu Kong +11

The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined...

Jan 1243

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja +9

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement...

Jan 2235

EEG Foundation Models: Progresses, Benchmarking, and Open Problems

Dingkun Liu, Yuheng Chen +6

Electroencephalography (EEG) foundation models have recently emerged as a promising paradigm for brain-computer interfaces (BCIs), aiming to learn transferable neural representations from large-scale ...

Jan 2518

A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation

Jian Zhang, Yu He +6

Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency a...

Jan 1415

More AI for Science papers