AI Agents

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

CChen YangGGuangyue PengJJiaying ZhuRRan LeRRuixiang FengTTao ZhangXXiyun XuYYang SongYYiming JiaYYuntao WenYYunzhi XuZZekai WangZZhenwei AnZZhicong SunZZongchao Chen
Published
February 13, 2026
Authors
15
Word Count
7,209
Code
Includes code

Nanbeige4.1-3B achieves reasoning, coding, and autonomy in just 3 billion parameters.

Abstract

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.

Key Takeaways

  • 1

    Nanbeige4.1-3B successfully combines reasoning, code generation, and agentic behavior in a single 3B parameter model.

  • 2

    Extended context length to 256k tokens and enhanced chain-of-thought reconstruction significantly improved reasoning capabilities.

  • 3

    Two-stage reinforcement learning using both point-wise and pair-wise rewards captures different aspects of response quality.

Limitations

  • Previous small models forced users to choose between reasoning, code generation, or agentic capabilities.

  • Most models struggle with long-horizon planning and maintaining meaningful tool interactions beyond a few steps.

Keywords

unified generalist language modelreward modelingreinforcement learningtool-call turnsdeep searchcomplex data synthesisturn-level supervisionpoint-wise reward modelingpair-wise reward modelingcode generationgeneral reasoningagentic behaviorhuman-aligned responsesmodel optimization

More in AI Agents

View all
Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts | Paperchime