Large Language Models

Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

ÖÖzgür UğurMMahmut GöksuMMahmut ÇimenMMusa YılmazEEsra ŞavirdiAAlp Talha DemirRRumeysa Güllüceİİclal ÇetinÖÖmer Can Sağbaş
Published
January 22, 2026
Authors
9
Word Count
20,846

Specialized Turkish legal language models boost NLP performance.

Abstract

This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.

Key Takeaways

  • 1

    Innovative pre-training strategies yield significant performance gains.

  • 2

    Domain-specific models achieve competitive results with fewer resources.

  • 3

    Continual pre-training adapts general models to specialized domains.

Limitations

  • Requires high-quality, domain-specific data and powerful infrastructure.

  • May struggle with rare or highly specialized legal terms.

Keywords

domain adaptationpre-trained modelsencoder modelsdecoder modelscontinual pre-trainingcurriculum learningperplexitytokenizationparameter-efficient fine-tuningretrieval performancedownstream tasksmulti-stage trainingsingle-stage pre-trainingpost-trainingembedding modelslegal domainTurkish languageBERT-based encodersQwen3 modelscontrolled curriculum learningsample ratioslong-context reasoning

More in Large Language Models

View all
Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain | Paperchime