Latest Large Language Models Research Papers

Research on large language models including GPT, Claude, Llama, and other transformer-based architectures for natural language understanding and generation.

189 Papers
Showing 20 of 20 papers

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany +2 more

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these f...

test-time trainingKV bindingonline meta-learninglearned linear attentionsequence modeling layer+4 more
Feb 24, 202622

On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi, Grace Lam, Mohammad Shoeybi +3 more

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contr...

large language modelsterminal agentsdata engineering practicessynthetic task generationTerminal-Task-Gen+6 more
Feb 24, 202670

Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss, Sami Jaghouar +23 more

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 2...

Mixture-of-Expertssparse Mixture-of-Expertsattentiongated attentiondepth-scaled sandwich norm+4 more
Feb 19, 202616

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Chansung Park, Juyong Jiang, Fan Wang +4 more

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tun...

Reinforcement Fine-Tuningcurriculum designtest suitecapability-adaptivefunctional correctness+4 more
Feb 17, 20265

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo, Wenhan Xia, Cheolmin Kim +2 more

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming rece...

large language modelsdense adaptive optimizerspreconditionersrandom maskingRMSProp+4 more
Feb 17, 20267

Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Jiahao Yuan, Yike Xu, Jinyong Wen +9 more

Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. F...

UserUQ-Anchor Embeddingdual-tower LLMsjoint contrastive-autoregressive optimizationCluster-based Soft Prompt Tuning+1 more
Feb 16, 202612

Panini: Continual Learning in Token Space via Structured Memory

Shreyas Rajesh, Pavan Holur, Mehmet Yigit Turali +2 more

Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant su...

retrieval-augmented generationcontinual learningsemantic memory stateGenerative Semantic Workspacesquestion-answer pairs+4 more
Feb 16, 20266

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon, Eyal Ben-David, Zorik Gekhman +2 more

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, ...

factuality evaluationsLLMsfactual knowledgeencoded factsrecall accessibility+5 more
Feb 15, 202620

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu, Tian Liang, Dongyang Ma +4 more

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumble...

foundation modelsinternal reasoning loopmemory toolscontext pruningdocument indexing+6 more
Feb 12, 202613

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Haolei Bai, Lingcheng Kong, Xueyi Chen +3 more

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are cri...

diffusion large language modelsautoregressive LLMsparallel token generationCUDA kernel generationsupervised fine-tuning+4 more
Feb 12, 20265

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Xiaohan He, Shiyang Feng, Songtao Huang +3 more

Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co-evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity ...

large language modelsscientific reasoningco-evolving paradigmssparse supervisionunsupervised learning+8 more
Feb 12, 20264

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang, Chaodong Xiao, Aoqi Wu +2 more

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each n...

mixture-of-expertslanguage modelsdecentralized trainingfederated optimizationexpert synchronization+3 more
Feb 12, 20264

ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces

Xin Xu, Tong Yu, Xiang Chen +3 more

Recent work explores latent reasoning to improve reasoning efficiency by replacing explicit reasoning trajectories with continuous representations in a latent space, yet its effectiveness varies across settings. Analysis of model confidence dynamics under latent reasoning reveals that thinking traje...

latent reasoningcontinuous representationslatent spacediscrete token spacemodel confidence+8 more
Feb 12, 20266

T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

Tunyu Zhang, Xinxi Zhang, Ligong Han +9 more

Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substan...

diffusion large language modelstrajectory self-distillationself-distillationDirect Discriminative Optimizationreverse-KL objective+4 more
Feb 12, 20268

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team, Wenhao An, Yingfa Chen +43 more

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involv...

large language modelsTransformer architecturesparse attentionlinear attentionhybrid architecture+6 more
Feb 12, 20265

Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li, Jiangnan Li, Mo Yu +4 more

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic informat...

retrieval headsreranking frameworkattention scorespassage-query relevancelistwise solution+10 more
Feb 12, 202646

dVoting: Fast Voting for dLLMs

Sicheng Feng, Zigeng Chen, Xinyin Ma +2 more

Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential...

diffusion large language modelsautoregressive modelingparallel test-time scalingtoken predictionsiterative refinement+2 more
Feb 12, 202619

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort +1 more

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repeti...

supervised fine-tuningchain-of-thought datareasoning language modelstraining epochstoken accuracy+5 more
Feb 11, 20268

Benchmarking Large Language Models for Knowledge Graph Validation

Farzad Shami, Stefano Marchesin, Gianmaria Silvello

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large...

Knowledge GraphsLarge Language Modelsfact validationRetrieval-Augmented Generationmulti-model consensus+2 more
Feb 11, 20264

When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Leheng Sheng, Yongtao Zhang, Wenchang Ma +6 more

While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an ...

long-context reasoninglarge language modelsrecurrent memory updatetext-controlled gatesreward signals+4 more
Feb 11, 202621
PreviousPage 2 of 10Next
Latest Large Language Models Research | Large Language Models Papers