Reinforced Fast Weights with Next-Sequence Prediction

HHee Seung HwangXXindi WuSSanghyuk ChunOOlga Russakovsky

Published: February 18, 2026
Authors: 4
Word Count: 11,529

Fast weights need sequence-level training, not token-level, solved via reinforced next-sequence prediction.

Abstract

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

Key Takeaways

1
Fast weight architectures use fixed-size memory to handle long contexts efficiently with constant memory overhead.
2
Next-token prediction misaligns with fast weights' design; next-sequence prediction better tests multi-token contextual information retention.
3
Reinforcement learning enables selective application of sequence-level training, avoiding computational explosion from exhaustive rollout generation.

Limitations

Traditional next-token prediction provides only single-token feedback, missing fast weights' multi-token sequence maintenance capabilities.
Generating multi-token sequences for every prefix position creates prohibitive computational costs without selective training strategies.

Keywords

fast weight architecturesattention-based transformersnext-token predictionnext-sequence predictionreinforcement learningprediction entropymulti-token rolloutsself-supervised sequence-level rewardsgroup relative policy optimizationneedle-in-a-haystack retrievallong-context question answeringLongBench

More in Reinforcement Learning

View all

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao +3

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference e...

Feb 11167

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang +8

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative...

Mar 3146

Your Group-Relative Advantage Is Biased

Fengkai Yang, Zherui Chen +11

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its vari...

Jan 13128

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin +9

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scar...

Mar 3122

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji +4

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challen...

Jan 2890

More Reinforcement Learning papers