Reinforcement Learning

Reinforced Fast Weights with Next-Sequence Prediction

HHee Seung HwangXXindi WuSSanghyuk ChunOOlga Russakovsky
Published
February 18, 2026
Authors
4
Word Count
11,529

Fast weights need sequence-level training, not token-level, solved via reinforced next-sequence prediction.

Abstract

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

Key Takeaways

  • 1

    Fast weight architectures use fixed-size memory to handle long contexts efficiently with constant memory overhead.

  • 2

    Next-token prediction misaligns with fast weights' design; next-sequence prediction better tests multi-token contextual information retention.

  • 3

    Reinforcement learning enables selective application of sequence-level training, avoiding computational explosion from exhaustive rollout generation.

Limitations

  • Traditional next-token prediction provides only single-token feedback, missing fast weights' multi-token sequence maintenance capabilities.

  • Generating multi-token sequences for every prefix position creates prohibitive computational costs without selective training strategies.

Keywords

fast weight architecturesattention-based transformersnext-token predictionnext-sequence predictionreinforcement learningprediction entropymulti-token rolloutsself-supervised sequence-level rewardsgroup relative policy optimizationneedle-in-a-haystack retrievallong-context question answeringLongBench

More in Reinforcement Learning

View all
Reinforced Fast Weights with Next-Sequence Prediction | Paperchime