OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Shaobo Wang, Xuan Ouyang +10
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic s...