Optimal Splitting of Language Models from Mixtures to Specialized Domains
Summary
A study proposing a method to allocate optimal computing resources based on scaling laws when splitting pre-trained language models into specialized domains.
Key Points
- Proposes a split model training method that improves the two-stage learning paradigm of general pre-training followed by continued pre-training in specialized domains.
- Utilizes scaling laws to accurately predict model loss using model size N, pre-training tokens D, and specialization tokens D'.
- Designed to allow extrapolation to larger model sizes and token counts.
- Confirmed consistent performance improvements across various model sizes and computing budgets in common knowledge and reasoning benchmarks.
- Provides a method to pre-determine the optimal computing allocation for each specialized domain in a multi-domain setting.
Notable Quotes & Details
Notable Data / Quotes
- Accepted at ICLR 2026 'Workshop on Navigating and Addressing Data Problems for Foundation Models'
- Authors: Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye (National University of Singapore), Louis Bethune, Angelos Katharopoulos, David Grangier
Intended Audience
AI researchers, machine learning engineers