LLM - Pretraining Dataset Research
updated
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper
•
2504.11393
•
Published
•
18
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting
LLMs Across Languages and Resources
Paper
•
2504.04152
•
Published
•
1
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining
Paper
•
2508.10975
•
Published
•
60
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon
Pretraining Dataset
Paper
•
2412.02595
•
Published
•
5
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering
for LLM Pretraining
Paper
•
2510.00866
•
Published
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Paper
•
2407.06380
•
Published
Judging Quality Across Languages: A Multilingual Approach to Pretraining
Data Filtering with Language Models
Paper
•
2505.22232
•
Published
•
18
Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math
Pretraining Dataset
Paper
•
2508.15096
•
Published
•
4
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language
Models
Paper
•
2407.07263
•
Published