Smallish LLM pre-training datasets - a gaunernst Collection

gaunernst 's Collections

DeepSeek testing

Gemma 3 QAT INT4 (from GGUF)

Gemma 3 QAT INT4 (from Flax)

Mini BERT models

Face Recognition Models

Smallish LLM pre-training datasets

Llama2-compatible

Llama3-compatible

Smallish LLM pre-training datasets

updated Sep 30, 2024

roneneldan/TinyStories

Viewer • Updated Aug 12, 2024 • 2.14M • 89.6k • 1.01k

Note V2 - 2GB
allenai/c4

Viewer • Updated Jan 9, 2024 • 10.4B • 762k • 584

Note realnewslike subset - 15GB
HuggingFaceFW/fineweb-edu

Viewer • Updated Jul 11, 2025 • 3.5B • 619k • 1.1k

Note sample-10BT subset - 28GB
HuggingFaceTB/smollm-corpus

Viewer • Updated Sep 6, 2024 • 237M • 59.2k • 457