This is a distillation experiment with Qwen2-1.5B as teacher and Qwen2-0.5B as student model respectively.
Samples were taken from the Pile dataset.
optimizer: SM3, scheduler: cosine with warmup, lr=2e-5

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains distilled 0.5B Qwen2 language model.

Downloads last month: 9

Safetensors

Model size

0.5B params

Tensor type

BF16

Dataset used to train aloobun/d-Qwen2-0.5B

Collection including aloobun/d-Qwen2-0.5B

distilexp

Collection

some distillation experiments • 4 items • Updated Dec 9, 2024