GoodGlinda-7B Training Code
The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83Β°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79Β°C for the remaining 54 hours. I recommend using Watercooling or other liquid cooling methods for your easiness
I initially wasted two days trying to implement pipeline parallelism before admitting defeat and switching to DeepSpeed ZeRO-2 with CPU offloading.
My Hardware Setup
- CPU: Intel Core i7-12700 (12th Gen)
- GPU 0: RTX 4060 (8GB VRAM) - Auxiliary/Offloading (throttled at 83Β°C before paste fix)
- GPU 1: RTX 5070 Ti (16GB VRAM) - Primary training (30% idle time due to asymmetry)
- RAM: 64GB DDR5-4800
- Storage: 2TB NVMe Gen4
- Power Supply: 850W (upgraded from 650W mid-training after voltage drops)
Quick Start
Install dependencies:
pip install -r requirements.txt
Generate the dataset (I used DeepSeek-V2 as teacher):
python prepare_data.py \
--output_dir ./data \
--num_samples 50000 \
--teacher_model deepseek-ai/deepseek-llm-7b-chat
Train (72 hours, single run):
deepspeed --num_gpus=2 train.py \
--deepspeed ds_config.json \
--model_name Qwen/Qwen2.5-7B-Instruct \
--output_dir ./output \
--num_train_epochs 3 \
--learning_rate 2e-4 \
--warmup_steps 500
Key Configurations
I used 4-bit NormalFloat with double quantization to squeeze into 8GB.
| Parameter | Value | Notes |
|---|---|---|
| Quantization | 4-bit NormalFloat | Double quantization enabled |
| Optimizer | AdamW 8-bit | CPU offloading via DeepSpeed ZeRO-2 |
| Effective Batch Size | 8 | 2 per GPU Γ 2 gradient accumulation |
| Learning Rate | 2e-4 | Cosine decay with 10% warmup |
| LoRA Rank | 64 | Targeting q, k, v, o projections |
| Training Duration | 72 hours | Continuous, single run, no restarts |
| Peak Temp (4060) | 83Β°C -> 79Β°C | After thermal paste replacement |
Repository Structure
βββ train.py # Main training script (simplified skeleton)
βββ ds_config.json # DeepSpeed ZeRO-2 config for asymmetric VRAM
βββ prepare_data.py # Dataset generation using DeepSeek-V2
βββ verify_data.py # Validation checks
βββ requirements.txt # Locked versions
βββ logs/
β βββ loss_curves.png # I screengrabbed this at hour 68
β βββ gpu_utilization.log # Shows the 30% idle time on 5070 Ti
β βββ thermal_stats.log # 83Β°C spike visible at hour 14
βββ checkpoints/ # Saved every 500 steps
What I Learned
Pipeline Parallelism Failure: I spent two days trying to split layers across the 8GB and 16GB cards manually. It failed constantly due to communication overhead. ZeRO-2 with CPU offloading solved this in 20 minutes but left the 5070 Ti underutilized.
Thermal Management: The RTX 4060 required aggressive intervention. I set a custom fan curve (80% speed at 75Β°C), replaced the thermal paste at hour 18 (dropped temps by 4Β°C), and added case fans. Without these, the card would throttle to 2.1GHz, adding roughly 40% to training time. Watercooled should have been a better solution.. maybe
Single Run Limitations: multiple seeds at home lab was not possible. This is a single 72-hour run. Your results may vary Β±3-5% due to random initialization.
Reproducibility Notes
What you can reproduce:
- Training procedure with identical hyperparameters
- Dataset generation pipeline (requires DeepSeek-V2 API access)
- Verification protocol on 200 samples
Known limitations:
- Single training run (no seed averaging)
- Exact loss curves may vary Β±3-5% due to hardware noise
- Thermal throttling events may affect timing (depends on ambient temperature)
Details on the full methodology will appear in an upcoming publication.
Troubleshooting
CUDA OOM Errors: I hit these constantly on the 4060. Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 4, or enable more aggressive gradient checkpointing.
Thermal Throttling: Monitor with nvidia-smi dmon. Target <83Β°C sustained. Consider undervolting (I stabilized at 0.95V @ 2.75GHz).
Slow Training: Expected throughput is ~1.8 samples/sec with my asymmetric setup. A symmetric dual-16GB setup would hit ~2.5 samples/sec, but I cannot afford that. The PCIe bottleneck is the limiting factor, not compute.
License
Apache 2.0. Commercial use permitted with attribution.