GoodGlinda-7B Training Code

The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79°C for the remaining 54 hours. I recommend using Watercooling or other liquid cooling methods for your easiness

I initially wasted two days trying to implement pipeline parallelism before admitting defeat and switching to DeepSpeed ZeRO-2 with CPU offloading.

My Hardware Setup

CPU: Intel Core i7-12700 (12th Gen)
GPU 0: RTX 4060 (8GB VRAM) - Auxiliary/Offloading (throttled at 83°C before paste fix)
GPU 1: RTX 5070 Ti (16GB VRAM) - Primary training (30% idle time due to asymmetry)
RAM: 64GB DDR5-4800
Storage: 2TB NVMe Gen4
Power Supply: 850W (upgraded from 650W mid-training after voltage drops)

Quick Start

Install dependencies:

pip install -r requirements.txt

Generate the dataset (I used DeepSeek-V2 as teacher):

python prepare_data.py \
  --output_dir ./data \
  --num_samples 50000 \
  --teacher_model deepseek-ai/deepseek-llm-7b-chat

Train (72 hours, single run):

deepspeed --num_gpus=2 train.py \
  --deepspeed ds_config.json \
  --model_name Qwen/Qwen2.5-7B-Instruct \
  --output_dir ./output \
  --num_train_epochs 3 \
  --learning_rate 2e-4 \
  --warmup_steps 500

Key Configurations

I used 4-bit NormalFloat with double quantization to squeeze into 8GB.

Parameter	Value	Notes
Quantization	4-bit NormalFloat	Double quantization enabled
Optimizer	AdamW 8-bit	CPU offloading via DeepSpeed ZeRO-2
Effective Batch Size	8	2 per GPU × 2 gradient accumulation
Learning Rate	2e-4	Cosine decay with 10% warmup
LoRA Rank	64	Targeting q, k, v, o projections
Training Duration	72 hours	Continuous, single run, no restarts
Peak Temp (4060)	83°C -> 79°C	After thermal paste replacement

Repository Structure

├── train.py                 # Main training script (simplified skeleton)
├── ds_config.json           # DeepSpeed ZeRO-2 config for asymmetric VRAM
├── prepare_data.py          # Dataset generation using DeepSeek-V2
├── verify_data.py           # Validation checks
├── requirements.txt         # Locked versions
├── logs/
│   ├── loss_curves.png      # I screengrabbed this at hour 68
│   ├── gpu_utilization.log  # Shows the 30% idle time on 5070 Ti
│   └── thermal_stats.log    # 83°C spike visible at hour 14
└── checkpoints/             # Saved every 500 steps

What I Learned

Pipeline Parallelism Failure: I spent two days trying to split layers across the 8GB and 16GB cards manually. It failed constantly due to communication overhead. ZeRO-2 with CPU offloading solved this in 20 minutes but left the 5070 Ti underutilized.

Thermal Management: The RTX 4060 required aggressive intervention. I set a custom fan curve (80% speed at 75°C), replaced the thermal paste at hour 18 (dropped temps by 4°C), and added case fans. Without these, the card would throttle to 2.1GHz, adding roughly 40% to training time. Watercooled should have been a better solution.. maybe

Single Run Limitations: multiple seeds at home lab was not possible. This is a single 72-hour run. Your results may vary ±3-5% due to random initialization.

Reproducibility Notes

What you can reproduce:

Training procedure with identical hyperparameters
Dataset generation pipeline (requires DeepSeek-V2 API access)
Verification protocol on 200 samples

Known limitations:

Single training run (no seed averaging)
Exact loss curves may vary ±3-5% due to hardware noise
Thermal throttling events may affect timing (depends on ambient temperature)

Details on the full methodology will appear in an upcoming publication.

Troubleshooting

CUDA OOM Errors: I hit these constantly on the 4060. Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 4, or enable more aggressive gradient checkpointing.

Thermal Throttling: Monitor with nvidia-smi dmon. Target <83°C sustained. Consider undervolting (I stabilized at 0.95V @ 2.75GHz).

Slow Training: Expected throughput is ~1.8 samples/sec with my asymmetric setup. A symmetric dual-16GB setup would hit ~2.5 samples/sec, but I cannot afford that. The PCIe bottleneck is the limiting factor, not compute.

License

Apache 2.0. Commercial use permitted with attribution.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

YellowLabsStudio
/

goodglinda-training-code