GoodGlinda-7B Training Code

Model Dataset Space Code HomeLab AI Independent Hardware License

The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83Β°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79Β°C for the remaining 54 hours. I recommend using Watercooling or other liquid cooling methods for your easiness

I initially wasted two days trying to implement pipeline parallelism before admitting defeat and switching to DeepSpeed ZeRO-2 with CPU offloading.

My Hardware Setup

  • CPU: Intel Core i7-12700 (12th Gen)
  • GPU 0: RTX 4060 (8GB VRAM) - Auxiliary/Offloading (throttled at 83Β°C before paste fix)
  • GPU 1: RTX 5070 Ti (16GB VRAM) - Primary training (30% idle time due to asymmetry)
  • RAM: 64GB DDR5-4800
  • Storage: 2TB NVMe Gen4
  • Power Supply: 850W (upgraded from 650W mid-training after voltage drops)

Quick Start

Install dependencies:

pip install -r requirements.txt

Generate the dataset (I used DeepSeek-V2 as teacher):

python prepare_data.py \
  --output_dir ./data \
  --num_samples 50000 \
  --teacher_model deepseek-ai/deepseek-llm-7b-chat

Train (72 hours, single run):

deepspeed --num_gpus=2 train.py \
  --deepspeed ds_config.json \
  --model_name Qwen/Qwen2.5-7B-Instruct \
  --output_dir ./output \
  --num_train_epochs 3 \
  --learning_rate 2e-4 \
  --warmup_steps 500

Key Configurations

I used 4-bit NormalFloat with double quantization to squeeze into 8GB.

Parameter Value Notes
Quantization 4-bit NormalFloat Double quantization enabled
Optimizer AdamW 8-bit CPU offloading via DeepSpeed ZeRO-2
Effective Batch Size 8 2 per GPU Γ— 2 gradient accumulation
Learning Rate 2e-4 Cosine decay with 10% warmup
LoRA Rank 64 Targeting q, k, v, o projections
Training Duration 72 hours Continuous, single run, no restarts
Peak Temp (4060) 83Β°C -> 79Β°C After thermal paste replacement

Repository Structure

β”œβ”€β”€ train.py                 # Main training script (simplified skeleton)
β”œβ”€β”€ ds_config.json           # DeepSpeed ZeRO-2 config for asymmetric VRAM
β”œβ”€β”€ prepare_data.py          # Dataset generation using DeepSeek-V2
β”œβ”€β”€ verify_data.py           # Validation checks
β”œβ”€β”€ requirements.txt         # Locked versions
β”œβ”€β”€ logs/
β”‚   β”œβ”€β”€ loss_curves.png      # I screengrabbed this at hour 68
β”‚   β”œβ”€β”€ gpu_utilization.log  # Shows the 30% idle time on 5070 Ti
β”‚   └── thermal_stats.log    # 83Β°C spike visible at hour 14
└── checkpoints/             # Saved every 500 steps

What I Learned

Pipeline Parallelism Failure: I spent two days trying to split layers across the 8GB and 16GB cards manually. It failed constantly due to communication overhead. ZeRO-2 with CPU offloading solved this in 20 minutes but left the 5070 Ti underutilized.

Thermal Management: The RTX 4060 required aggressive intervention. I set a custom fan curve (80% speed at 75Β°C), replaced the thermal paste at hour 18 (dropped temps by 4Β°C), and added case fans. Without these, the card would throttle to 2.1GHz, adding roughly 40% to training time. Watercooled should have been a better solution.. maybe

Single Run Limitations: multiple seeds at home lab was not possible. This is a single 72-hour run. Your results may vary Β±3-5% due to random initialization.

Reproducibility Notes

What you can reproduce:

  • Training procedure with identical hyperparameters
  • Dataset generation pipeline (requires DeepSeek-V2 API access)
  • Verification protocol on 200 samples

Known limitations:

  • Single training run (no seed averaging)
  • Exact loss curves may vary Β±3-5% due to hardware noise
  • Thermal throttling events may affect timing (depends on ambient temperature)

Details on the full methodology will appear in an upcoming publication.

Troubleshooting

CUDA OOM Errors: I hit these constantly on the 4060. Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 4, or enable more aggressive gradient checkpointing.

Thermal Throttling: Monitor with nvidia-smi dmon. Target <83Β°C sustained. Consider undervolting (I stabilized at 0.95V @ 2.75GHz).

Slow Training: Expected throughput is ~1.8 samples/sec with my asymmetric setup. A symmetric dual-16GB setup would hit ~2.5 samples/sec, but I cannot afford that. The PCIe bottleneck is the limiting factor, not compute.

License

Apache 2.0. Commercial use permitted with attribution.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using YellowLabsStudio/goodglinda-training-code 1