Multi-Task DiT Policy β Coffee Capsules (Config Fix)
Diffusion Transformer (DiT) policy trained on villekuosmanen/bin_pick_pack_coffee_capsules for robotic bin-picking. Training config (batch size, learning rate, resize, horizon) recommended by the repo author based on what worked for him.
Training Details
| Parameter | Value |
|---|---|
| Architecture | DiT with CLIP ViT-B/16 vision encoder + CLIP text conditioning |
| Dataset | 47,865 samples, 200 episodes |
| State/Action dim | 17D β joint_pos(7) + eef_xyz(3) + rot6d(6) + gripper(1) |
| Delta actions | All dims except 6D rotation (absolute) |
| Normalization | Ramen (q02/q98 percentile, per-timestep, per-dim, clipped [-1.5, 1.5]); 6D rotation exempt |
| Batch size | 80 per GPU, 320 global (4x GPUs) |
| Training steps | 30,530 / 50,000 (walltime limit) |
| Learning rate | 3e-4, cosine schedule, 500 warmup steps |
| Diffusion | DDIM, 100 train timesteps, 20 inference steps |
| Horizon | 32 |
| Action steps | 32 |
| Obs steps | 2 |
| Vision resize | 224x224, no crop |
| Mixed precision | AMP |
| Optimizer | Adam, grad clip 1.0 |
| Hardware | 1 node, 4x NVIDIA GH200 (Isambard-AI AIP2) |
| Training time | 24h (walltime limit) |
| Final loss | ~0.004-0.006 |
Checkpoints
| Checkpoint | Steps | sha256 (model.safetensors) |
|---|---|---|
checkpoint_30000 |
30k | 1e0fa327...e627b9 |
Each checkpoint contains:
model.safetensorsβ model weights (~1.3GB)config.jsonβ model configurationramen_stats.ptβ normalization statistics (required for inference)
Task
Pick a single coffee capsule from the cardboard tray and drop it inside the brown cardboard container holding a plastic bag.
W&B
Training logs: wandb.ai/pravsels/dit_coffee_capsules_config_fix/runs/6pobxw3c
Usage
from multitask_dit_policy.model import MultiTaskDiTPolicy
policy = MultiTaskDiTPolicy.load("pravsels/dit_coffee_capsules_config_fix/checkpoint_30000")