Multi-Task DiT Policy β€” Coffee Capsules (Config Fix)

Diffusion Transformer (DiT) policy trained on villekuosmanen/bin_pick_pack_coffee_capsules for robotic bin-picking. Training config (batch size, learning rate, resize, horizon) recommended by the repo author based on what worked for him.

Training Details

Parameter Value
Architecture DiT with CLIP ViT-B/16 vision encoder + CLIP text conditioning
Dataset 47,865 samples, 200 episodes
State/Action dim 17D β€” joint_pos(7) + eef_xyz(3) + rot6d(6) + gripper(1)
Delta actions All dims except 6D rotation (absolute)
Normalization Ramen (q02/q98 percentile, per-timestep, per-dim, clipped [-1.5, 1.5]); 6D rotation exempt
Batch size 80 per GPU, 320 global (4x GPUs)
Training steps 30,530 / 50,000 (walltime limit)
Learning rate 3e-4, cosine schedule, 500 warmup steps
Diffusion DDIM, 100 train timesteps, 20 inference steps
Horizon 32
Action steps 32
Obs steps 2
Vision resize 224x224, no crop
Mixed precision AMP
Optimizer Adam, grad clip 1.0
Hardware 1 node, 4x NVIDIA GH200 (Isambard-AI AIP2)
Training time 24h (walltime limit)
Final loss ~0.004-0.006

Checkpoints

Checkpoint Steps sha256 (model.safetensors)
checkpoint_30000 30k 1e0fa327...e627b9

Each checkpoint contains:

  • model.safetensors β€” model weights (~1.3GB)
  • config.json β€” model configuration
  • ramen_stats.pt β€” normalization statistics (required for inference)

Task

Pick a single coffee capsule from the cardboard tray and drop it inside the brown cardboard container holding a plastic bag.

W&B

Training logs: wandb.ai/pravsels/dit_coffee_capsules_config_fix/runs/6pobxw3c

Usage

from multitask_dit_policy.model import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.load("pravsels/dit_coffee_capsules_config_fix/checkpoint_30000")
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train pravsels/dit_coffee_capsules_config_fix