PIPer Stage 1 SFT - ShareGPT Checkpoint

Intermediate checkpoint from Stage 1 of the PIPer training pipeline.

Model Description

Base Model: Qwen3-8B-am
Training: Supervised Fine-Tuning on ShareGPT conversations
Purpose: Instruction-following baseline for Stage 2 RL training

Training Data

Dataset: PIPer-SFT-ShareGPT-Data
Training Samples: 2,250 conversations
Validation Samples: 250 conversations

Training Configuration

Batch Size: 256 (8 GPUs × 32 samples)
Learning Rate: 2e-5 with cosine schedule
Warmup: 10% of steps
Sequence Length: 4096 tokens
Epochs: 3
Training Steps: 24
Training Time: ~15 minutes
Hardware: 8x NVIDIA H200 GPUs

Performance

This checkpoint serves as the initialization for Stage 2 RL training, which achieves:

100% pass@5 on EnvBench (20-problem evaluation)

See PIPer-Stage2-RL-Final for the final trained model.

Usage

Use this checkpoint as a starting point for RL fine-tuning on environment setup tasks:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "PIPer-Stage1-SFT-ShareGPT",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage1-SFT-ShareGPT")

Related Models

Next Stage: PIPer-Stage2-RL-Final (RL-trained from this checkpoint)

License

Same as base model (Qwen3-8B-am)

Downloads last month: 2

Safetensors

Model size

1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support