Model Card for devstral-sft
This model is a fine-tuned version of mistralai/Devstral-Small-2-24B-Instruct-2512. It has been trained using TRL. TRL (Transformer Reinforcement Learning) is HuggingFace's library for training language models with reinforcement learning, including supervised fine-tuning. This repository is tested on a Devstral-Small-2-24B-Instruct-2512 model which is Supervised Fine-tuned using the open-thoughts/OpenThoughts-Agent-v1-SFT dataset. The LoRA adapter is then directly pushed under Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts here. This adapter can then be directly merged into the base model and tested on the Terminal Bench 2.0 benchmark.
Quick start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Training procedure
This model was trained with SFT.
Supervised Fine-Tuning (SFT) Guide
This guide covers the complete workflow for fine-tuning models with LoRA adapters, merging them with base models, and deploying them using vLLM. This repository is tested on a Devstral-Small-2-24B-Instruct-2512 model which is Supervised Fine-tuned using the open-thoughts/OpenThoughts-Agent-v1-SFT dataset. The LoRA adapter is then directly pushed under Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts here. This adapter can then be directly merged into the base model and tested on the Terminal Bench 2.0 benchmark.
Table of Contents
- Fine-tuning with LoRA
- Merging LoRA Adapters
- Pushing Models to HuggingFace
- Serving with vLLM
- Configuration
Fine-tuning with LoRA
Prerequisites
- Base model (e.g., Devstral, Mistral, Llama)
- Training dataset prepared
- Sufficient GPU memory
- Python environment with required packages
Training Process
- Prepare your training data in the required format
- Configure your training parameters
- Run the training script
- Monitor training progress
The LoRA adapter will be saved to the output directory specified in your training configuration.
Note: Specific training scripts should be configured based on your model architecture and dataset requirements.
Merging LoRA Adapters
After training, you have two options:
Option 1: Merge LoRA Adapter with Base Model
Use the generic merge script that loads configuration from vLLM/config.yaml:
python merge_lora.py
With custom configuration:
python merge_lora.py --config /path/to/config.yaml
Override specific parameters:
python merge_lora.py \
--base-model "mistralai/Devstral-Small-2-24B-Instruct-2512" \
--adapter-path "./outputs/devstral-sft" \
--output-path "./outputs/merged-devstral-sft"
Option 2: Use Mistral-Specific Merge Script
For Mistral models specifically:
python merge_mistral_lora.py
This script is hardcoded for Mistral3ForConditionalGeneration and uses the original configuration.
Pushing Models to HuggingFace
The push_to_hf.py script provides three modes for uploading to HuggingFace Hub:
Mode 1: Merge and Push (Default)
Merge the LoRA adapter with the base model and push the merged model:
python push_to_hf.py \
--hf-repo-id "your-username/your-model-name" \
--mode merge
With HuggingFace token:
python push_to_hf.py \
--hf-repo-id "your-username/your-model-name" \
--hf-token "your_hf_token_here" \
--mode merge
Mode 2: Push Adapter Only
Push only the LoRA adapter without merging:
python push_to_hf.py --hf-repo-id yourusername/repo-id --mode adapter --adapter-path adapter-path --hf-token your-hf-token
Mode 3: Push Existing Merged Model
Push an already merged model:
python push_to_hf.py \
--hf-repo-id "your-username/your-model-name" \
--mode existing \
--model-path "/path/to/merged/model"
Authentication
You can provide your HuggingFace token in three ways:
- As an argument:
--hf-token "your_token" - Cached login: Run
huggingface-cli loginbeforehand - Environment variable: Set
HF_TOKENin your environment
Serving with vLLM
The vLLM server supports serving merged models or using LoRA adapters dynamically.
Start the Server
Navigate to the vLLM directory and run:
cd ../../vLLM
python serve.py
The server will:
- Read configuration from
config.yaml - Start on
http://localhost:8000 - Provide OpenAI-compatible API endpoints
Configuration Options
Edit vLLM/config.yaml to configure:
For Merged Models:
model_information:
model_config:
is_model_local: true
model_path: "/path/to/merged/model"
vllm_engine_config:
enable_lora: false
For LoRA Adapters:
model_information:
model_config:
is_model_local: false
model_id: "mistralai/Devstral-Small-2-24B-Instruct-2512"
vllm_engine_config:
enable_lora: true
lora_modules:
devstral-sft: "/path/to/adapter"
max_loras: 1
max_lora_rank: 8
API Usage
Once the server is running, you can use it like any OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="devstral-sft", # or your model name
messages=[
{"role": "user", "content": "Hello, how are you?"}
]
)
Configuration
Main Configuration File: vLLM/config.yaml
general:
name: "agentic-SLM-vllm-deployment"
description: "vLLM deployment configuration"
model_information:
model_config:
is_model_local: false
model_id: "your-model-id"
model_path: "/path/to/local/model"
trust_remote_code: true
dtype: "auto"
vllm_engine_config:
max_model_len: 32768
tensor_parallel_size: 8
tool_call_parser: "mistral"
enable_auto_tool_choice: true
enable_lora: false
lora_modules:
adapter-name: "/path/to/adapter"
max_loras: 1
max_lora_rank: 8
inference_parameters:
temperature: 0.6
max_tokens: 8192
lora_merge:
base_model: "mistralai/Devstral-Small-2-24B-Instruct-2512"
adapter_path: "/path/to/adapter"
output_path: "/path/to/output"
Key Configuration Parameters
- is_model_local: Set to
trueto load from local path,falsefor HuggingFace Hub - model_id: HuggingFace model ID (when
is_model_local: false) - model_path: Local path to model (when
is_model_local: true) - enable_lora: Set to
trueto enable dynamic LoRA adapter loading - lora_modules: Dictionary of adapter names and paths
- max_model_len: Maximum context length
- tensor_parallel_size: Number of GPUs for tensor parallelism
Complete Workflow Example
Here's a complete example workflow:
1. Fine-tune Model
# Your training script here
python train.py --output-dir ./outputs/devstral-sft
2. Update Configuration
Edit ../../vLLM/config.yaml:
lora_merge:
base_model: "mistralai/Devstral-Small-2-24B-Instruct-2512"
adapter_path: "./outputs/devstral-sft"
output_path: "./outputs/merged-devstral-sft"
3. Merge LoRA Adapter
python merge_lora.py
4. Push to HuggingFace
python push_to_hf.py \
--hf-repo-id "your-username/devstral-sft" \
--mode merge \
--hf-token "your_token"
5. Configure vLLM Server
Update ../../vLLM/config.yaml to use your model:
model_information:
model_config:
is_model_local: false
model_id: "your-username/devstral-sft"
6. Start vLLM Server
cd ../../vLLM
python serve.py
7. Test the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-username/devstral-sft",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Troubleshooting
Common Issues
Out of Memory Errors
- Reduce
max_model_len - Reduce
tensor_parallel_size - Use a smaller batch size
- Reduce
HuggingFace Authentication Failed
- Run
huggingface-cli login - Or provide token with
--hf-token
- Run
vLLM Server Won't Start
- Check GPU availability
- Verify model path is correct
- Check
config.yamlsyntax
LoRA Adapter Not Loading
- Verify adapter path exists
- Check
enable_lora: truein config - Ensure
max_lora_rankmatches your adapter
Additional Resources
Framework versions
- PEFT 0.18.0
- TRL: 0.26.2
- Transformers: 5.0.0.dev0
- Pytorch: 2.9.1
- Datasets: 4.4.1
- Tokenizers: 0.22.1
Citations
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 1
Model tree for Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503