Model Card for devstral-sft

This model is a fine-tuned version of mistralai/Devstral-Small-2-24B-Instruct-2512. It has been trained using TRL. TRL (Transformer Reinforcement Learning) is HuggingFace's library for training language models with reinforcement learning, including supervised fine-tuning. This repository is tested on a Devstral-Small-2-24B-Instruct-2512 model which is Supervised Fine-tuned using the open-thoughts/OpenThoughts-Agent-v1-SFT dataset. The LoRA adapter is then directly pushed under Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts here. This adapter can then be directly merged into the base model and tested on the Terminal Bench 2.0 benchmark.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was trained with SFT.

Supervised Fine-Tuning (SFT) Guide

This guide covers the complete workflow for fine-tuning models with LoRA adapters, merging them with base models, and deploying them using vLLM. This repository is tested on a Devstral-Small-2-24B-Instruct-2512 model which is Supervised Fine-tuned using the open-thoughts/OpenThoughts-Agent-v1-SFT dataset. The LoRA adapter is then directly pushed under Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts here. This adapter can then be directly merged into the base model and tested on the Terminal Bench 2.0 benchmark.

Fine-tuning with LoRA
Merging LoRA Adapters
Pushing Models to HuggingFace
Serving with vLLM
Configuration

Fine-tuning with LoRA

Prerequisites

Base model (e.g., Devstral, Mistral, Llama)
Training dataset prepared
Sufficient GPU memory
Python environment with required packages

Training Process

Prepare your training data in the required format
Configure your training parameters
Run the training script
Monitor training progress

The LoRA adapter will be saved to the output directory specified in your training configuration.

Note: Specific training scripts should be configured based on your model architecture and dataset requirements.

Merging LoRA Adapters

After training, you have two options:

Option 1: Merge LoRA Adapter with Base Model

Use the generic merge script that loads configuration from vLLM/config.yaml:

python merge_lora.py

With custom configuration:

python merge_lora.py --config /path/to/config.yaml

Override specific parameters:

python merge_lora.py \
  --base-model "mistralai/Devstral-Small-2-24B-Instruct-2512" \
  --adapter-path "./outputs/devstral-sft" \
  --output-path "./outputs/merged-devstral-sft"

Option 2: Use Mistral-Specific Merge Script

For Mistral models specifically:

python merge_mistral_lora.py

This script is hardcoded for Mistral3ForConditionalGeneration and uses the original configuration.

Pushing Models to HuggingFace

The push_to_hf.py script provides three modes for uploading to HuggingFace Hub:

Mode 1: Merge and Push (Default)

Merge the LoRA adapter with the base model and push the merged model:

python push_to_hf.py \
  --hf-repo-id "your-username/your-model-name" \
  --mode merge

With HuggingFace token:

python push_to_hf.py \
  --hf-repo-id "your-username/your-model-name" \
  --hf-token "your_hf_token_here" \
  --mode merge

Mode 2: Push Adapter Only

Push only the LoRA adapter without merging:

python push_to_hf.py --hf-repo-id yourusername/repo-id --mode adapter --adapter-path adapter-path --hf-token your-hf-token

Mode 3: Push Existing Merged Model

Push an already merged model:

python push_to_hf.py \
  --hf-repo-id "your-username/your-model-name" \
  --mode existing \
  --model-path "/path/to/merged/model"

Authentication

You can provide your HuggingFace token in three ways:

As an argument: --hf-token "your_token"
Cached login: Run huggingface-cli login beforehand
Environment variable: Set HF_TOKEN in your environment

Serving with vLLM

The vLLM server supports serving merged models or using LoRA adapters dynamically.

Start the Server

Navigate to the vLLM directory and run:

cd ../../vLLM
python serve.py

The server will:

Read configuration from config.yaml
Start on http://localhost:8000
Provide OpenAI-compatible API endpoints

Configuration Options

Edit vLLM/config.yaml to configure:

For Merged Models:

model_information:
  model_config:
    is_model_local: true
    model_path: "/path/to/merged/model"
  vllm_engine_config:
    enable_lora: false

For LoRA Adapters:

model_information:
  model_config:
    is_model_local: false
    model_id: "mistralai/Devstral-Small-2-24B-Instruct-2512"
  vllm_engine_config:
    enable_lora: true
    lora_modules:
      devstral-sft: "/path/to/adapter"
    max_loras: 1
    max_lora_rank: 8

API Usage

Once the server is running, you can use it like any OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="devstral-sft",  # or your model name
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

Configuration

Main Configuration File: `vLLM/config.yaml`

general:
  name: "agentic-SLM-vllm-deployment"
  description: "vLLM deployment configuration"

model_information:
  model_config:
    is_model_local: false
    model_id: "your-model-id"
    model_path: "/path/to/local/model"
    trust_remote_code: true
    dtype: "auto"

  vllm_engine_config:
    max_model_len: 32768
    tensor_parallel_size: 8
    tool_call_parser: "mistral"
    enable_auto_tool_choice: true
    enable_lora: false
    lora_modules:
      adapter-name: "/path/to/adapter"
    max_loras: 1
    max_lora_rank: 8

  inference_parameters:
    temperature: 0.6
    max_tokens: 8192

lora_merge:
  base_model: "mistralai/Devstral-Small-2-24B-Instruct-2512"
  adapter_path: "/path/to/adapter"
  output_path: "/path/to/output"

Key Configuration Parameters

is_model_local: Set to true to load from local path, false for HuggingFace Hub
model_id: HuggingFace model ID (when is_model_local: false)
model_path: Local path to model (when is_model_local: true)
enable_lora: Set to true to enable dynamic LoRA adapter loading
lora_modules: Dictionary of adapter names and paths
max_model_len: Maximum context length
tensor_parallel_size: Number of GPUs for tensor parallelism

Complete Workflow Example

Here's a complete example workflow:

1. Fine-tune Model

# Your training script here
python train.py --output-dir ./outputs/devstral-sft

2. Update Configuration

Edit ../../vLLM/config.yaml:

lora_merge:
  base_model: "mistralai/Devstral-Small-2-24B-Instruct-2512"
  adapter_path: "./outputs/devstral-sft"
  output_path: "./outputs/merged-devstral-sft"

3. Merge LoRA Adapter

python merge_lora.py

4. Push to HuggingFace

python push_to_hf.py \
  --hf-repo-id "your-username/devstral-sft" \
  --mode merge \
  --hf-token "your_token"

5. Configure vLLM Server

Update ../../vLLM/config.yaml to use your model:

model_information:
  model_config:
    is_model_local: false
    model_id: "your-username/devstral-sft"

6. Start vLLM Server

cd ../../vLLM
python serve.py

7. Test the API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-username/devstral-sft",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Troubleshooting

Common Issues

Out of Memory Errors
- Reduce max_model_len
- Reduce tensor_parallel_size
- Use a smaller batch size
HuggingFace Authentication Failed
- Run huggingface-cli login
- Or provide token with --hf-token
vLLM Server Won't Start
- Check GPU availability
- Verify model path is correct
- Check config.yaml syntax
LoRA Adapter Not Loading
- Verify adapter path exists
- Check enable_lora: true in config
- Ensure max_lora_rank matches your adapter

Additional Resources

Framework versions

PEFT 0.18.0
TRL: 0.26.2
Transformers: 5.0.0.dev0
Pytorch: 2.9.1
Datasets: 4.4.1
Tokenizers: 0.22.1

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 1

Model tree for Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Quantized

mistralai/Devstral-Small-2-24B-Instruct-2512

Adapter

(3)

this model