Model Card for devstral-sft

This model is a fine-tuned version of mistralai/Devstral-Small-2-24B-Instruct-2512. It has been trained using TRL. TRL (Transformer Reinforcement Learning) is HuggingFace's library for training language models with reinforcement learning, including supervised fine-tuning. This repository is tested on a Devstral-Small-2-24B-Instruct-2512 model which is Supervised Fine-tuned using the open-thoughts/OpenThoughts-Agent-v1-SFT dataset. The LoRA adapter is then directly pushed under Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts here. This adapter can then be directly merged into the base model and tested on the Terminal Bench 2.0 benchmark.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was trained with SFT.

Supervised Fine-Tuning (SFT) Guide

This guide covers the complete workflow for fine-tuning models with LoRA adapters, merging them with base models, and deploying them using vLLM. This repository is tested on a Devstral-Small-2-24B-Instruct-2512 model which is Supervised Fine-tuned using the open-thoughts/OpenThoughts-Agent-v1-SFT dataset. The LoRA adapter is then directly pushed under Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts here. This adapter can then be directly merged into the base model and tested on the Terminal Bench 2.0 benchmark.

Table of Contents

  1. Fine-tuning with LoRA
  2. Merging LoRA Adapters
  3. Pushing Models to HuggingFace
  4. Serving with vLLM
  5. Configuration

Fine-tuning with LoRA

Prerequisites

  • Base model (e.g., Devstral, Mistral, Llama)
  • Training dataset prepared
  • Sufficient GPU memory
  • Python environment with required packages

Training Process

  1. Prepare your training data in the required format
  2. Configure your training parameters
  3. Run the training script
  4. Monitor training progress

The LoRA adapter will be saved to the output directory specified in your training configuration.

Note: Specific training scripts should be configured based on your model architecture and dataset requirements.


Merging LoRA Adapters

After training, you have two options:

Option 1: Merge LoRA Adapter with Base Model

Use the generic merge script that loads configuration from vLLM/config.yaml:

python merge_lora.py

With custom configuration:

python merge_lora.py --config /path/to/config.yaml

Override specific parameters:

python merge_lora.py \
  --base-model "mistralai/Devstral-Small-2-24B-Instruct-2512" \
  --adapter-path "./outputs/devstral-sft" \
  --output-path "./outputs/merged-devstral-sft"

Option 2: Use Mistral-Specific Merge Script

For Mistral models specifically:

python merge_mistral_lora.py

This script is hardcoded for Mistral3ForConditionalGeneration and uses the original configuration.


Pushing Models to HuggingFace

The push_to_hf.py script provides three modes for uploading to HuggingFace Hub:

Mode 1: Merge and Push (Default)

Merge the LoRA adapter with the base model and push the merged model:

python push_to_hf.py \
  --hf-repo-id "your-username/your-model-name" \
  --mode merge

With HuggingFace token:

python push_to_hf.py \
  --hf-repo-id "your-username/your-model-name" \
  --hf-token "your_hf_token_here" \
  --mode merge

Mode 2: Push Adapter Only

Push only the LoRA adapter without merging:

python push_to_hf.py --hf-repo-id yourusername/repo-id --mode adapter --adapter-path adapter-path --hf-token your-hf-token

Mode 3: Push Existing Merged Model

Push an already merged model:

python push_to_hf.py \
  --hf-repo-id "your-username/your-model-name" \
  --mode existing \
  --model-path "/path/to/merged/model"

Authentication

You can provide your HuggingFace token in three ways:

  1. As an argument: --hf-token "your_token"
  2. Cached login: Run huggingface-cli login beforehand
  3. Environment variable: Set HF_TOKEN in your environment

Serving with vLLM

The vLLM server supports serving merged models or using LoRA adapters dynamically.

Start the Server

Navigate to the vLLM directory and run:

cd ../../vLLM
python serve.py

The server will:

  • Read configuration from config.yaml
  • Start on http://localhost:8000
  • Provide OpenAI-compatible API endpoints

Configuration Options

Edit vLLM/config.yaml to configure:

For Merged Models:

model_information:
  model_config:
    is_model_local: true
    model_path: "/path/to/merged/model"
  vllm_engine_config:
    enable_lora: false

For LoRA Adapters:

model_information:
  model_config:
    is_model_local: false
    model_id: "mistralai/Devstral-Small-2-24B-Instruct-2512"
  vllm_engine_config:
    enable_lora: true
    lora_modules:
      devstral-sft: "/path/to/adapter"
    max_loras: 1
    max_lora_rank: 8

API Usage

Once the server is running, you can use it like any OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="devstral-sft",  # or your model name
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

Configuration

Main Configuration File: vLLM/config.yaml

general:
  name: "agentic-SLM-vllm-deployment"
  description: "vLLM deployment configuration"

model_information:
  model_config:
    is_model_local: false
    model_id: "your-model-id"
    model_path: "/path/to/local/model"
    trust_remote_code: true
    dtype: "auto"

  vllm_engine_config:
    max_model_len: 32768
    tensor_parallel_size: 8
    tool_call_parser: "mistral"
    enable_auto_tool_choice: true
    enable_lora: false
    lora_modules:
      adapter-name: "/path/to/adapter"
    max_loras: 1
    max_lora_rank: 8

  inference_parameters:
    temperature: 0.6
    max_tokens: 8192

lora_merge:
  base_model: "mistralai/Devstral-Small-2-24B-Instruct-2512"
  adapter_path: "/path/to/adapter"
  output_path: "/path/to/output"

Key Configuration Parameters

  • is_model_local: Set to true to load from local path, false for HuggingFace Hub
  • model_id: HuggingFace model ID (when is_model_local: false)
  • model_path: Local path to model (when is_model_local: true)
  • enable_lora: Set to true to enable dynamic LoRA adapter loading
  • lora_modules: Dictionary of adapter names and paths
  • max_model_len: Maximum context length
  • tensor_parallel_size: Number of GPUs for tensor parallelism

Complete Workflow Example

Here's a complete example workflow:

1. Fine-tune Model

# Your training script here
python train.py --output-dir ./outputs/devstral-sft

2. Update Configuration

Edit ../../vLLM/config.yaml:

lora_merge:
  base_model: "mistralai/Devstral-Small-2-24B-Instruct-2512"
  adapter_path: "./outputs/devstral-sft"
  output_path: "./outputs/merged-devstral-sft"

3. Merge LoRA Adapter

python merge_lora.py

4. Push to HuggingFace

python push_to_hf.py \
  --hf-repo-id "your-username/devstral-sft" \
  --mode merge \
  --hf-token "your_token"

5. Configure vLLM Server

Update ../../vLLM/config.yaml to use your model:

model_information:
  model_config:
    is_model_local: false
    model_id: "your-username/devstral-sft"

6. Start vLLM Server

cd ../../vLLM
python serve.py

7. Test the API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-username/devstral-sft",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Troubleshooting

Common Issues

  1. Out of Memory Errors

    • Reduce max_model_len
    • Reduce tensor_parallel_size
    • Use a smaller batch size
  2. HuggingFace Authentication Failed

    • Run huggingface-cli login
    • Or provide token with --hf-token
  3. vLLM Server Won't Start

    • Check GPU availability
    • Verify model path is correct
    • Check config.yaml syntax
  4. LoRA Adapter Not Loading

    • Verify adapter path exists
    • Check enable_lora: true in config
    • Ensure max_lora_rank matches your adapter

Additional Resources


Framework versions

  • PEFT 0.18.0
  • TRL: 0.26.2
  • Transformers: 5.0.0.dev0
  • Pytorch: 2.9.1
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts

Dataset used to train Madhurprash/Devstral-Small-2-24B-Instruct-2512-SFT-LoRA-OpenThoughts