DRIFT

This is a fine-tuned version of Qwen2.5-VL for enhanced reasoning capabilities, specifically optimized for multimodal reasoning tasks. The model is presented in the paper Directional Reasoning Injection for Fine-Tuning MLLMs. The code and further details can be found on the GitHub repository: https://github.com/WikiChao/DRIFT

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch

model_id = "ChaoHuangCS/DRIFT-VL-7B"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Example usage with an image
from PIL import Image

image = Image.open("your_image.jpg")
prompt = "Analyze this image and explain your reasoning step by step."

# Format the input
messages = [
    {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
]

# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_vision_info(messages)

inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
    
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Fine-tuning Details

This model was fine-tuned using:

Base Model: Qwen2.5-VL
Merged Model: DeepSeek-R1
Training Method: Custom reasoning-focused fine-tuning
Dataset: Multimodal reasoning datasets
Architecture: Preserves original Qwen2.5-VL architecture

Performance

The model has been optimized for:

Enhanced reasoning capabilities
Better multimodal understanding
Improved step-by-step thinking processes
More accurate visual question answering

Citation

If you use this model, please cite our paper.

License

This model is released under the MIT license.

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for ChaoHuangCS/DRIFT-VL-7B

Quantizations

2 models

Collection including ChaoHuangCS/DRIFT-VL-7B

DRIFT

Collection

2 items • Updated Oct 15, 2025 • 1

Paper for ChaoHuangCS/DRIFT-VL-7B

Directional Reasoning Injection for Fine-Tuning MLLMs

Paper • 2510.15050 • Published Oct 16, 2025 • 12