Qwen3-VL-2B-Thinking-Unredacted-MAX-FP8

Qwen3-VL-2B-Thinking-Unredacted-MAX-FP8 is an FP8-compressed evolution built on top of prithivMLmods/Qwen3-VL-2B-Thinking-abliterated-v1. This variant leverages BF16 · FP8 (F8_E4M3) precision formats to significantly reduce memory footprint and improve inference efficiency, while preserving the unredacted multimodal reasoning and structured thinking strengths of the original 2B Thinking architecture. The result is a compact yet highly capable 2B vision-language model optimized for unrestricted, detailed reasoning, step-by-step analysis, and dense captioning across complex visual inputs, with enhanced hardware efficiency.

FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – FP8 W8A8. Quantization W8A8 FP8-dynamic recipe – examples.

Key Highlights

BF16 · FP8 (F8_E4M3) Compression: Transformer Engine–based FP8 quantization reduces VRAM usage and improves throughput while maintaining strong multimodal reasoning fidelity.
Unredacted MAX Training: Retains the abliterated fine-tuning strategy designed to minimize internal refusal behaviors and improve instruction adherence.
2B Thinking Architecture: Built on top of prithivMLmods/Qwen3-VL-2B-Thinking-abliterated-v1 (Derived from Qwen/Qwen3-VL-2B-Thinking.), enabling structured reasoning and stepwise analysis in a lightweight footprint.
Unrestricted Multimodal Reasoning: Designed for deep analysis of artistic, technical, abstract, or high-complexity visual content without standard safety-driven refusals.
High-Fidelity Captions: Produces dense, descriptive outputs suitable for dataset generation, metadata enrichment, or accessibility workflows.
Dynamic Resolution Support: Retains Qwen3-VL’s ability to process varying image resolutions and aspect ratios effectively.
Optimized Deployment: FP8 compression enables smoother deployment on Hopper and compatible GPU architectures, even on lower VRAM systems compared to larger variants.

Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the 2B Thinking Unredacted MAX FP8 model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen3-VL-2B-Thinking-Unredacted-MAX-FP8",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/Qwen3-VL-2B-Thinking-Unredacted-MAX-FP8"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Provide a detailed caption and reasoning for this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

Advanced Red-Teaming: Evaluating multimodal robustness and probing behavioral edge cases.
Compact Research Deployment: Running structured multimodal reasoning experiments on smaller GPU setups.
Refusal Mechanism Research: Studying behavioral shifts after abliterated fine-tuning in a lightweight architecture.
Structured Visual Reasoning Research: Exploring step-by-step multimodal reasoning efficiency at the 2B scale.
Creative Storytelling & Captioning: Producing rich visual descriptions for datasets and narrative projects.

Limitations & Risks

Critical Note: This model is designed to minimize built-in refusal mechanisms.

Sensitive Content Exposure: The model may generate explicit or controversial descriptions if prompted accordingly.
User Responsibility: Generated outputs must be handled responsibly and used within ethical and legal boundaries.
Hardware Requirements: Although significantly lighter than 32B/8B/4B variants, FP8 inference still requires compatible GPU hardware and sufficient VRAM for higher-resolution image inputs and longer generations.

Acknowledgements

I would like to thank the works of the following:

Uncensor any LLM with abliteration – Maxime Labonne
Using FP8 and FP4 with Transformer Engine – docs.nvidia
Remove Refusals with Transformers – Sumandora
LLM Compressor – vllm-project
FP8 Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training – nvidia