Instructions to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included") model = AutoModelForImageTextToText.from_pretrained("rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included
- SGLang
How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with Docker Model Runner:
docker model run hf.co/rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included
Qwen3.5-27B-NVFP4-Full (W4A4)
NVFP4 quantization of Qwen/Qwen3.5-27B with all linear layers quantized, including the DeltaNet linear attention projections that are typically excluded.
Key differences from standard NVFP4 checkpoints
| Standard NVFP4 (e.g., Sehyo) | This checkpoint | |
|---|---|---|
| MoE experts | FP4 | FP4 |
| Shared experts | FP4 | FP4 |
| Self-attention (q/k/v/o) | FP4 | FP4 |
| DeltaNet (in_proj_qkv, in_proj_z, out_proj) | BF16 | FP4 |
| DeltaNet (in_proj_a, in_proj_b) | BF16 | BF16 (N=48, below CUTLASS tile minimum) |
| Model size | 27 GB | 20 GB |
Performance (DGX Spark / GB10 / SM121)
Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:
| Metric | Standard NVFP4 | This checkpoint | Improvement |
|---|---|---|---|
| Decode (tg32) | 7.93 tok/s | 11.98 tok/s | +51% |
| Decode @ d4096 | 7.66 tok/s | 11.90 tok/s | +55% |
| Decode @ d8192 | 7.92 tok/s | 11.80 tok/s | +49% |
| Prefill (pp2048) | 1855 tok/s | 2383 tok/s | +28% |
The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.
Quality benchmarks (0-shot, 200-sample subsets)
| Benchmark | Metric | This checkpoint | BF16 typical | Recovery |
|---|---|---|---|---|
| ARC-Challenge | acc_norm | 63.5% | ~66% | ~96% |
| HellaSwag | acc_norm | 74.0% | ~78% | ~95% |
| TruthfulQA MC2 | acc | 54.2% | ~55% | ~99% |
| Winogrande | acc | 51.5% | ~52% | ~99% |
95-99% quality recovery across knowledge and reasoning benchmarks. Quantizing the DeltaNet linear attention layers to FP4 is near-lossless.
Note: GSM8k results are excluded as the model's thinking/reasoning output format interferes with lm-eval-harness answer extraction, producing unreliable scores. Subjective quality in interactive use (Open WebUI, chat API) is excellent with reasoning intact.
Quantization details
- Method: llm-compressor
oneshotwith calibrated NVFP4 (W4A4) - Calibration: 256 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=4096
- Format: compressed-tensors
nvfp4-pack-quantizedwith calibratedinput_global_scale - Excluded layers:
in_proj_a,in_proj_b(N=48, CUTLASS FP4 requires N%64==0),conv1d(3D), norms,A_log,dt_bias,lm_head,embed_tokens
Usage
vLLM (recommended)
Requires vLLM >= 0.19.1 with PR #38423 (W4A4 SM120/SM121 support) and FlashInfer >= 0.6.7.
vllm serve rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included \
--trust-remote-code \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Quality notes
FP4 activation quantization on DeltaNet layers was widely assumed to be destructive for model quality. Our analysis shows the quantization error (SNR ~24 dB, relative error ~26%) is comparable to other layer types (SNR ~24 dB, relative error ~26%). The model produces coherent output with reasoning capabilities intact.
Required llm-compressor fix
Quantizing the DeltaNet layers requires vllm-project/llm-compressor#2566, which fixes model_free_ptq for models with non-contiguous fused attention layers (Qwen3.5's interleaved self_attn + linear_attn architecture).
Acknowledgments
- Downloads last month
- 96
Model tree for rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included
Base model
Qwen/Qwen3.5-27B