Qwen3.6-35B-A3B-PRISM-NVFP4

NVFP4 (W4A4) quantization of a PRISM-tuned Qwen3.6-35B-A3B. ~24 GB on disk, multimodal + MTP draft head preserved. Designed for NVIDIA Blackwell (SM120/SM121).

PRISM softens over-refusal behaviour and removes bias / propaganda patterns while maintaining and enhancing task performance, coherence, and multimodal capability.

Model details

  • Base: Qwen/Qwen3.6-35B-A3B (35B total, ~3B active per token, 256 routed experts)
  • PRISM: refusal-softening, bias + propaganda removal
  • Format: compressed-tensors NVFP4 (FP4 E2M1 weights + activations, UE4M3 per-block-16 scales)
  • Kept BF16: vision encoder, lm_head, router gates, embeddings, linear-attention SSM state
  • Runtime targets: vLLM (--quantization compressed-tensors), Blackwell tensor cores

Files

File Purpose
model.safetensors language-model + vision encoder weights
model_mtp.safetensors MTP draft head (optional, for speculative decode)
model.safetensors.index.json weight map
config.json, generation_config.json model + generation config
tokenizer*, processor_config.json, chat_template.jinja tokenizer + chat template

Serving (vLLM)

vllm serve Ex0bit/Qwen3.6-35B-A3B-PRISM-NVFP4 \
  --quantization compressed-tensors \
  --dtype auto \
  --max-model-len 32768 \
  --trust-remote-code

Requires vLLM with Blackwell NVFP4 kernels. On SM121 (DGX Spark), use a vLLM build with SM121-aware patches β€” stock PyPI wheels will fault on the missing cvt.rn.satfinite.e2m1x2.f32 PTX instruction.

Known-working community Docker images (Apache 2.0, tested on GB10):

  • ghcr.io/aeon-7/vllm-spark-omni-q36 β€” vLLM HEAD + GB10 patches + flashinfer sm_120 kernels; also supports DFlash speculative decoding.
  • avarok/dgx-vllm-nvfp4-kernel β€” generic NVFP4 MoE image with software-E2M1 conversion and Marlin-MoE default.

License

Apache 2.0, inherited from the base model.

β˜• Support Our Work

If you enjoy our work and find it useful, please consider sponsoring or supporting us!

Ko-fi

Downloads last month
3,000
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ex0bit/Qwen3.6-35B-A3B-PRISM-NVFP4

Quantized
(250)
this model