Qwen3.6-35B-A3B-PRISM-NVFP4
NVFP4 (W4A4) quantization of a PRISM-tuned Qwen3.6-35B-A3B. ~24 GB on disk, multimodal + MTP draft head preserved. Designed for NVIDIA Blackwell (SM120/SM121).
PRISM softens over-refusal behaviour and removes bias / propaganda patterns while maintaining and enhancing task performance, coherence, and multimodal capability.
Model details
- Base: Qwen/Qwen3.6-35B-A3B (35B total, ~3B active per token, 256 routed experts)
- PRISM: refusal-softening, bias + propaganda removal
- Format: compressed-tensors NVFP4 (FP4 E2M1 weights + activations, UE4M3 per-block-16 scales)
- Kept BF16: vision encoder,
lm_head, router gates, embeddings, linear-attention SSM state - Runtime targets: vLLM (
--quantization compressed-tensors), Blackwell tensor cores
Files
| File | Purpose |
|---|---|
model.safetensors |
language-model + vision encoder weights |
model_mtp.safetensors |
MTP draft head (optional, for speculative decode) |
model.safetensors.index.json |
weight map |
config.json, generation_config.json |
model + generation config |
tokenizer*, processor_config.json, chat_template.jinja |
tokenizer + chat template |
Serving (vLLM)
vllm serve Ex0bit/Qwen3.6-35B-A3B-PRISM-NVFP4 \
--quantization compressed-tensors \
--dtype auto \
--max-model-len 32768 \
--trust-remote-code
Requires vLLM with Blackwell NVFP4 kernels. On SM121 (DGX Spark), use a
vLLM build with SM121-aware patches β stock PyPI wheels will fault on the
missing cvt.rn.satfinite.e2m1x2.f32 PTX instruction.
Known-working community Docker images (Apache 2.0, tested on GB10):
ghcr.io/aeon-7/vllm-spark-omni-q36β vLLM HEAD + GB10 patches + flashinfer sm_120 kernels; also supports DFlash speculative decoding.avarok/dgx-vllm-nvfp4-kernelβ generic NVFP4 MoE image with software-E2M1 conversion and Marlin-MoE default.
License
Apache 2.0, inherited from the base model.
- Downloads last month
- 3,000
Model tree for Ex0bit/Qwen3.6-35B-A3B-PRISM-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B