Qwen3-Coder-30B-A3B-REAM AWQ 4-bit

REAM-pruned and AWQ-quantized variant of Qwen/Qwen3-Coder-30B-A3B-Instruct, built end-to-end from the upstream BF16 base. 128 experts merged down to 96 (~30B → ~23B total params; 3B active retained), then 4-bit AWQ-quantized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details

Base model Qwen/Qwen3-Coder-30B-A3B-Instruct (upstream BF16, 128 experts)
Architecture Qwen3 MoE (96 experts post-REAM, top-8)
Parameters ~23B total / 3B active
Pruning REAM (Router Expert Aware Merging) — Samsung SAIL merge.py with saliency=reap, grouping=ream, merging=logits+weights, mix_ratio=0.0,0.3,0.7. REAM merges similar experts into representative survivors (it does not drop experts the way Cerebras REAP does).
Layers 48
Context 256K (max_position_embeddings=262144)
Quantization Native AWQ 4-bit, group_size=128, GEMM kernel format
Calibration llmcompressor GPTQ, 256 samples × 1024 tokens, code+thinking mix; ignore=['lm_head'] (router mlp.gate is preserved BF16 — targets=Linear skips the router)

The full REAM merge args are captured in config.json["merge_args"] for reproducibility.

Usage with SGLang

The recommended path on RDNA4 (gfx1201) is --quantization moe_wna16 --dtype bfloat16. Per-Linear AWQ + fp16 routes 128 experts × 48 layers × 3 projections through individual GEMV kernels per forward, which is much slower (and on some configurations triggers a HSA exception). The MoE-fused kernel is the fast and stable path.

# vLLM
from vllm import LLM
llm = LLM(model="mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ",
          quantization="moe_wna16", dtype="bfloat16")

# SGLang (CLI)
python -m sglang.launch_server \
  --model-path mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ \
  --quantization moe_wna16 --dtype bfloat16

For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via transformers + autoawq without modification.

Notes

  • REAM ≠ REAP. REAM (Samsung SAIL) merges experts; REAP (Cerebras cerebras/Qwen3-Coder-REAP-25B-A3B) drops them. Different algorithms with different artifacts; not interchangeable.
  • The shared_expert_gate (a Linear with output dim 1) is preserved in BF16 — AWQ's group-quantization packing requires output dim divisible by 8.

Hardware

REAM-merged + calibrated + smoke-tested on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10/v0.5.11.

Source code for the build pipeline (RDNA4 patches, calibration scripts, conversion utilities): https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference

Downloads last month
37
Safetensors
Model size
23B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ

Quantized
(145)
this model