Qwen3-Coder-30B-A3B-REAM AWQ 4-bit

REAM-pruned and AWQ-quantized variant of Qwen/Qwen3-Coder-30B-A3B-Instruct, built end-to-end from the upstream BF16 base. 128 experts merged down to 96 (~30B → ~23B total params; 3B active retained), then 4-bit AWQ-quantized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details


Base model	Qwen/Qwen3-Coder-30B-A3B-Instruct (upstream BF16, 128 experts)
Architecture	Qwen3 MoE (96 experts post-REAM, top-8)
Parameters	~23B total / 3B active
Pruning	REAM (Router Expert Aware Merging) — Samsung SAIL `merge.py` with `saliency=reap, grouping=ream, merging=logits+weights, mix_ratio=0.0,0.3,0.7`. REAM merges similar experts into representative survivors (it does not drop experts the way Cerebras REAP does).
Layers	48
Context	256K (`max_position_embeddings=262144`)
Quantization	Native AWQ 4-bit, group_size=128, GEMM kernel format
Calibration	llmcompressor GPTQ, 256 samples × 1024 tokens, code+thinking mix; `ignore=['lm_head']` (router `mlp.gate` is preserved BF16 — `targets=Linear` skips the router)

The full REAM merge args are captured in config.json["merge_args"] for reproducibility.

Usage with SGLang

The recommended path on RDNA4 (gfx1201) is --quantization moe_wna16 --dtype bfloat16. Per-Linear AWQ + fp16 routes 128 experts × 48 layers × 3 projections through individual GEMV kernels per forward, which is much slower (and on some configurations triggers a HSA exception). The MoE-fused kernel is the fast and stable path.

# vLLM
from vllm import LLM
llm = LLM(model="mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ",
          quantization="moe_wna16", dtype="bfloat16")

# SGLang (CLI)
python -m sglang.launch_server \
  --model-path mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ \
  --quantization moe_wna16 --dtype bfloat16

For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via transformers + autoawq without modification.

Notes

REAM ≠ REAP. REAM (Samsung SAIL) merges experts; REAP (Cerebras cerebras/Qwen3-Coder-REAP-25B-A3B) drops them. Different algorithms with different artifacts; not interchangeable.
The shared_expert_gate (a Linear with output dim 1) is preserved in BF16 — AWQ's group-quantization packing requires output dim divisible by 8.

Hardware

REAM-merged + calibrated + smoke-tested on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10/v0.5.11.

Source code for the build pipeline (RDNA4 patches, calibration scripts, conversion utilities): https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference

Downloads last month: 37

Safetensors

Model size

23B params

Tensor type

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Quantized

(145)

this model