Qwen3-Coder-30B-A3B-REAM AWQ 4-bit
REAM-pruned and AWQ-quantized variant of Qwen/Qwen3-Coder-30B-A3B-Instruct, built end-to-end from the upstream BF16 base. 128 experts merged down to 96 (~30B → ~23B total params; 3B active retained), then 4-bit AWQ-quantized for AMD RDNA4 (gfx1201) inference with SGLang.
Model Details
| Base model | Qwen/Qwen3-Coder-30B-A3B-Instruct (upstream BF16, 128 experts) |
| Architecture | Qwen3 MoE (96 experts post-REAM, top-8) |
| Parameters | ~23B total / 3B active |
| Pruning | REAM (Router Expert Aware Merging) — Samsung SAIL merge.py with saliency=reap, grouping=ream, merging=logits+weights, mix_ratio=0.0,0.3,0.7. REAM merges similar experts into representative survivors (it does not drop experts the way Cerebras REAP does). |
| Layers | 48 |
| Context | 256K (max_position_embeddings=262144) |
| Quantization | Native AWQ 4-bit, group_size=128, GEMM kernel format |
| Calibration | llmcompressor GPTQ, 256 samples × 1024 tokens, code+thinking mix; ignore=['lm_head'] (router mlp.gate is preserved BF16 — targets=Linear skips the router) |
The full REAM merge args are captured in config.json["merge_args"] for reproducibility.
Usage with SGLang
The recommended path on RDNA4 (gfx1201) is --quantization moe_wna16 --dtype bfloat16. Per-Linear AWQ + fp16 routes 128 experts × 48 layers × 3 projections through individual GEMV kernels per forward, which is much slower (and on some configurations triggers a HSA exception). The MoE-fused kernel is the fast and stable path.
# vLLM
from vllm import LLM
llm = LLM(model="mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ",
quantization="moe_wna16", dtype="bfloat16")
# SGLang (CLI)
python -m sglang.launch_server \
--model-path mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ \
--quantization moe_wna16 --dtype bfloat16
For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via transformers + autoawq without modification.
Notes
- REAM ≠REAP. REAM (Samsung SAIL) merges experts; REAP (Cerebras
cerebras/Qwen3-Coder-REAP-25B-A3B) drops them. Different algorithms with different artifacts; not interchangeable. - The
shared_expert_gate(a Linear with output dim 1) is preserved in BF16 — AWQ's group-quantization packing requires output dim divisible by 8.
Hardware
REAM-merged + calibrated + smoke-tested on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10/v0.5.11.
Source code for the build pipeline (RDNA4 patches, calibration scripts, conversion utilities): https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
- Downloads last month
- 37
Model tree for mattbucci/Qwen3-Coder-30B-A3B-REAM-AWQ
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct