RedHatAI
/

Qwen3-32B-NVFP4A16

@@ -27,7 +27,7 @@ base_model: Qwen/Qwen3-32B
   - **Activation quantization:** FP16
 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 6/25/2025
-- **Version:** 1.0
 - **Model Developers:** RedHatAI
 This model is a quantized version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B).
@@ -35,7 +35,7 @@ It was evaluated on a several tasks to assess the its quality in comparison to t
 ### Model Optimizations
-This model was obtained by quantizing the weights of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to FP4 data type, ready for inference with vLLM>=0.9.1
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
 Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
@@ -53,7 +53,7 @@ from transformers import AutoTokenizer
 model_id = "RedHatAI/Qwen3-32B-NVFP4A16"
 number_gpus = 2
-sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -161,8 +161,7 @@ tokenizer.save_pretrained(SAVE_DIR)
 This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
-### Accuracy
 <table>
   <thead>
     <tr>
@@ -176,115 +175,135 @@ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, an
   <tbody>
     <tr>
       <td rowspan="7"><b>OpenLLM V1</b></td>
-      <td>mmlu</td>
       <td></td>
       <td></td>
       <td></td>
     </tr>
-  <tr>
-    <td>MMLU</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>ARC Challenge (0-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>GSM8K (8-shot, strict-match)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>Hellaswag (10-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>Winogrande (5-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>TruthfulQA (0-shot, mc2)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td><b>Average</b></td>
-    <td><b></b></td>
-    <td><b></b></td>
-    <td><b>%</b></td>
-  </tr>
-  <tr>
-    <td rowspan="7"><b>OpenLLM V2</b></td>
-    <td>MMLU-Pro (5-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>IFEval (0-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>BBH (3-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>Math-|v|-5 (4-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>GPQA (0-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td>MuSR (0-shot)</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td><b>Average</b></td>
-    <td><b></b></td>
-    <td><b></b></td>
-    <td><b>%</b></td>
-  </tr>
-  <tr>
-    <td><b>Coding</b></td>
-    <td>HumanEval pass@1</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
-  <tr>
-    <td></td>
-    <td>HumanEval_64 pass@2</td>
-    <td></td>
-    <td></td>
-    <td></td>
-  </tr>
- </tbody>
 </table>
 ### Reproduction
 The results were obtained using the following commands:

   - **Activation quantization:** FP16
 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 6/25/2025
+- **Version:** 10
 - **Model Developers:** RedHatAI
 This model is a quantized version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B).
 ### Model Optimizations
+This model was obtained by quantizing the weights of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to FP4 data type, ready for inference with vLLM>=9.1
 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
 Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
 model_id = "RedHatAI/Qwen3-32B-NVFP4A16"
 number_gpus = 2
+sampling_params = SamplingParams(temperature=6, top_p=9, max_tokens=256)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
+<h3>Accuracy</h3>
 <table>
   <thead>
     <tr>
   <tbody>
     <tr>
       <td rowspan="7"><b>OpenLLM V1</b></td>
+      <td>MMLU</td>
+      <td>80.94</td>
+      <td>80.57</td>
+      <td>99.55%</td>
+    </tr>
+    <tr>
+      <td>ARC Challenge (0-shot)</td>
+      <td>68.34</td>
+      <td>68.43</td>
+      <td>100.12%</td>
+    </tr>
+    <tr>
+      <td>GSM8K (8-shot, strict-match)</td>
+      <td>87.34</td>
+      <td>87.72</td>
+      <td>100.43%</td>
+    </tr>
+    <tr>
+      <td>Hellaswag (10-shot)</td>
+      <td>71.16</td>
+      <td>70.48</td>
+      <td>99.05%</td>
+    </tr>
+    <tr>
+      <td>Winogrande (5-shot)</td>
+      <td>69.93</td>
+      <td>70.09</td>
+      <td>100.23%</td>
+    </tr>
+    <tr>
+      <td>TruthfulQA (0-shot, mc2)</td>
+      <td>58.63</td>
+      <td>58.96</td>
+      <td>100.56%</td>
+    </tr>
+    <tr>
+      <td><b>Average</b></td>
+      <td><b>72.72</b></td>
+      <td><b>72.71</b></td>
+      <td><b>99.98%</b></td>
+    </tr>
+    <tr>
+      <td rowspan="7"><b>OpenLLM V2</b></td>
+      <td>MMLU-Pro (5-shot)</td>
+      <td>54.48</td>
+      <td>51.61</td>
+      <td>94.73%</td>
+    </tr>
+    <tr>
+      <td>IFEval (0-shot)</td>
+      <td>88.85</td>
+      <td>88.49</td>
+      <td>99.59%</td>
+    </tr>
+    <tr>
+      <td>BBH (3-shot)</td>
+      <td>62.61</td>
+      <td>62.14</td>
+      <td>99.25%</td>
+    </tr>
+    <tr>
+      <td>Math-|v|-5 (4-shot)</td>
+      <td>56.87</td>
+      <td>56.27</td>
+      <td>98.94%</td>
+    </tr>
+    <tr>
+      <td>GPQA (0-shot)</td>
+      <td>30.45</td>
+      <td>30.29</td>
+      <td>99.47%</td>
+    </tr>
+    <tr>
+      <td>MuSR (0-shot)</td>
+      <td>39.15</td>
+      <td>40.48</td>
+      <td>103.40%</td>
+    </tr>
+    <tr>
+      <td><b>Average</b></td>
+      <td><b>55.40</b></td>
+      <td><b>54.88</b></td>
+      <td><b>99.06%</b></td>
+    </tr>
+    <tr>
+      <td><b>Coding</b></td>
+      <td>HumanEval Instruct pass@1</td>
+      <td>88.41</td>
+      <td>87.20</td>
+      <td>98.63%</td>
+    </tr>
+    <tr>
       <td></td>
+      <td>HumanEval 64 Instruct pass@2</td>
+      <td>90.27</td>
+      <td>89.66</td>
+      <td>99.32%</td>
+    </tr>
+    <tr>
+      <td></td>
+      <td>HumanEval 64 Instruct pass@8</td>
+      <td>92.20</td>
+      <td>92.13</td>
+      <td>99.92%</td>
+    </tr>
+    <tr>
+      <td></td>
+      <td>HumanEval 64 Instruct pass@16</td>
+      <td>92.96</td>
+      <td>93.27</td>
+      <td>100.33%</td>
+    </tr>
+    <tr>
       <td></td>
+      <td>HumanEval 64 Instruct pass@32</td>
+      <td>93.58</td>
+      <td>94.47</td>
+      <td>100.95%</td>
+    </tr>
+    <tr>
       <td></td>
+      <td>HumanEval 64 Instruct pass@64</td>
+      <td>93.90</td>
+      <td>95.73</td>
+      <td>101.95%</td>
     </tr>
+  </tbody>
 </table>
 ### Reproduction
 The results were obtained using the following commands: