Update README_EVALUATION.md
Browse files- README_EVALUATION.md +44 -6
README_EVALUATION.md
CHANGED
|
@@ -28,10 +28,11 @@ Math: see data/*
|
|
| 28 |
For model generation on single seed, please use the following command:
|
| 29 |
|
| 30 |
```
|
| 31 |
-
bash generate_livecodebench.sh ${model_path} ${seed} ${output_path}
|
| 32 |
-
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path}
|
| 33 |
-
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path}
|
| 34 |
```
|
|
|
|
| 35 |
|
| 36 |
Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
|
| 37 |
|
|
@@ -52,7 +53,7 @@ python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime25.jso
|
|
| 52 |
We also left our generations into cache.tar.gz as references.
|
| 53 |
|
| 54 |
```
|
| 55 |
-
LiveCodeBench AceReason-Nemotron-7B (Avg@8)
|
| 56 |
=================================================================
|
| 57 |
Months Corrects Total Accuracy
|
| 58 |
2023-05 180 272 66.17647058823529
|
|
@@ -84,7 +85,7 @@ Months Corrects Total Accuracy
|
|
| 84 |
v5 1142 2232 51.16487455197132
|
| 85 |
v6 621 1400 44.357142857142854
|
| 86 |
|
| 87 |
-
LiveCodeBench AceReason-Nemotron-14B (Avg@8)
|
| 88 |
=================================================================
|
| 89 |
Months Corrects Total Accuracy
|
| 90 |
2023-05 211 272 77.57352941176471
|
|
@@ -116,6 +117,38 @@ Months Corrects Total Accuracy
|
|
| 116 |
v5 1350 2232 60.483870967741936
|
| 117 |
v6 775 1400 55.357142857142854
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
AceReason-Nemotron-7B
|
| 120 |
====================================
|
| 121 |
AIME2024 (Avg@64) 68.64583333333334
|
|
@@ -125,4 +158,9 @@ AceReason-Nemotron-14B
|
|
| 125 |
====================================
|
| 126 |
AIME2024 (Avg@64) 78.43749999999997
|
| 127 |
AIME2025 (Avg@64) 67.65625
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
For model generation on single seed, please use the following command:
|
| 29 |
|
| 30 |
```
|
| 31 |
+
bash generate_livecodebench.sh ${model_path} ${seed} ${output_path} ${model_type}
|
| 32 |
+
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path} ${model_type}
|
| 33 |
+
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path} ${model_type}
|
| 34 |
```
|
| 35 |
+
Please specify model_type as r1 for AceReason-Nemotron-1.0 models, and qwen for AceReason-Nemotron-1.1 models.
|
| 36 |
|
| 37 |
Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:
|
| 38 |
|
|
|
|
| 53 |
We also left our generations into cache.tar.gz as references.
|
| 54 |
|
| 55 |
```
|
| 56 |
+
LiveCodeBench AceReason-Nemotron-1.0-7B (Avg@8)
|
| 57 |
=================================================================
|
| 58 |
Months Corrects Total Accuracy
|
| 59 |
2023-05 180 272 66.17647058823529
|
|
|
|
| 85 |
v5 1142 2232 51.16487455197132
|
| 86 |
v6 621 1400 44.357142857142854
|
| 87 |
|
| 88 |
+
LiveCodeBench AceReason-Nemotron-1.0-14B (Avg@8)
|
| 89 |
=================================================================
|
| 90 |
Months Corrects Total Accuracy
|
| 91 |
2023-05 211 272 77.57352941176471
|
|
|
|
| 117 |
v5 1350 2232 60.483870967741936
|
| 118 |
v6 775 1400 55.357142857142854
|
| 119 |
|
| 120 |
+
LiveCodeBench AceReason-Nemotron-1.1-7B (Avg@8)
|
| 121 |
+
=================================================================
|
| 122 |
+
Months Corrects Total Accuracy
|
| 123 |
+
2023-05 205 272 75.36764705882354
|
| 124 |
+
2023-06 255 312 81.73076923076923
|
| 125 |
+
2023-07 356 432 82.4074074074074
|
| 126 |
+
2023-08 208 288 72.22222222222223
|
| 127 |
+
2023-09 287 352 81.5340909090909
|
| 128 |
+
2023-10 278 352 78.97727272727273
|
| 129 |
+
2023-11 234 280 83.57142857142857
|
| 130 |
+
2023-12 263 320 82.1875
|
| 131 |
+
2024-01 215 288 74.65277777777777
|
| 132 |
+
2024-02 182 256 71.09375
|
| 133 |
+
2024-03 270 360 75.0
|
| 134 |
+
2024-04 254 296 85.8108108108108
|
| 135 |
+
2024-05 221 288 76.73611111111111
|
| 136 |
+
05/23-05/24 3228 4096 78.80859375
|
| 137 |
+
2024-06 309 368 83.96739130434783
|
| 138 |
+
2024-07 235 344 68.31395348837209
|
| 139 |
+
2024-08 292 528 55.303030303030305
|
| 140 |
+
2024-09 211 376 56.11702127659574
|
| 141 |
+
2024-10 254 424 59.905660377358494
|
| 142 |
+
2024-11 269 456 58.99122807017544
|
| 143 |
+
2024-12 239 392 60.96938775510204
|
| 144 |
+
2025-01 194 408 47.549019607843135
|
| 145 |
+
06/24-01/25 2003 3296 60.77063106796116
|
| 146 |
+
2025-02 203 408 49.754901960784316
|
| 147 |
+
2025-03 306 544 56.25
|
| 148 |
+
2025-04 41 96 42.708333333333336
|
| 149 |
+
v5 1283 2232 57.482078853046595
|
| 150 |
+
v6 726 1400 51.857142857142854
|
| 151 |
+
|
| 152 |
AceReason-Nemotron-7B
|
| 153 |
====================================
|
| 154 |
AIME2024 (Avg@64) 68.64583333333334
|
|
|
|
| 158 |
====================================
|
| 159 |
AIME2024 (Avg@64) 78.43749999999997
|
| 160 |
AIME2025 (Avg@64) 67.65625
|
| 161 |
+
|
| 162 |
+
AceReason-Nemotron-1.1-7B
|
| 163 |
+
====================================
|
| 164 |
+
AIME2024 (Avg@64) 72.60416666666667
|
| 165 |
+
AIME2025 (Avg@64) 64.84375
|
| 166 |
+
```
|