Post
62
Excited to share our paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
A common assumption in test-time reasoning is that giving a model more chances to think or verify should improve performance. Our results show that this is only partly true.
We introduce SEVRA, a serving-layer controller that decides when a frozen reasoning model should keep its initial answer and when it should actively verify it. Instead of treating verification as always useful, SEVRA asks a more deployment-focused question:
Is this specific attempt likely recoverable by verification?
We evaluate this through helpful fixes, harmful flips, extra calls, and realized token cost.
Some key takeaways:
* Selective verification improves over always verifying on MATH500 while reducing harmful flips.
* On GSM8K, the controller verifies only a small fraction of examples but still improves accuracy.
* However, a longer initial solve can sometimes match selective verification with fewer realized tokens.
* Cheap serving-visible features, such as completion status, token count, and finalizer use, nearly match larger learned gates.
* On CommonsenseQA, always-on verification hurts, showing that the best test-time compute action is workload-dependent.
The main deployment lesson is simple:
Tune the initial reasoning budget first. Then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
Paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning (2606.19808)
Code: https://github.com/Sajib-006/SEVRA
Replay dashboard: sevra-space/sevra-replay
Would love feedback from the community, especially on broader test-time compute allocation, risk-aware verification, and practical serving policies for reasoning models.
A common assumption in test-time reasoning is that giving a model more chances to think or verify should improve performance. Our results show that this is only partly true.
We introduce SEVRA, a serving-layer controller that decides when a frozen reasoning model should keep its initial answer and when it should actively verify it. Instead of treating verification as always useful, SEVRA asks a more deployment-focused question:
Is this specific attempt likely recoverable by verification?
We evaluate this through helpful fixes, harmful flips, extra calls, and realized token cost.
Some key takeaways:
* Selective verification improves over always verifying on MATH500 while reducing harmful flips.
* On GSM8K, the controller verifies only a small fraction of examples but still improves accuracy.
* However, a longer initial solve can sometimes match selective verification with fewer realized tokens.
* Cheap serving-visible features, such as completion status, token count, and finalizer use, nearly match larger learned gates.
* On CommonsenseQA, always-on verification hurts, showing that the best test-time compute action is workload-dependent.
The main deployment lesson is simple:
Tune the initial reasoning budget first. Then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
Paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning (2606.19808)
Code: https://github.com/Sajib-006/SEVRA
Replay dashboard: sevra-space/sevra-replay
Would love feedback from the community, especially on broader test-time compute allocation, risk-aware verification, and practical serving policies for reasoning models.