@gm8xx8
rStar-Math shows SLMs can rival or surpass OpenAI o1 in math reasoning w/out distillation from larger models, using MCTS and three keys factors:
1. Code-Augmented CoT Synthesis: MCTS generates verified reasoning data to train policy SLMs.
2. Enhanced PRM: A novel training approach avoids naïve annotations, yielding a stronger process preference model (PPM).
3. Self-Evolution Framework: Four rounds of self-evolution refine reasoning with millions of synthesized solutions for 747k problems.
Performance Highlights:
> Achieves 90.0% on MATH, improving Qwen2.5-Math-7B by +31.2% and surpassing OpenAI o1-preview by +4.5%.
> Boosts Phi3-mini-3.8B from 41.4% to 86.4%.
> Solves 53.3% of AIME problems, ranking in the top 20% of high school competitors.
don’t sleep on small models.
https://arxiv.org/abs/2501.04519